DATA HANDLING IN SCIENCE AND TECHNOLOGY - VOLUME 15
Adaption of simulated annealing to chemical optimization problems
DATA HANDLING IN SCIENCE AND TECHNOLOGY Advisory Editors: B.G.M. Vandeginste and S.C. Rutan Other volumes in this series: Volume 1 Microprocessor Programming and Applications for Scientists and Engineers by R.R. Smardzewski Volume 2 Chemometrics: A Textbook by D.L. Massart, B.G.M. Vandeginste, S.N. Deming, Y. Michotte and L. Kaufman Volume 3 Experimental Design: A Chemometric Approach by S.N. Deming and S.L. Morgan Volume 4 Advanced Scientific Computing in BASIC with Applications in Chemistry, Biology and Pharmacology by P. Valk() and S. Vajda Volume 5 PCs for Chemists, edited by J. Zupan Volume 6 Scientific Computing and Automation (Europe) 1990, Proceedings of the Scientific Computing and Automation (Europe) Conference, 12-15 June, 1990, Maastricht, The Netherlands, edited by E.J. Karjalainen Volume 7 Receptor Modeling for Air Quality Management, edited by P.K. Hopke Volume 8 Design and Optimization in Organic Synthesis by R. Carlson Volume 9 Multivariate Pattern Recognition in Chemometrics, illustrated by case studies, edited by R.G. Brereton Volume 10 Sampling of Heterogeneous and Dynamic Material Systems: theories of heterogeneity, sampling and homogenizing by P.M. Gy Volume 11 Experimental Design: A Chemometric Approach (Second, Revised and Expanded Edition) by S.N. Deming and S.L. Morgan Volume 12 Methods for Experimental Design: principles and applications for physicists and chemists by J.L. Goupy Volume 13 Intelligent Software for Chemical Analysis, edited by L.M.C. Buydens and P.J. Schoenmakers Volume 14 The Data Analysis Handbook, by I.E. Frank and R. Todeschini Volume 15 Adaption of simulated annealing to chemical optimization problems, edited by J.H. Kalivas
DATA HANDLING IN SCIENCE AND TECHNOLOGY — VOLUME 15 Advisory Editors: B.G.M. Vandeginste and S.C. Rutan
Adaption of simulated annealing to chemical optimization problems
edited by JOHN H. KALIVAS Department of Chemistry, Idaho State University, Pocatello, ID 83209, U.S.A.
1995 ELSEVIER Amsterdam — Lausanne — New York — Oxford — Shannon — Tokyo
ELSEVIER SCIENCE B.V. Sara Burgerhartstraat 25 P.O. Box 211, 1000 AE Amsterdam, The Netherlands
ISBN 0-444-81895-2 © 1995 Elsevier Science B.V. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior written permission of the publisher, Elsevier Science B.V., Copyright & Permissions Department, P.O. Box 521, 1000 AM Amsterdam, The Netherlands. Special regulations for readers in the USA – This publication has been registered with the Copyright Clearance Center Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923. Information can be obtained from the CCC about conditions under which photocopies of parts of this publication may be made in the USA. All other copyright questions, including photocopying outside of the USA, should be referred to the copyright owner, Elsevier Science B.V., unless otherwise specified. No responsibility is assumed by the publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. This book is printed on acid-free paper. Printed in The Netherlands
V
Contents Introduction References
1 2
Chapter 1. Simulated annealing and generalizations I.O. Bohachevsky, M.E. Johnson and M.L. Stein
3
1. Introduction 2. The simulated annealing method 2.1. Formulation 2.2. The simulated annealing algorithm 2.3. Termination 2.4. Illustrative example 3. Characteristics of simulated annealing 3.1. Number of function evaluations 3.2. Relative computational efficiency 3.3. General properties of simulated annealing 4. An application 5. Closing observations References Chapter 2. Comparison of algorithms for wavelength selection U. HOrchner and J.H. Kalivas 1. Introduction 2. Theory 2.1. Spectroscopic calibrations and wavelength selection 2.2. Simulated annealing, Boltzmann statistics and threshold acceptance 2.3. SA type algorithms and convergence to the exact extreme 2.4. Wavelength selection optimization functions 2.5. Neighborhood definition for wavelength selection 3. Results 3.1. K-matrix analysis 3.2. P-matrix analysis 4. Conclusions References
3 4 4 5 6 7 12 13 17 18 19 22 23
25 25 26 26 28 29 33 35 35 35 41 52 53
vi Chapter 3. Robust principal component analysis and constrained background bilinearization for quantitative analysis R. Yu, Y. Xie and Y. Liang 57 1. Introduction 57 2. Robust principal component analysis by projection pursuit and simulated annealing 58 2.1. Projection pursuit 60 2.2. Projection pursuit algorithm for robust principal component analysis 60 2.3. Generalized simulated annealing as an optimization algorithm for PP PCA 63 3. Simulation for PCA treatment 65 3.1. Linear structure 66 3.2. Planar structure 66 3.3. Contaminated distribution 66 3.4. The PC directions obtained by SVD and PP PCA 66 4. Constrained background bilinearization with generalized simulated annealing 71 4.1. Constrained background bilinearization (CBBL) 73 4.2. Generalized simulated annealing for CBBL 75 4.3. Analytical systems for CBBL analysis 76 References 81 Chapter 4. Kalman filter quantitative resolution of overlapped shifted peaks after optimal alignment by simulated annealing T. Rotunno 85 1. Introduction 2. The Kalman filter 2.1. Basics 2.2. Program overview 2.3. The simulated annealing algorithm 3. Applications 3.1. Resolution of synthetic spectra 3.2. Resolution of HPLC chromatograms 3.3. Resolution of ESCA spectra 4. Conclusions References Chapter 5. Selection of molecular descriptors for quantitative structureactivity relationships J.M. Sutter and P.C. Jurs 1. Introduction
85 87 87 90 92 93 93 98 103 108 109
111 111
vii 2. Adapt Methodology 3. Selecting descriptors for linear regression 3.1. Multiple linear regression 3.2. Simulated annealing for descriptor selection 4. Selecting descriptors for computational neural networks 4.1. Computational neural networks 4.2. Training neural network 4.3. Generalized simulated annealing for descriptor selection 5. A QSAR study 5.1. Experimental section 5.2. Results and discussion 5.3. Conclusions References Chapter 6. Fundamentals of cluster analysis using simulated annealing D.E. Brown and C.L. Huntley 1. Clustering 2. The clustering optimization problem 2.1. Partitional clustering 2.2. Hierarchical clustering 3. Simulated annealing for clustering 3.1. Simulated annealing for partitional clustering 3.2. Simulated annealing for hierarchical clustering 4. Examples 4.1. External clustering criteria 4.2. Results for partitional clustering 4.3. Results for hierarchical clustering 5. Summary and conclusions References Chapter 7. Classification of materials R. Yu, L. Sun and Y. Liang 1. Introduction 2. Cluster analysis by simulated annealing 2.1. Principle of cluster analysis by simulated annealing 2.2. Treatment of simulated data 2.3. Classification of tea samples 2.4. Some computational aspects of simulated annealing algorithm 3. Cluster analysis by K-means algorithm and simulated annealing 3.1. Introduction 3.2. Treatment of simulated data
112 114 114 115 115 115 116 120 122 122 126 129 131
133 133 135 136 137 141 141 144 146 147 148 151 152 153
155 155 156 156 158 160 164 167 167 167
viii 3.3. Classification of calculus bovis samples 3.4. Classification of tea samples 3.5. Comparison of methods with and without simulated annealing 4. Classification of materials by projection pursuit based on generalized simulated annealing 4.1. Introduction 4.2. The iris data 4.3. Classification of tea samples 4.4. Classification of beer samples 4.5. Classification of biological samples References Chapter 8. Chemical batch process scheduling I.A. Karimi and S. Hasebe 1. A serial multiproduct batch process 1.1. SA algorithm details 1.2. Heuristic algorithms 1.3. Numerical evaluation 2. Extensions to due-date penalties 2.1. Heuristic algorithms 2.2. Numerical evaluation 3. A large scale scheduling problem 3.1. A batch plant with two production lines 3.2. Schedule definition 3.3. Infeasible production sequences 3.4. Simulation algorithm 3.5. Improved simulation algorithm 3.6. The SA algorithm 3.7. Example 4. Conclusions References Chapter 9. Nuclear fuel management G.T. Parks and D.J. Kropaczek 1. Introduction 2. Loading pattern evaluation 2.1. Reference neutronics model 2.2. Generalized perturbation theory model 3. Optimization methodology 3.1. Solution generation 3.2. Problem functions 3.3. Annealing schedule
170 171 172 173 173 173 175 175 176 178
181 182 183 184 184 187 187 188 190 190 192 195 196 197 198 199 200 202
205 205 207 207 208 210 211 211 212
ix 3.4. Search strategies 3.5. Archiving 4. Application results and conclusions References
213 215 217 220
Chapter 10. Design of cost-effective emission control strategies R.G. Derwent
223
1. Introduction 2. An optimal strategy for retrofit flue gas desulphurisation in the United Kingdom 2.1. The nature of the problem 2.2. Selecting sensitive receptor sites 2.3. Critical loads for sulphur deposition 2.4. Attribution of sulphur deposition 2.5. Finding optimal strategies 2.6. Optimization by simulated annealing and retrofit flue gas desulphurisation strategies 3. Retrofit FGD strategies and long range transboundry transport 3.1. Long range transport from the United Kingdom to Scandinavia 3.2. Optimal control strategies and pollution control technology costs 4. Combatting the long range transport and deposition of nitroge species in Europe 4.1. NOx abatement strategies 4.2. Selection of sensitive receptor sites for oxidized nitrogen deposition 4.3. Emission deposition relationship for oxidized nitrogen 4.4. Finding optimal strategies for NO x abatement 5. Optimal strategies for pollutants in combination 5.1. Photochemical oxidant control strategies 5.2. Selecting sensitive receptors for ozone 5.3. Ozone to hydrocarbons and NOx source-relationship 5.4. Application of simulated annealing 6. Conclusions References Chapter 11. Determination of biexponential fluorescence lifetimes by using simulated annealing and simplex searching S.L. Shew and C.L. Olson 1. Introduction 2. Background
223 225 225 226 227 227 228 229 230 230 230 231 232 232 233 234 235 235 236 236 236 237 238
239 239 240
x 2.1. Instrument description 2.2. Instrumental distortions 3. Data analysis 3.1. Marquardt fitting 3.2. Simplex searching 3.3. Simulated annealing 3.4. Combination of simulated annealing and simplex searching 3.5. Error analysis 4. Results References Chapter 12. Simulated annealing applied to crystallographic structure refinement A.T. Brunger and L.M. Rice 1. Introduction 2. Crystallographic refinement 2.1. The crystallographic residual Exray 2.2. The chemical term Echem 2.3. Additional restraints and constraints 2.4. Weighting 2.5. The free R value 3. Simulated annealing refinement 3.1. Monte Carlo 3.2 Molecular dynamics 3.3. Torsion angle molecular dynamics 3.4. Temperature control 3.5. Annealing schedule 3.6. Annealing control 3.7. Commonly used annealing schedules 4. Radius of convergence 5. Directionality of simulated annealing refinement 6. Simulated annealing omit maps 7. Refinement with phase restraints 8. Conclusions References Chapter 13. Multi-dimensional searches in macromolecular X-ray crystallography S. Subbiah 1. Introduction 1.1. Overview of macromolecular X-ray crystallography 2. The real space search problem
243 244 245 248 250 252 255 256 257 257
259 259 260 261 262 263 263 263 264 265 266 266 268 269 270 270 270 274 275 276 276 277
281 281 281 285
xi
2.1. Statement of the problem 2.2. Implementation of the simulated annealing solution 2.3. Experiment 1 2.3.1. The problem 2.3.2. Results 2.4. Experiment 2 2.4.1. The problem 2.4.2. Results 3. The reciprocal space search problem 3.1. Statement of the molecular replacement problem 3.2. Implementation of the simulated annealing solution 3.3. Experiment 1 3.3.1. The problem 3.3.2. Results 3.4. Experiment 2 3.4.1. The problem 3.4.2. Results 4. Conclusion References Chapter 14. Simulated annealing in the calculation of NMR structures D. Zhao and 0. Jardetzky 1. Introduction 2. Statement of the problem 3. Types of NMR constraints 3.1. Distance constraints 3.2. Dihedral angle constraints 3.3. Chemical shift constraints 3.4. Types of constraint potentials 3.5. Consistency of the experimental constraints 4. Empirical force fields for SA of proteins and DNA 5. Simulated annealing 5.1. Molecular dynamics as an algorithm of SA 5.2. Temperature controls 5.3. Cooling rate 5.4. Convergence tests 6. Applications of SA 6.1. ab initio SA 6.2. SA as a refinement procedure 6.3. Simulated annealing used for making assignments 6.4. SA using time averaged constraints 6.5. 4D SA 6.6. Back calculation of NMR spectra and direct refinement against NOE intensities
285 286 290 290 290 291 291 292 292 292 295 295 295 296 299 299 300 300 300
303 303 304 305 305 306 308 308 309 309 312 312 313 314 314 315 315 315 316 317 318 318
xii
6.7. Sequential simulated annealing 6.7.1. Ergodic theorem 6.7.2. Starting structure 6.8. SA in dihedral angle space 6.9. Monte Carlo combined with MD 6.10. Using SA to calculate 3D structures with out spectral assignment 6.11. Relative weight of structures within a family of structures 7. Accuracy and precision of calculated structures 7.1. Analysis of the calculated structures 7.2. Accuracy and precision References
319 319 320 321 321 322 322 322 322 323 325
Chapter 15. Structural models of tetrahedrally bonded amorphous materials 329 F. Wooten and D. Weaire 329 1. Introduction 331 2. The WWW algorithm 332 2.1. Rules for bond switches 333 2.2. Randomization of the network 334 2.3. The randomization temperature 2.4. Relaxation of the structure 335 337 3. The WWW567 algorithm 338 4. The initial crystal structure 339 5. Applications and results 339 5.1. Amorphous silicon 341 5.2. The crystalline-amorphous interface 342 5.3. Amorphous diamond: ta-C 344 5.4. Even-ring models 5.5. Hydeogenated amorphous silicon: a-SiHx 345 348 References Chapter 16. Conformational analysis of flexible molecules S.R. Wilson, W. Cui and F. Guarnieri 1. Introduction 1.1. Simulated annealing of met-enkephalin and other peptides 1.2. New simulated annealing methodology 2. The simulated annealing pathway 2.1. Monitoring structural changes 2.2. Monitoring bond rotation frequency 2.3. Bond statistics and flexibility monitoring 2.4. Dihedral distribution functions
351 351 354 354 355 355 358 358 359
2.5. Anneal-flex 2.6. Conformational memories and bioactivity 3. Conclusion References Chapter 17. Simulated annealing-optimal histogram applications to the protein folding problem D.M. Ferguson and D.G. Garrett 1. Introduction 2. Protein model 3. Computational algorithms 3.1. Optimal histogram methodology 3.2. Order parameters 4. Results 5. Concluding discussion 5.1. Comparison to experiment References Chapter 18. Optimization of linear and non-linear parameters in a trial wavefunction by the method of simulated annealing P. Dutta and S.P. Bhattacharyya 1. 2. 3. 4.
359 360 366 366
369 369 371 376 376 381 383 388 391 392
395
Introduction 395 The construction of cost function 397 Minimization of the cost function 400 Results and discussion 403 4.1. Ground states of electron atoms and ions 403 4.1.1. Use of Davis' basis set: optimization of linear parameters only 403 4.1.2. Use of STOs: optimization (cyclic or simultaneous) of many linear and single non-linear parameters 405 4.1.3. The case of many non-linear parameters 407 4.2. Excited states of two electron atoms and ions 407 4.3. Applications to two-electron diatomics 408 4.3.1. Simultaneous optimization of R and a 408 4.3.2. Simultaneous optimization of R and a with a modified choice of & 410 T trial
4.3.3. Simultaneous optimization of orbital exponent (a), internuclear distance (R) and the CI-coefficient (C.) 5. Further generalization of the scheme 6. Postscript References
411 413 414 415
xiv
Chapter 19. Annealing to a moving target: first principles molecular dynamics S.J. Singer 1. The basic scheme 1.1. An example simulated annealing calculation 1.2. Holonomic constraints 1.3. Damped motion 1.4. Further refinements of simulated annealing dynamics 2. Models of electronic structure and applications 2.1. Solvated electrons and other few-electron systems 2.2. Ab initio Schrodinger equation based methods 2.3. Density functional techniques 2.4. Other electronic structure approaches 3. New directions References Chapter 20. A MATLAB algorithm for optimization of an arbitrary multivariate function M.A. Curtis
417 419 422 424 427 430 432 432 434 435 437 438 438
445
445 1. Introduction 2. Program requirements and operation 445 448 3. Applications 448 3.1. Multioptimum function in two variables 451 3.2. Non-linear regression 3.3. Binary classification using linear discriminant analysis 453 455 4. Modifications for MATLAB 4.0 456 5. Conclusions 456 6. Source code listings 456 6.1. Program gsaopt.m 461 6.2. Program cosmaze.m (Application 1) 461 6.3. Program wexpred.m (Application 2) 462 6.4. Program clsdemo.m (Application 3) 463 References Epilogue 1. Introduction 2. Bibliography Index
465 465 466 471
1
Introduction John H. Kalivas Department of Chemistry, Idaho State University, Pocatello, Idaho, 83209 USA
Optimization problems occur regularly in chemistry. The problems are diverse and vary from selecting the best wavelength design for optimal spectroscopic concentration predictions to geometry optimization of atomic clusters and protein folding. Numerous optimization tactics have been explored to solve these problems. While most optimizers maintain the ability to locate global optima for simple problems, few are robust against local optima convergence with regard to hard or large scale optimization problems. Simulated annealing (SA) has shown a great tolerance to local optima convergence and is often called a global optimizer. The optimization algorithm has found wide use in numerous areas such as engineering, computer science, communication, image recognition, operation research, physics, and biology. Recently, SA and variations on it have shown considerable success in solving numerous chemical optimization problems. One thrust of this book is to demonstrate the utility of SA in a wide range chemical disciplines. The SA method of optimization can be traced to Metropolis et al. where in 1953 an algorithm was proposed for simulation of the progression of a solid to thermal equilibrium [1]. The algorithm is known as the Metropolis algorithm. In 1983, Kirkpatrick et al. realized that the Metropolis algorithm could also be used as a general optimizer and applied it to the physical design of computers [2]. Since this time, developments and applications of SA have significantly matured. Recently, SA was generalized (GSA) producing a faster convergence to the global optimum. Simulated annealing has been described in terms of Markov chains and is similar to Wiener filters. This book describes SA, GSA and other modifications of SA in their abilities to serve specific needs in a variety of chemical disciplines. An assortment of books and special journal issues have been written that discuss unique theoretical aspects of SA and applications to non-chemical problems [3-6]. Simulated annealing is well characterized in these sources. Lacking from the literature for chemists is a collection of chemical applications that clearly demonstrate the power of SA. Gathered in this book are results from using SA and its variations on chemical problems. After reading this book, or chapters of interest, the reader should understand the strengths and weakness of SA and how to apply SA, or some variation, to their chemical optimization problem. The number of investigators using some form of SA is markedly increasing and this book does not cover all possible applications. I apologize in advance if your application of SA has not been included. The book begins with a detailed discussion on SA and GSA in Chapter 1. This chapter presents the theoretical framework of SA and GSA from which a computer
2
program can be written by the reader. The remainder of this book goes on to describe applications of SA type algorithms to a diverse set of chemical problems. The final chapter contains an algorithm for GSA written in the MatLab programming environment. This program can be easily adapted to any optimization problem and with only slight modifications, it can be altered to perform SA. A general flowchart is also given.
REFERENCES
1. 2. 3. 4.
5. 6.
N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller, J. Chem. Phys. 21 (1953) 1087. S. Kirkpatrick, C.D. Gelatt, Jr., and M.P. Vecchi, Science, 220 (1983) 671. P.J.M. van Laarhoven and E.H. Aarts, Simulated Annealing: Theory and Applications, D. Reidel Publishing Company, Dordrecht, 1987. E.H. Aarts and J. Korst, Simulated Annealing and Boltzmann Machines: A Stochastic Approach to Combinatorial Optimization and Neural Computing, Wiley, New York, 1989. M.E. Johnson (Editor), Am. J. Math. Management Sci., 8 (1988) 205. A. Sangiovanni-Vincentelli (Editor), Algorithmica, 6 (1991) 295.
Adaption of Simulated Annealing to Chemical Optimization Problems, Ed. by J.H. Kalivas © 1995 Elsevier Science B.V. All rights reserved.
3
Chapter 1 Simulated annealing and generalizations Ihor 0. Bohachevskya, Mark E. Johnsonb and Myron L. Stein' a3 Loma Vista, Los Alamos, NM 87545 bDept. of Statistics, University of Central Florida, Orlando, FL 32816 'Analysis and Assessment Div., Los Alamos National Lab, NM 87545
1. INTRODUCTION The simulated annealing algorithm as it is known today was first employed by Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller [1] to determine equilibrium distributions of canonical ensembles of particles that interact through some specified potential. The calculated results and the principles of statistical mechanics were used to determine the pressure and the corresponding equation of state for the ensemble. The authors did not observe the applicability of the algorithm to general nonlinear optimization problems with multiple extrema nor did they comment on the need for a provision to increase the energy in some steps when the objective is to minimize the ensemble energy. Kirkpatrick, Gelatt, and Vecchi [2] recognized the general applicability of, what they called, the "Metropolis Algorithm" to combinatorial optimization problems and used the analogy with the annealing process in metallurgy to discuss quantitatively the need to allow detrimental steps in the optimization process. They illustrated the utility of the algorithm with solutions of practical problems in which the objective function resembled the Hamiltonian of systems of interacting particles and used concepts and principles of statistical mechanics to gain insight into the performance of the algorithm. These authors also observed that as the size of the system increases the pathological ("worst case") behavior of the algorithm becomes increasingly irrelevant and the average performance becomes more critical. Their popularization made the simulated annealing method widely known and resulted in significant research activities in the stochastic optimization area. The next step in the development of simulated annealing was the extension of the method to continuous variables; it was taken by D. Vanderbilt and S. G. Louie [3] and by I.O. Bohachevsky, M.E. Johnson and M.L. Stein [4]. Annealing is not the only process that may be used as a model for the development of probabilistic optimization algorithms. Bohachevsky, Johnson, and Stein [5] identified an analogy between stochastic optimization and a biased game of chance and used that analogy to estimate
4
the dependence of the expected number of function evaluations required for an optimizing random walk on the number of independent variables (dimensionality of the problem) albeit only for a special quadratic objective function. In this introductory chapter we discuss in Sec.2 the formulation of the simulated annealing approach to optimization, computations with the algorithm and their termination and we illustrate the method with an example. In Sec. 3 we present attempts to model and to analyze the performance of the algorithm, in particular, the dependence of the computational effort on the dimensionality of the problem and the termination criterion. We combine the results presented in this section with observations of the results of many applications and discuss in Sec. 4 some of the characteristics of the simulated annealing method. Results of calculations that minimize the total energy of molecular conformation for several compounds and a summary conclude the chapter. 2. THE SIMULATED ANNEALING METHOD 2.1.
Formulation
Optimization with the simulated annealing method requires: a. specification of one single-valued objective function with either a closed form expression or a procedure and prescription of its computation; b. description of the space of independent variables (arguments of the objective function) and the region in which the solution will be sought; the latter expresses the so-called optimization constraints; c. definition of a neighborhood of a point in the space of independent variables; it should be small in the sense that values of the objective function in contiguous neighborhoods should not differ by large amounts. In mathematical terms it means introducing a topology; when the space is metrizable the definition of a neighborhood in terms of the metric is straightforward, for non-metrizable (Hausdorff) spaces encountered, for example, in combinatorial optimization (e.g. the traveling salesman problem) the success of the method may depend on a clever definition of a neighborhood; d. a procedure that generates a pseudo-random walk through contiguous neighborhoods; e. a criterion for the termination of the random walk. In the specification of this criterion it is convenient to distinguish two cases: (i) the optimal value of the objective function is known and only its location remains to be determined, (ii) neither the optimal value nor its location are known. In the first case the walk is terminated when the objective function attains a value sufficiently close to the known optimal value. The meaning of "sufficiently close" must be specified consistently with the definition of a small neighborhood; an approximate consistency analysis in the case of continuous variables is outlined in subsection 3 below. In the second case a satisfactory specification of a termination criterion requires knowledge of some special characteristics of the problem or several judicious replications of the optimization calculations.
5 2.2. The simulated annealing algorithm We may assume without loss of generality [4] that the objective is to locate the minimum of
a function 4)(X), X = (x1, x2,...x.). (Capital letters designate vectors, lowercase letters with subscripts ranging from 1 to n are the components.) The simulated annealing algorithm to achieve that goal may be specified as follows: (a) choose an initial point X0 randomly or judiciously on the basis of some knowledge of the problem and determine the corresponding value of the objective function (1)(X0); (b) choose a small random step /IX in one of the several ways indicated in the next paragraph; (c) at any point Xk calculate the provisional change in the value of the objective function A4) = 4( Xk + A X) - (I)(Xk); (d) check the termination criterion, if satisfied stop, otherwise: (e) for A cf• ^ 0 set Xk + 1 = Xk + A X, and go to (b); for A4) > 0 evaluate exp (- p A 4)), choose a random number p, 0 < p < 1, from some (0 specified distribution on [0,1] (uniform distribution suffices in most cases), and compare: if exp (-p A 4)) ^ p go to (e) (i) if exp (-pAo> p go to (b). (ii) The parameter p controls the performance of the algorithm; it may depend on the value of the objective functions 4(Xk) and/or the number of executed iterations, k, as discussed below. The procedure for choosing a small random step OX depends on the characteristics of the particular problem. In discrete and combinatorial problems A X will be directed from the current location (state Xk) to an arbitrary point in a randomly chosen neighborhood contiguous to Xk. In metric spaces there are two basically distinct ways of determining A X: one is to choose a random point on a unit hypersphere (n direction cosines) and to step off in that direction a prescribed distance I AXI [4], another is to choose a random point inside a hypercube centered on the current location, Xk, and having an edge length equal to 2 [3]. In the latter case the step size is random in the interval 0 ^ IAXI ^ n with the expected value proportional to IF/; stretching adjustments are necessary to satisfy specific step size requirements. S.W. Weller [6] compared these two procedures and found the space-filling property of the second recipe useful. In both cases the procedure may be refined to make the step size IAXI depend on the direction. The control parameter p appearing in (0 provides opportunities for many refinements of the basic simulated annealing algorithm. When the global minimum value of the objective function, 4)„,„ is known we may set p = 13 0 / (4) - (L); this substitution ensures that the probability of walking out of the global optimum becomes vanishingly small as it is approached while the probabilities of climbing out of local optima remain bounded away from zero. If the value of (L is not known, several successive adjustments after an initial guess may result in a reasonably efficient computation [4]. An effective way to control the performance of the simulated annealing algorithm is to make the parameter 13 depend on the number of past steps, k; commonly used dependence is a multi-step function. The heights and the frequency of stepwise changes may be prescribed a priori (a monotonically increasing p is called a "cooling schedule") or they can be determined from the statistics of past performance. Bohachevsky, Johnson, and Stein [4] determined that for a class of smooth objective functions with moderately deep local minima the algorithm is efficient when p is changed every 50 - 100 steps in such a way that exp (-13 A 4)) remains in the range [0.5, 0.9]. In general, a histogram of the optimizing walk can and should be accumulated and exploited to
6 enhance the performance of the algorithm especially in applications when the evaluation of the objective function is computationally intensive (expensive). Vanderbilt and Louie [3] used the covariance of the sequence of accepted steps to bias the choice of the random step with the requirement of equal covariance. The merit of such procedure will depend on the length of the histogram utilized and on the topography of the objective function. Forbes and Jones used a similar approach to develop an adaptive simulated annealing program for the optimization of optical lens designs [7,8]. Because analogies other than physical annealing may be used to motivate the simulated annealing optimization [5] one may conjecture that the probability of accepting or rejecting a step in (f) could be determined from distributions other than Boltzman's. However, exploratory calculations carried out with normal and rational polynomial distributions (with the required properties) showed that none were as effective as the Boltzman distribution. 2.3. Termination
Specification of a satisfactory termination criterion is the difficult part of the simulated annealing method. When the optimum value of the objective function is not known the only viable option is to repeat the computation with different starting points, X 0 , and to continue each computation until the successive improvements are very small, or the value of the objective function becomes acceptably low, or the cost of computing becomes unacceptably high. In general, the decision is based on a combination of all such considerations. When the value of the objective function at the global optimum, s:L, is known we may obtain a reasonable stopping criterion from the following heuristic considerations. Assume that 4) is smooth and s:L = 0 at X = 0; near X = 0, 4) may be approximated with: n
2
aExi =
i-1
Let the stopping criterion be 4). ^ E and the step size IAXI constant. The condition which ensures that the stopping criterion is satisfied within a reasonable number of steps was derived by Bohachevsky, Johnson, and Stein [5]; it is E ''' sk ( I A X I ) . The following plausibility argument illustrates the reasoning. There is a 8 given by n
g2 = E
x?
i.i such that .4). (8) = E. Consider concentric hyperspheres of radii IAXI and 8 ( ^ I A X D. The probability IN that a random endpoint of a minimizing walk is inside the 8 radius given that it is inside the I AXI radius is the ratio of volumes:
7
Po =
I AXI
This ratio decreases as n increases unless 8=16, X I . Therefore, even for moderately large n (5-10) the probability that the walk approaches the origin closer than IAXI is very small and the attainment of the termination criterion
4„ < E < 4) ( I AXI) requires a large number of steps near the endpoint [5]. 2.4. Illustrative example An illustration of simulated annealing optimization is presented in Figures 1- 5. In this 2-
dimensional example the objective function is a modified Rosenbrock's function: do(
4)(x,y)=(1-x)2,a(y-x 2)2,b[cos(ocx+yy)-(- 1) with a = 3Tc, y = 4Tc. We chose this function because it is one with which gradient descent has difficulties and we added local minima to emphasize the global optimization challenge. The global minimum of this function is located at x = y =1. Figs. 1 and 2 illustrate optimizing random walks for the Rosenbrock's function with a = 100 and b = 0. The walk in Fig. 1 was carried out with a constant step size I XI = 0.10 and that in Fig. 2 with random steps picked from a uniform distribution over the interval [0, 0.10]; both were initiated at x = y = -1. The circular pattern of points surrounding the global minimum is characteristic of optimization with constant step size when the termination criterion is a large number (500 in Fig. 1) of consecutive failures to accept a step. Figs. 3 and 4 illustrate optimizing random walks with a constant step size IdX1 = 0.075 on the modified Rosenbrock surface characterized with a = 25 and b = 1. The walks were initiated at x = y = -1 in Fig. 3 and at x = -1, y = +1 in Fig. 4; the latter case illustrates the difficulty that may be encountered when the initial point is near a local minimum. Not withstanding the initial difficulty , the total number of function evaluations in the two computations are approximately equal (8277 and 7834); a plausible reason for this is given below. Fig. 5 illustrates optimization on the same surface with a random step size uniformly distributed over the interval [0, 0.10]; the number of function evaluations in this case was 6356. The examples presented in Figs. 3-5 graphically illustrate the increasing numbers of iterations required to climb out of progressively lower local minima. Therefore, to ensure success in the quest for global minimum the termination criterion must be very loose; in the above 3 examples it was 7500, 5000, and 1000 successive failures to accept a step. Thus, in Figs. 3 and 4 almost 3/4 of the number of function evaluations are consumed in the process of satisfying the termination criterion and only a small fraction is used in the actual search for the global minimum. This underscores the difficulty of devising a satisfactory termination criterion for functions with many local extrema: a large fraction of the computational effort is expended to confirm that the extremum found is indeed global.
8
—1
—0.5
0.5
Figure 1. Biased random walk with a constant step size -- Rosenbrock's function.
9
1
—0.5
0
0.5
1
1.5
cn
t-
cn
Figure 2. Biased random walk with a random step size -- Rosenbrock's function.
10
. .
Figure 3. Biased random walk with a constant step size over local minima; starting point on the steep part of the function.
11
Figure 4. Biased random walk with a constant step size starting near a local minimum.
12
—0.5
0.5
Figure 5. Biased random walk with a random step size over local minima. 3. CHARACTERISTICS OF SIMULATED ANNEALING The most emphasized property of simulated annealing optimization is the capability to traverse local optima in search of the global one. However, a random search may also be preferable to a systematic gradient descent method for some objective functions that have a unique optimum. This should not be surprising in view of successful applications of Monte Carlo calculations to obtain deterministic results (e.g. evaluations of definite integrals). Examples of functions that may pose problems for the gradient descent method are discontinuous or nondifferentiable functions, functions that are specified with a computational procedure instead of an analytic expression, or functions with peculiar topography (e.g., Rosenbrock's function). Therefore, it is natural to ask how simulated annealing compares to a gradient descent method with respect to the number of function evaluations per step and per solution (location of a stationary point). Both characteristics may be used with some justification as measures of computational efficiency.
13
Because the path that descends along the gradient leads directly to the stationary point while the optimization walk of simulated annealing meanders, one may expect that the former is more computationally efficient that the latter. That this is not true in general is shown in the next two sub-sections. 3.1. Number of function evaluations per step
Calculation of the gradient at a point Xi, in n-dimensional space requires n function evaluations -one for each partial derivative. To estimate the expected number of tries required to find a point X k 1 such that A 4) < 0 we consider a random walk in which a step is determined by choosing a random point from a uniform distribution on the unit hypersphere centered at Xk and moving in that direction a constant distance I AXI. The probability that the step so chosen has a component pointing toward the center of curvature of the contour 4) = constant (toward the stationary point) equals the fraction f. Si fn = §,
where S is the surface of the unit hypersphere and S i is the part of that surface that lies on the same side as the center of curvature of the contour surface. For a Euclidean space and a spherically symmetric contour surface]
folFsinn-20d0 fn (1 11 )(n fitsinn-20do
^ 2)
where 111 is the half angle of the cone with vertex at Xk, axis along the radius of curvature and the base defined by the intersection of the contour 1) = const = 4(X k) and the hypersphere Xk centered at k and radius equal to I AX I . In 2-dimensions shown in Fig. 6, S = 2n, S i = 2'F, cosi' = I AXI/ 2R where R is the local radius of curvature of the contour line; therefore 12 = 'F/,t . In general I AXI / R<<1 and I 1-c . - ti, I << 1. 2
From the definition of probability the expected number of trial function evaluations, T, required to encounter an Xk + 1 such that Ack < 0 equals f,,-1 . To estimate the growth of T with n it is convenient to represent it in the following form:
2/./2sinn_20do 0 T(n; 111 ) = ini
fr5inn-2 0d0-f V2S inn-20de
14
CONTOUR TANGENT
Figure 6. Illustration of the geometrical probability of random direction choice. Using standard tables of integrals we find for n ^ 2: frsinn-2 Ode = It- ( n-2)! • 2 202_2) [( n-2)!]2
(n-2)
even
2
2(17-2) 2 (n-2)!
3 [(
)!12.
(n-2)
odd
2
For large n when the approximations n! 1 are valid we obtain a simple approximation:
7-- \Tr En (n I e)"
(Stirling's formula) and [1 - (1/n)]' e
15
foxi2 sinn-2 Ode =
it 2(n-2)
To evaluate the integral over the interval [ IF, is/2] approximately we introduce a new variable 11 = n/2 - 0 and obtain; fx/2 7
Sin-2 i. nn-2 OdO =J 0irr2-T COSn-2 11 Chl
Because (1-- - IF )<< 1 we can use the approximation cos en . exp (-'/2 re) with an error equal to 2 ri 4/12 to obtain s
r 2 Jo
/Li, _ n-2 2
-I'
2
cosn-2 idi z f o
2 chi
e
.
Using the standard approach that consists of evaluating the square of the right hand integral in polar coordinates over an area equal to the area of the rectangle determined by the limits of integration we obtain the approximation:
t 2 f0
i-_ty
n-2 ridn rz COS
1n[
2(n-2)
e
2
it
ic
2
]
V 2
(n-2 X-1-2 1 -1`)2 ] . \I It [e 2(n-2) -
with an error less than
Substituting these values into the expression for T(n;'F) we obtain the approximation:
TN ;W) z T a (n; IF) -
2 - 1 (n-2X 2--1 -T)2 ]Y2
1- [1- e 2.
71
2
IC inA2 T « 1 this approximation simplifies to:
for —0-L---r A It
TO;
2
IF) z 2[1. ,\1 -4-t- (n - 2) (i- - 11)] .
16 This result shows that for small n the number of trials required to find a step that reduces the value of the objective function, Ta(n,'F), grows only as the square root of n with a small coefficient. For moderately large n [provided n < (t/2 -'F)' 2] it is necessary to retain the quadratic term in the expansion of the exponential function and T a (n ; 'F) becomes proportional to n with coefficient of proportionality less than one. This result was deduced by the authors in [5]. For values of (n-2) (--n - T)-2 greater than one the denominator becomes exponentially small and the approximation to2T(n; 111) begins to increase accordingly. The accuracy of the approximate expression for T(n, 11') derived in this section is illustrated in Figs. 7 and 8 where the dependence of Ta(n;T) on n is plotted and the exact values of fl(n) obtained by recursive evaluation of the integrals for 21- - I' = .005 and .010. In the range of values of n shown in these plots the approximation is 3ery good; however, as Fig. 8 indicates, it deteriorates with increasing n because that variable appears in the exponent of the integrand.
Psi = pi/2 - 0.05 8.0
TILT
. 0 .0
0.0
aaa
, i , i , , , 100.0
, , 300.0
200.0
, , , , , 400.0
n
Figure 7. Number of trials required to find a direction with negative slope; Tr/2 - IF = 0.05.
, 500.0
17 Psi = pi/2 - 0.10 100.0 90.0 80.0 70.0 60.0 50.0 40.0 30.0 20.0 10.0 0.0
0.0
. „ ... .. . 100.0
i .... , 200.0
300.0
400.0
500.0
n Figure 8. Number of Trials Required to Find a Direction with negative slope; rc/2 -'F = 0.10.
3.2 Relative computational efficiency The estimates of T(n; 111) derived in the preceding section enable us to compare computational efficiency of the simulated annealing to that of the gradient descent method. Because the comparison is meaningful only for convex objective functions (with a unique minimum) for which the gradient descent is appropriate we can assign to 13 such a large value that simulated annealing behaves like the "greedy" algorithm, i.e. accepts only the steps that result in improved values of the objective function. (This algorithm is sometimes called "Zero Temperature Monte Carlo.") A smooth objective function has a tangent plane at each point X k of the random walk as shown in Fig. 9. When the ratio of the step size IA X Ito the radius of curvature of the contour, R, is small the values of A 4) averaged over random step directions, AC , approximately equal one half of the change in the gradient direction, i.e. AC z 1/2A ckg. Therefore, an optimizing random walk requires approximately twice as many steps as the gradient descent. If we define computational efficiency as the number of steps required to complete the optimization, then random search will be more efficient than explicit gradient calculations when 2T < n. Using the simple approximate estimate Ta(n; IY) that includes only the first term in the expansion of the exponential, the condition 2T < n becomes
18
GRADIENT TANGENT PLANE CONTOUR 9 = CONST RANGE OF X k + AX
Acpm < 0
Figure 9. Tangent Plane and the Locus of Random Steps of Constant Size.
7r - 111) 4[ 1 + ( n 2
,\ 2 1 —(- - 2 ) ] it n n2
<1
Because ( 2LT) is small the inequality is satisfied for n > 4. 2
The approximate expression for T(n;'F) with the exponential not expanded does not admit a convenient explicit solution of the inequality 2T < n; however, numerical results illustrated in Figs 7 and 8 show that the inequality remains satisfied also for large n. 3.3. General properties of simulated annealing
In addition to computational efficiency analyzed above the following general properties merit mention. a. Simplicity. The basic simulated annealing algorithm does not require much more than a dozen lines of code (not including evaluation of the objective function) and the optimization constraints are easily handled by rejecting candidate points that violate them.
19 b. Flexibility. Prescription of different functional dependencies of p, incorporation of different stretchings for different variables, various recipes for pseudorandom step selection, and specifications of various termination criteria ensure that the basic simulated annealing algorithm can be modified to meet special requirements of most applications. c. Generality. The method is applicable to nonconvex objective functions with multiple optima and to nondifferentiable (discontinuous) functions and it may be used for discrete (combinatorial) and continuous variables optimization. We have used simulated annealing to determine optimal designs of biological experiments [4] and optical lenses [9, 10], optimal deployments of missile interceptors [11], optimal allocations of resources in orbital engagements [12] and optimal molecular conformations [13]. A comprehensive summary of the theory and applications of simulated annealing has been published by van Laarhoven and Aarts [14]. 4. AN APPLICATION
To illustrate the application of simulated annealing to problems of physical chemistry we present an example of molecular conformation optimization calculated by John H. Hall et al. [13] at the Los Alamos National Laboratory. The purpose of Hall's exploratory study was to demonstrate the feasibility of using simulated annealing to determine minimum energy configurations of the molecules of chemical compounds such as bicyclo-HMX, Tyr-Gly-Gly, or dibromoethane. In this study the objective was to minimize the total energy calculated from the Allinger's MM2 potential energy functions. These functions are sums of the contributions of the Coulomb and Van-der-Waal potentials and of the elastic energies of the bond lengths and bond angles. The expressions for the potentials contain empirical constants determined for each type of atom from studies of many compounds. The empirical constants in Hooke's law for the elastic energies were determined similarly for each atom-atom pair (bond lengths) and for each atom triplet (bond angles). Thus, the independent variables were bond lengths, bond angles, and torsion angles. In the case of dibromoethane, utilization of symmetry and other properties results in a reduction of the expression for the total potential energy to a dependence on only one variable which is the torsion angle. Because of this fortuitous circumstance it is convenient to present the conformation optimization calculations and results for that compound. The potential energy function of dibromoethane is shown in Fig. 10. The search for the minimum configuration was initiated at the value of the torsion angle equal to 30°; the corresponding molecular conformation is shown in Fig. 11. The histogram of the random search for the minimum energy conformation is illustrated in Fig. 12; the random walk was terminated at the value of the torsion angle equal to 180°. The plot shows that considerable "time" was spent in the two local minima. Nevertheless the algorithm located the global minimum energy conformation represented in Fig. 13. For the other compounds examined the potential energy function is a multidimensional hypersurface, computations are laborious, and the results are not amenable to concise graphical representation.
20
0.0
80.0
160.0
240.0
320.0
Torsion Angle Figure 10. Potential energy function of dibromoethane. SRI
XI
Figure 11. Initial conformation of dibromoethane; stereo view of Neumann projection.
21 9 .0 —
/ft
8.0 — 7.0 — 5 \ 6.0 — ctS U . 5.0 C.. C4 in4
zW
_
I
4.0 —1 3 .0 —
2.0 — 1.0 0.0
I 80.0
I 160.0
I 240.0
Iteration Number Figure 12. Potential energy histogram.
i
320.0
1 400.0
22
Dl,
RS
Figure 13. Final conformation of dibromoethane; stereo view of Neumann projection. 5. CLOSING OBSERVATIONS
Since its introduction by Kirkpatrick, Gelatt, and Vecchi [2] in 1983, simulated annealing has been applied by hordes of practitioners to diverse optimization problems. In view of such popularity and interest it is appropriate to close this introductory presentation of the method by addressing two questions: (1) What have we learned? (2) What more would we like to know? Simulated annealing is analogous to an unfair game of chance biased in favor of the player [5]. The bias alleviates but does not eliminate the need for patience. Unfortunately, patience is expensive in terms of the computer execution time. Therefore, it behooves the user to monitor recent trials and to use this information toward improving the bias (commonly referred to as "accelerating the convergence") without compromising the ability to escape local extrema. Attempts to accomplish this task benefit from knowledge of any relation among the quantities (1), - (kk, k, 13, and I A XI. In particular, it would be useful to have an estimate of the expected number of trials (steps) required to climb out of a local minimum of a given depth in terms of p and IA XI. Such estimates would be helpful in the derivation of a reasonable termination criteria. The approach developed in [5] and outlined in Sec. 3 offers a way to derive approximations to
23 the above indicated dependencies. Such approximations need not be very good to be useful. For example, J. H. Hall's work [15] on the parallelization of simulated annealing with a chaotic algorithm demonstrated that the method is not sensitive to imperfections of the pseudo-random number generator (i.e., its cycle length need not be terribly long). This and similar results [16] imply that simulated annealing computations are remarkably robust. It is probably correct to say that successful application of simulated annealing requires a clever definition of neighborhoods in the configuration space (i.e., specification of individual "moves") and a judicious variation of the control parameter 13. In the case of metric spaces we would like to have a rationale for choosing an appropriate step size and for specifying a priori an effective cooling schedule. The above indicated knowledge can be gained only from the understanding of the structure of the objective function, much of which evolves in the course of the search. Therefore, it is not likely that a simulated annealing code can be developed to be user-friendly to a degree that will allow efficient and successful applications by users not conversant with the configuration space and the objective function defined on this space [16]. Present applications of simulated annealing appear to be special adaptations of the method to specific problems because we have not yet uncovered general principles that may exist and underlie all optimization methods based on random searches. For example, does the Boltzman factor provide the best expression for the probability of step acceptance in all applications? The observation that with sufficient ingenuity all probabilistic optimization methods can be interpreted and analyzed as variants of simulated annnealing suggests that some fundamental principles may indeed exist. For example, the well known genetic algorithm [17] fits into the framework of the simulated annealing paradigm when mutations and cross-breeding are interpreted as a special choice of "elementary moves" and selective "breeding" of "good" solutions as an intelligent use of information gained from the past trials. That the definition of elementary moves, the critical observations of past trials, and the bias of the random walk are motivated by the biological process of evolution does not change the fact that this algorithm fits into the structure of simulated annealing. As we pointed out in the Introduction, annealing is not the only process that can serve as a model for the development and analysis of biased random search optimization methods. Acknowledgement: The authors gratefully acknowledge helpful discussions with J.H. Hall and the permission to use his results. REFERENCES
1.
Metropolis, N., Rosenbluth A., Rosenbluth M., Teller, A. and Teller, E. J. Chem. Physics,
2. 3. 4.
Kirkpatrick S., Gelatt C.D. Jr., and Vecchi, M., P. Science, v. 20, pp 671- 680 (1983). Vanderbilt, D. and Louie, S.G. J. of Comp. Physics. v 56, pp 259-271(1984). Bohachevsky, I.O., Johnson, M.E. and Stein, M.L. Technometrics, v. 28 (3) pp 209-217
5.
Bohachevsky, I.O. Johnson, M.E., and Stein, M.L. J. of Comp. and Graphical Statistics,
v. 21, pp 1087-1092 (1953).
(1986). v. 1 (4) pp. 367-384 (1993).
6. 7.
Weller, S.W. in Proceedings of SPIE v. 818, "Current Developments in Optical Engineering," R.E. Fischer and W.J. Smith eds, pp 265-274 (18-21 Aug. 1987, San Diego, CA).
24 8. 9.
10.
11.
12. 13. 14. 15. 16. 17.
Forbes, G.W. and Jones, A.E.W in SPIE Proceedings, v. 1354, pp 144-153 (International Lens Design Conference, 11-14 June, 1990, Monterey, CA). Jones, A.E.W. and Forbes, G.W. submitted to J. of Global Optimization, Apr. 1992. Bohachevsky, I.O., Viswanathan, V.K., and Woodfin, G. in SPIE Proceedings, v. 485 pp 104-112, (Applications of Artificial Intelligence, Arlington, VA. 3-4 May, 1984). Viswanathan, V.K., Bohachevsky, I.O. and Cotter, T.P. in SPIE Proceedings, Vol 554, pp 10 - 17, (International Lens Design Conference, 10 - 13 June, 1985, Cherry Hill, N.J.) Bohachevsky, I.O., Johnson, M.E. and Stein, M.L. Amer. J. of Math. and Management Sciences, v. 8 pp 361-387 (1988); also Los Alamos Nat'l Lab Report LA-10940 MS (1987). Bohachevsky, I.O., Johnson, M.E., and Stein, M.L. Proceedings, 27th IEEE Conference of Decision and Control, Austin, TX, 7-9 December, 1988. Ryan, R.R., Hall, J.H., Bohachevsky, I.O., Triay, I.R. and Stein, M.L. AM. CHEM. SOC.; Abstracts of Papers, v. 193, Phys-0134 (Apr. 1987). Laarhoven, van, P.J.M. and Aarts, E.H.L. Kluver, 1987. Hall, J.H. and Hiromoto, R. (LA - UR - 88- 1423, March 1988), Supercomputing "88 Conf., Orlando, Florida, Nov. 14 - 18, 1988. Rutenbar, R.A. IEEE Circuits and Devices Magazine, Jan. 1989, pp 1- 26. Goldberg, D.E.: Genetic Algorithms in Search, Optimization, and Machine Learning, Addison-Wesley Publishing Co., Inc., 1989.
Adaption of Simulated Annealing to Chemical Optimization Problems, Ed. by J.H. Kalivas © 1995 Elsevier Science B.V. All rights reserved.
25
Chapter 2 Comparison of algorithms for wavelength selection Uwe HOrchner and John H. Kalivas1 Department of Chemistry, Idaho State University, Pocatello, Idaho, 83209 USA
1. INTRODUCTION The simulated annealing (SA) algorithm has proven to be suitable for large scale optimization problems. However, optimization results are limited if applications of SA ignore problem specific issues. Accordingly, this chapter first introduces in section 2.1. the analytical problem of wavelength selection for spectroscopic multicomponent analysis. Section 2.2. goes on to briefly discuss the SA algorithm for discrete combinatorial searches. A variation of SA known as generalized (GSA) is described in this section as well. Section 2.2. also introduces threshold acceptance (TA) as a generalization of the probabilistic acceptance principle of SA. Section 2.3. targets the escape-and-catch problem encountered in locating exact extrema of discrete functions by means of any SA type algorithm. A method that modulates stepwidths for expansion and shrinkage is proposed to overcome the escape-and catch dilemma. The method is named GSA with modulated stepwidth (GSAMS). Finally, sections 2.4. and 2.5. describe appropriate parameters specific to the investigated multidimensional optimization problem of wavelength selection. Section 3. reports results from applying GSA, GSAMS, and TA towards the problem of wavelength selection for different spectroscopic situations. Further detailed results
and discussions of GSA, GSAMS, and TA to wavelength selection are available [1,2]. With regard to the general nature of the discussion in this chapter, the following terms will be used. The optimization(cost) function 1(x) has k independent variables (parameters, wavelengths) such that x = (x1, x2, ..., xidT. The function value 1(x) for a certain parameter set (configuration, wavelength set, wavelength configuration) x is referred to as the response of the cost function. The term operational parameter settings (OPS) is not related to the parameters of the cost function, but to the control parameters for respective optimization algorithms. The term quality of an optimization run is used to characterize how close an optimization process approaches the sought extreme. Because a difference between maximization and minimization of a cost function does not technically exist, the general term increase in quality is used to characterize an optimization step towards the sought global extreme. A decrease in quality marks a move away from the optimal configuration. Wavelengths are referenced by an absolute index rather than by their real
l Author to whom correspondence should be sent.
26 wavelength value. Capitalized bold characters denote matrices and lower bold characters describe column vectors. A superscripted T symbolizes transposition of the corresponding matrix or vector. Superscripts -1 and + indicate the inverse and pseudo-inverse of a matrix, respectively.
2. THEORY 2.1. Spectroscopic calibrations and wavelength selection Quantitative spectroscopic multicomponent analysis is based on the direct proportionality between absorbance at a particular wavelength (a) and concentration of the chemical constituent (c). This relationship is known as the Beer-Lambert law [3]. For multicomponent systems, the measured absorbance at the jth wavelength equals the sum of individual contributions from the n responding components and is expressed as n
a..1 = Ek ii. C. i =1
where k denotes the proportionality factor. With substitution of absorbance with a more general spectroscopic response variable r, the matrix equation of linear equations for measurements at p wavelengths using m calibration samples is
R= CK + E
(1)
where R contains m rows and p columns representing the response matrix, C denotes the m x n concentration matrix, and K signifies the calibration(sensitivity) matrix consisting of n rows and p columns. An additional error term E of the same dimensions as R describes deviations from the linear model, e.g., chemical interactions of constituents, polychromatic measurement conditions, random noise, and systematic errors. In order to predict concentrations of unknown samples, equation (1) must be solved for K. The most common approach is to approximate K by least squares using
fi=ROccrryl Once ,.
(2)
k is calculated, unknown concentrations are predicted by
TifrilkikT
c= ran k (K
(3)
Equations (2) and (3) outline the classical calibration and prediction approach and the combination is often referred to as K-matrix analysis. The K-matrix analysis approach requires quantitative calibration for all n components of the chemical system, even if they are of no interest for future predictions. Solution of equation
27 (2) demands m n and p n. K-matrix calibration yields estimates of purecomponent spectra, i.e., the rows of K are the corresponding pure-component spectra. While calibration is numerically a relatively stable process, predictions may become problematic. Collinearities are passed from the spectroscopic response matrix R to K, and thus, under certain circumstances, making it difficult to obtain an accurate inverse of elk. An alternative method, often referred to as the P-matrix analysis, avoids disadvantages present with K-matrix analysis by viewing concentration as a function of spectroscopic responses. Mathematically,
C= RP, E
(4)
where P denotes the p x n calibration matrix. Unknown concentrations are predicted by a simple multiplication of the sample spectrum runk with the estimated calibration matrix Pobtained from equation (4). That is, P = RFC. A general requirement for P-matrix analysis is n = rank(R). Unfortunately, for most practical cases, the rank of R is greater than the number of components, i.e., rank(R) > n, and rank(R) = min(m, p). Thus, P-matrix analysis is associated with the problem of substituting R with an R' that produces rank(R') = n. This is mostly done by orthogonal decomposition methods, such as principal components analysis, partial least squares (PLS), or continuum regression [4]. Dimension requirements of involved matrices for these methods are m n, and p n. If the method of least squares is used, additional constraints on matrix dimensions are needed [4]. The approach of P-matrix analysis does not require quantitative concentration information of all constituents. Specifically, calibration samples with known concentrations of analytes under investigation satisfy the calibration needs. The method of PLS will be used in this chapter for P-matrix analysis. In an error free environment, any combination of n wavelengths that respond to the components would be suitable for calibration and prediction. In practice, each wavelength contributes in an unknown way both noise and nonlinear effects to the error term E. Thus, the analyst is confronted with the problem of choosing a suitable set of wavelengths for calibration and prediction. One approach consists of selecting wavelengths with a known correlation between the response and chemical structure of an analyte, e.g., absorbance at 1940 nm and 2310 nm and moisture in flour. A disadvantage of such a highly problem specific approach is that wavelengths suitable for predicting moisture in the specific matrix of flour, may not be appropriate for substances with different chemical matrices. An alternative method would be to use complete spectra and hope that discrepancies between the linear calibration model and the real world data are leveraged out by the large pool of spectral information. For example, the PLS method is said to be suitable for such a type of P-matrix analysis. However, deterioration of measured spectra is more likely to be attributed to systematic physical effects than to uncorrelated random noise. Despite excellent results obtained using full spectra PLS calibrations and predictions compared to other linear projection methods, prediction errors can still be significantly higher than for
28 calibrations based on a few selected undisturbed wavelengths. A detailed discussion of K-matrix and P-matrix analysis with respect to wavelength selection, including references, is available [4]. In the past, wavelength selection was mostly performed by means of forward selection (FS) and backward elimination (BE). The BE method simply excludes the most unimportant or disturbing wavelengths from the spectra until a certain quality is reached or a predefined number of wavelengths remains. Analogously, FS adds the most suitable wavelength to the calibration pool. An advantage of these methods is their relative speed. For example, to find the 'best' subset with k = 4 wavelengths from a spectrum with p = 50 wavelengths (the typical size of an ultraviolet/visible spectrum), BE inspects 50 + 49 + 48 + ...+ 5 = 1265 combinations. The number of all possible combinations with 4 wavelengths is P = 2.3 x 10 5 . However, the right k number of wavelengths to use is rarely known in practice, and thus, all combinations of 50 to 4 wavelengths would have to be tested and the search space P ) = 1.26 x 10 15 combinations. Note that infrared
for this example increases to
=4 i
(IR) or near infrared (NIR) spectra typically include 1,000 wavelengths to choose from creating a vast number of possible combinations. Large scale problems such as these make a SA type algorithm an interesting technique to search for suitable wavelength subsets.
2.2. Simulated annealing, Boltzmann statistics and threshold acceptance Simulated annealing is based on the analogy between the process of arranging a cluster of particles in their ground state and minimization of a generic optimization function [5]. Because of this analogy, a Boltzmann type probability function is employed to decide whether or not to accept detrimental configurations. To find the parameter configuration with the highest quality at the system's current temperature, i.e., the equivalent of the ground state, SA performs a large number of function evaluations. Simulation of a slow cooling process requires the definition of a cooling schedule. The number of function evaluations at each temperature has been the subject of extensive work [6]. However, a general method to define the best settings of optimization parameters using SA for diverse optimization problems has not been found. To elude this problem, generalized simulated annealing (GSA) was introduced [7]. Instead of calculating the acceptance probability of a detrimental configuration p(xnew) based on an external absolute temperature, GSA uses the optimization progress towards the expected extreme function value 0(xexp)
= exp (
cI)(x
) - (1) (xcur )
(5)
I)(x cur) - (1)(xe.p) where (1)(xcur) is the response value of the current position, i.e., the last accepted configuration. Numerous applications of GSA in chemometrics and other fields of
29
physical sciences have indeed shown that it causes less trouble than the classical SA algorithm. Furthermore, equation (5) shows that GSA departs from the thermodynamical pattern of SA. Namely, GSA introduces an instant change of the system's temperature if a new configuration is accepted rather than hold temperature constant until the optimal arrangement has been found at that temperature. Thus, it seems reasonable to question the use of the Boltzmann statistic at all. Viewing the SA algorithm in terms of Markov chains, Greene and Supowit [8] pointed out that any type of function may be used for the decision making process about acceptance of new configurations, provided the detailed balance equation for the Markov process is satisfied. A rigorous generalization of the acceptance criterion was introduced by Dueck and Scheuer [9] with the so called threshold acceptance algorithm (TA). The relatively computationally expensive Boltzmann statistic is substituted by the rule (for minimization): accept all improving and detrimental configurations with a response value equal to or less than CIEcur) + t, with threshold t > 0. Explicitly, ) new ) 13(31"w / {0, else \
1_, 4 (X
-
(13(X
cur)
^
T, t > 0
Similar to SA, the threshold value t is lowered during the optimization process. As with the SA algorithm, this can be done with an absolute schedule, i.e., assigning t to a list of predefined values, or a relative procedure by associating the update of t with the progress of the optimization run. For the wavelength selection problem, t evolves in a semi-absolute manner. It is reduced after 1(xbsf) has not improved for a string of consecutive trials where 0(xbsf) represents the best response value found so far. An arithmetic series t = T 8, with 8 ^ 1, is used to reduce the threshold during wavelength optimization. Besides the advantage of a more liberal acceptance rule, TA does not demand any information about the expected extreme response value (13(xe.p). See Appendix A for a pseudo-code listing of a generic TA algorithm.
2.3. SA type algorithms and convergence to the exact extreme
As already noted, all SA type algorithms require a mechanism by which detrimental steps can be evaluated for acceptance. They also have in common the requirement of some type of a configuration generator, i.e., the part of the algorithm in charge of creating the next trial configuration. Most SA type algorithms use a random based process while some use a deterministic method to define the direction and stepwidth of the next trial [10], e.g., topological information collected during prior moves. In order to escape local extrema without operator supervision, the configuration generator must maintain stepwidths large enough to escape local extrema within a few successive configurations. At the same time, the configuration generator must respect the basic rule of a neighborhood search. Namely, new configurations should classify as statistically small variations of the current configuration. Typically,
30 SA(GSA) employs a three step mechanism to build a new trial configuration [7]: 1. 2. 3.
Choose a random direction vector d E N(0, 1). Normalize d to unit length d = d/ Ild Define the new configuration 'Erie, based on the current parameter set, Xnew i = xcur / • + disi • •, = 1, 2, ..., k.
The stepwidth vector s is defined prior to the optimization and kept constant during the optimization run. Due to the normalization of d (step 2), all generated configurations are placed on a rotation ellipsoid of dimensionality corresponding to s. This makes it difficult for the algorithm to locate the exact extreme of an optimization function. Figure 1 shows the final stage for optimization of a two dimensional discrete function. Assume that the discrete function has a parabolic shape in the surroundings of the extreme E = (6, 6) and the current configuration is at C = (5, 5) and is close to E, i.e., (I) (C) 1(E). For this example the configuration generator obeys the rule E *new - xeur l = 3 which is similar to the three step rule listed above. Despite the close proximity of C to E, E can not be reached with one direct step from C. Each of the parameter combinations marked with a diamond (open or filled) in Figure 1 can be selected as the next trial configuration when stepping away from C. Movements from parameter combinations marked with a ¤ can reach E in the next step. Movements from parameter combinations marked with a 0 can not reach E in a second successive step. Thus, for Figure la, only six of the 12 possible configurations to choose from when moving away from C can hit the extreme in the next step. The formal probability to locate E with two consecutive trials for Figure la is 6/12 x 1/12 = 4.17 10". Because the response function shown in Figure la is assumed to have a parabolic shape in the surroundings of E, all configurations with the possibility of hitting E from C in two successive steps have a response worse than C, i.e. (I)(x.) < (I)(C) < (I)(E). Thus, all x• positions represent detrimental moves from C whose acceptances are decided with the Boltzmann probability function. The x. positions will probably not be accepted because the acceptance probabilities are extremely low due to the denominator in equation (5) being close to zero, i.e., 1(C) 1(E). Note that either of the two possible moves consisting of C to I 1 or 1 2 shown in Figure la would improve the quality, i.e., (1)(C) < (1)(1 1), (1)(I2) < (I)(E). From these configurations several other intermediate positions can be reached that have the ability to hit E in the next step. As (I)(I 1) and 0(I 2) represent the closest possible values to (I)(E), and all of the intermediate steps are detrimental. Due to the closeness of (I)(I 2) to 1(E), the acceptance probabilities of these intermediate steps are lower than when moving from C to E in two steps and would also probably not be accepted. In general, because an optimization run terminates after a certain number of consecutive nonaccepted trials, SA(GSA) returns a configuration in the vicinity of the global extreme. A natural way to overcome this constant stepwidth flaw in SA(GSA) is to modify the configuration generator to search within a predefined radius. This approach shall be named modified GSA (MGSA). Movement within a radius can be
(WO
31
(a)
Zi5
1
1
5
9 (b) *
*
*
*
n
n
kN 5
E
*
n
C *
<>
4,
O 5 X1 Figure 1. Catch-and-escape dilemma of SA type algorithms for a discrete two dimensional function (I)(x1 , x2). To escape local extrema, a certain minimal stepwidth must be observed by the configuration generator. The finite stepwidth at the same time makes it difficult to approach the exact extreme. (a) GSA configuration generator searches at a predefined radius with respect to the rule E *new - xcur I 3, (b) MGSA for search within the predefined radius E *new - xcur I -^ 3.
32
9 (c)
• •
n
0
•
C15
•
<>
0
1 1
5
9
X1 Figure 1. Continued from previous page, (c) GSAMS in shrink mode with
E
*new -
xcurl ^ 2. accomplished by rescaling the direction vector d, such that step 2 becomes: 2. Normalize d to unit length d = d / Ildll and rescale d to d i = d ir i with r i E U and 0 < r i < 1 Note that GSA and MGSA are identical in all other aspects. With MGSA, the generation rule changes to E *new - xcur 1 ^ 3. The statistics for the two dimensional discrete function is only marginally affected by this modification. As can be seen from Figure lb, the number of configurations to choose from increases to 24 and the extreme E can now be found by a direct move from C with a probability of It24 = 4.17 x 10-2. The probability for a second step hit, i.e., selection and acceptance of an x• configuration followed by a move to E, decreases to 16/24 x 1/24 = 2.79 x 10-2 due to the larger number of total possible selections. The chance of locating E from C within one or two moves is given by the sum of both independent events and equals 6.96 x 10-2. This can be compared to 4.17 x 102 for the original rule E 1 xne, - xcur 1 = 3. For the situation illustrated in Figure lb, only 4 out of 16 potential candidates for a successful step to E can be accepted without cross check against the Boltzmann function. Significantly improved results for GSA can be achieved with a procedure that features modulated stepwidths (GSAMS). Three modes exist for operation of GSAMS:
33
1. 2.
3.
Normal mode: the algorithm uses stepwidths as do GSA or MGSA, depending on the approach selected Expand mode: when the maximum number of consecutive nonaccepted steps is reached, it is assumed that the algorithm has became trapped at a local extreme because of too small of a stepwidth. GSAMS continues with another set of trials using an increased search radius. Shrink mode: a configuration has not been accepted for a certain number of successive trials while in expand mode. It must then be assumed that the sought extreme lies within the stepwidth used in normal mode. Another set of trials is performed using a decreased stepwidth. If no acceptable configuration is found, the stepwidth shrinks further.
The GSAMS algorithm terminates after a given number of shrink cycles does not yield the acceptance of any configuration. Anytime a configuration is accepted during an expand or shrink mode, the algorithm switches back to normal mode and restores the initial search radius. In order to increase the speed of convergence, the expand mode can be discarded after the difference between the sought extreme value and the best value found so far are within a specified tolerance. Figure lc illustrates an example of GSAMS in shrink mode using reduced stepwidths of E 1 x new - Xcur ^ 2. This produces a higher probability of locating the extreme E within one or two steps. The probability for a direct transition from C to E in one step is estimated as 1/12. A two step hit of E can occur from six of 12 possible configurations. The overall probability for locating E in one or two steps is given by 1/12 + 6/12 x 1/12 = 0.125. Further shrinkage of the search space to E I xnew - Xcur = 1 excludes the possibility of a direct move from C to E in one step while the probability for a move to E in a second step remains the same as before, i.e., 2/4 x 1/4 = 0.125. Note that a shrink process is suppose to be preceded by an expand mode but this aspect was not demonstrated in this brief discussion. See Appendix B for the pseudo-code listing of the GSAMS algorithm.
2.4. Wavelength selection optimization functions As discussed in Section 2.1., two distinct calibration and prediction situations exist, i.e., K-matrix and P-matrix analysis. Because K-matrix and P-matrix analysis have distinct requirements and generate different calibration matrices, optimization criteria are treated separately. However, regardless of the approach, good calibration and prediction results are generally expected with wavelengths selected that behave linearly and maintain good selectivity and sensitivity. In addition, ideal wavelengths are those with a large signal to noise ratio. Since this chapter and book focusses on the SA optimization algorithm and not on criteria for wavelength selection, the separate criteria for K-matrix and P-matrix analysis are only briefly described. It should be noted that these criteria are not necessarily the best for obtaining acceptable prediction results. Further information about the effects of wavelength selectivity, sensitivity, and noise can be found in references 4 and 11. Two different measures of selectivity from the literature were chosen as optimization functions for K-matrix analysis. The first is an estimator of overall
34
selectivity in
k
called global selectivity (GSEL) [11] defined as
n GSEL(k) =
1
11
i=1
11
ft, 1E1 re where i denotes the ith singular value of p resents the ith row of ft, 11 . 1 symbolizes the Euclidean norm, and it is assumed that rank(K) = n. The range of GSEL is 0 GSEL ^ 1 with 0 representing no selectivity and 1 denoting perfect selectivity. The second selectivity criterion (SEL) is also a global measure of selectivity in K and is a linear combination of local selectivities (SEW for the i = 1, 2, ... , n components [12]. Briefly.
n
SEL=
n n
E1/SEL
E i=1
where denotes the ith column of the pseudo-inverse of K. The range of SEL is 0 SEL 1 with 0 representing no selectivity and 1 denoting perfect selectivity. For either measure, the set of wavelengths that maximize selectivity in K is deemed best. Both measures have been used in previous studies related to the wavelength selection problem [1,13,14]. A commonly used measure of quality for a P-matrix analysis is the predicted residual sum of squares (PRESS) value computed by
E
PRESS =Cê
cal
C ) cal
2
i = 1
The PRESS value is determined by leave-one-out cross-validation [15]. Basically, one spectrum at a time is removed from the set of calibration spectra, a calibration model is built from the remaining spectra, and the concentration for the excluded spectrum is estimated as b cal The squared differences between these values and their respective known concentrations ccal are summed up as PRESS. The set of wavelengths that minimizes the PRESS value is deemed best. If a sufficient number of calibration spectra are available, it is possible to split the spectra pool into a calibration set and one or more independent prediction set(s). The best set of wavelengths would then be the one that minimizes the standard error of prediction (SEP) defined as •
(,
)2
'red cpred
SEP =
i =1
m
where m denotes the number of prediction samples. A significant advantage of a
35
SEP evaluation over PRESS is the time factor. To compute a single PRESS value, as many PLS models have to be formed as there are calibration samples. Conversely, only one PLS model is needed for SEP regardless of the number of spectra considered for calibration and prediction. In order to obtain a statistically valid SEP, it is crucial that the prediction set truly represents future prediction samples.
2.5. Neighborhood definition for wavelength selection Section 2.3. noted that the success of SA type algorithms strongly depends on the choice of the stepwidth. This requires an adequate definition of the term neighborhood for the wavelength search. The first application of GSA to wavelength selection simultaneously shifted all currently selected wavelengths by a constant number of wavelength units to generate the next trial combination [16]. This approach was followed by a more restrictive configuration generator that passed 25% to 50% of the current wavelengths unchanged into the new combination [17]. However, wavelengths designated for replacement were altered without restriction over the complete spectral range, i.e., the chosen wavelength was viewed as being independent of its previous position. The configuration generator in this study is regulated by significantly stricter rules: 1. 2. 3.
Leave a certain fraction of the current subset unchanged, e.g., two thirds of the wavelengths. Restrict the overall distance between the current subset and the new combination. Restrict the maximum shift of each wavelength.
The modified Manhattan distance dm =
E min ( I x cur- xriew, i =1
is used to introduce a topology into the wavelength space and quantify the distance between two sets of wavelengths. Note that the absolute values of restrictions 2 and 3 depend on the bandwidth of spectral features and the digitization interval.
3. RESULTS 3.1. K-matrix analysis A set of four pure-component ultraviolet-visible spectra, each containing 76 data points, was taken from reference 18 representing a K-matrix calibration situation and is referred to as the UV1 data set (see Figure 2). To limit the discussion, results for subsets with 5 wavelengths are reported. From enumerating searches, i.e., checking all possible wavelength combinations, the best subsets are known for
36
50 45 40 35 `) a) 30 Cr
11) 25 20 15 10
0
0
10
20
40 30 50 wavelength index
60
70
Figure 2. Frequency distribution of selected wavelengths using method 2 in Table 3 overlaid with the investigated ultraviolet spectra: (o) 3-carbethoxy-9-methy1-4myrido [1, 2-a] pyrimidinone, (+) 4-hydroxy-butyrophenone, (x) 4pyridinecarboxylicacid-4-methylanilide and (-) 3-acetylaminophenanthrene. The grouping of frequently selected wavelengths stems from the broad bandwidth of spectral features and permits exclusion of most parts of the spectra from further search.
both optimization functions GSEL and SEL. Evaluation of hitlists obtained from enumerating searches reveals two important properties of the optimization functions. The SEL and GSEL response surfaces are nearly gradientless in the vicinity of the best wavelength subset and the general shape of the response surfaces are disturbed by a certain roughness. Fragments of the complete hitlist for the SEL function given in Table 1 establish these attributes. From Table 1 it can be seen that the SEL value of the optimal wavelength subset for the UV1 data set differs from the 100th best subset by only 1.3% verifying the general flat shape of the SEL function. The roughness is substantiated by the distances between respective wavelength sets and the best wavelength set listed in Table 1. Note that the smallest possible distance (d M = 1) is found as far down as entry 74 while subsets with dM ^ 4 are found beginning at position 17. The hitlist for the GSEL
37
Table 1 Various SEL values for the UV1 data set. Column 1 indicates the placement of the listed subset in the ordered list from the enumerating search. The last column gives the distance dM obtained between the specified subset and the best subset (No. 1) No.
1 2 3 4 5 6 7 13 14 15 16 17 41 42 56 57 58 73 74 75 96 97 100
SEL
0.5804 0.5800 0.5793 0.5791 0.5791 0.5789 0.5781 0.5773 0.5773 0.5772 0.5772 0.5770 0.5754 0.5754 0.5747 0.5747 0.5746 0.5738 0.5738 0.5737 0.5730 0.5730 0.5729
Wavelength index
15 14 15 14 15 14 14 15 13 16 14 14 14 14 13 16 14 13 15 13 14 17 13
38 38 38 39 39 38 39 38 39 38 38 39 39 39 38 37 39 39 38 39 39 38 37
50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 49 50 50 50 50
51 51 51 51 51 51 51 52 51 51 52 51 51 53 53 51 52 53 50 51 52 51 51
76 76 75 76 76 75 75 76 76 76 76 74 73 75 76 76 74 75 76 73 73 76 75
dM
0 1 1 2 1 2 3 1 3 1 2 4 5 5 4 2 5 6 1 6 6 2 4
function is similar to the SEL function reported in Table 1. The two observed properties can be explained by the spectroscopic character of the wavelength selection problem. The flat shapes of the selectivity functions stem from the broad spectral features of ultraviolet-visible spectra. Neighboring wavelengths contain highly collinear information at the given resolution, and thus, no major differences result from exchanging adjacent wavelengths. The roughness of the selectivity hypersurfaces arises from random noise in the measured data. Backward elimination (3E) was first applied to the data set using both selectivity riteria. In order to reduce the initial 76 wavelengths to a set of 5 wavelengths, 76 + 75 + ... + 6 = 2991 combinations must be calculated. For the GSEL criterion, the combination 15/ 37/ 51/ 52/ 76 with GSEL = 0.3254 was identified as best while BE found the subset 16/ 37/ 50/ 51/ 76 with SEL = 0.5747 to be best. These values
38 represent the 26th best subset for GSEL and the 57th best combination for the SEL function. Due to the fact that optimization paths for SA type algorithms profoundly vary because of probabilistic features, 50 optimizations were performed to characterize average fluctuations in search paths. The 50 repetitive runs used identical initial wavelength subsets, but different random number generator seed values. Table 2, methods 1A, 1B, and 1C list the SEL results. The GSA algorithm produces relative standard deviations for the total path length of about 70% and roughly 80% for the number of iterations needed to locate the best combination. The relative standard deviations decrease to approximately 25% and 40%, respectively, for GSAMS and to less than 15% and 30%, respectively, for TA. Modulating the stepwidth and eliminating the Boltzmann statistic appear to substantially reduce fluctuations in a search path. Table 2 also lists results for GSA, MGSA, GSAMS, and TA averaged over 50 consecutive runs where each run was started with a randomly selected wavelength Table 2 Averaged results for 50 runs optimizing subsets with five wavelengths for the UV1 data set using the SEL criterion Optimization methoda a.(xexp)
1A 1B 1C 2a 2b 3 4a 4b 4c 5
0.59 0.59
13
1.0 3.0
1.00 8.0 1.5 0.59 0.59 1.5 1.00 3.0 1.00 100.0 3.0 0.59
Iterations
No. of 1(xbsf)b hits
Dd
Ee
0.5501 0.5801 0.5753 0.5444 0.5446 0.5741 0.5546 0.5585 0.5803 0.5740
467 414 489 524 258 488 811 348 481 533
383 178 143 444 170 296 449 196 214 150
39 2
2 30 45 2
Ff
Gg
544 384 262 601 82 616 873 533 172 316 327 594 1500h 0 815 450 373 840 88 676
aMethods 1 use fixed initial subset 1/ 2/ 3/ 4/ 5, (A) GSA, (B) GSAMS, and (C) TA. All other entries used random initial subsets: methods 2 = GSA, 3 = MGSA, 4 = GSAMS, and 5 = TA. bBest found response averaged over 50 runs. 'Number of runs that found the known optimal subset. dAverage number of iterations to find the best subset in each run. eStandard deviation for column D. fAverage number of iterations per run. gStandard deviation of column F. hMaximum allowed number of iterations.
39 subset. Using different initial wavelength sets avoids anisotropic effects of the optimization functions, i.e., the outcome of an optimization depends on the initial position. As expected, GSA is not able to identify the best subset due to the fixed stepwidth. The MGSA approach performed slightly better locating an average best subset with SEL = 0.5741 compared to 0.5444 for GSA. Excellent results were acquired with GSAMS for the case of good OPS (method 4c). Problems arise for the three variations GSA, MGSA, and GSAMS when poor OPS are used. For example, two particular OPS, among others, for GSA, MGSA, and GSAMS are 3 and cl)(xexp) used in equation (5). From a previous investigation based on multidimensional mathematical functions [1], it was found that using p 5.0 worked well when (1)(xexd was close to the actual global optimum function value. Without knowledge of the actual global optimum value, it seems reasonable to initially set [3 = 3.0 and I(Xexp) = 1.00 for ideal selectivity. Because the actual global optimum SEL = 0.5804, this initial estimate of 4:1)(x.) causes nearly all detrimental steps to be accepted for all three variations. For GSAMS method 4a in Table 2, the algorithm does not terminate until the maximum number of iterations set by the user is performed, in this case, 1500. One way to decrease the high acceptance rates of detrimental steps is to increase 13. Method 4b in Table 2 uses = 100.0 and 0(xexp) = 1.00. Notice that with this 3 value, GSAMS is able to locate the best subset. Figure 3 shows that many of these optimization runs succeed in locating wavelength subsets very near the best subset as well. However, with these OPS values, GSAMS becomes trapped at local extrema for approximately 20% of the 50 runs. The low average SEL listed in Table 2 for method 4b confirms this. Method 4c in Table 2 reveals that results for GSAMS improve significantly when the expected extreme response value was set to a value slightly greater than the actual global SEL value. The global subset was identified 45 out of the 50 runs and the average SEL only differs from the global value by 0.02%. References 7, 16, 17, and 19 describe how to identify good 13 and (1)(x e.)- values for any optimization problem. Reference 1 contains additional information with regard to OPS for GSAMS, i.e., normal mode, expand mode, and shrink mode. Because TA does not use the Boltzmann statistic for acceptance of detrimental steps, manipulations of I:3 and 1(Xe& are unnecessary. However, two other problem specific operational parameters must be defined: the initial threshold ti and the threshold factor 8. Results from optimization of generic math functions indicate that the initial threshold should be set to less than 10% of the function range and to decrease at a rate of 10% each time it is updated [1]. For the wavelength selection problem based on SEL, the optimization was initialized with = 0.07 and = 0.92. Two additional OPS were set to values comparable to those of GSAMS. Specifically, the threshold was updated if there was no improvement of (1)(xbd) for 15 trials, and the process terminated after 30 nonaccepted steps. Inspection of Table 2 discloses that while TA converged to the best subset for only two of the 50 optimization runs, it maintained a high average SEL value of 0.5740. Additionally, Figure 3 verifies that TA rarely became trapped in local extrema. Values tabulated in Table 3 demonstrate that the algorithms work with GSEL as well as with SEL. The optimum five wavelength subset is 15/ 37/ 50/ 51/ 76
40
0.6
0.55
0.45
0.4 1
10
20
30
40
50
optimization run Figure 3. Best response values for each of the of 50 runs: (x) GSAMS with (I)(xe.D) = 1.0 and 13 = 3.0, (o) GSAMS with ()(x.) = 1.0 and 13 = 100, and (+) TA. Sequences are ordered for better graphical display and do not appear in their actual generated order.
producing GSEL = 0.3314. Essentially, GSA performs poorly compared to the other methods, GSAMS converges in most cases to the best combination, and TA lies between GSAMS and MGSA. Despite the fact that GSAMS often terminates at the best subset, it was found to be impossible to guarantee certainty in convergence to the best subset. However, the spectroscopic nature of the investigated wavelength selection problem allows positive identification of the best subset. This is based on the fact that due to high collinearity between adjacent data points (assuming suitable digitization), it does not matter which of three or four neighboring wavelengths are selected. The frequency histogram of selected wavelengths based on GSEL is plotted in Figure 2. The plot reveals well separated groups of selected wavelengths. Based on spectral properties for the problem, the best subset must come from these preferred wavelengths and it can be found with a three step strategy: 1. Run the SA type algorithm several times, e.g., 5 to 10 repetitions.
41
2.
Eliminate wavelengths from further search with a selection frequency less than a certain threshold, e.g., require at least a twofold selection. Perform a grid search within this reduced wavelength space.
3.
Such a compound strategy is able to identify the global best combination for both investigated selectivity criteria with absolute certainty. For example, exclusion of all wavelengths with a frequency less then 4 in Figure 3 reduces the search space ( 76
( 17 = 1.85 x 10 7 possible 5 wavelength subsets to = 6,188 subsets. This 5 5 pool of combinations was then checked by an exhaustive search and the best subset was indeed identified. from
Table 3 Averaged results for 50 runs optimizing subsets with five wavelengths for the UV1
data set using the GSEL criterion Optimization method.' (1(xexp)
1 2 3a 3b 4
0.332 0.332 0.332 0.332
13
1.25 1.25 3.00 1.25
No. of 0(xbsdb hits'
0.2971 0.3231 0.3296 0.3311 0.3257
3 11 37 4
Iterations Dd
Ee
Ff
Gg
395 661 240 513 469
314 357 104 232 123
468 716 273 548 594
323 378 107 231 71
'Methods 1 = GSA, 2 = MGSA, 3 = GSAMS, and 4 = TA. b-gSame as for Table 2.
3.2. P-matrix analysis A set of 100 NIR spectra of wheat samples was obtained from Bran & Luebbe (Germany) to evaluate wavelength optimization algorithms for prediction of moisture. Samples were measured using diffuse reflectance as log(1/R) from 1100 nm - 2500 nm in 10 nm intervals (141 data points). The spectra pool was separated into a calibration set of 50 spectra (WC1) and two prediction sets of 20 spectra each (WP1 and WP2). Calibration and prediction samples were selected such that calibration sample concentrations bracketed all prediction sample concentrations. In the following discussion, the SEP for the two prediction sets are denoted SEPp i and SEPp2, respectively. The method of PLS is used for determining calibration models. For brevity, the three optimization algorithms Monte Carlo, BE, GSAMS, TA are only discussed. Before the final calibration model can be determined, the number of factors to retain must be ascertained. The common data scaling methods of mean centering
42
multiplicative scatter correction (MSC), and first and second derivatives were investigated to observe their effects on determining the number of factors. Based on full spectra, Figure 4a displays corresponding PRESS evolutions as a function of the number of factors used to construct R and Figure 4b shows the SEP evolutions. In evaluation of Figure 4a, it appears that depending on the data scaling method, between 6 and 12 factors should be used. Figure 4b indicates that 10 to 12 factors may be appropriate. The plots disclose that PRESS and SEP generate the same general shapes for the different data scaling methods. Thus, wavelength selection should not be affected by the type of scaling used. Plots for PRESS and SEP show a relative flat plateau for less than 5 factors. After 5 factors, PRESS and SEP begin to reduce in value as more factors become involved in the model. Thus, it was decided to use 4 factors for calibration and prediction. With only 4 factors and full spectra, calibrations and predictions are expected to be poor and extremely sensitive to slight variations in the spectral data. The number of wavelengths to search for was set to twice the number of factors. Thus, the wavelength optimization problem for P-matrix analysis of the wheat data set consist of finding the best subset of 8 wavelengths from spectra consisting of 141 data points based on PLS and 4 factors. For this problem, the search space consists of 3.165 x 1012 possible combinations. The vast solution space prohibits an exhaustive search as done for the K-matrix problem. As a benchmark for comparing optimization results using 8 wavelengths and 4 factors to full spectra and 4 or 12 factors, the following should be noted: full spectra calibration with MC data and 4 PLS factors results in PRESS = 19.57, SEPp i = 0.72, and SEPp2 = 0.66 and full spectra calibration with MC data and 12 PLS factors generates PRESS = 5.28, SEPpi = 0.46, and SEPp2 = 0.40. The following discussions focus on MC spectra and do not include MSC or derivative spectra. Monte Carlo searches were performed to test its applicability to the large wavelength search space. Separate searches for optimization of PRESS and SEPpi using 10,000 random combinations were conducted. Table 4 lists ten of the best found 8 wavelength subsets based on SEPpi (rows 1 - 10) and PRESS (rows 11 20). The SEPp2 and PRESS values listed for rows 1 - 10 were computed using respective wavelengths found by minimizing the corresponding SEPp i . Similarly, the SEPpi and SEPp2 values listed for rows 11 - 20 were computed using wavelengths found by minimizing the corresponding PRESS value. Results presented in Table 4 clearly show the ineffectiveness of Monte Carlo searches. For example, SEPpi values range from 0.80 to 0.49 for rows 1 - 10 and PRESS values range from 18.28 to 9.70 for rows 11 - 20. These broad ranges demonstrate the poor precision of Monte Carlo searches on this problem. A more appropriate wavelength search algorithm should be more reproducible in the optimization criterion. As further evidence of the ineffectiveness of Monte Carlo searches, the lowest SEPp i value was obtained using wavelengths located from the Monte Carlo PRESS search (number 11 in Table 4). Notice that most of the SEPp i (rows 1 - 10) and all of the PRESS values (rows 11 - 20) are lower than respective values based on full spectra and 4 factors. None of the SEPp i and PRESS values are lower than corresponding values obtained with full spectra and 12 factors. Unlike the K-matrix example, backward elimination does not deliver acceptable (MC),
43
(a)
0.9 (b) 0.8 0.7 a.,
cf) 0.6 0.5 0.4 0.3
2
4
6
8
10 12 factors
14
16
18
Figure 4. (a) PRESS for full spectra PLS calibration as a function of the number of factors: (+) mean centered (MC), (x) MC and mutliplicative scatter corrected (MSC), (*) first derivative spectra (MC), and (o) second derivative spectra (MC). (b) SEP for full spectra PLS calibration as a function of the number of factors: (+) WP1 (MC), (x) WP1 (MC and MSC), (*) WP2 (MC), and (o) WP2 (MC and MSC).
44
results. The best wavelength subset for an SEPp i optimization is 1/ 4/ 26/ 27/ 28/ 38/ 85/ 86 with SEPpi = 0.69. The PRESS value for this wavelength subset is 22.87. In looking over the SEPpi sequence obtained during the backward elimination process, it was found that the SEPp i value computed with the best 29 wavelengths was the lowest with SEPp i = 0.42 and a corresponding PRESS of 30.77. Comparison of these SEPp i and PRESS values with full spectra and 12 factor values signifies the ineffectiveness of BE. Backward elimination based on the PRESS criterion was not evaluated due the excessive time requirement. Table 4 Best eight wavelength subsets found with Monte Carlo searches (10,000 tested combinations) using SEPp i (No. 1 - 10) and PRESS (No. 11 - 20) based on PLS and four factors No.
SEPpi SEPp2 PRESS
Wavelengths
1 2 3 4 5 6 7 8 9 10
0.49 0.50 0.52 0.59 0.61 0.62 0.65 0.65 0.67 0.80
0.53 0.46 0.53 0.59 0.53 0.56 0.55 0.56 0.59 0.60
20.44 17.27 21.54 19.62 22.10 21.56 18.62 17.87 18.63 20.35
10 14 16 18 28 67 128 130 9 12 13 15 42 45 47 123 10 15 21 23 68 71 121 123 24 25 57 58 65 73 121 139 10 11 56 62 68 96 120 126 6 16 60 93 104 130 135 141 27 38 52 56 63 104 106 117 22 24 57 58 64 68 74 114 25 41 65 74 77 83 96 99 31 35 59 61 83 84 121 139
11 12
0.45 0.79
0.48 0.57
9.70 13.49
57 60 65 68 122 123 125 136 15 20 54 58 60 68 74 98
13
0.77
0.52
14.22
14 15 16 17 18 19 20
0.75 0.80 0.73 0.77 0.70 0.70 0.75
0.64 0.60 0.55 0.53 0.57 0.57 0.57
14.87 15.04 16.57 16.80 17.52 18.09 18.28
9
17
18 47 50 52 59 96
56 60 76 110 122 129 130 135 51 68 104 106 129 132 133 135 33 47 48 54 103 108 110 125 27 53 62 102 105 125 130 135 16 18 71 100 103 104 107 115 19 58 59 62 68 101 111 115 52 59 79 90 100 115 135 137
For GSAMS, the combination generator permitted up to 6 of the 8 wavelengths to change in each step. The new combination could have a maximum distance of dm = 40 with wavelengths changing by no more than 12 wavelength indices. The algorithm terminated after 26 consecutive nonaccepted steps where nonaccepted
45
12
(a)
6 4
10
20
30
40
50 (b)
0.8 rli 0.7 W cf)
0.6 0.5 0.4
1
10
20
30
40
50
optimization run Figure 5. (a) PRESS values for a series of 50 GSAMS runs. (b) SEP values for WP1 (+) and WP2 (o) using wavelengths from respective GSAMS/PRESS optimizations.
steps consist of the total number of declined steps in normal, expand, and shrink modes. For both PRESS and SEPp i optimizations, up to 4 shrink cycles with a shrink factor of 0.6 were permitted. During the expand mode, the stepwidth radius was doubled. Similar to wavelength selection and K-matrix analysis, problems were encountered with determining good OPS for the GSAMS algorithm. At first, the expected minimum cost function value was set to zero. An unsuccessful attempt was made to correct the observed high acceptance rate of detrimental steps by increasing p. Repetitive runs showed a tremendous variation in the lengths of the searches, i.e., they terminated in local minima or moved across all barriers without converging. Results improved after the expected extreme value was set to 80% of the best optimization function found and p was set to 2.55. The expected extreme value was then updated each time a lower (tobsf was obtained. Fifty sequential optimization runs were performed for PRESS and SEPpi. Using PRESS as the optimization function, Figure 5a shows the best found values for the sequence of 50 GSAMS runs. The mean PRESS equals 8.46 and is value for the 50 runs is 5.46. Plotted in Figure 5b are SEP values for both prediction sets using respective wavelengths determined optimal by GSAMS/PRESS
46
20 (a)
1 5
L 1
0
1111
1 II [r 1
1
20
40
11
60
ml(1
ill [h T
80
120
100
o ii[
140
(b) 0.8
0.6
0.4
0.2
20
40
80 60 wavelength index
100
120
140
Figure 6. (a) Frequency distribution for wavelengths included in the best found subsets from 50 repetitive GSAMS/PRESS runs. (b) Mean calibration spectrum (-) and column variances of Yf (+). Note that both lines are normalized to the (0, 1) range and do not represent actual values.
47
50% less than the full spectra calibration value obtained with 4 factors and is 60% larger than the PRESS value for full spectra and 12 factors. The lowest PRESS in Figure 5a. The mean SEPp i and SEPp2 values are 0.62 and 0.53, respectively. Inspection of Figure 5 reveals that certain trends in the error profiles are present. In Figure 5 for instance, 17 of the 50 PRESS values are lower than the average PRESS value in combination with respective SEPp i values lower than the average SEPp i value. Likewise, 19 of the 50 PRESS values are larger than the average PRESS value in conjunction with SEPpi values greater than the mean SEPp i value. A similar observation can be made with the PRESS and SEPp 2 results. Specifically, 18 PRESS and SEPp2 values are concurrently lower than respective average values while 20 PRESS and SEPp2 values are simultaneously greater than respective average values. The correlation coefficients between PRESS and SEPp i and PRESS and SEPp2 are 0.62 and 0.48, respectively. These trends and correlation coefficients indicate that selected wavelength subsets based on calibration spectra are fairly representative of calibration and prediction sets. A frequency plot for the appearance of wavelengths in the best found subsets for the 50 GSAMS/PRESS runs is shown in Figure 6a. An explanation for the preference of some wavelengths over others it not as easy as for the K-matrix approach. Obviously, the selected wavelengths are those that minimize prediction error, but why these wavelengths are effective in reducing prediction error is not clear. However, if one looks at the PLS algorithm's method to obtain the modified R, some insight is possible. As part of the PLS algorithm, the matrix Y f is generated where vectors yi are linear combinations of the eigenvectors obtained by singular value decomposition of R and f denotes the number of eigenvectors considered. The mean calibration spectrum of WC1 is plotted together with the column variances of Yf in Figure 6b. It appears that wavelengths with a high variance in Yf are rarely included in the best combinations found. Good examples are the wavelength index ranges 20 - 40, 85 - 95, and 110 - 120. Note that these ranges mostly coincide with sharp changes in reflectance signals in calibration spectra. Figure 6a also shows narrow gaps repeatedly appearing between two peaks of frequently selected wavelengths, e.g., wavelength indices 11, 56, or 61-64. This observation implies that additional information is being used in wavelength selection besides wavelengths with low variance in the orthogonal matrix Yf from the PLS decomposition. These aspects are currently under investigation. Plotted in Figure 7a are the best found SEPp i values for the sequence of 50 GSAMS optimization runs. The mean SEPp i equals 0.44 and is approximately 50% less than the full spectra calibration value obtained with 4 factors and slightly smaller than the SEPp i value for full spectra and 12 factors. The lowest SEPpi value is 0.33. Figure 7a also shows corresponding SEPp 2 values using wavelengths determined optimal for SEPpi . The average SEPp2 value is 0.51. Displayed in Figure 7b are coinciding PRESS values using wavelengths deemed best in the GSAMS/SEPp i optimizations. The mean PRESS value is 19.78. Optimization for the SEPpi criterion appears to locate wavelength combinations more specific to the prediction set WP1 compared to the PRESS criterion. That is, the average SEPpi value for 8 wavelengths and 4 factors (GSAMS/SEPp i) is closer to the SEPp l value based on full spectra and 12 factors compared to the closeness of the average
48
1
10
20 30 optimization run
40
50
Figure 7. (a) SEPp i (o) and SEPp2 (+) values based on 50 GSAMS/SEPpi optimizations. (b) Corresponding PRESS values.
PRESS value for 8 wavelengths and 4 factors (GSAMS/PRESS) to the PRESS value for full spectra and 12 factors. Inspection of Figure 7 reveals similar trends in error profiles as were present in Figure 5. In this case, 14 of the 50 PRESS values are low in combination with low SEPp i values while 16 of the PRESS values are high in conjunction with high SEPp i values. As with Figure 5, a similar observation can be made with the PRESS and SEPp 2 trends shown in Figure 7. Specifically, 15 PRESS and SEPp2 values are jointly lower than their corresponding averages and 17 PRESS and SEPp2 values are simultaneously larger than respective averages. Correlation coefficients between PRESS and SEPpi and PRESS and SEPp2 are 0.28 and 0.33, respectively. Although the correlation coefficients are relatively low, selected wavelength subsets based on prediction spectra are indeed representative of calibration and prediction sets. This is true because the number of prediction error values with similar trends is fairly large, 60% of the 50 values, and the low correlation coefficients reflect the magnitudes of the trends. That is, the correlation coefficients consider not only trends in the data but also the magnitude of
49
differences between PRESS/SEP pairs. Results listed for TA were achieved with a configuration generator changing up to 6 of the 8 wavelengths with a maximal distance of dm = 40 and a maximum individual wavelength shift of 12 index units. No stepwidth modulations were performed. The threshold was updated after 25 nonimproving steps and the termination criterion was set to 40 successive nonaccepted steps. To accommodate the larger search space compared to the K-matrix analysis problem, the threshold was lowered by 3%, i.e., 8 = 0.97. A representative series of 50 TA optimization runs based on SEPp i achieved an average SEPpi value of 0.50. The lowest SEPp i value is 0.40. The average SEPp2 and PRESS values based on wavelengths found optimal by minimizing SEPp i are 0.55 and 19.59, respectively. A plot similar to Figure 7 was obtained indicating that optimization for the SEPp i criterion results in wavelength combinations more specific to the prediction set WP1 compared to the PRESS criterion. Although TA/PRESS optimizations were not performed, similar results to GSAMS/PRESS are expected. Figure 8 reveals that the distribution of wavelengths in the 50 best found subsets of the GSAMS/SEPp i and TA/SEPpi series does not differ significantly from the GSAMS/PRESS pattern shown in Figure 6. The GSAMS/SEPp i and TA/SEPpi selection processes avoid similar spectral regions, i.e., wavelengths with high variance in the orthogonal vectors yi from the PLS decomposition. Additionally, the pattern of narrow gaps repeatedly appearing between two peaks of frequently selected wavelengths occurs in Figure 8. Rather than use the process applied in K-matrix analysis for determining the global wavelengths, i.e., exclusion of spectral regions from further search based on the selection frequency distribution, examination of wavelength subset structures may prove more valuable. It is expected that optimizations based on PRESS and SEP should yield similar and different characteristic wavelength patterns. Ideal wavelength subsets would be those identified with acceptable PRESS and SEP values in conjunction with similar wavelength patterns regardless of the optimization criterion. Table 5 lists some representative wavelength combinations of searches optimizing for PRESS and SEPp i . Group AI consist of some of the lowest PRESS values found by GSAMS/PRESS and mostly includes wavelengths with indices 55 - 75 and indices greater than 100. However, these combinations do not produce good SEPpi and SEPp2 values. Groups B1 and C I contain wavelength combinations found optimal based on GSAMS/SEPp i and TA/SEPp i optimizations, respectively. Wavelength patterns for these two groups primarily contain wavelengths with indices 9 - 16, indices in the upper 60 range, and indices greater than 110. Unfortunately, corresponding PRESS values for these wavelength combinations are large. Wavelength combinations for groups A, B, and C indexed with II denote those wavelength patterns that yield good PRESS and SEP values regardless of the optimization criterion. These wavelengths combinations predominately exclude indices from the 50 range and indices greater than 100. It should be noted that each respective list of 50 optimization runs generates numerous wavelength combinations that can not be classified in either of the two groups indexed with I and II. Additionally, other wavelength patterns than those
50
35 (a) 30 ›,25 N
X20 15 10 5 Old 1
In(
20
n rli riff ri6i nifirli
40
11 [11 n MI 11111141
60
80
100
GIN ilarn- flf:11-111
120
140
30 ^ (b) 25
10
1
I,
1
111[11
Ili 1-16
20
40
11 M
TI
[
m mi
80 60 wavelength index
100
120
140
Figure 8. Frequency distribution for wavelengths included in the best found subsets from 50 repetitive runs for (a) GSAMS/SEPp i and (b) TA/SEPp i . Histograms are overlaid with the mean calibration spectrum (-).
51 listed in Table 5 under groups indexed with I yield low PRESS and SEP values (group AI) or low SEPpi and high PRESS values (groups B1 and CI). It appears that an evaluation of best subsets obtained from repetitive runs permits identification of wavelength regions that result in low values for both optimization criteria. A search of this reduced search space could then be preformed. As for K-matrix analysis, it seems questionable to search for the global combination since results for similar subsets do not significantly differ. The 50 successive GSAMS/PRESS runs required an average of 630 iterations before termination. The GSAMS/SEPp i and TA/SEPpi optimizations needed on average, 680 and 246 iterations, respectively. Note that the average number of steps for Ta/SEPp i is less than 50% of the average number of steps for GSAMS/SEPpi.
Table 5 Representative wavelength subsets from GSAMS/PRESS (group A), GSAMS/SEPpi (group B) and TA/SEPp i (group C) optimizations. See text for a description subgroups I and II. Group
SEPpi
AI 0.57 0.60 0.67 B1
CI
All
BII
CH
SEPp2 PRESS
Wavelengths
0.46 0.50 0.62
5.46 6.33 7.32
56 57 58 59 98 106 107 109 57 59 60 72 74 97 105 128 46 55 56 60 69 105 129 132
0.34 0.37 0.38
0.44 0.45 0.40
19.55 15.85 22.59
9 18 65 69 70 110 122 137 9 15 65 68 71 72 126 140 9 14 16 28 73 129 135 139
0.41
0.52
11.78
10 12 15 17 64 66 131 132
0.43 0.46
0.56 0.46
17.43 23.44
9 14 16 66 68 78 112 114 9 10 16 61 78 130 133 137
0.49 0.51 0.54
0.40 0.42 0.42
6.09 6.74 7.04
10 12 14 56 65 68 73 75 9 12 15 38 61 64 65 67 9 12 14 37 42 62 64 67
0.39 0.39 0.41
0.39 0.49 0.49
5.85 9.89 9.99
7 9 14 57 65 66 67 76 6 9 13 14 16 65 66 67 9 11 16 60 61 64 68 92
0.41 0.43 0.46
0.51 0.53 0.53
10.42 10.91 13.20
9 15 16 20 65 68 69 102 9 12 15 56 66 68 69 70 9 14 59 60 64 73 88 94
52
While the 630 iterations for GSAMS/PRESS are a tiny fraction compared to the size of the search space, the time requirements are extraordinary due to the crossvalidation process. With the given computational tools (486/60MHz Computer under MS-DOS/Windows and MATLAB V4.2b) a single GSAMS/PRESS run necessitates approximately two hours, compared to 3.5 and 1.5 minutes for GSAMS/SEPp i and TA/SEPpi.
4. CONCLUSIONS General principals of SA type algorithms, such as GSA and TA, were evaluated. Algorithm problems with regard to locating the exact extreme of discrete functions were also discussed. A typical configuration generator with a fixed search radius allows the optimization process to escape local extreme but with the probabilistic acceptance mechanism for detrimental steps, the fixed search radius prohibits convergence in many cases to the exact global extreme. The three level stepwidth modulation process of GSAMS was introduced and successfully applied to wavelength selection for spectroscopic calibration and predictions. Both GSAMS and TA were applied to situations typical for K-matrix and Pmatrix analysis. The goal of the K-matrix problem was to locate the wavelength combination with the highest selectivity based on pure-component ultraviolet-visible spectra. Due to the probabilistic nature of the search heuristics, they were not able to identify the best subset with absolute certainty. Nevertheless, the broad bandwidth of spectral features causes distinct patterns in the selection frequency of wavelengths. It was found that by repeating optimization runs a few times, a reduced search space could be obtained. Enumerating searches were then performed on the reduced search space to identify the global combination. The P-matrix example consisted of MR spectra for wheat samples with 141 data points. Due to the enormous extent of the search space, the best combination of wavelengths can not be determined by an enumerating search. Thus, the results of GSAMS and TA are compared to the numbers generated using a traditional full spectra PLS model. Optimal subsets of 8 wavelengths and 4 PLS factors achieved PRESS and SEP values in general, 50% to 75% less than the full spectra model with 4 PLS factors. The resulting error values are comparable to full spectra calibrations with approximately 12 factors. An objective of the investigation was to observe if acceptable predications are possible using a wavelength subset and less factors compared to classical PLS, i.e., full spectra and the number of factors derived from a plot of PRESS versus number of factors. Another objective of the study was to test the ability of GSAMS and TA to identify proper wavelength subsets. Thus, the described study investigated the empirically chosen situation of optimizing for 8 wavelengths using 4 factors. This does not necessarily represent the best wavelength to factor ratio. In other words, it may be possible to find a better wavelength/factor combination that minimizes prediction error. For instance, using 10 factors and optimizing for 15 wavelengths, GSAMS/PRESS resulted in an average PRESS equal to 2.85 based on 50 runs. The minimum PRESS found is 1.75. Corresponding SEPp i and SEPp2 average values
53 based on wavelengths deemed optimal from the GSAMS/PRESS searches are 0.45 and 0.45, respectively. The fact that less wavelengths and factors can produce smaller prediction errors compared to the classical PLS approach stems from the ability of SA type algorithms to select wavelengths relevant to prediction of the desired property. Limiting calibration and prediction to pertinent wavelengths reduces the number of factors required to properly model the remaining data structure resulting in the lower prediction errors. Of course, the best prediction results would be realized if the ideal subset of wavelengths and number of factors were used. This represents a complex computational problem and was not attempted. When compared to using PRESS as the optimization criterion based on crossvalidation, the SEP for an independent set improves the optimization speed several hundred percents. To avoid selection of wavelength combinations specific to the prediction set, it is necessary to validate the predictive ability of selected wavelengths by using additional prediction sets. In addition, the PRESS value for the calibration spectra should also be acceptable. As for the ultraviolet spectra, the appearance of wavelengths in favorable combinations for the NIR problem is closely correlated with spectral properties of the calibration set. A good wavelength subset can be obtained from analysis of the internal composition of selected subsets. Overall, GSAMS yielded superior results compared to GSA and TA. However, the less restricted acceptance rule for TA makes two problem dependent OPS for GSAMS, i.e., the partition constant 13 and expectation value for the cost function, superfluous. As all outlined applications were solved with a nearly identical set of OPS, TA appears to be more robust than GSAMS.
ACKNOWLEDGMENTS We want to express our thanks Dr. H. Priifer at Bran & Luebbe for the kind provision of NIR spectra. The work described was supported by the NSF-Idaho EPSCoR Program and the National Science Foundation grant number OSR 9350539. U.H. is partially supported by a grant from the DAAD.
REFERENCES 1. U. HOrchner and J. H. Kalivas, J. of Chemometrics, in press (1995). 2. U. HOrchner and J. H. Kalivas, submitted to Anal. Chem., (1995). 3. D.C. Harris, Quantitative Chemical Analysis, Third Ed., W.H. Freeman and Company, New York, 1991. 4. J. H. Kalivas and P. M. Lang, Mathematical Analysis of Spectral Orthogonality, Marcel Dekker, New York, 1994. 5. S. Kirkpatrick, C. D. Gelatt and M. P. Vecchi, Science, 220 (1983) 671. 6. P. J. M. van Laarhoven and E. H. Aarts, Simulated Annealing: Theory and Applications, D.Reidel Publishing Company, Dordrecht, 1987.
54
7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19.
I. 0. Bohachevsky, M. E. Johnson and M. L. Stein, Technometrics, 28 (1986) 209. J.W. Greene and K. J. Suppowit, IEEE Trans. Computer-Aided Design, Vol. CAD-5, No. 1 (1986) 221. G. Dueck and T. Scheuer, J. Comp. Phys., 90 (1990) 161. D. Vanderbilt and S. G. Louie, J. Comp. Phys., 56 (1984) 259. P. M. Lang and J. H. Kalivas, submitted to Chemometrics Intelligent Lab. Sys., (1995). A. Lorber, Anal. Chem., 58 (1986) 1167. C. Lucasius, M. L. M. Beckers and G. Kateman, Anal. Chim. Acta, 286 (1994) 135. U. HOrchner and J. H. Kalivas, Anal. Chim. Acta, in press (1995). S. Wold, Technometrics, 20 (1978) 397. J. H. Kalivas, N. Roberts and J. M. Sutter, Anal. Chem., 61 (1989) 2024. J. M. Sutter and J. H. Kalivas, Microchim. J., 47 (1993) 60. E. Lang (ed.), Absorption Spectra in the Ultraviolet and Visible Region, Academic Press, New York, 1961 - 1979. J. H. Kalivas, Chemometrics Intelligent Lab. Sys., 15 (1992) 1.
APPENDIX A Pseudo-code for threshold acceptance (TA) for minimization of a generic function 0(x) 1. Initialization - generate a start configuration 'Lair and estimate its quality 0(xcur) - define an initial threshold ti >0 meaningful to the expected function range, e.g., 1/10 of it - define a threshold factor 8 ^ 1.0 - define an initial stepwidth vector s 2. Iteration - generate a new configuration Xnew = f(Xcur, s) and estimate its quality (I)(xnew) - if 0 (X new) 0 (Xcu d ^ t then set ;lir = anew - if (1) (Xnew) < CXbsd then set Xbsf = Xnew Adjustments - if Xbsf is not updated for a certain number of steps, then decrease t = a t
3.
4. Break Conditions - if the maximal number of iterations is performed - if 'Lair is not updated for a certain number of steps - if t =0
55 APPENDIX B Pseudo-code for generalized simulated annealing with modulated stepwidth (GSAMS) for minimization of a generic function (I)(x). 1. Initialization - generate a start configuration Xcur and estimate its quality 0(xcur) - set the best so far configuration xbsf Xcur - define a conservative expectation value for the extreme quality 0(xexp) - define an initial stepwidth vector s 2. Iteration - generate a new configuration Xnew f(Xcur, s) and estimate its quality (1) (Xnew) - if (I) ()Knew) < (Xcur) then set Xcur = Xnew if (Xnew) < CXbsd then set Xbsf = )Knew else0 ) (1) new set p = ex - p
)
IT (X cur ) (1)(Xexp) if p r (r E U) then set Xcur = )Knew 3. Adjustments - if Xcur is not updated for a certain number of steps, then expand stepwidth s - if in expand mode and no update of Xcur occurs for a certain number of steps, then shrink stepwidth - if in collapse mode and no update of Xcur occurs for a certain number of steps and the maximal number of shrink cycles is not reached, then collapse stepwidth further - if Xcur is updated while in expand or collapse mode, return to the initial stepwidth s 4.
Break Conditions
- if (I) (Xbsd (I) (Xexp) I < E - if the maximal number of iterations is conducted - if the maximal number of shrink cycles is performed
This Page Intentionally Left Blank
Adaption of Simulated Annealing to Chemical Optimization Problems, Ed. by J.H. Kalivas © 1995 Elsevier Science B.V. All rights reserved.
57
Chapter 3 Robust principal component analysis and constrained background bilinearization for quantitative analysis
Ruqin Yu, Yulong Xie and Yizeng Liang
Department of Chemistry and Chemical Engineering, Hunan University, Changsha, 410082, People's Republic of China.
1. INTRODUCTION
One of the outstanding characteristic features of the modern analytical instrumentation is the ability of providing analytical signals in the form of different order tensors, i.e. vectors (the first order tensors), matrices (the second order tensors) and even higher order tensors. The multivariate data supplied by such instruments embody more information than the traditional univariate signal and are more suitable for the qualitative and quantitative analysis. The optimization procedure is frequently encounted in the treatment of analytical data. Different optimization procedures sometimes result in quite different outputs if there are some local optima in the search field. In this chapter, we will discuss some applications of the simulated annealing technique to the treatment of multivariate analytical data. Simulated annealing as a tool for searching the global optimum on the multi-dimensional response surface is especially suitable for solving multivariate calibration problems involving optimization procedures. First, a robust principal component analysis (PCA) based on projection pursuit (PP) and generalized simulated annealing (GSA) techniques is developed. Then a modified constrained background bilinearization (CBBL) algorithm based on GSA is described. The results obtained from the computer simulation and experiment data show that the proposed methods compare favorably with the traditional ones.
58 2. ROBUST PRINCIPAL COMPONENT ANALYSIS BY PROJECTION PURSUIT AND SIMULATED ANNEALING
It is well known that PCA is an important technique for high dimensional data reduction and exploratory analysis. It is also the basis and an indispensable part of many multivariate quantitative methods developed in chemometrics such as most curve resolution procedures, the widely used multivariate calibration methods like principal component regression (PCR) and partial least square regression (PLS), many pattern recognition methods, and so on. Unfortunately, the classical PCA is non-robust. Sometimes, a principal component might be created just by the presence of one or two outliers [1]. So if there are outliers existing, the coordinate axes of the principal component space might be misdetermined by the classical PCA, and reliable analytical results would not be obtainable based on a non-representative subspace of the original variable space. On the other hand, statistical procedures used in chemical data treatment usually include some explicit or implicit a priori assumptions about the noise distribution. The most frequently assumed distribution is an independent and normal (Gaussian) error distribution with constant variance.The assumed distribution is, unfortunately, not always held in practice. Clancey [2] examined some 250 error distributions involving about 50,000 chemical analyses of metals, and he found only about 10-15% of them could be regarded as normal ones. Similar phenomenon has also been found in the analysis of blood constituents [3]. The deviations of this sort may be due to either the inherent properties of the error distribution or to the presence of outliers in the data. In analytical measurements, unexpected outliers in experimental data may result from the drift of instruments or artificial errors made during the measuring process. In the past decades, it has been aware that most commonly used statistical techniques are excessively sensitive to minor deviations from the assumed error model, and incorrect results would be obtained if one tried to treat the data by using these techniques. For this reason, searching alternative robust procedures is of considerable interest. One would expect that the robustification of PCA should substantially improve the performance of a whole spectrum of chemometric procedures. There are several routines to obtain the robust PCA, which includes M-estimators based on the ellipsoidal distributions [4,5] and the elementwise robust estimation of the disperse
59 (covariance/correlation) matrix [1,4,6]. More recently, a new type method for robust PCA with the use of PP has been proposed by Li and Chen [7]. The Monte Carlo simulation has shown that the new procedures compare favorably with other robust methods. They provide results as good as the best of M-estimators in terms of efficiency of robustness and as good as the elementwise approaches with respect to the empirical breakdown point properties [7]. Inspired by the work of statisticians, chemists have also placed the use of the robust methods on the agenda. Phillips and Eyring [8] were among the first ones who applied an M-estimator, the biweight function of Tukey, in regression analysis of analytical data. They concluded that the efficiency of the robust regression was about the same or superior to the least squares regression. Wolters and Kateman [9] studied thoroughly the performance of an M-estimator and the ordinary least squares regression in the calibration of analytical methods under nonnormal noise distribution and they claimed that, under certain conditions, robust regression offered some advantage over least squares regression. Another robust estimator, least median of squares (LMS), proposed by Rousseeuw [10] was used by Massart et al. [11] to detect or correct for outliers in analytical calibration. In a recent research comparing the performance of several outlier detection procedures in calibration, the superiority of the robust method was demonstrated [12]. Rousseeuw [13] have published a tutorial on robust statistics for chemists with the emphasis on the robust regression by the use of LMS. The robustness of many other chemometric techniques, however, has not yet been paid enough attention to as the robust regression. More recently, the present authors [14] introduced a limiting transformation into the innovation sequence of the Kalman filter and robustified the ordinary Kalman filter. Most of the researches undertaken so far focused on the univariate analytical calibration, the extension of the robust concept to multivariate situation is of considerable interest. In this chapter, the robust PCA via PP techniques is used for chemical data treatment, for this method assembles some excellent performance characteristics of some other robust methods. As a key point of the technique, the GSA algorithm is used as an optimization procedure to guarantee the global optimum. The theory and algorithm of PP PCA together with GSA are described and the necessity and the applicability of robust PCA is demonstrated by some numeric examples.
60 2.1.Projection pursuit Multivariate data are very common in the scientific research. If a multivariate observation is taken as a point in the d-dimensional variable space, totally n observations will form a point cloud in this space. The goal of multivariate data analysis is to find out and describe the structure of the point cloud. Dealing with high dimensional data, Kruskal suggested projecting the data from high dimensional space onto a lower dimensional subspace [15] and then analyzing the data in such a subspace of projection, which can be regarded as the basis of the concept of PP. PP techniques search for such a lower-dimensional subspace that the configuration of the data obtained in this subspace can reflect the structure and features of the original higher-dimensional data in an optimal way. By working on the lower-dimensional projections, PP techniques are able to avoid the difficulty caused by the sparseness of the data in the high dimensional space, i.e. the so-called "curse of dimensionality" [16]. The term projection pursuit was coined by Friedman and Turkey [11. Later, a series of methods based on the PP concept were developed including PP regression [18], PP classification (discrimination) [19] and PP PCA [5], etc. The ample content of PP techniques was reviewed by Huber [19]. The new tendency of treating high-dimensional data by PP techniques attracted the attention of chemists. Kowalski et al [20], for instance, applied PP regression to the interpretation of the sensor array data and disclosed a logarithmic structure of the data that coincided with the inherent properties of electrochemical sensors. PP techniques possess many valuable characteristics. One of them is that they are easy to be made robust.
2.2.Projection pursuit algorithm for robust principal component analysis The basic idea of PCA can be described as follows: Lets x i ,...,xn be d-dimensional data with xiT=(x ii ,...,xdi ) forming a d-dimensional point cloud. The aim of PCA is to confirm whether the point cloud is distributed on the entire d-dimensional space or mainly on a subspace with a lower dimensionality, say, of q. In order to do that, q orthogonal directions a l ,...,aq, are selected out one after another so that the projection of the point cloud onto a certain direction a, z i = aT xi (i=1,...,n), is distributed as widely as possible. If one takes a projection index, V(z), which describes the dispersion of one-dimensional projection z, then
61 a set of selected principal components a i ,...,aq will satisfy: 1 =maxV(z) = maxV(aTX) = V(a 1 TX) Ila1li=1 X2 =maxV(z) = maxV(aTX) = V(a2TX) Ila2 11 = 1 and a21a 1 (1) ••• , X =maxV(z) = maxV(aTX) = V(a TX) lia 11=1 and aq la 1 • " a qwhere X =(xi,...,xn) is the data matrix, a represents a certain direction and a l to aq signify the selected directions (principal components) with projective index X1 to X q . If Xq±i((q+1)^d) is small enough to be ignored, one suggests that the point cloud spread mainly on the q-dimensional subspace spanned by a i ,...,aq, the data dimensionality can thus be reduced. Essentially, the classical PCA, which uses the variance as the one-dimensional projective index, is a special case of the PP algorithm. By the use of this special projective index, the classical PCA can be accomplished directly by solving out the eigenvectors and eigenvalues of the covariance matrix XX T. For the i-th eigenvalue of XX T, X i , and the associated eigenvector a i, one has the following relationship: 2 Xi = V(aiTXXTai)
(2)
Since variance is very sensitive to outliers, the classical PCA is non-robust. Instead, a more robust projective index V(z) is used in the PP PCA. A well known extremely robust estimator of scale is the median of all absolute distances from the sample median (MAD) and it is taken as the one-dimensional projective index: S = 1.483xmedian(Iz i -el)
(3)
where z* is the median of the one-dimensional projection and 1.483 is a correction factor to make the estimator consistent with the ordinary parameter of the Gaussian distribution. Following Li & Chen [7], a refining cycle for z* and S was adopted where Huber-type weights were used.
62 The algorithm of robust PP PCA can be implemented in an iterative fashion as follows: Step 1. Variable initiation Assume that q (q_c-d) significant principal components should be found. Initialize the first q principal component directions a i ,...,aq with the first q columns of an d by d identity matrix I and let A= (ai,...,aq), and preset the projection indices S i with the MAD of the original data: Si = 1.483xmedian(lxii -mii) where mi = median(x- -)
i= 1,...,d; j=1,...,n
(4)
i=1,...,d; j=1,...,n
The initial robust covariance matrix is denoted by C: C ADAT
(5)
where D is a diagonal matrix with the projection indices, Si, as the elements. Step 2. Optimization Step 2a. Suppose the first i-1 principal componentsa i ,...,al _ i and Sp...,S i _ i are available, the direction with the largest projection index S i in the orthogonal complement space of the subspace spanned by the i-1 selected principal component directions, ,ai_1, should be selected as the ith principal component direction ai. For the determination of ai , the global optimization procedure GSA is adopted. The initial gauss, a 10, of GSA could be a column of the residual matrix X. Then the ith principal component direction is determined according to Si = maxV(z) = maxV(aTX) = V(aiTX) IIaiI = 1
(6)
A brief description of GSA will be given in the next section. After the ith principal component a i being determined, one calculates the residual matrix: X = (I- aiaiT)X
(7)
In this way, the computation is always carried on the orthogonal complement of the subspace spanned by the already calculated principal component directions and the resulting principal component direction would be automatically orthogonal to the former ones. This is slightly different from the procedure adopted by Li &Chen [71 where they searched for a lower-dimensional vector and then readjusted it to a direction in d-dimensional space. But
63 there was no essential difference between these two approaches. Step 2b. In order to improve the computational efficiency, the direction among a i+1 ,..., aq which is most similar to the ith direction ainew just calculated is replaced by the original ith direction ai. The process is to pick out the direction a k from ai+1 ,..., aq which has the largest projection length in the direction of a i new and then exchange their positions. i.e. ak = ai, ai = anew. In order to maintain the orthogonality of the columns of matrix A all the time, the successive directions are made orthogonal by the use of the QR decomposition method. After doing that, the q directions are sorted according to their values of S to guarantee the descending order of the principal components. Step 2c. After q significant directions have been calculated, the robust covariance matrix is then constructed from them: Cnew = ADAT
(8)
This modified covariance matrix is compared with the preceding one. If the difference is smaller than a preset threshold, the computation is terminated or else return to step 2a, and a new cycling is invoked by using the estimated principal component directions as initial guesses, together with the untouched original data matrix X.
2.3.Generalized simulated annealing as an optimization algorithm for PP PCA Besides the measure of the dispersion of the one-dimensional projection i.e. the projective index, another distinction of PP PCA from the classical PCA is the procedure of computation. Since the projective index is the quadratic form of X as stated above, the extremal problem of Eqn. 1 can be turned into the problem of finding the eigenvalues and eigenvectors of the sample covariance matrix for which a lot of algorithms such as SVD, QR are available. Because of the adoption of the robust projective index in PP PCA, some nonlinear optimization approaches should be used. In order to guarantee the global optimum, Simulated annealing (SA) is adopted which is the main topic of this book. SA as a category of stochastic optimization algorithms was first used by Kirkpatrick [21] to the solution of combinatorial optimization and later generalized (GSA) by Bohachevsky [22]
64 for searching the global optimum on a continuous multi-dimensional response surface. The GSA approach is very attractive for its mechanism of walking across local optima. Kalivas published an excellent tutorial for chemists entitled Optimization using variations of simulated annealing [23]. He studied the use of SA for principal component selection to
minimize the prediction errors from principal component regression [24]. GSA has been used in the analytical calibrations [25-28], the details of GSA can be found elsewhere [2026]. The presentation here will be restricted to the clarification of the present application. For the maximization of the objective function in the present application, GSA is started by selecting a random initial direction a as the current direction and computing its objective function value, (1) = (a), that is, the value of S for given a as calculated by Eqn.3. Then, a new random direction a' is generated in the neighborhood of the current direction according to the following equation: a' = a + Aro
(9)
here i) is a random direction with elements determined by d random numbers u i (i =1,...,d) from N(0,1) according to vi = ul(Iiii2) 1/2 , and Ar is the stepsize. The objective function value of the perturbed direction, ir = (I)(a') , is calculated together with its difference with (I), 5d = - 11): If 81:121 > 0, the perturbed direction is better than the current one and is accepted as the next current direction unconditionally. If S (I) -^ 0, an acceptance probability p defined by Eqn.10 is computed: p = exp[-E .8/(00 - 0)]
(10)
Here 1 is a controlling parameter and 0 0 is the value of the objective function on the global optimum. The calculated probability p is compared with a random number, P, drawn from a uniform distribution at the interval [0,1]. If p P, the detrimental perturbed new direction is accepted as the current direction, otherwise another random perturbation is executed on the current direction a and another new direction is computed and the process is repeated until some termination criterion is satisfied. The influences of the parameters of GSA were discussed extensively in the literature [26], for the sake of simplicity the detail of their selection is omitted. The parameter
65 13 is chosen so that the inequality 0.5 < p < 0.9 is satisfied assuring 50% to 90% of early detrimental directions be accepted. The stepsize is selected to ensure the walking across any local optimum in two to three steps. In practice, unfortunately, no a priori information is available concerning the response surface. In order to search the whole space of directions, a relatively large initial stepsize is adopted. The objective function value of the global optimum, 00 , is not always known a priori. A conservative initial value is used and the necessary modification is needed in the process of computation. Some modifications of GSA are adopted [28]. First, the stepsize Ar is a constant in the GSA, which limits the precision of the optimization results. So, an additional cycle for reducing Ar is introduced in this algorithm. Second, since GSA accepts detrimental directions with non-zero probability which makes GSA not sinking into local optima, the ultimate direction in one cycle of GSA is usually not the best direction produced in this cycle. In order to improve the efficiency of the algorithm, the best direction appeared in the cycle of the computation is stored and taken as the initial direction of the successive cycle or the ultimate result if the algorithm is terminated. In this way, GSA can operate more efficiently.
3. SIMULATION FOR PCA TREATMENT
Numeric simulation is used to testify the proposed PCA methods. The reason of using numeric simulation instead of experimental data is the generality and flexibility of the simulation approach. With the aid of simulation, one can easily investigate purposefully various situations that are not likely be encountered in a few experiments. The examples based on experimental data for cluster analysis, however, will be presented in the chapter "Classification of materials". Two pure component spectra a and b were generated with Gaussian bands each consisting of 50 points with a unit peak height. The peak width was fixed as 10 points which was determined as the number of points between the peak centre and the location with 2% of the peak height. The locations of centres of the two spectra were taken 15 and 35, respectively.
66 3.1.Linear structure Seventeen mixtures were constructed by using a and b, and the compositions of a and b in the mixtures satisfied the closure, i.e. the sum of a and b in the mixtures was equal to a constant (Table 1), this data set was denoted as LINO°. One of the mixture (No 2) of LINO0 was taken as an outlier by adding several Gaussian bands in its spectrum, the data set containing one outlier was obtained (LIN01). Then, the spectrum of another mixture (No 16) of LINO° was altered in the same manner as that of mixture No. 2, and a data set with two outliers (LINO2) was thus obtained. In a similar way, the third outlier (mixture No 7) was introduced in to form a data set with three outliers (LINO3).
3.2.Planar structure Seventeen mixtures were generated from the same two components a and b used above and the relative content of a and b in mixtures were designed using coordinates of points on a circle (Table 2); this simulated data set was denoted as CYCOO. Then the spectrum of mixture No. 2 was contaminated by several Gaussian bands and used as an outlier. The contaminated data set was called CYC01. In the above simulation, no noise was added.
3.3.Contaminated distribution Nine error distributions were generated according to a contaminated normal distribution: noise = (1-c)*N(0, G 12) + c*N(8, 0 22)
(11)
The parameters in these distributions are tabulated in Table 3. The first one of them is actually the normal distribution and the remaining ones have different heavy tails and skewness. All the generated data sets were treated by both PP PCA and SVD methods. As the PP PCA algorithm includes a step (Eqn.3) resembling data mean-centring, the data were mean-centred and variance-normalized before SVD treatment for the convenience of comparison with results obtained by PP PCA.
3.4.The PC directions obtained by SVD and PP PCA For the linear structure data set LINN, there should be only one significant principal
67 Table 1 Composition of mixtures of data set LINO°
Concertration
b
No.
a
1
0.10 0.90
2
0.15 0.85
3
0.20 0.80
4
0.25 0.75
5
0.30 0.70
6
0.35 0.65
7
0.40 0.60
8
0.45 0.55
9
0.50 0.50
10
0.55 0.45
11
0.40 0.60
12
0.35 0.65
13
0.30 0.70
14
0.25 0.75
15
0.20 0.80
16
0.15 0.85
17
0.10 0.90
component after data standardization because of the condition of closure. Furthermore, the first principal component direction would be consistent with the spectrum of mixture No.9 (0.5:0.5) after scaling. Besides, the linear structure of data set should be reflected in the result of principal component analysis. The scatter point plot of mixtures in the PC1-PC2 plane shows that the points are actually located on a straight line. For this noise and outlier
68 Table 2 Composition of mixtures of data set CYCOO
Concertration
No.
a
1
0.000
4.000
2
0.127
3.000
3
1.000
1.354
4
1.000
6.646
5
2.000
0.536
6
2.000
7.464
7
3.000
0.127
8
3.000
7.873
9
4.000
0.000
10
4.000
8.000
11
5.000
0.127
12
5.000
7.873
13
6.000
0.536
14
6.000
7.464
15
7.000
1.354
16
7.000
6.646
17
8.000
4.000
free data set, the classical PCA has no problem for data illustration. The presence of outlier(s) distorts the principal component directions seriously, and as the number of outliers increases, the number of significant principal components increases which contradicting the inherent linear structure of data sets. The scatter point plots of mixtures in
69 Table 3 Parameters of contaminated distributions
1
Q2
a
0.0
0.001 0.0 0
b
0.1
0.001 0.3
0.2
0.001 0.3 20
d
0.2
0.001 0.5 50
e
0.2
0.001
0.3
0.001 0.1
0.3
0.001 0.5 50
g
10
1.0 50 50
the SVD PC1-PC2 plane shows that the first principal components of these three data sets with outlier(s) are actually dominated by the outlier(s) (Fig.1, A). However, the principal components derived from PP PCA are affected only very slightly by the presence of outlier(s) and the linear structure of the main body of the data is maintained. The PP PCA brings another benefit that the outliers are easily detected from the results of PP-PCA (Fig.1, B). Similar results were obtained for data sets CYCOO and CYC01. The presence of outliers degrades the results of the classical PCA considerably, and the intentionally designed circular structure in the plane no longer exists. In contrast, PP PC's do not change very much and the circular pattern is retained both with and without the outliers. As stated in the introduction section, the deviation from the ideal normal distribution is either originated from the presence of outlier or the inherent properties of the error distribution. The common error distribution can be regarded as a contaminated normal distribution as formulated by Eqn.11, where the parameters £, 5,
(3' 1,
G2 determine the extent of tailing,
skewness and kurtosis. To the linear structure data set LINN, contaminated distributions with different parameters
70
A
B
5
5
0
0 x
-5 -10
0
-5 X
5
-5
10
0
-5
-10
S
1570 .—
s1=1.1086
X =3 0161 2
s2=0.0000
1
10
s3=0.0000
X =0 0000 3 . 5
5 x
0
0
-5
-5
-10
- 5
0
10
5
-5
-10
X =7 5276 1 X =4 1960 2— X =2 1929 3
0
5
10
si=1.1086 s2=0.0000 s3=0.0000
5
5 x
XX
x
0
0
-5 -10
-5
0
5
-5 10
1=7.5294
=2 3044 -
-5
5
0
si=1.1086
X2=4.3169 3
-10
s2=0.0000
s3=0.0000
Figure 1. Comparison of SVD and PP PCA for linear structure data with outliers. Scatter point plots in PC1-PC2 plane for data sets LIN01, L1NO2 and LINO3 (from top to bottom) calculated by SVD (A) and PP PCA (B). The corresponding singular values ( X
i )
and projective indexes ( s i
)
are given.
10
71 (Table 3) were added as noise component. The first row of Table 3, i.e. case a, is actually the uncontaminated normal distribution. Similar results of PCA were obtained with SVD and PP for this case. For other cases in Table 3, the heavier the contamination, the larger the departure of the results of SVD PCA of the contaminated distribution from that of the normal one. The results of PP are insensitive to the contamination of distributions. The linear stricture of the data set is maintained fairly well for all cases for PP PCA (Fig.2). The classical PCA is non-robust and sensitive to deviations of error distribution from the normal assumption, the PC directions being influenced by the presence of outlier(s). In PP PCA, the PC directions are determinated by the the inherent structure of the main body of the data. Using some robust projective index, the influence of the outliers is thus substantially reduced. The distorted appearance or misrepresentation of the projected data structure in the PC subspace caused by the presence of outlier(s) could be eliminated in PP PCA. This characteristic feature of PP PCA is essential for obtaining reliable results for exploratory data analysis, calibration and resolution in analytical chemometrics where PCA is used for dimension reduction.
4. CONSTRAINED BACKGROUND BILINEARIZATION WITH GENERALIZED SIMULATED ANNEALING
Samples containing unexpected interferents are encountered very often in analytical practice. Quantifying the desired analytes in the presence of interferents without using time-consuming separation procedures is the ongoing target of analytical chemists. Osten and Kowalski [29] proposed methods for solving this kind of problem which could provide possible solutions. A hybrid method combining the generalized standard addition method and iterative target transformation factor analysis technique was developed in this laboratory [30]. All of these approaches have the intrinsic limitation associated with low order tensor (vector) analytical signals which cannot provide enough information necessary for obtaining unique solutions. The situation changes dramatically when two-way bilinear data are adopted to attack this kind of problem. Ho et al. [31] introduced rank annihilation factor analysis (RAFA) method to quantify the desired analytes in the presence of unknown interferents. The iterative procedure of RAFA was modified by Lorber [32], a direct approach with the generalized
72
2
2
0
XXXXXXXXXXXXXXXXXXXX
a
-2
o
XXX*X3400,**0**XXXXXXX
-2 -2
-1
0
-2
1
2
x xof " xx* xoe
0
b
-2
1
********************
0 -2
-2
-1
0
-2
2
1
2
-1
0
2
1
2
C
X ** XXx****XXX*
0 2 -2
-1
0
0
XXXXXXXXXXXXXXXXXXXX
-2
-2
1
2
-1
0
1
2
2
.),
0
d
mx'x.
2
-
1.00(30EXXXXXXXX***X*XM
-2 -2
-1
0
2
1
•nn••n••n
-2
2
-1
0
1
2 *
m* 0*
x . x*
0
*.
x
e
-2
o
-
XXXXXXXXXX*XXXXXXXXX
-2 -2
0
-1
2
1
2
-2
0 -
*xoxxxx xx
0
2
1
2
f
*X XXXX
-2
-
XXXXXXXXXXXXXXXXXXXX
-2
-2
0
-1
-2
1
2
0 -2
0
-1
2
-1
0
2
1
2
-
-2
. X 34,, X X
-1
0
X
1
g
XXX*X3.00.*X.X.X.XXX
-2 2
-2
0
1
2
Figure 2. Scatter point plots, in PC I -PC2 plane for contaminated data sets, a-g of Table 3: left, SVD; right PP.
73 eigenvalue technique was presented. Sanchez and Kowalski [33] extended the method to a more universal version and called it generalized rank annihilation factor analysis (GRAFA). Later, Ohman, Geladi and Wold [34-35] developed residual bilinearization method (RBL) and claimed that RBL could provide better predictions than GRAFA. RBL was an iterative optimization procedure and there was a problem of reaching the global optimum. In order to avoid getting stuck in local optima, Liang, Manne and Kvalheim [36] suggested to limit the search region with the constraints of positive concentrations and (background) spectrum intensities, they proposed constrained background bilinearization method (CBBL). In CBBL, the modified Powell algorithm was adopted as a more efficient search method. Since the optimized response surface could be very complicated because of its dependence upon the concentration of the desired analytes and the residual matrix, it was not absolutely sure that the response surface was convex and only one optimum existed in the constrained region. In order to reduce the possibilities of sinking into local optima and further improve the CBBL method, a global search algorithm, GSA, was adopted here to replace the Powell algorithm.
4.1.Constrained background bilinearization (CBBL) The theory of CBBL has been thoroughly discussed elsewhere [36], the presentation will be restricted to clarification of the present method. Any analytical data obtained by hyphenated instruments or by two-way spectroscopic techniques such as excitation-emission fluorescence spectroscopy are bilinear ones. The bilinear data matrix has a very useful property, namely the rank of such matrix obtained with any chemical mixture is equal to the number of chemical components in the mixture. Thus, theoretically, the rank of a data matrix of any pure chemical component is unit. It can be expressed by the product of two vectors: X = pxtT (12)
Here X is the bilinear response data matrix of a pure chemical component obtained from two-way instruments, such as GC/MS, LC/UV or EM/EX et .al., without noise. The vectors p and t represent, for instance, the concentration profile of chromatography and the standard spectrum respectively for GC/MS and LC/UV or the emission and excitation spectra
R
74 respectively in fluorescence spectroscopy. In analytical practice, a sample may contain various unexpected components and the instrument always introduces measurement error and noise. The two-way bilinear response model for an analytical sample with unexpected interferents can be expressed as Y=
I i=1
ciX i +
I
J J
+E
j=1
(13)
In equation 13, Y is the bilinear data matrix of a sample containing some unknown background interferents. X i (i=1,..., N) are the bilinear data matrices of N sought-for analytes and c i (i=1,..., N) are their concentrations to be estimated. R i (j=1,..., M) represent the bilinear data matrices of M unexpected interferents and both the number, M, and the concentrationsM) of them are unknown. E denotes the measurement noise cJ matrix. Since the rank of Y can be estimated by factor-analytical technique with consideration of experimental noise, the number of unexpected interferents, M, can be obtained easily by subtracting N from the rank of Y. The information on the number of interferents is crucial in this situation, this makes the distinction between matrix calibration and vector calibration. Assuming the bilinear structure of the response, one can factor-decompose the overall background responses of M interferents into the product of two matrices =
E
M T = tip s j=i
j=i (14) where tj and pj are the orthonormal score and the orthogonal loading vectors, respectively.
In the CBBL method, the concentrations of the sought-for analytes are obtained by an optimization procedure described as follows. If c i (i=1,N) are estimated correctly, the background matrix, say Y - E Citrue X i can be expressed by equation 14. Under the constraint of rank( R) = M, the residual part of the background matrix, E = Y - E citrue.X
-
E tj piT , should be at the same level as the
measurement noise. Here M principal components have been extracted. On the other hand, if c i (i=1,..., N) are overestimated or underestimated, the background matrix Y - s ci X i can not be expressed accurately by M principal components, since some information of Xi (i = 1,
75 N) would be embeded in it, thus the residual matrix, Y - E Ci trueX i - E ti pi T , should be significantly different from the measurement noise. So, the sum of squares of the elements of the residual matrix E would be a suitable objective function for this optimization problem. i.e: K L
gc,R) =
I
k=1,...,K ek2i
1= 1,...,L
k=1 1= 1
(15)
where K and L denote the numbers of measurement points in each direction of two-way instruments. In order to locate the global optima more efficiently, CBBL searches in a limited region derived from the reasonable assumption of positivity of spectral intensities and concentrations of analytes. The upper limit of the concentrations to be estimated can be derived from the spectral positivity: max
k1+ ca =Min 37 [(
c)]
j=1
,N
(16)
xJkl
where C is an error bound, and the lower limit will be zeros. 1,e: 0 Lc c. _^ c.max
(17)
j=1,...,N
Unfortunately, it is not absolutely certain that the response surface of the objective function in this constrained region is convex. In order to avoid the possibility of sinking into local optima, generalized simulated annealing is adopted for the optimization step. 4.2.Generalized simulated annealing for CBBL The optimization procedure by GSA 4has been described in section 2.3 in connection with the PP PCA problem. Here, only some aspects specific for the CBBL algorithm are emphasized. For the minimization of objective function (0, here f(c, R) defined by Eqn.15, GSA is started by computing this objective function, = 0(x), taking a random initial state x. Here a state denotes an estimation of the concentration vector of N sought-for components. Then, a new random state x' is generated in the neighborhood of the current state by a small perturbation, and its objective function, (0' = (1)(x'), is calculated together with its difference with 0, 5(I) = - cr. The unconditional acception of the current state for
50 -^
0 and acceptance
76 probability (Eqn.10) for the current state are the same as described in section 2.3. Since this optimization procedure is a constrained one, each new state is checked to see if it is in the constrained region before proceeding to subsequent steps. It should be thrown away if it violates the constraints and a new random state should be generated. In order to ensure searching the whole region of the state space, the initial step size is set to about 10% of the size of the constrained region. The termination criterion is formulated as follows. The searching is regarded as accomplished`if 30 successive steps are not accepted; and then the best state produced during the computation process is taken as the initial state and the whole process is started again with a 50% shrinkage of step size. The computation is terminated when the current step size is reduced to the size necessary for obtaining sufficient precision of the concentration estimates. In this work, the step size is reduced finally to one tenth of its initial size.
4.3.Analytical systems for CBBL analysis Numerical simulation and real experimental systems are used for the CBBL treatment to verify the proposed procedures including the GSA algorithm. In the numerical simulation, a series of hyphenated chromatographic data sets of size 30)<30 were generated for several four- and five-component systems. The pure spectra and chromatographic profiles were constructed by using Gaussian bands or their combinations. Every two-component combination in the systems was taken as the background interferents and the remaining ones were regarded as the analytes of interest and their concentrations were estimated. All simulations were conducted in the noise free situation except in the experiments for studying the influence of the noise. A four-component system was used in the investigation of the influence factors of GSA. In a similar way, a 50 * 50 two-way fluorescence EEM of a five-component system was generated. Components d and e were taken as the sought-for analytes with a relative concentration 0.5 and 1.0 respectively, and the other three were regarded as unknown interferents with the same relative concentration of 0.5. Zero-mean normal random numbers were added to the mixture EEM to simulate the experimental noise and the standard deviation of the noise was taken to be one-thousandth of the largest value in the mixture EEM. This data set is used for the comparison of GSA with the Powell algorithm.
77 The real analytical data were obtained by fluorimetric measurements with a Hitachi 850 fluorophotometer. The standard solutions of three organic dyes, rhodamine B, fluorescein, eosin Y and their mixtures were prepared with 2.5x10- 3 mo1/1 NaOH. The two-way EM/EX matrices of size 30x30 were recorded in the range 454-570 nm of excitation spectrum and 484-600 nm of emission spectrum at 4 nm interval. The data were processed by the proposed method after subtracting the reagent blanks. All cases with two different components as interferents for the simulated hyphenated chromatographic analytical systems were investigated by CBBL using GSA. The concentration estimates for the analytes of interest converge nearby the true values, demonstrating the feasiability of the use of GSA as the optimization method for CBBL. For the sake of conciseness, the results are not tabulated. The computation process of GSA is random in essential. The initial values should have no influence on the performance of GSA as shown by the results listed in Table 4. Also, the variation in concentrations of components have no substantially effect on the estimated results (Table 5). The influence of noise was inspected and the results are tabulated in Table 6. The noise of moderate level does not likely degrade the analytical results very much. The motivation of this work is the suspicion of the response surface being convex within the constrained region in CBBL. It has been found that the complexity of the response surface is substantially reduced as the constraints exerted. For most cases, the response surface within the constrained region can be regarded as convex, hence the reason of the superiority of CBBL to RBL. In RBL, the initial estimates of concentration often go far away from the true values and are located outside the constrained region in CBBL. However, the response surface within the constrained region is rather complicated, the simulated five-component fluorescence system being an example of this sort. In this case, the calculated constrained regions for component d and e are 0-0.6449 and 0-1.5469, respectively. This data set was treated by CBBL both with GSA and the modified Powell algorithm. The coordinate axes were taken as the initial search directions in the Powell algorithm. The results obtained with different initial values and constraints for GSA are tabulated in Table 7. Compared with the Powell algorithm, GSA has some advantages. First, the performance of GSA is influenced only slightly by the initial values for its intrinsic randomness; this may not
78 Table 4 Results for simulated chromatographic system c with different initial values
Initial state
Real
Estimated
concentrations
concentrations
0.0000,
0.0000
0.5000,
1.0000
0.4853,
0.9743
0.2000,
0.2000
0.5000,
1.0000
0.5026,
0.9827
1.0000,
0.0000
0.5000,
1.0000
0.4985,
1.0087
0.0000,
1.0000
0.5000,
1.0000
0.5000,
1.0023
0.5000,
0.5000
0.5000,
1.0000
0.4992,
1.0037
Table 5 Results for simulated chromatographic system c with different concentration profiles
Real concentrations
Estimated concentrations
1.0000,
1.0000
0.9936,
1.0036
0.1000,
1.0000
0.0969,
1.0041
1.8000,
0.1000
1.8015,
0.1023
0.1000,
0.9000
0.0992,
0.9037
5.0000,
0.0500
5.0025,
0.0602
be the case for the Powell algorithm. In principle, the initial search directions in the Powell algorithm can be arbitrarily selected as long as they are linearly independent. For the data set of Table 7, if the initial search directions were chosen as (0.707, 0.707) and (1,0), with (0.4,0.4) as the initial values, the Powell algorithm converged on the point of (Not shown in Table 7). This deviation seems to be originated from the adoption of so-called golden section
79 Table 6 Results for simulated chromatographic system c with different noise levels (the sought-for analytes are 1 and 2 and the interfcrents are 3 and 4)
Standard deviation of noise (0)
0.001
Real
Estimated
concentrations
concentrations
0.5000,
0.003
0.5000,
0.005 0.007
0.5000, 0.5000,
1.0000 1.0000 1.0000 1.0000
0.4994,
0.9988
0.5005,
1.0069
0.5035,
0.9737
0.5098,
1.0057
method in the one-dimensional search, which missed the minimum and regards point (0.6449,0.6449) as the best point in these two directions. It can not be excluded that similar deviations could occur in the situations with the coordinate axes as the initial search directions. In GSA, the direction between two successive iterations is a random one, and detrimental states are accepted with a non-zero probability, providing the mechanism of guaranteeing the global optimum. Second, the concentration constraints introduced in CBBL can be exerted more easily in the process of computation of GSA rather than the Powell algorithm. In GSA, the constraints can be used directly to check the availability of the perturbed state, while in the Powell algorithm the transfer of the multi-dimensional constraints to the one dimensional search boundary might not be a trivial work for the quantitation of more than two analytes as the constrained region is no longer a rectangle in a plane. Furthermore, a strict constraint, in fact, is not necessary in GSA for its intrinsic mechanism of walking across the local optima. If a constraint is likely used for the computational convenience, a relatively loose one could be chosen. The results with a expanded constrained region for GSA are given in Table 7. It must be pointed out that the computation of GSA is generally more time-consuming than
80 Table 7 Results for simulated fluorescence EEM
GSA
Initial values
0,
0
Powell
A
0.4937 0.9920* (0.000959)t
0.4,
0.4
0.4868 0.9845 (0.001001)
0.6, 1.5
0.4938
0.9922
(0.000996)
0.4877 0.9919 (0.001009) 0.4973 0.9836 (0.001077) 0.4953 0.9948 (0.000997)
0.5053
1.0103
(0.001044) 0.5014
1.0091
(0.001037) 0.4879 0.9943 (0.001012)
A, results obtained with the constraint of maximum concentrations as calculated by equation 16, i.e., 0.6449 and 1.5699 for d and c, respectively. B, results obtained with an expanded constraint of maximum concentrations of 3 for both d and e. * Concentrations of sought-for analytes d and c are 0..S and 1.0, respectively. t Corresponding objective function in parentheses.
that of the Powell algorithm, though the difference in computation becomes smaller as the response surface becomes more complicated. The data of five mixtures of three dyes were treated by using the proposed method. The error bound, E, was taken as ten fluorescence intensity units of the Hitachi 850 instrument. One or two of these three components were taken as unknown interferences and the concentration of the remaining components was estimated. The true concentrations and the relative errors of concentration estimtcs are shown in Table 8. It is clear that the estimated results are acceptable.
81 Table 8 Relative errors of concentration estimates A 1
2
3
4
5
B(C)C A
C(B) B
C(A) A(B,C) B(A,C) C(A,B)
2.09t 2.92
2.99
4.19
-2.57
3.75
2.69
-7.43
2.90
1.79t -2.73
3.39
3.66
-2.71
3.52
3.49
-2.80
3.84
-0.20
-2.18
1.33
3.73
-2.15
3.75
2.39
1.41
2.36
0.00
-2.43
1.20
3.61
-1.32
3.73
1.40
-0.37
3.98
-1.35
-2.18
-1.30
3.48
-1.48
3.87
-3.79
1.25
4.89
-1.84
-2.85
-1.40
3.89
-1.48
3.89
4.84
-0.60
4.21
-6.43
-0.07
-7.93 -7.87
0.74
1.19
-15.51
-4.81
-8.83
-6.22
5.76
-6.22
1.14
-1.69
1.14
-6.22
-2.01
-1.14
-3.99 -4.35
-4.78 -5.95
-4.58
1.28
-4.59
-2.41
0.48
-3.47 -6.92
-3.43
-4.93
1.35
-3.47
-3.68
1.59
1.28
* An underlined combination represents one treatment with component(s) in parentheses taken as interferent(s). t GSA. Powell. ACKNOWLEDGEMENT
This work was supported by National Natural Science Foundation of P.R.C. and partly by the Electroanalytical Laboratory of Changchun Institute of Applied Chemistry, Chinese Academy of Siciences.
REFERENCES 1. H. Chen, R. Gnanadesikan, J.R.Kettenring, Sankhya, 1336 (1974) 1. 2. V.J. Clancey, Nature, 9 (1947) 339. 3. E.K. Harris and D.L. DeMets, Clin. Chem., 18 (1972) 605. 4. P.J. Huber, Robust Statistics, John Wiley & Sons, New York, 1984. 5. R.A. Marcnna, Ann. Statist, 4 (1976) 51. 6. S.J. Devlin, R. Gnanadesikan and J.R. Kettenring, J. Am. Stat. Assoc., 76 (1981) 354.
82 7. G. Li and Z. Chen, J. Am. Stat. Assoc., 80 (1985) 759. 8. G.R. Phillips and E.M. Eyring, Anal. Chem., 55 (1983) 1134. 9. R. Wolters and G. Kateman, J. Chemometrics., 3 (1989) 329. 10. P.J. Rousseeuw, J. Am. Stat. Assoc., 79 (1984) 871. 11. D.L. Massart, L. Kaufman, P.J.Rousseeuw and A. Leroy, Anal. Chim. Acta, 187 (1986) 171. 12. Y. Hu, Chemom. Intell. Lab. Sys., 9 (1990) 31. 13. P.J. Rousseeuw, J. Chemom., 5 (1991) 1. 14. Y. Xie , J. Wang, Y. Liang and R. Yu, Anal. Chim. Acta, 269 (1992) 307. 15. J.B. Kruskal, in Multidimensional scaling, theory and application in behavioral sciences, Vol 1 theory, R.N. Shepard, A.K. Romney, S.B. Nerlov (eds.), London, 1972. 16. R.E. Bullman, Adaptive Control Processes, New York, 1961. 17. J.H. Friedman and J.W. Turkey, IEEE. Trans. Computers, C-23 (1974) 881. 18. J.H. Friedman and W. Stuetzle, J. Am. Stat. Assoc., 76 (1981) 817. 19. P.J. Huber, Ann. Statistics, 13 (1988) 435. 20. K.R. Beeb and B.R. Kowalski, Anal. Chem., 60 (1988) 2273. 21. S. Kirkpatrick, C.D. Gelatt and Jr., M.P. Vecchi, Science, 220 (1983) 671. 22. I.O. Bohachevsky, M.E. Johnson and M. L. Stein, Technometrics, 28 (1986) 209. 23. J.H. Kalivas, Chemom. Intell. Lab. Sys., 15 (1992) 1. 24. J.H. Kalivas, Chemom. Intell. Lab. Sys., 15 (1992) 127. 25. J.H. Kalivas, N. Roberts and J. M. Sutter, Anal. Chem., 61 (1989) 2024. 26. J.H. Kalivas, J. Chemom., 5 (1991) 37. 27. N.E. Collins, Amer. J. Math. Magn. Sci., 8 (1988) 209. 28. Y. Xie, J. Wang and R. Yu, Chemical Journal of Chinese Universities,14 (1993) 174. 29. D.W. Osten and B.R. Kowalski, Anal. Chem., 57 (1985) 908. 30. Y. Liang, Y. Xie and R. Yu, Chinese Science Bulletin (English Edition), 34 (1989) 1533. 31. C.N. Ho, G.D. Christian and E. R. Davidson, Anal. Chem., 50 (1978) 1108. 32. A. Lorber, Anal.Chim. Acta, 164 (1984) 293. 33. E. Sanchez and B.R. Kowalski, Anal. Chem., 58 (1986) 496. 34. J. Ohman, P. Geladi and S. Wold, J. Chemometrics, 4 (1990) 79.
83 35. J. Ohman, P. Geladi and S. Wold, J. Chemometrics, 4 (1990) 135. 36. Y. Liang, R. Manne and O.M. Kvalheim, Chemom. Intell. Lab.Syst., 12 (1992) 646.
This Page Intentionally Left Blank
Adaption of Simulated Annealing to Chemical Optimization Problems, Ed. by J.H. Kalivas © 1995 Elsevier Science B.V. All rights reserved.
85
Chapter 4 Kalman filter quantitative resolution of overlapped shifted peaks after optimal alignment by simulated annealing Taddeo Rotunno Dipartimento di Chimica Universita' degli Studi di Bari Via E. Orabona, 4 70126 Bari, Italy
1. INTRODUCTION Curve resolution procedures are frequently used in analytical chemistry for the quantitative resolution of each individual component when the analytical response arises from the overlapping of several signals. The Kalman filter [1] in particular has recently been successfully used for solving various problems related to the overlapping of several signals [2-4]. This algorithm is a linear least-squares estimator aimed at obtaining the contribution of each component in the overlapped response, by using the pure component signals in the filter model. The applicability of the Kalman filter requires an accurate knowledge of the response of each component and an efficient procedure for background removal. Background subtraction has recently been treated with cubic splines polynomials[5,6] as smoothing interpolators between peak valleys and this has proved to be efficient for baseline resolution particularly for very low signal-to-noise ratios[7] . The essential requisite to the use of the Kalman filter is the linearity of the model; i.e. the overall response must be a linear combination of the component signals. Because of this constrain, the alignment of each component signal with respect to the overall signal is critical. In other words, the correct application of the Kalman filter to the resolution of overlapped responses requires coincidence of the signals on the axis of the independent variable in the scans of each pure component and of the mixture, at least in the spectral window of the composite signal chosen for the deconvolution ,[8,9] .
86 The signals of many analytical techniques may suffer significant shifts on the scale of the independent variable because of instrumental drifts, or due to the inability to match backgrounds, etc. For instance, incorrect modelling due to inexact alignment may easily happen in the resolution of chromatographic peaks, due to the relatively poor reproducibility of retention times, especially with gradient elution techniques. Another technique where the occurrence of unpredictable shifts of the transition bands may significantly affect the reliability of the measurement is Electron Spectroscopy for Chemical Analysis (ESCA). This is a powerful technique for surface analysis. The characteristic transition bands of the elements on the sample surface, promoted by X-ray excitation, contain important information on both the quantification of the surface elements (through the peak areas or the peak heights), and the speciation of the surface elements (through the chemical shifts exhibited by the bands on the binding-energy scale). Very often, though, the samples analysed by ESCA are insulators. Since ESCA involves the emission of electrons from the sample, the insulator surface is left positively charged after exposure to the X-ray source. The main consequence of this surface charging is the shifting of the bands on the binding-energy scale, in addition to the shifts due to instrumental instabilities. Thus, complicated calibration procedures are required for a correct interpretation of the chemical states. Moreover, the consequent lack of collinearity between the pure and overlapped bands will diminish the suitability of application of the Kalman filter, making the deconvolution of overlapped spectra hardly amenable. The correction for multiple spectral shifts in the quantitative resolution of overlapped multiple component systems is not straightforward, and there are very few computational procedures that cope with this problem. Examples of resolution of overlapping peaks for multicomponent analysis when the composite spectrum is badly modelled because of various sources of errors, but not for the type of error arising from incorrect alignment, have been reported in literature[lo-18] . In these instances the adaptive Kalman filter has been applied. It uses the property of the noise sequences defined in the filter model to allow changes in the model, so that the experimental data are better fitted. Adaptive filtering strictly applies to situations where the model is correct for a portion of the response, and the errors are attributable to a single component. Among the different sources of error that may be compensated for by this digital procedure, there is the error in the model arising from small shifts suffered by the component spectra in mixtures. Rutan et al. [17] have recently described an iterative application of the adaptive Kalman filter to correct for small shifts of the fluorescent responses of some polyaromatic hydrocarbon mixtures. In this case, however, the exact wavelength positioning relative to the analyte composite spectrum has to be known for
87 each component. All these restrictions may not be met in the case of the model error due to the loss of collinearity of the component and mixture peaks, because of the unpredictable direction and magnitude of the error. This is a typical situation found in gradient elution chromatography and in electron spectroscopy, where the random peak shifts may be so large as to alter the order of occurrence of the component peaks in the mixture envelope. The computational procedure described here is capable of finding the best alignment of the pure component peaks respect to the composite peaks, regardless of the extent of the relative shifts and the order of occurrence of the components, so that the correct evaluation of the contributions of each component can be estimated. The program uses the ordinary Kalman filter to resolve the composite peak by utilising, as model, the component signals aligned according to some values of the peak position parameters. A procedure of minimisation, based on the use of the simulated annealing algorithm, iteratively searches for the position of the component peaks corresponding to the minimum on the surface of an error function. With the hypothesis that the error in the model is mainly due to an incorrect alignment of the component peaks, the program converges to the optimal alignment. The procedure has been applied to the resolution of synthetic overlapped spectra as well as to real overlapped peaks in liquid chromatography and in ESCA.
2. THE ICALMAN FILTER The Kalman filter, a linear optimal estimator operating recursively on digital data, has found many applications in analytical chemistry [19-22], including the resolution of overlapped responses in voltammetric[8] , spectrometric [19,22i and chromatographic [15,20] experiments. Provided that an accurate model is available for the individual components contributing to the overlapped response, accurate estimates of the concentration of each component can be obtained by the application of this algorithm. The theoretical aspects of the discrete Kalman filter have been thoroughly treated by Brown [8] and Rutani[3] .
2.1. Basics Briefly, the discrete Kalman filter applies only to linear models whose behaviour can be coded by two equations describing the system dynamics (Eqn.(1)) and the measurement process (Eqn.(2)), respectively.
88 X(k)=F(kik-1) • X(k-1) + W(k) Z(k)=HT(k) - X(k) + v(k)
(1) (2)
The system dynamics equation defines the model for the propagation of the state vector X(k) (which comprises the best estimates of the n parameters describing the system state) between the discrete measurement intervals 1,....k-1,k, m. Its formulation makes use of the n-n state transition matrix F(kjk-1), which describes how the state vector propagates between the k-1 and k measurement intervals. The measurement process equation connects the kth measurement of the observable Z(k) to the current state vector through the n-dimensional observation vector 11(k), which describes the dependence of the measurement on the state. The n-1 vector W(k) and the scalar v(k) are the noise contributions to the system and measurement, respectively. The assumptions that need to be met in order to obtain parameter estimates with smaller variances are that both W(k) and v(k) be independent white-noise sequences with a zero mean. When the system vector is deterministic (not randomly affected by system error) and invariant with respect to independent variable (e.g. potential, wavelength, time, etc.), the simplification W(k)=0 and F=1, where I is the identity matrix, are possible in Eqn.(1). If Y is the spectral overlapped response formed by two measurable component spectra y1 and y2, the system dynamics (Eqn.(1)) will be c,(k) = c, (k)
10 c i (k - 1) x 01 c 2 (k - 1)
and the measurement process (Eqn.(2)) will be
Y(k) = 131 1 (lc) 31 2 (k)1 x
c, (k) + v(k) c2 (k)
where c1 and c2 are the relative contributions of the components y1 and y2 to the mixture.Once the model has been properly defined, the application of the Kalman filter is straightforward. The algorithm consists of the five equations of Table 1. Equations (3) and (4) predict the state vector X and the error covariance matrix of the state parameters P on the base of the values estimated at the previous k-1 measurement and prior to assimilating the new kth is P defined by matrix covariance error The Tr n measurement.
89 P(k)=E{[X(k)-Ti(k)]•[X(k)- X(k )]T }, where E {} is the expected value; X(k) and X(k) are the true and the estimated state vectors at the same measurement point, respectively. Table 1 The Kalman filter algorithm (State estimate extrapolation) X(kjk-1)=F(k,k-1)-X(k-ljk-1)
(3)
(Error covariance extrapolation) P(kjk- 1 )=F(k,k- 1 )•P(k- 1 jk- 1 )•FT(k,k- 1 )+Q(k- 1 )
(Kalman gain) K(k)=P(kIk-1)•HT(k)•[ll(k)•P(kik-1)•11T(k)+R(k)i-1 (State estimate update) X(14)=X(kfk- 1 )+K(k)•[Z(k)-HT(k)•X(kik-1)] (Error covariance update) P(kIk)=P(kIk-1)-RI-K(k)-HT(k)]
(4)
(5)
(6)
(7)
The diagonal elements of P are the variance of the estimated parameters. The n • n matrix Q(k) is the covariance of the system noise Q(k)=E f [W(k)-WTOM -lik, where lik=1 for j=k, otherwise Ijk=0. The prior estimates of X and P allow the n - 1 Kalman gain K to be calculated through Eqn.(5), where R(k) is the covariance of the kth measurement. It is worth nothing that the elements of K(k) are computed as the optimal filter weights producing the minimum variance fit of the model to the data. Moreover, the inverted quantity in Eqn.(5) is scalar, so only the reciprocal must be calculated. Equation (6) updates the current state vector by adding the correction factor, which is given by the difference between the actual measurement z(k) and the predicted state vector, weighted by the Kalman gain vector. Similarly, the Kalman gain is used to estimate the updated current error covariance matrix, as in Eqn.(7).
90 The procedure of applying Eqns (3)-(7) recursively to each measurement point of a spectrum performs the Kalman filter algorithm. To start the first iteration, i.e for k=1, some initial guesses for X(0) and P(0) are required, which will allow the predicted estimates X(1(0) and P(110) to be obtained prior to filtering the first measurement. 2.2. Program overview
The iterative procedure for finding the optimal alignment starts (see Figure 1) by the request of some initial values of both The state parameters refined by the Kalman filter (i.e.: X(0), the concentration of the components in the mixture, and P(0), the diagonal elements of the covariance matrix), and the parameters refined by the optimization procedure (i.e. the position parameters j=1,....,n, where n is the number of the components in the mixture). The T. values represent the co-ordinates on the abscissa scale of the peak maximum of the components with respect to a fixed point on the composite spectrum, which is the beginning of the chosen spectral window. Moreover, the Ti are integer quantities expressed in terms of number of data points. The following steps are carried out: 1) The subroutine SHIFT shifts each pure component peak, according to the current values of To in the spectral window. The composite spectrum is always held fixed. 2) The KALMAN subroutine accomplishes the Kalman filter deconvolution, utilising in the filter model the pure component peaks shifted as previously by SHIFT. The concentrations obtained for each component by the Kalman filter deconvolution, although they are the optimal estimates obtainable by the Kalman filter for that model, may be erroneous because of an incorrect alignment. Consequently the calculated overall response will be misshaped with respect to the experimental mixture spectrum. 3) The subroutine ERSUM computes the error function, whose minimum should correspond to the best estimate of the Ti parameters and to the best Kalman filter performance. 4) The subroutine OPTIM manages the iterative optimization procedure to find the values of the position parameters T i corresponding to the minimum value of the error surface. Three optimization methods were optionally used in this subroutine: simplex, steepest descent and simulated annealing. The steps from 1) to 4) are repeated until the convergence of the program is achieved. If the error in the model is mainly due to incorrect peak alignment, then the refined values of T i at convergence will correspond to the co-ordinates of the component peaks with the minimum error in the model, and the last Kalman filter deconvolution will yield the best estimates of the concentration of the components in the mixture.
91
(...stak rt
1) Acquisition 2) Handling 3) Processing 4)Exit smoothing and
)
F base line
acquires spectra
INPUT
xq, P(0), Tj(o)
OPTIM calculates new values of Tj
SHIFT shifts the peaks
V KALMAN performs the deconvolution
ERSUM calculates the error function S
outputs the results
Figure 1. Block diagram of the computer program In the present application of two combined techniques, the Kalman filter and the minimization procedure, an appropriate error function should be considered such that the values of Ti at convergence correspond as much as possible to the model with the minimum error and to the concentration estimates with the minimum variances. The employed error function is the sum of two terms: N n 2 S = / Wk jYr — YO + X /00.0 k=1 j=1
(8)
92 The first term is the pointwise weighted sum of the squared differences between the experimental mixture spectrum Yr, and the overall response Y: calculated with the component concentrations obtained by the Kalman filter deconvolution at the current iteration. The sum is extended to the N data points comprised in the chosen spectral window. Each residual is weighted by a factor Wk =1/( Yr • Y: ) inversely proportional to the modified square of the response [23] . The second term is the trace of the logarithm of the diagonal element of the error covariance matrix, i.e. the variances of the n concentration values obtained by the current Kalman filtering. The simplex algorithm used was the modified Nelder-Mead[241 version. As for the steepest descent method, we used the program CFT4A developed by L. Meites [25], slightly modified to manage integer parameters and adapted to the present application. 2.3. The simulated annealing algorithm The simulated annealing is a technique well suited to solve combinatorial optimization problems on a very large scale. Recently Kalivas et al.
[26]
have applied a modification of this
method, the generalized simulated annealing, as global optimum location technique for multidimensional continuous functions. The theory and applications of this algorithm have been illustrated in literature[27-29] . To apply this method, the configuration of the system must be first assessed by defining the state parameters, and an objective function S, whose minimization is the goal of the procedure. Then, a control parameter t is introduced, which takes the role of the product (• 7) of the Boltzmann equation in the annealing process. For a given value of t, and an initial configuration of the system, a random perturbation of the state parameters is generated, and the variation AS=Se-So is evaluated, where S o and Se are the values of the objective function corresponding to the initial and current perturbated states, respectively. If AS< 0, this new configuration is accepted and hence a new perturbation generated; if AS> 0, then the new configuration with a higher value of S is given a possibility to be accepted on the base of the probability P calculated according to the Metropolis [30] criterion: P(Se)=exp[–AS/t•(Sc-S*)] where S* is the assumed global minimum of S (it may be zero). The criterion for acceptance of such detrimental (AS> 0) perturbation is based on the generation of a random number p, drawn from a uniform distribution in the interval (0,1). If P(S e)>p, then the acceptance for the new configuration ensues, otherwise a new random perturbation occurs, and the process repeats. The implication of the Metropolis acceptance test is that unfavourable states of the system are possible, but the probability of accepting one decreases as S becomes larger and t becomes smaller. After a series of moves (whose number depends on the annealing schedule) the
93 control parameter t is lowered, thereby narrowing the acceptance criterion for detrimental states. In the present application, the state parameters were the co-ordinates of the component peaks maxima Ti (see the program overview). The initial configuration of the system state was assessed by some guessed values of T parameters, and the associated relative uncertainties d3 (d.J < 1) ' which established the percentage interval AN. of the spectral window, encompassing each value of T3, and where the new values of the state parameters had to be drawn. The initial value of the control parameter t was set to 0.8 as this value was experimentally found to allow acceptance of 70% — 90% of the unfavourable perturbations at the beginning of the procedure. After a certain number of successful moves, t was lowered by 10% of the previous value, thus diminishing the probability of acceptance of detrimental states. At a fixed t value, N.(n-1) values of Tj were generated by the relation T; = (Ti° – A N 3/ 2) + d N 3 • r; where N is the number of the data points in the spectral window, r is the random number picked from a N(0,1) distribution, and T; are the co-ordinates of the last local minimum of S. If no successful move occurred, the intervals ANA were reduced by 20% and the process was repeated with the same value oft. When five successive trials, at the same value of t, gave no successful move of the state system, the annealing procedure ended. This annealing schedule ensured convergence of the procedure. 3. APPLICATIONS 3.1. Resolution of synthetic spectra
This computational approach of finding the optimal alignment for the Kalman filter resolution of the overlapped shifted spectra by the simulated annealing algorithm has been tested on simulated overlapped spectra obtained by linear combination of Gaussian-Lorentzian curves, synthetically generated using the mathematical model described by Eqn. (9) f(x) = hp [1 + r • (x –x0 )2 /b 2 • exp{(1 – r)[1n2(x – x 0 )2 j/b 2
(9)
where xo, is the peak centre, b is 1/2 FWHM (full width at half maximum), hp is the peak height, r is the mixing ratio and takes the value 1 for a pure Lorentzian peak and 0 for a pure Gaussian peak. Gaussian noise of 3% RSD was added to all the simulated spectra.
....
94
751._
2
•••
•
• •
M
• • •
I-
.
••. Saw
a . . •
•
,...
... • ".. # . , ...--.. ,. %.,
,........... ../ ir
a
3
* • •
. a 9 "". -11`
•• , ... ... ....... ...."' ....k .."'' ' A. 10
•
41".
. . • • •• • • 15
• •
•
se -.. .,
_ .0,
i n
e` -4.
..... .....,
4 %.,,,
V
i
, ... .....
dre
_ I 20
1
1
30
di splacement,
1
• 40
ir
Figure 2. Error trend on one component contribution estimate in the Kalman filter resolution of two identical overlapped Gaussian bands vs. the model error due to the displacement, with respect to the true position of that component. Curves 1,2,3 and 4 refer to errors in the deconvolution of envelopes with separations of 1/4, 1/2, 3/4 and 1 of the full width at half maximum (FWHM), respectively. The displacement unity was 1/100 of FWHM. Figure 2 shows the errors in the Kalman filter deconvolution of overlapped signals when in presence of incorrect alignment of the pure component peaks with the envelope. The curves refer to the errors on the estimate of one component concentration, in the resolution of two similar Gaussian overlapped curves and for different overlapping degrees, vs. the displacement of that component, on the abscissa axis, from the true position. The error, quite remarkable for greater overlapping degrees, is significant also for an acceptable separation of the overlapped components. The general behaviour of simulated annealing in correcting for shift errors has been evaluated by comparing the performances of different optimization procedures: simplex, steepest descent, and simulated annealing in the resolution of two- and three-components overlapped synthetic band systems.
95
400.
A •^
•
)44fraht•4141, 1kh...
(f) F-
Z 0 U
J SO
100
1 SO
200
250
300
350
400
450
SO(
CYCLES
Figure3. Synthetic Gaussian-Lorentzian curves. The values of the Gaussian-Lorentzian factor b in the Eqn.(9) were 0.3, 0.5, 0.7 for the curves 1, 2, and 3, respectively. The FWIEV1 of curves 1B and 2B were doubled with respect to 1A and 2A. The peak maxima coordinates xo were 150, 200, 250 for the curves 1, 2, and 3. (••••••) are the linear combination (unitary coefficients) of the three curves. (From Fresenius J Anal. Chem(1993) 345 490, with permission). The Gaussian-Lorentzian curves in Figure 3, simulating ESCA peak doublets, were generated using always the same values of the parameters of Eqn.(9), except that the FWHM parameter of curves 1B and 2B were doubled with respect to 1 A and 2A, to give rise to a greater overlapping degree in the B convolutes. The examined models were obtained by the following linear combinations: [1A+2A], [1A+2A+3A], [1B+213], [1B+2B+313]. The parameters used for evaluating the performance of the optimization algorithms were the errors on the estimates of each pure component contribute, and the computer time required by the procedure to achieve the optimal alignment, as measured by the number of times the Kalman filter was executed until convergence of the program. The synthetic mixture curves were resolved using as starting co-ordinates TR)) a set of values not corresponding to the true position of the pure component bands. Tables 2 A-C summarize the results of the comparison.
96 Table 2 Kalman filter resolution of some linear combinations of the curves in Figure 2, after the optimal locations by: A) simplex, B) steepest descent, C)simulated annealing. The coefficients of the linear combinations, CC were equal to 1. The true peak maxima co-ordinates were 150, 200, 250 (From Fresenius J Anal. Chem (1993) 345 490, with permission). Model
Initial co-ordinates
Optimum co-ordinates
% Error
Number of
T 1 (0) T2(0) T3 (0) T1
T2 T3
C1
C2
193 206 170 172 194 208 n.c n.c
-11.2 4.2 10.1 18.3 -12.8 5.0
-3.1 -5.1 -63.5 -65.0 10.0 -1.4
170 180
155 155 144 160 140 142 n.c n.c
220 170 220 180
148 148 156 155 146 145 144 144
200 202 194 205 194 194 194 185
149 148 152 150 149 151 148 148
201 202 198 198 202 203 197 198
C3 iterations
(A) Simplex [1A+2A1 [1A+2A] [1A+2A+3A] [1A+2A+3A] [113+2131 [1B+213] [1B+2B+313] [1B+2B+3131
100 220 170 170 130 230 170 180
250 130 170 220 250 130 170 220
220 170
221 221 -
67.5 69.5
25 32 78 87 36 46
-
(B) Steepest Descent [1A+2A] [1A+2A1 [1A+2A+3A] [1A+2A+3A] [18+213] [113+213] [1B+2B+3BJ [1B+2B+313]
100 220 170 170 130 230 170 180
250 130 170 220 250 130 170 220
254 246 255 232
-1.9 -0.8 -7.1 7.8 -6.1 -7.1 -7.9 21.0
1.0 0.8 12.7 -20.7 4.3 4.4 14.5 -42.1
0.5 -0.8 -2.8 -2.7 4.2 4.5 2.1 3.0
-1.0 0.8 -6.8 -5.5 -3.5 -3.9 -5.8 -6.0
3.0 6.6 -4.2 36.7
28 33 86 96 48 52 98 111
(C) Simulated Annealing [1A+2A] [1A+2A] [1A+2A+3A] [1A+2A+3A] [1B+213] [1B+2131 [1B+2B+3BJ [1B+2B+313]
100 220 170 170 130 230 170 180
250 130 170 220 250 130 170 220
220 170
220 180
250 250
252 252
1.2 0.6
3.3 5.1
98 95 173 177 97 101 186 194
As for the two components models, simplex performed the fastest in reaching a convergence, but it was also the most unreliable in estimating the component contributions. The steepest descent method gave results with the same good accuracy of that obtained by simulated annealing, but in a shorter number of iterations. Generally, for the two-component systems, the steepest descent algorithm accomplished the best compromise between accuracy and computer time. For the three components models, simplex was fast in those cases where it attained convergence, although it yielded poorly accurate result. Most frequently simplex was unable to
97 converge, exhibiting overflow errors, probably in the attempt to examine too large or too small values of TT parameters. The steepest descent was always able to converge, but the accuracy of the results depended to some extent on the initial values of the peaks co-ordinates Ti(0). Simulated annealing, although the slowest, showed superior ability to converge to correct alignment, thus producing a more acceptable accuracy of the results (the maximum error was less than 7%). Both the number of iterations of the procedure and the accuracy of the results by simulated annealing were insensitive to the initial guesses of the parameters. Finally, the difference in the number of iterations between the steepest descent (or simplex) method and the simulated annealing method became shorter with the increased number of components in the model. Thus, it may be convenient to use simulated annealing for the sake of more reliable results at expense of more computer time.
120
150
180
210
coordinate T2
Figure 4. Three dimensional graph of the error function, eqn.(8), for different combinations of the two peak maxima co-ordinates in the Kalman filtering of the combined model [1A+2A]. The error function has been inverted for graphical enhancement. (From Fresenius J Anal. Chem(1993) 345 490, with permission) In Figures 4 and 5, typical three-dimensional plots of the reciprocal of the error function, Eqn.(8), are shown during the iterative search of the optimal alignment of the program for the resolution of the envelope models [1A+2A] and [1B+2B] and for all the combinations of the parameters T 1 and T2 in the interval from 120 to 220 (data points or channels).
98
, 120
i
i
150
.
i 180
coordinate
, 1 210 <0
12
Figure 5. Three dimensional graph of the error function, eqn.(8), for different combinations of the two peak maxima co-ordinates of the pure components in the Kalman filtering of the combined model [1B+213]. The error function has been inverted for graphical enhancement. (From Fresenius
J
Anal.
Chem(1993) 345 490, with permission) For the model [1.A+2A] a clearly defined minimum was located in correspondence with the true co-ordinates (150, 200). For this relatively simple model, the three optimization methods performed reliably in falling very close to the minimum. In the case of the more poorly separated model [1B+213], the error surface displayed an elongated region comprising two minima, the deeper of which was positioned very close to the true co-ordinates. Out of five optimization runs performed on the model [1B+213], each with different initial values of TAO), simplex and steepest descent converged to the local minimum three and four times, respectively; while the simulated annealing always converged to the global minimum.
3.2. Resolution of HPLC chromatograms Alternaria, one of the most common moulds contaminating foods and feeds, produces many metabolites with different chemical structures. Two metabolites of major concern because of their toxicity and/or natural occurrence in a number of fungal contaminated commodities are the dibenzo-a-pyrone derivatives: alternariol (AOH) and altenuisol (ATS).
99 Recently a HPLC-diode array UV method capable of simultaneous determination of AOH and ATS in maize, rice and tomato samples has been reported
[31,32]
/2\ \ 11.0
12.0
min
Figure 6. Typical chromatograms for the sequential injections of standards of AOH (60 ng injected curve 0 and ATS (100 ng injected - curve e) and their mixtures. The AOH/ATS ratios (ng/ng injected) are : (a) 60/150; (b) 30/100; (c) 120/50; (d) 180/50. Dashed lines represents individual components (1=AOH, 2=ATS) resolved after alignment and Kalman filtering. (From Chromatographia (1992) 34(1/2) 56, with permission). The use of diode-array detection has greatly improved peak identification, peak purity assessment and quantification. However, there are still some problems to be solved relating to peak purity assessment (compounds possessing identical spectral characteristics may invalidate peak purity criteria) and peak quantification (poor resolved peaks cannot be quantified by the
100 techniques normally available with integrators). This is the case of the simultaneous determination of AOH and ATS which are only slightly separated by reversed phase chromatography even under gradient elution conditions. In such a case use of a chemometric approach, like the use of the Kalman filter, for peak deconvolution is essential. Figure 6 shows typical chromatograms for sequential injections of mixtures with different amounts of AOH and ATS. Using the Foley [33] apparent valley-to-peak ratio as an estimate of peak separation, values of 58% and 72% for a mixture of 104 : 125 ng of AOH and ATS, respectively, were calculated as the best . The peak of AOH was visible in the mixture up to 30 : 300 ng of AOH and ATS, with apparent ratios less than 80%. The peak of ATS was slightly visible, with a ratio of about 95%, up to mixtures of 120 : 50 of AOH and ATS, and disappeared for mixture of 180 : 50 ng of AOH and ATS, although a distinct inflection point on the fused peaks was visible. It is worth noting in Figure 6 a variability of about 20 sec in the retention times of both the fused peaks and the resolved component peaks. Also for the injections of the pure AOH and ATS samples, a variation ranging up to 25 sec in the retention times (which corresponds to twice the full width at half maximum) was observed. The randomness in the peak positions of the pure components and mixtures made it necessary to find the optimal alignment by simulated annealing before the Kalman filter resolution. The sequence of the described computational procedure of aligning the components peaks according to the previous estimate of Ti , performing the resolution of the overlapping peaks, calculating the error function, checking for the minimum of the surface error, and estimating those values of Ti yielding a better Kalman filter resolution, was repeated until convergence. The values of Ti yielding the minimum on the error surface corresponded to the positions of the component peaks with the minimum error in the model, and the Kalman filter resolution at convergence produced the best estimates of the contributions of the components in the mixture. The three-dimensional and contour plots of the reciprocal of the error function (Eqn.(8)) vs. the shift parameters, obtained for two different mixtures of AOH and ATS are shown in the Figures 7 and 8. For the mixture of figures 7a and 8a, a clear defined minimum was located at T 1=28 and T2=52. After 22 iterations (starting T 1=T2=40) the program converged to the above minimum, where the Kalman filter resolution calculated injected amounts of 120.2 and 50.4 ng for AOH and ATS, respectively, very close to the true known amounts of 120 and 50 ng. In the case of resolution of a heavier overlapped mixture of AOH and ATS (Figures 7b and 8b) the error surface displayed an elongated region of minima. Even in this unfavourable condition the program converged, after 48 iterations (starting from T 1=T2=30), to T 1=18 and
101 T2=32; the calculated amounts were 62.4 and 144.4 ng of AOH and ATS, respectively compared to the true amounts of 60 and 150 ng.
O. 8a
E
20 m 40
0 T2
60
40
60
L2 7b
0
20 40 60 1E2
40 2 20 T
60
Figure 7. Three dimensional graphs of the error functions corresponding to the iterative resolution of (a) the mixture peak c of figure 6, and (b) the mixturte peak a of figure 6, respectively. The error functions have been inverted for graphical enhancement. T 1 and T2 are the position parameters for AOH and ATS, respectively.(From Chromatographia (1992) 34(1/2) 56, with permission). Figure 8. Contour plots for the graphs 7a and 7b, respectively. The contour values 1 to 5 correspond to equally spaced values from minimum to maximum of the inverse of the error function.(From Chromatographia (1992) 34(1/2) 56, with permission).
102 The above findings proved that the error function defined in Eqn.(8) works properly giving reliable convergences to the minimum with an acceptable speed. On the other hand, an iterative procedure using as optimization function only the second term (Ilog(P)ii) of Eqn.(8), yielded unsatisfactory convergences. In fact, although this error function had a minimum variance parameters estimate, its surface appeared flat with a narrow and steep descent to a minimum [34]. This caused the minimization procedure to wander for an extended period of time before converging to the minimum, especially with slightly noisy responses. The use of the first summation only in Eqn.(8), improved the convergence speed, but for resolutions of very poorly separated peaks the procedure was often unable to overcome local minima.
Figure 9. Kalman filter resolution of a chromatographic peak of an extract of an Alternaria culture on rice spiked with ATS standard (1 ng in the injected volume), after optimal alignment of the component ( 1=AOH 2=ATS) by simulated annealing. A series of six different standard mixtures of the two micotoxins was analysed. The unsmoothed pure peaks obtained by direct injection of 104 and 250 ng of AOH and ATS, respectively, were used as models for the simulated annealing optimized Kalman filter
103 resolution of the mixtures. Both the accuracy and precision of the concentration values obtained at the convergence were satisfactory. Typical errors were around 5%. Larger errors of 16% and 12% were obtained for the minor components of the mixtures 30 : 300 and 180 : 50 ng of AOH and ATS, because of the large differences between component peak areas. The correlation between found vs. true amounts was linear with r 2 = 0.994 and 0.997 for AOH and ATS, and with intercepts and slopes not significantly different from zero and 1, at 95% confidence level. The validity of the computational approach described above was tested on chromatograms obtained from several extracts of alternaria cultures on rice and maize. In the elution zone of AOH and ATS only one tailed peak was generally observed. Peak purity criteria based on the technique of spectral overlaying after normalization and/or absorbance ratios at two different wavelenghts were also usually satisfied, leading to the conclusion that the observed peak had to be ascribed to the major metabolite AOH. However, a clear shoulder appeared on the chromatograms after the extracts were fortified with known amounts of ATS, indicating composite peaks instead of pure AOH peaks. It is worth noting that the very similar UV spectra of the two micotoxins still led to a false result of the peak purity test. On the other hand peak quantification could not be achieved by any of the ordinary techniques commonly available on LC integrators. Figure 9 shows peak resolution by Kalman filtering after the alignment procedure by simulated annealing of the chromatogram relative to an extract of alternaria culture on rice. A rice extract analysed in triplicate gave the following concentrations: [AOH] =100±14 ppm and [ATS] = 15±3 ppm. A mize extract analysed in the same way gave: [AOH] = 11±1 ppm and [ATS] = 5±1 ppm.
3.3. Resolution of ESCA spectra Figure 10 shows some ESCA spectra of binary and ternary mixtures of some lead compounds, as well as of the pure components. The unpredictable shifts of the pure component peaks respect to the mixtures, mainly due to the different surface charging of the samples, made it necessary to find the optimal alignment of the peak doublets, before determining the percentage composition of the components in the mixtures through the Kalman filter. The smoothed Pb 4f peaks of each pure component were used in the filter model after background removal from the peak doublets by using the cubic splines interpolation. Also the unresolved Pb 4f peaks relative to the mixtures
104 were smoothed and background corrected by the cubic spline algorithm before processing the ESCA spectra.
Figure 10. Mg Ka -excited ESCA spectra of some lead compound mixtures, acquired in the region of the kinetic energy corresponding to the Pb 4f (4f512 and 4f712) peak-doublet transitions. The lack of collinearity among the components and mixtures makes the Kalman filter separation not feasible. (a) Pb02+PbSO4 (60:40); (b) Pb02+Pb(CO3) (60:40); (c) Pb02+Pb(NO3)2+Pb(CO3) (35:35:30), (d) Pb02++Pb(CO3)+Pb(SO4) (20:20:60); (c) pure Pb(CO3); (f) pure Pb(SO 4). (From Fresenius J Anal. Chem (1993) 345 490, with permission).
105 Table 3 reports the results of the quantitative resolution of some spectra of lead compounds, obtained after aligning the spectra by simplex, steepest descent and simulated annealing, respectively. Table 3 Comparison of the Kalman filter resolution of the ESCA spectra of some binary and ternary lead compound mixtures after search for the optimal alignment by simplex, steepest descent and simulated annealing. Nr is the number of iterations of the optimisation until convergence; n.c. : no convergence.(From Fresenius J Anal. Chem (1993) 345 490, with permission).
System
Given (%)
Pb02+Pb(NO3)2 25:75
Simplex
Nr Steepest
(%)
22.7:77.3
Nr Simulated
descent(%)
Nr
annealing(%)
21
26.3:73.7
28
24.7:75.3
77
Pb02+Pb(NO3)2 60:40
50.1:49.9
19
62.2:37.8
25
61.8:38.2
78
75:25
69.6:30.4
23.
73.8:26.2
27
74.6:25.4
82
Pb02+PbSO4 60:40
71.8:28.2
21
64.7:35.3
25
63.4:36.6
80
Pb02+Pb(CO3)
66.5:33.5
16
57.1:42.9
20
57.4:42.6
80
Pb02+ Pb(NO3)2
60:40
Pb02+Pb(NO3)2+ Pb(CO3) 35:35:30 59.2:3.8:37 95
28.7:40.8:30.5 114 31.7:37.1:31.2 201
Pb02+ PbCO3+Pb(SO4)
15.7:25.3:59.0 134 18.0:21.6:60.4 208
20:20:60
n.c.
—
The conclusions on the relative merits of the three optimization methods for solving this type of problem for the real spectra, are in accordance with those inferred from the analogous comparison on the synthetic spectra. Simplex was fast in reaching convergence for the binary mixtures, but with a bad accuracy on the results (the mean error was 15%). The results in the quantification of ternary mixtures by simplex were rather unreliable when there was convergence, while in other cases the program was terminated by overflow errors. For the quantification of the binary mixtures, the steepest descent method was as fast as simplex in converging, with an acceptable accuracy on the percentage composition estimates (mean error of 5%), which was not significantly different from that obtained by using the simulated annealing. However, the accuracy of the steepest descent worsened, with errors up to 20%, for the quantification of ternary mixtures. The simulated annealing was the slowest in reaching a convergence; but it always converged to the best results, regardless the complexity of the model and the initial conditions of the parameters.
106
Figure 11. Kalman filter resolution of Mg Ka -excited ESCA spectra of some binary lead compound mixtures in the kinetic energy region of the overlapped Pb 4f (4f 512 and 4f712) doublets. (A) (• • • • ) spectrum of Pb02+Pb(NO3)2 (25:75); (- - - -) resolved doublets of 1+2 Pb(NO3)2, 3+4 Pb02; (B) (• • • • ) spectrum of Pb02 +PbCO3 (60:40), (- - - -) resolved doublets of : 1+2 PbCO3 , 3+4 Pb02; (C) (• • ) • • ) spectrum of Pb02 +PbSO4 (60:40), (- - - -) resolved doublets of : 1+2 PbSO4, 3+4 Pb02. ( Resulting fits. (–•–•–•–•–) Cubic spline interpolation baselines.(From Fresenius J Anal. Chem(1993) 345 490, with permission).
107
Figure 12. Kalman filter resolution of Mg Ka -excited ESCA spectra of some ternary lead compound mixtures in the kinetic energy region of the overlapped Pb 4f (4f 512 and 4f712) doublets. (A) (• • • • ) spectrum of Pb02+PbCO3+PbSO 4 (20:20:60), (– – – –) resolved doublets of : 1+2 PbSO4, 3+4 PbCO3, 5+6 Pb02; (B) (• • • • ) spectrum of Pb02+Pb(NO3 )2+PbCO3 (35:35:30), (– – – –) resolved doublets of 1+2 PbCO 3 , 3+4 Pb(NO3 )2, 5+6 PbCO3 .
) Resultant fits. (–•—•—•—•–) Cubic spline interpolation
baseline. (From Fresenius J Anal. Chem(1993) 345 490, with permission). Figures 11-13 depict the resolution of some binary and ternary lead and chromium compounds mixtures, as a further evidence of the reliability of the iterative combined simulated annealing-Kalman filtering procedure for the quantification of the shifted spectra. The mean errors were about 4% for binary mixtures and larger, about 10%, for ternary mixtures. This error was still acceptable considering the complexity of the unresolved ternary
108 systems, due both to the severe overlapping of the bands and to the lack of dissimilarity in the Pb 4f and Cr 2p peak doublets shape.
655
665
Kinetic
675
energy, eV
Figure 13. Kalman filter resolution of Mg IC --excited spectrum of powder of K2Cr207+Cr03 +Cr (40:45:15) mixture in the region of the overlapped Cr 2p doublets. (• • • • ) Experimental spectrum. (– – – –) Component peak separation of : doublet 1+2, Cr metal; doublet 3+4, Cr(III)oxide; doublet 5+6, potassium dicromate.
)Resulting fit.
(–•—•—•–) Cubic spline interpolation baseline.
4. CONCLUSIONS The computational approach described here, based on the combination of the Kalman filter algorithm and iterative optimization by the simulated annealing method, was able to find the optimal alignment of the pure component peaks with respect to the shifted components in the overlapped spectra, and hence, to correctly estimate the contributions of each component in the mixture. The simulated annealing demonstrated superior ability over the other optimization methods, simplex and steepest descent, in yielding more reliable convergences at the expense of not much more computer time, at least for resolving ternary shifted overlapped spectra. The proposed method, although applied to the quantitative resolution of spectra in x-ray electron spectroscopy and high performance liquid chromatography, should be suitable also for correcting the spectral shifts of other real systems analysed with different analytical techniques, as long as the component interactions in the mixture are not so intimate to give rise to inadequacy of the quantification model.
109 REFERENCES 1. R.G. Brown in "Introduction to Random Signal Analysis and Kalman Filtering". 2nd ed.,Wiley, New York, (1992). 2. S.D Brown, Anal. Chim. Acta 181 (1986) 1. 3. S.C. Rutan, J. Chemiometrics 1 (1987) 7. 4. S.C. Rutan, 4 (1990) 103. 5. W.A. Halang, R. Langlais and E. Kugler, Anal. Chem. 50 (1978) 1829. 6. P. Gans and J.G. Gill, Appl. Spectrosc. 38 (1984) 370. 7. T. Rotunno, F. Palmisano, G. Tiravanti and P.G. Zambonin, Chromatographia 29 (1990) 269. 8. T.F. Brown and S.D. Brown, Anal. Chem. 53 (1981) 1410. 9. E.H. Van Veen and M.T.C. de Loos-Vollebregt, Spectroch. Acta 45B (1990) 1109. 10. E.H. Van Veen and M.T.C. de Loos-Vollebregt, Anal. Chem. 63 (1981) 1441. 11. S.C. Rutan and S.D. Brown, Anal. Chim. Acta 160 (1984) 99. 12. S.C. Rutan and S.D. Brown, Anal. Chim. Acta 167 (1985) 39. 13. S.C. Rutan and P.W. Carr, Anal. Chim. Acta 215 (1988) 131. 14. D.D. Gerow and S.C. Rutan, Anal. Chim. Acta, 184 (1986) 53. 15. D.D Gerow and S.C. Rutan, Anal. Chim. Acta 60 (1988) 847. 16. H.R. Wilk and S.D. Brown, Anal. Chim. Acta 225 (1989) 37. 17. T.L. Cecil and R.C. Rutan, Anal. Chem. 62 (1990) 1998. 18. S.C. Rutan, Anal. Chem. 63 (1991) 1103A. 19. S.C. Rutan and S.D. Brown, Anal. Chem. 55 (1983) 1707. 20. S.C. Rutan and C.B. Motley, Anal. Chem. 59 (1987) 2045. 21. P.D. Wentzell, A.P. Wade and S.R. Crouch, Anal. Chem. 60 (1989) 905. 22. M. Redmont, S.D. Brown and H.R. Wilk, Anal. Lett. 22 (1989) 963. 23. T.P Kohman, J. Chem. Educ. 47(9) (1970) 657. 24. J.A. Nelder, R. Mead, Computer J. 7 (1965) 308. 25. L. Meites, CRC Crit. Rew. Anal. Chem. 8(11) (1979) 1 26. J.K. Kalivas, N. Roberts, M. Sutter, Anal. Chem. 612 (1989) 2024 27. J.H. Kalivas, J. Chemometr. 5 (1991) 37 28. I.O. Bohachevsky, M.E. Johnson and M.L. Stein, Technometrics 28 (1986) 209 29. P.J.M. van Laarhoven, E.H.L. Aarts (eds.), Simulated Annealing Theory and Applications, Redel, Dordrecht ,1989.
110 30. N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, E.J. Teller, J. Chem. Phys. 21 (1953) 1087. 31. F. Palmisano, P.G. Zambonin A. Visconti, A. Bottalico, J. Chromatogr. 27 (1989) 425. 32. F. Palmisano, P.G. Zambonin, A. Visconti, A. Bottalico, J. Chromatogr. 465 (1989) 305. 33. J.P. Foley, J. Chromatogr. 384 (1987) 301. 34. S.D. Brown, S.D. Rutan, J. Res. Natl. Bur. Stand. 90 (1985) 403.
Adaption of Simulated Annealing to Chemical Optimization Problems, Ed. by J.H. Kalivas ®1995 Elsevier Science B.V. All rights reserved.
111
Chapter 5
Selection of molecular descriptors for quantitative structure-activity relationships
Jon M. Sutter and Peter C. Jurs Department of Chemistry, 152 Davey Laboratory, The Pennsylvania State University, University Park, PA 16802.
1. INTRODUCTION The field of quantitative structure-activity relationships (QSARs) consists of studies for relating molecular structures to a biological activity of interest. In order to successfully build a QSAR model it is necessary to generate representations of the molecular structures with molecular structure descriptors. Then a mathematical relationship that links the descriptors to the biological activity of interest must be developed. Over the years, many different methods have been used to build a wide variety of QSARs that are capable of predicting many different biological activities. QSARs have been used to study pharmaceutical drug design, toxicity, local anesthetics, and agricultural chemicals. Once developed, QSARs can be used to predict biological activities for new structurally related compounds, quantities that might be difficult or impossible to obtain experimentally. QSARs may also lend insight to the important molecular features that are related to a biological activity, which may confirm, contradict, or lead to new theories concerning the activity. In the past, QSAR investigators were interested in obtaining descriptors with ease and simplicity [1 ]. With the availability of increased computer power, the information content of the descriptor has become more important than the ease of calculating the descriptor. Therefore, it is beneficial to create a relatively large descriptor pool that maximizes the total amount of information represented. After the descriptors are generated, a mathematical model that relates the numerical values of a subset of the descriptors to the biological activity can be created. These mathematical models are typically generated using either multiple linear regression or computational neural networks, the latter of which may be classified as a nonlinear modeling technique. The number of possible descriptor subsets grows factorially as the size of the descriptor pool increases, making testing all possible combinations of descriptors impractical. An optimization technique must be employed to find a suitable subset of descriptors. This chapter discusses the possibility of using simulated annealing (SA) or generalized simulated annealing (GSA) to select optimal molecular structure descriptor subsets for both linear regression and computational neural network models. A method of training the neural networks using a combination of GSA and a quasi-Newton technique is also discussed.
112 2. ADAPT METHODOLOGY Before a QSAR can be developed, a set of compounds, known as the training set, must be created. The training set should contain compounds with known biological activities that are structurally similar to the compounds of interest. A set of descriptors and a mathematical model that minimizes the differences between the calculated and the actual biological activity of the training set should lead to a QSAR that is capable of accurately predicting biological activities of unknown compounds. The methodology for creating QSARs using the Automated Data Analysis and Pattern Recognition Toolkit (ADAPT) [2,3] software system is shown in Figure 1, and the individual steps are as follows: • Input and store the molecular structures under investigation as well as the biological activity of interest for each compound. • Generate a three-dimensional molecular model for each compound. • Calculate molecular structure descriptors for each molecule. The descriptors are derived directly from the stored topological representations of the structures or from the 3-D molecular models. • Apply objective statistical tests to identify only those descriptors which are significant. • Select a descriptor subset using an optimization technique and develop mathematical models using linear regression or computational neural networks. • Test the predictive ability of the models.
Model Validation
Structure Entry
7 \
Molecular Modeling
Regression Analysis
n011.1.
t
t Descriptor Generation
.1 n•
Descriptor Reduction
Figure 1. Flow diagram of the
Neural Networks
...n41..
Automated Descriptor Selection
ADAPT method.
The molecular structure entry is performed by sketching the compounds on a graphics terminal, and then storing them as connection tables. The geometries are optimized using ADAP'F'S molecular mechanics routine, MM2 [4], the semiempirical molecular orbital program (MOPAC) [5], or a newly developed extended Heckel method [6].
113 The next step in a QSAR study is to represent the structures using calculated descriptors. The descriptors are designed to encode the features of the structures that influence their ability to engage in intermolecular interactions such as dipole-dipole interactions, van der Waals interactions, and hydrogen bonding. Many descriptors can be generated for each compound using the ADAPT software. The descriptor pool contains topological, electronic, geometric, and physicochemical structural values. Topological descriptors can be derived directly from the connection tables and include quantities such as simple and valence-corrected connectivity indices and substructure counts. Geometric descriptors are calculated from the three-dimensional representations and include the length-to-breadth ratios, radius of gyration, and molecular volume and surface area. Electronic structural features include such quantities as the most negative or most positive atomic charge [7,8] in the molecule and dipole moment. Physicochemical parameters include log P, molar refraction, and molecular refraction. There are also hybrid descriptors that combine aspects of several of these descriptor types, such as charged partial surface area (CPSA) descriptors. CPSA descriptors were designed to encode properties that depend on intermolecular interactions. They combine solvent-accessible molecular surface area and partial atomic charge information [8] to form charged partial surface area descriptors [9]. The descriptor pool can be reduced in size by using standard statistical procedures. For example, descriptors that contain little or no useful information or are similar in value and content to other descriptors are rejected. Therefore descriptors that have the same value for nearly every compound are rejected, as are ones that are highly correlated (r>0.95) with other descriptors. When pairwise correlation is encountered, the descriptor that has the more obvious feature representation or is easier to calculate is retained. Models can be generated using stepwise addition multiple linear regression as the descriptor selection criterion. Leaps-and-bounds regression [10] and simulated annealing (ANNLIN) can be used to find a subset of descriptors that yield a statistically sound model. The best descriptor subset found with multiple linear regression can also be used to build a computational neural network model. The root mean square (rms) errors and the predictive power of the neural network model are usually improved due to the higher number of adjustable parameters and nonlinear behavior of the computational neural network model. Computational neural network models can also be generated using an automated descriptor selection routine (ANNDES) that uses the computational neural network results as the selection criterion instead of multiple linear regression. The ANNDES algorithm is computationally intensive because it is necessary to optimize the descriptor subset as well as the starting weights and biases of the neural network. ANNDES will generally find unique descriptor subsets yielding neural network models that are superior to the neural network models based on descriptor subsets found using multiple linear regression. The multiple linear regression models are validated using standard statistical techniques. These techniques include inspection of residual plots, standard deviation, and multiple correlation coefficient. Both regression and computational neural network models are validated using external prediction. The prediction set is not used for descriptor selection, descriptor reduction, or model development, and it therefore represents a true unknown data set. In order to ascertain the predictive power of a model the rms error is computed for the prediction set.
114 3. SELECTING DESCRIPTORS FOR LINEAR REGRESSION There are many techniques for selecting a descriptor subset that yields a good QSAR model. Some of the more common methods based on regression analysis such as forwardselection or backward-elimination [11,12] often lead to models that are inadequate. The only way to ensure that the best subset is found is to investigate all possible subsets which, of course, is not always possible. The leaps-and-bounds program is an efficient algorithm that does not evaluate all possible regressions, but rather it arranges the descriptor pool into subsets of descriptors and calculates the square multiple correlation coefficient (R 2 ) values of the subsets. Smaller subsets are formed by splitting the larger subsets and their R 2 values are calculated. Large subsets that yield low R 2 values are deleted. The algorithm generally gives excellent subsets, but it is possible to find a local minimum because the number of total descriptors evaluated is limited. Only 24 descriptors can be evaluated in a group because of the rapid increase in possibilities, and unfortunately the descriptor pools used in ADAPT are usually much larger than this. In an attempt to avoid this problem, descriptors that do not appear to be important are removed from the total of 24 and replaced with other descriptors in the pool. Leaps-and-bounds is repeated in this manner until an adequate subset of descriptors is found. Simulated annealing is a more reliable descriptor selection routine for multiple linear regression models since the total descriptor pool is not limited and the best subset of descriptors is found in most cases. 3.1. Multiple linear regression A linear mathematical function that relates descriptor values to a biological activity can be created using multiple linear regression. For n observations and p independent variables the general linear regression model is represented as Yi
= Po + 13 0( 1 + 02x2 + ••• + Opxp
(1)
where 00, 0 1 ,...13p are regression coefficients, x 1 , x2,...xp are the independent variables (numeric descriptors), and Y i is the dependent variable of interest (biological activity). The regression equation can be represented in matrix notation as Y = X f3
(2)
where Y is an nx 1 vector of responses, X is an nx(p+1) matrix of independent variables, and 13 is a (p+1)x 1 vector of regression coefficients. The regression coefficients can be determined by the least-squares solution represented as i3 = (xTx)-1xTy
(3)
Once 13 is calculated, equation 2 can be used to estimate the dependent variable or activity for other compounds. Multiple linear regression techniques have proven to be very successful in QSAR studies.
115 3.2. Simulated annealing for descriptor selection The least-squares solution of a multiple linear regression equation can be calculated in one step, so the number of iterations performed is not an important issue to consider when selecting descriptors. Therefore, simulated annealing is an appropriate choice for performing the optimization. The objective function used for this algorithm is the root mean squared (rms) error of the calculated minus observed biological activities for the descriptor subset being tested. A new set of descriptors is selected by replacing one descriptor in the subset with a randomly selected descriptor. As usual in simulated annealing, a detrimental step is accepted when p (from p = exp(-(364)) is greater than a random number from a uniform distribution on [0,1]. The control parameter, 13, is adjusted so that the average p value is between 0.5 and 0.9 in the beginning of the algorithm (20 iterations). After every 1000 iterations, 13 is multiplied by two for a maximum of 50,000 iterations. If over 900 detrimental steps are rejected in a row, 13 is set to the original value and the algorithm is allowed to continue. If over 900 steps are again rejected and there is no improvement in the optimal cost function, the algorithm is terminated. This cooling schedule has worked well for all the QSAR data sets investigated so far. For small data sets, simulated annealing finds the same subset as the leaps-and-bounds algorithm. For larger data sets, simulated annealing generally finds descriptor subsets that have lower rms errors than subsets found by leaps-and-bounds. 4. SELECTING DESCRIPTORS FOR COMPUTATIONAL NEURAL NETWORKS Multiple linear regression has been shown to adequately relate the structure of a compound to a biological activity of interest. Therefore, linear regression has been used extensively in QSARs. Recently, however, research has shown that computational neural networks can lead to better QSAR models [13-16]. As with linear regression, optimization opportunities arise for computational neural networks using SA or GSA. If the response is monotonic, and the departures from nonlinearity are not significant, then a linear model can be expected to adequately identify important parameters for a neural network. Consequently, descriptor subsets are frequently selected based on the computationally less expensive regression approach and then submitted to a neural network. However, since a linear function usually cannot accurately describe a nonlinear function without using extra nonlinear terms, it is possible that descriptor subsets that yield adequate neural network models will not yield adequate regression models. It follows that if a descriptor subset is selected based on neural network analysis rather than regression analysis, a superior model may result. An optimization technique is needed to generate a descriptor subset for neural network models since investigating all possible combinations is not feasible. This section discusses the possibility of using generalized simulated annealing (GSA) in combination with neural networks to develop an automated descriptor selection routine. 4.1. Computational neural networks Neural networks were originally designed as a model for the activity of the human brain. However, a computational neural network can be thought of simply as a nonlinear regression model when applied to QSAR studies. The computational neural network is a
116 mathematical function that can build models by a nonlinear least-squares method similar to regression. However, since the model is nonlinear the regression coefficients cannot be found in one step, and an iterative process must be used to determine the coefficients. A typical neural network architecture is shown in Figure 2, which depicts a three-layer, feed-forward, fully-connected neural network. The computational neural network processing begins by performing a linear transformation on the input layer values (the descriptor values) from their original ranges to the interval (0,1). The transformed values are then passed to the hidden layer neurons. The input value of a hidden layer neuron is the summation of the products of the weights and the corresponding outputs of the previous input layer plus a bias term (0). An example of this calculation is shown in Figure 2. The first neuron in the hidden layer has an input of Ina 0. The output of the neuron, a sigmoidal transformation of the input, is also shown in Figure 2. The output layer will give a value between 0 and 1, so an expansive linear transformation must be performed to obtain an estimate of the biological activity in the desired units. The weights and biases are adjusted iteratively to minimize the sum-squared error for calculation of the target values (biological activity values) for the training set compounds. The neural network method is much more computationally intensive than linear regression since the nonlinear regression coefficients (weights and biases) must be changed iteratively, which requires repeated evaluation of the network outputs. However, the greater mathematical flexibility found in neural networks leads to models that are superior to regression models, so the additional time required is worth the effort. 4.2. Training a neural network
Adjusting the weights and biases to fit the known target values constitutes the training of the network. Because of the increased mathematical flexibility and the large number of adjustable parameters, it is possible to obtain apparently good fits by chance or to overtrain the neural network. Recently, some guidelines for avoiding chance correlations and overtraining have been presented [17]. Chance correlations can be reduced by keeping the ratio of cases to connections larger than three, and overtraining can be avoided by employing a cross-validation set. If overtraining occurs, specific idiosyncrasies of the individual training set members will be assigned as significant contributors, and this can adversely affect the predictive ability of the network. To avoid this situation the data set is split into a training set and a crossvalidation set. The weights and biases are adjusted based on the rms error of estimation for the training set compounds, and the rms error for the cross-validation set is calculated periodically throughout the training. Overtraining is believed to occur when the rms error of the cross-validation set begins to rise. If training is stopped when the cross-validation error is at a minimum, then the network can be used with reasonable confidence for future predictions. Figure 3 shows the progress of a typical training session of a neural network. The training set error curve and the cross-validation set error curve are both shown. The network would be considered to be fully trained at the minimum of the cross-validation rms error curve. Training the network is nothing more than an optimization problem. When a large data set is used, or when the network has a large number of weights and biases, this task can
117
Typical Feed- Forward Neural Network Desc(1)
Desc(p)
Desc(2)
Input Layer w1,1 2,1
4A"
wl,p 2,3
0 /0 O
Hidden Layer
Output Layer
out
Neuron Function
p
ao 0(2,1) +j21OUt(i' In(2,1) =1
Out(2,1) = (1 + exp[-Ir1/42,0])-1
Figure 2. Schematic of the neural network architecture used in this study.
118
Training set rms error Cross-validation rms error RMS Error Stop here and evaluate the objective function
Number of Cycles Figure 3. Training curve of a neural network.
become quite expensive computationally. It makes sense, then, to utilize an efficient optimization method for training. For these reasons, networks are trained in our work using the quasi-Newton [18] BFGS (Broyden-Fletcher-Goldfarb-Shanno) [19-23] algorithm, as opposed to the more widely reported steepest descent, back-propagation (BP) algorithm [24]. Use of a BFGS optimization technique significantly speeds up training of neural networks compared to the widely-used BP method. Training times are often reduced by an order of magnitude or more because the BFGS algorithm generates and utilizes second derivative information about the error function. Moreover, there is no possibility for oscillation of the error, which can happen with BP because weights and biases are adjusted for one observation at a time. In addition, unlike BP, BFGS has no adjustable learning rate or momentum parameters that must be chosen by the user. The BFGS optimization routine is a very efficient algorithm for training neural networks. However, it is a directional routine, and therefore the final results depend on the initial weights and biases. Thus, in applying such a neural network to QSAR, a number of trainings are done. Once a descriptor subset has been selected using multiple linear regression, this subset is submitted to a neural network many times using different starting weights and biases. A distribution of errors results from the different trainings, and a Gaussian type curve results, such as that depicted in Figure 4. In order to obtain a reasonable training session the neural network is trained several hundred times with different random starting weights and biases. The fact that so many sessions are needed to obtain a good training set error and a good cross-validation error does not pose a problem when only one descriptor subset is being investigated. However, when GSA was implemented to optimize a descriptor subset, this was the first problem that had to be addressed. It would take excessive computation time to investigate several hundred neural network sessions for each descriptor subset selected by GSA. Therefore, it was necessary to reduce
119
Number of Sessions
RMS Error Figure 4. A representation of a distribution of rms errors from numerous neural networks starting from different weights and biases.
the time needed while still obtaining the same approximate network results. First, the possibility of using a smaller number of neural network sessions and reporting the smallest error found as the cost function for that subset of descriptors was investigated. Five to ten neural network sessions were executed for each subset. When more than ten sessions were used, computational time became an issue (over 24 hour computation times). Also, the optimized descriptors often resulted in inadequate models when this method was used. It was a false assumption that a cost function would be found on the low end of the Gaussian curve each time. Often a good descriptor subset would be assigned a bad cost function and a bad subset would consequently look fairly good. A method that consistently finds errors on the low end of the Gaussian curve was needed. A combination of GSA and the quasi-Newton BFGS method was investigated to solve this problem. A combination of BFGS and simulated annealing has been presented in the literature [25]. In this study, simulated annealing was allowed to walk into the global area and then BFGS was used to quickly converge to the global conditions. In the case of an automated descriptor selection routine for neural networks, the emphasis was not global optimum convergence, but the amount of time required to obtain an adequate training session. A good neural network result that was in the area of the global minimum and that was obtained in a short amount of time was desired. Therefore, the steps for annealed neural networks (ANN) are exactly the same as Bohachevsky's generalized simulated annealing [26], except a short BFGS neural network training is performed after each step. 1. 2. 3. 4. 5.
Select a random set of weights and biases (x0). Perform a short BFGS training (10-20% of a normal training). Evaluate cost function 4)0 = 4)(xo) from the short BFGS training. Set x*= xo + (Ar)U. Short BFGS training, evaluate 4) 1 = 4)(x*) and A4) = Or Ow
120 6. 7.
If 4)15 00, set xo= x* and 00= 4:01 . Go to 4. If (1)1> (1)0, set p = exp(-13084). If p is greater than a random number from [0,1] then set xo= x* , 00= 01 , and go to step 4.
The cost is a function of the rms errors of the training set and the cross-validation set compounds and is discussed in more detail in the next section. The step size, fir, was set to 4.0 and the weights and biases were bound between -20.0 and 20.0. There was no bound on the BFGS optimization. The ANN routine runs for a maximum of 200 iterations, and after the best starting weights and biases are found they are submitted to a full BFGS training. The ANN algorithm is called for each subset of descriptors selected by GSA for descriptor subset optimization, so automated descriptor selection for neural networks can be considered an optimization inside of an optimization. 4.3. Generalized simulated annealing for descriptor selection Descriptor selection using neural network performance as the selection criterion is an enormous computational task. Investigating all possible descriptor combinations would take years of computation time, therefore an optimization technique is needed to reduce the time involved in performing this task. When choosing an optimization technique several factors must be considered. These include the simplicity of the algorithm, the likelihood of convergence to a global optimum, and the speed of the algorithm. Considering these factors, GSA seems to be an appropriate choice. Selecting an appropriate cost function that adequately describes the problem is perhaps the most important step in an optimization. Many different possibilities were investigated for the neural network descriptor selection problem. The cost function that yielded models with the most predictive power is represented as Cost = T(err) + I T(err) - CV(err) I
(4)
where T(err) is the root mean square (rms) error for the biological activity values for the members of the training set, and CV(err) is the rms error for the members of the crossvalidation set at the minimum of the training curve shown in Figure 3. The sets of descriptors that support models with the lowest T(err) but that simultaneously have a similar CV(err) seemed to have superior predictive power. As equation 4 reveals, a cross-validation set that is indicative of the entire set of compounds being investigated is essential for neural network descriptor selection by this approach. Two modifications were made to the GSA algorithm for this study. The first modification converted GSA from a continuous to a discrete function optimization routine for descriptor selection. In a previous study, a discrete function was given continuity by changing only a fraction (50-75%) of the variables [27], thereby passing some information from the previous subset to the next subset of variables. The technique worked reasonably well for that study and was applied to the descriptor selection routine. The second modification was to bias the descriptors selected for the subset in order to speed the optimization procedure. The descriptor selection is achieved by calculating a quality value for each descriptor. The quality value for a particular descriptor is the cost
121
function calculated by equation 4 when that descriptor is used as the sole input, with all other layers of the neural network the same. For example, if a five-member input layer, a three-member hidden layer, and an output layer (5-3-1) architecture is desired then each descriptor is subjected to a 1-3-1 neural network and the resulting cost function is used as the quality value for that descriptor. Since the cost function in equation 4 is being minimized, the lowest quality value corresponds to the best descriptor choice. The quality value is used to bias the descriptor selection during the optimization. The values are scaled so that the descriptor with the lowest (best) quality value has a 75% chance of being accepted and the descriptor with the highest quality value has a 25% chance of being accepted into the new subset. These percentages were obtained empirically. The quality values are mapped onto the interval [0.288, 1.39], then the probability of acceptance is calculated by P = exp(-SQUAL)
(5)
where SQUAL is the scaled quality value. This number is compared to a random number on the interval [0,1]. The new descriptor is accepted if P is greater than the random number. The steps of ANNDES are summarized below. 1. 2. 3. 4.
5. 6. 7.
Calculate a quality value for each descriptor by training a 1-Hidden Layer-1 neural network and calculating a cost function 4:sq. Start with the set of descriptors (x 0) that have the best quality values. Evaluate the cost function 00= 0(x0) from equation 4. Get a new subset of descriptors (x*) by replacing 50-75% of the subset with descriptors selected with a bias from the descriptor pool. The bias is based on the quality value. The best quality value is assigned a value of 0.75 and the worst is assigned a value of 0.25. The others are scaled in that range. The scaled quality value is compared to a random number on the interval [0,1]. If the scaled quality value is greater than the random number, it is accepted into the new subset. If the number of iterations is equal to 200, then quit. Evaluate the new cost function 4) 1 = 4)(x*) and A co = $1-4)0If 4)15 4:00, set x0= x* and Oo= Op Go to step 4. If Op 00, set p = exp(-13086,0). If p is greater than a random number from [0,1] then set x0= x*, 4:so= 01 , and go to step 4.
In the interest of time the number of iterations had to be limited to 200 which, of course, is not a fair representation of the total number of possibilities. Even for small data sets the number of possible combinations is very large. For example, if a five-descriptor model was sought from a total of 66 possible descriptors there would be approximately nine million subset combinations. This does not pose a problem when selecting descriptor subsets based on linear regression because the amount of time to evaluate the cost function is small and therefore a large number of iterations (tens of thousands) can be performed in a short amount of time. The aforementioned modification that preferentially pulls good descriptors into the subset seems to find excellent subsets without losing the random nature of GSA. Generalized simulated annealing generally works very well for feature selection of neural
122 network models for small data sets. The upper bound of the GSA optimization capability is being approached for this problem. Therefore, occasionally subsets of descriptors that are only as good as the linear regression subsets are found, and inadequate descriptor subsets are often found when the data set is large.
5. A QSAR STUDY With growing environmental concern, a need for predicting the toxicity of compounds has emerged. Experimental assessment of toxicity can be costly, time consuming, and hazardous. QSARs can be used to predict toxicity accurately without using experimental methods, provided the unknown compounds are close structural analogs of compounds with experimentally known toxicities. Many methods have been created for developing QSARs. Some of the first QSAR investigators were concerned with the simplicity of the model developed and the ease of obtaining the descriptor values. Two early methods that conform to this view are the Hansch [28] and the Free-Wilson method [29]. Both approaches use mathematical functions to link biological activity and structure. They differ only in the descriptors used to encode structural characteristics. The Hansch method uses physicochemical constants and the Free-Wilson method uses the frequency and position of chemical substituents (group contribution). More recently, QSAR investigators have not been as concerned with the amount of computation time involved. In the present example, the ADAPT methodology and GSA/SA were used to create adequate QSAR models. The results are compared to a more traditional group contribution model created by Gao et al [1].
5.1. Experimental section Most computations were performed using the ADAPT software system on a Sun 4/110 workstation. However, the annealing programs and the neural network computations [15] were performed on a DEC 3000 AXP Model 500 workstation. Both systems run under Unix, and all programs are written in Fortran 77. The data set consisted of 140 functionally related compounds taken from reference 1. The compounds were split into a training set of 130 compounds and a prediction set of 10 compounds the same as reference 1 for comparison. The dependent variable was -log(LC50), where LC50 is the concentration that causes 50% mortality in fathead minnows. The 96-hour LC50 values of these compounds were taken from the literature [1]. The compounds and their corresponding -log(LC50) values that were used in this study are presented in Table 1. For the multiple linear regression portion of the study, the training set consisted of compounds 1-130 and the prediction set consisted of compounds 131-140. For the neural network portion, the training set and the 13 member cross-validation set were taken from compounds 1-130 and the prediction set consisted of compounds 131-140. The cross-validation set, labelled with the superscript a in Table 1, consisted of 13 randomly chosen compounds and was believed to be representative of the entire data set. The structure entry was performed by sketching the compounds on a graphics terminal and then storing them as connection tables. The geometries were optimized using the semiempirical molecular orbital program (MOPAC) with the PM3 Hamiltonian [30]. A total of 165 descriptors were generated for each compound using the ADAPT descrip-
123 Table 1 Compounds and Corresponding -log(LC 50) Values Used in This Study No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
Compound Benzene Bromobenzene Chlorobenzene Hydroxybenzene 1,2-Dichlorobenzene 1,3-Dichlorobenzenea 1,4-Dichlorobenzene 1-Chloro-2-hydroxybenzene 1-Chloro-3-methylbenzene 1-Chloro-4-methylbenzene 1,3-Dihydroxybenzene 1-Hydroxy-3-methoxybenzene 1-Hydroxy-2-methylbenzene 1-Hydroxy-3-methylbenzene 1-Hydroxy-4-methylbenzene 1-Hydroxy-4-nitrobenzene 1,4-Dimethoxybenzene 1,2-Dimethylbenzene 1.4-Dimethylbenzene 1-Methyl-2-nitrobenzene 1-Methyl-3-nitrobenzene 1-Methyl-4-nitrobenzene 1,3-Dinitrobenzene 1-Amino-2-methyl-3-nitrobenzenea 1-Amino-2-methyl-4-nitrobenzenea 1-Amino-2-methyl-5-nitrobenzene 1-Amino-2-methyl-6-nitrobenzene 1-Amino-3-methyl-6-nitrobenzene 1-Amino-2-nitro-4-methylbenzene 1-Amino-3-nitro-4-methylbenzene 1,2,3-Trichlorobenzene 1,2,4-Trichlorobenzenea 1,3,5-Trichlorobenzene 1,3-Dichloro-4-hydroxybenzenea 1,2-Dichloro-4-methylbenzene 1,3-Dichloro-4-methylbenzene 1-Hydroxy-2,4-dimethylbenzene 1-Hydroxy-2,6-dimethylbenzene 1-Hydroxy-3,4-dimethylbenzene 1-Hydroxy-2,4-dinitrobenzene 1,2,4-Trimethylbenzenea 1-Methyl-2,3-dinitrobenzene 1-Methyl-2,4-dinitrobenzene 1-Methyl-2,6-dinitrobenzene 1-Methyl-3,4-dinitrobenzene 1-Methyl-3,5-dinitrobenzene 1-Amino-2-methyl-3,5-dinitrobenzenea 1-Amino-2-methyl-3,6-dinitrobenzene
-log(LC50) 3.40 3.89 3.77 3.51 4.40 4.30 4.62 4.02 3.84 4.33 3.04 3.21 3.77 3.29 3.58 3.36 3.07 3.48 4.21 3.57 3.63 3.76 4.38 3.48 3.24 3.35 3.80 3.80 3.79 3.77 4.89 5.00 4.74 4.30 4.74 4.54 3.86 3.75 3.90 4.04 4.21 5.01 3.75 3.99 5.08 3.91 4.12 5.34
124 Table 1 Compounds and Corresponding -log(LC 50) Values Used in This Study No. 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
Compound 1-Amino-2,4-dinitro-3-methylbenzene 1-Amino-2,6-dinitro-3-methylbenzene 1-Amino-2,6-dinitro-4-methylbenzene 1-Amino-3,5-dinitro-4-methylbenzene 1,3,5-Tribromo-2-hydroxybenzene 1,2,3,4-Tetrachlorobenzene 1,2,4,5-Tetrachlorobenzene 1-Methyl-2,4,6-trinitrobenzene 1-Hydroxy-2,3,4,5,6-pentchlorobenzene 1-Amino-4-bromobenzene 1-Amino-3,4-dichlorobenzene 1-Amino-2,4-dinitrobenzene 1-Amino-2-chloro-4-methylbenzene 1-Amino-2-chloro-4-nitrobenzene 1-Amino-2,3,4,-trichlorobenzenea 1,3,5-Trichloro-2,4-dinitrobenzene 1-Amino-2,3,5,6-tetrachlorobenzene 1-Cyano-3,5-dibromo-4-hydroxybenzene 1-Cyano-2-amino-5-chlorobenzene 1-Cyano-2-chloro-6-methylbenzene 1-Cyano-2-methylbenzene 1-Aldehydo-2-chloro-5-nitrobenzene 1-Aldehydo-2,4-dichlorobenzene 1-Aldehydo-4-chlorobenzene 1-Aldehydo-2-nitrobenzene 1-Aldehydobenzene 1-Aldehydo-2,4-dimethoxybenzene 1-Aldehydo-2-hydroxy-5-bromobenzenea 1-Aldehydo-2-hydroxy-5-chlorobenzene 1-Aldehydo-2-hydroxybenzene 1-Aldehydo-2-methoxy-4-hydroxybenzene 1-Aldehydo-3-methoxy-4-hydroxybenzene 1-Aldehydo-2-hydroxy-4,6-dimethoxybenzene 1-Amino-2,3,4,5,6-pentaflourobenzene 1-Flouro-4-nitrobenzene 1-Amino-4-flourobenzene 1-Aldehydo-pentaflourobenzenea 1-Aldehydo-2-chloro-6-flourobenzenea 1-Acyl-4-chloro-3-nitrobenzene 1-Acy1-2,4-dichlorobenzene 1-Acylbenzene 3-Nitrobenzonitrile 4-Nitrobenzonitrile 2-Amino-4-nitrotoluene 2-Amino-6-nitrotoluene 3-Amino-4-nitrotoluene 4-Amino-2-nitrotoluene 3-Methyl-2-nitrophenol
-log(LC50) 4.26 4.21 4.18 4.46 4.70 5.43 5.85 4.88 6.06 3.56 4.33 4.07 3.60 3.93 4.74 6.09 5.93 4.38 3.73 4.00 3.42 4.72 4.99 4.81 4.02 4.14 3.92 5.19 5.31 4.73 4.02 4.81 4.83 3.69 3.70 3.82 5.25 4.23 4.56 4.21 2.87 3.39 3.79 3.34 3.48 3.80 3.77 3.52
125
Table 1 Compounds and Corresponding -log(LC50) Values Used in This Study No. 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
Compound
-log(LC5o)
3.51 5-Methyl-2-nitrophenol 4.39 1,5-Dimethyl-2,4-dinitrobenzene 4.12 2-Amino-4,6-dinitrotoluene 4.21 3-Amino-2,4-dinitrotoluene 4.26 3-Amino-2,6-dinitrotoluene 4.46 4-Amino-2,6-dinitrotoluene 4.92 2,4-Dinitro-5-methylphenol 5.29 1,3,5-Trinitrobenzene 2.58 1-Chloro-2-propanol 2.70 2,2,2-Trichloroethanol 3.49 2,3-Dibromopropanol Cyclohexanol 2.15 2.60 2-Phenoxyethanol Acetone 0.85 2-Butanone 1.35 2-Pentanone 1.75 2.00 3-Methyl-2-butanone 5-Methyl-2-hexanone 2.86 4-Methyl-2-pentanone 2.29 3,3-Dimethyl-2-butanone 3.07 Acetophenone 2.87 Benzophenone 4.09 2,3,4-Trichloroacetophenone 5.00 2,4-Dichloroacetophenone 4.16 tert-Butylmethyl ether 2.10 Diisopropyl ether 3.05 2,6-Dimethoxytoluene 3.88 Diphenyl ethers 4.63 p-Nitrophenyl phenyl ether 4.91 1,2-Dichloroethane 2.92 1,1,2-Trichloroethane 3.21 1,1,2,2-Tetrachloroethane 3.92 Pentachloroethane 4.44 5.19 Hexachloroethanea Methylbenzeneb 3.32 1-Amino-3-nitro-4-methylbenzeneb 3.65 1-Chloro-2-methyl-4-hydroxybenzeneb 4.27 1-Aminobenzeneb 2.84 1-Aldehydo-2-hydroxy-3,5-dibromobenzeneb 5.52 5-Amino-2,4-dinitrotolueneb 4.91 1-Chloronaphthaleneb 4.85 Pentachloronaphthaleneb 6.01 Carbon tetrachlorideb 3.75 1-Aldehydo-2-flourobenzeneb 4.96
The unit of LC50 is [mol/L] and was taken from reference [1]. a Compound used in the cross-validation set for computational neural networks. b Compound used in the prediction set for regression and neural network models.
126 tor development software. The descriptor pool contained topological, electronic, geometric, physicochemical, and CPSA descriptors. The number of descriptors was reduced to a set of 65 using standard statistical procedures. Descriptors that contained little or no useful information and descriptors that were similar in value and content to other descriptors were rejected. 5.2. Results and discussion Numerous multiple linear regression models were created using leaps-and-bounds regression, and their quality was determined by examining the multiple correlation coefficient (R), standard deviations of regression, and the number of variables in the model. The best model contained five descriptors, with R = 0.914 and s = 0.370. Models that contained a larger set of descriptors did not significantly improve the R and standard deviation and were therefore not used. The same model was found using simulated annealing (ANNLIN).
The best regression model contained one electronic, one CPSA, and three topological descriptors. The electronic descriptor was the charge on the most negative atom (QNEG). The CPSA descriptor was the partial positive surface area (PPSA-1) developed by Stanton and Jurs [9]. The topological descriptors found were an aldehyde group contribution descriptor, a molecular connectivity descriptor ( 1x), and a path count and length descriptor. lx is a path one valence-corrected molecular connectivity descriptor that was introduced by Randid [31] and applied to structure-activity analysis by Kier and Hall [32]. 1x is a measure of the degree of branching in a molecule and thus correlates to many physical properties such as boiling points and retention indices. The path descriptor, originally reported by Randid [33], is the total weighted number of paths between 0 and 45 in the structure divided by the number of atoms. The total weighted number of paths is given as N(1) + N(2)exp(-1) + N(3)exp(-2) + ... where N(k) is the number of paths of length k. The model was improved by detecting and then rejecting several outliers. Outlier detection was performed using traditional regression diagnostics [34]. Standard statistical values such as residuals, standardized leverage, studentized residuals, and DFITS were computed to detect possible outliers. The leverage is a measure of the weight a point has in determining the regression equation. The residual is the absolute difference between the actual dependent variable value and the predicted dependent variable value. The standardized residual allows the comparison with a given distance from the mean. The studentized residual is a direct comparison to the t-values for the distribution. The DFITS value provides a measure of the difference or change in the estimated value of the i-th dependent variable when the regression coefficients are recalculated without the i-th value. If any of the tests fail they are marked with an asterick, and if three or more fail the compound is considered an outlier. The results are presented in Table 2. Three compounds were found to be outliers, and when they were removed the R value and the standard deviation of regression were improved to 0.922 from 0.914 and 0.351 from 0.370, respectively. The five descriptors found in this model were then fed to a computational neural network in an attempt to improve the predictive ability. The program ANN was used to optimize the starting weights and biases. The quality of the model was assessed by calculating the residuals [actual-predicted values of -log(LC 50)] of the prediction set compounds.
127 Table 2 Detection of Outliers
Compound Number Leverage Residual
Standardized Studentized Residual Residual
Dfits
53
0.097*
-0.81
-2.31*
-2.35*
-0.77*
65
0.046
0.93
2.57*
2.63*
0.58*
84
0.044
0.74
2.03*
2.06*
0.44*
Compound 53 - 1,3,5-Tribromo-2-hydroxybenzene Compound 65 - 1-Amino-2,3,5,6-tetrachlorobenzene Compound 84 - 1-Amino-4-flourobenzene An asterick represents a failed test.
The regression model, the neural network model, and a group contribution model developed by Gao [1] were then compared. Gao used 16 structural groups to develop a model. Table 3 provides a comparison predictive power of the three approaches. The 16-descriptor group contribution model gave the smallest overall rms error of 0.40. The neural network model using only 5 descriptors gave an rms error of 0.41. ANNDES was then used to seek a better 5-3-1 neural network model. Since ANNDES is such a computationally intensive routine, it was believed that a good initial guess would improve the results. Therefore the quality value was calculated for each descriptor and the descriptors with the five best values were used as the starting point. ANNDES was started several times using different starting weights and biases for the neural network routine. Table 4 shows the rms errors for the training, cross-validation, and prediction sets of three separate ANNDES models. Figure 5 shows the calculated vs. observed -log(LC50) values for the ANNDES 1 model. The members of the training set, cross-validation set, and prediction set are plotted with different symbols to allow each to be seen clearly. In general, the optimized models were either comparable or slightly better than the five-descriptor neural network model found based on descriptors found using multiple linear regression. Table 5 is a list of the eight descriptors used for the three ANNDES models. The only descriptor present that was also found in the regression model is the aldehyde (CHO) group contribution descriptor. This descriptor was shown to be very important in all four models. It was also shown to have the highest toxicity contribution of all the chemical groups used in Gao's group contribution model. The geometric descriptor from the regression model (PPSA-1) was not found in any of the ANNDES models. However, the PPSA-2 descriptor, a weighted version of PPSA-1, was found for all three of the ANNDES models. The topological 11 descriptor, a path-one valence descriptor, was found for the regression model but not for the ANNDES models. However, all
128 Table 3 Residuals for the prediction set compounds using three different models Residualsa Compound
Methylbenzene 1-Amino-3-nitro-4-methylbenzene 1-Chloro-2-methyl-4-hydroxybenzene 1-Aminobenzene 1-Aldehydo-2-hydroxy-3,5-dibromobenzene 5-Amino-2,4-dinitrotoluene 1-Chloronaphthalene Pentachloronaphthalene Carbon tetrachloride 1-Aldehydo-2-flourobenzene RMS Error
Gao MLR Modelb Modelc
CNN Model
0.06 0.29 0.07 0.07 0.01 0.51 0.48 -0.49 0.33 0.81
-0.17 -0.01 0.50 0.13 -0.30 0.60 -0.33 -1.14 0.00 0.77
-0.28 0.03 0.54 -0.05 0.40 0.59 0.57 -0.36 0.00 0.56
0.40
0.53
0.41
a Residual = Observed - Predicted of the -log(LC 50) value. b Group contribution model that uses 16 substituents as descriptors. C The MLR and CNN models use five descriptors.
Table 4 The rms errors of the three models found using generalized simulated annealing RMS errors Model
Training Cross-validation Prediction Set Set Set
ANNDES 1
0.32
0.31
0.35
ANNDES 2
0.36
0.33
0.39
ANNDES 3
0.40
0.37
0.28
129
6.5
. 5.2 4) =
o • •
Tset (117 Compounds) CVset (13 Compounds) Pset (10 Compounds)
74
g 3.9 0 to 0 "g 2.6 vs 0,4 cd
V
1.3
0.0 0.0
Tset RMS error: 0.32 CVset RMS error: 0.31 Pset RMS error: 0.35
5.2 3.9 2.6 1.3 Observed -log(LCso) Values
6.5
Figure 5. Calculated vs. observed -log(LC5o) values using a computational neural network model with the descriptor subset selected by generalized simulated annealing (ANNDES). the ANNDES models contained a molecular connectivity descriptor of some kind. Also, the electronic descriptor found by regression, QNEG, was not found in any of the ANNDES models, but the atomic charge was shown to be important. The descriptors PPSA-2 and SCSP2 were found by ANNDES and further confirmed the importance of atomic charge in good toxicity models. Although the models found using regression and ANNDES contain different descriptors, there are obvious similarities for each model. The most important features found in all the models used for predicting toxicity of chemicals of this class seem to be the aldehyde groups, molecular connectivity, and atomic charge. 5.3. Conclusions
An automated descriptor selection routine for neural networks using GSA has been shown to be successful. Neural network models found using this algorithm are similar or superior to neural network models found using regression techniques, which might be due to
130 Table 5 Definitions of the descriptors used in the three models found using generalized simulated annealing
Descriptor CHO PPS A-2 V5C S5C N5C NN MREF SCSP2
ANNDES Model 123 x x x x x x x x x x x x x x x
Descriptor Definition Aldehyde group contribution Total charge weighted partial positive surface areas Five path valence molecular connectivityb Five path simple molecular connectivity° The total count of five path lengths" The number of nitrogens The whole-molecule molar refraction valuesc The smallest sp2 carbon atomic charged
a See reference 9. b See reference 32. See reference 35. d See reference 8.
nonlinear features found in the QSAR. Nonlinear features cannot be successfully modeled using multiple linear regression, but seemingly do not pose a problem for the neural network based algorithm, ANNDES. Therefore, there are many advantages to using generalized simulated annealing for descriptor selection. Most importantly, the models found by ANNDES seem to be superior to those found using regression. However, there are disadvantages to using this technique. The main disadvantage is the computational time required to perform the optimization. Since time is a concern, the number of iterations must be limited. A small number of iterations may cause convergence to local conditions due to a quick decrease in the annealing temperature. Considering the fact that the number of iterations is limited, ANNDES does quite well by finding subsets that are as good or slightly better than the regression subsets. The methods presented have been applied successfully to the development of a QSAR for toxicity. The descriptor subset for each model found using regression and neural network is unique, but there are many similarities. Therefore, as with any QSAR model, features that are important to toxicity are often uncovered using this technique. The descriptors in this study reveal the importance of an aldehyde group for toxicity, which is confirmed in Gao's study. Also molecular connectivity and atomic charge were found to be important. Several good neural network models were found that were based on just five calculated structural descriptors and which showed good predictive ability for unknown compounds not used during model development.
131 REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32.
C. Gao, R. Govind and H.H. Tabak, Environmental Toxicology and Chemistry, 11 (1992) 631. A.J. Stuper, W.E. Brugger and P.C. Jurs, Computer-Assisted Studies of Chemical Structure and Biological Function, New York, 1979. P.C. Jurs, J.T. Chou and M. Yuan, In Computer-Assisted Drug Design, E.C. Olson and R.E. Christoffersen (eds.), Washington, D.C., 1979. U. Burkert and N.L. Allinger, Molecular Mechanics, Washington, D.C., 1982. J.P.P. Stewart, MOPAC 6.0, Quantum Chemistry Program Exchange, Indiana University, Bloomington, IN, Program 455. S.L. Dixon and P.C. Jurs, J. Comp. Chem., 15 (1994) 733. R.J. Abraham, L. Griffiths and P. Loftus, J. Comp. Chem., 3 (1982) 407. S.L. Dixon and P.C. Jurs, J. Comp. Chem., 13 (1992) 492. D.T. Stanton and P.C. Jurs, Anal. Chem., 62 (1992) 2323. G. Furnival and R. Wilson, Technometrics, 16 (1974) 499. N.R. Draper and H. Smith, Applied Regression Analysis, 2nd Ed., New York, 1981. D.A. Belsley, E. Kuh and R.E. Welsch, Regression Diagnostics: Identifying Influential Data and Sources of Collinearity, New York, 1980. T.A. Andrea and H. Kalayeh, J. Med. Chem., 34 (1991) 2824. D.W. Salt, D.W. Salt and N. Yildiz, Pesticide Science, 36 (1992) 161. L. Xu, J.W. Ball, S.L. Dixon and P.C. Jurs, Environmental Toxicology and Chemistry, 13 (1994) 841. T. Aoyama, Y. Suzuki and H. Ichikawa, J. Med. Chem., 33 (1990) 905. D.J. Livingstone and P.T. Manallack, J. Med. Chem., 36 (1993) 1295. T. Schlick, In Reviews in Computational Chemistry, B. Lipkowitz and B. Boyd (eds.) New York, 1992. C.G. Broyden, J. Inst. Maths. Appl., 6 (1970) 76. R. Fletcher, Comput. J., 13 (1970) 317. D. Golfarb, Math. Comput., 24 (1970) 23. D.F. Shanno, Math. Comput., 24 (1970) 647. R. Fletcher, Practical Methods of Optimization, Vol. 1, Unconstrained Optimization, New York, 1980. P.A. Jansson, Anal. Chem., 63 (1991) 357A. I.M. Navon, F.B. Brown and D.H. Robertson, Comput. Chem., 14 (1990) 305. I.O. Bohachevsky, M.E. Johnson and M.L. Stein, Technometrics, 28 (1986) 209. J.M. Sutter and J.H. Kalivas, Microchemical Journal, 47 (1993) 60. C. Hansch and A.J. Leo, Substituent Constants for Correlation Analysis in Chemistry and Biology, New York, 1979. S.M. Free and J.W. Wilson, J. Med. Chem., 7 (1964) 395. J.P.P. Stewart, J. Comput.-Aided Mol. Des., 4 (1990) 1. M. Randie, J. Am. Chem. Soc., 97 (1975) 6609. L.B. Kier and L.H. Hall, Molecular Connectivity in Structure-Activity Analysis, New York, 1986.
132 33. 34. 35.
M. Randie, Comput. Chem., 3 (1979) 5. D.A. Belsley, E. Kuh and R.E. Welsch, Regression Diagnostics, New York, 1980. A.I. Vogel, Textbook of Organic Chemistry, Chaucer, 1977.
Adaption of Simulated Annealing to Chemical Optimization Problems, Ed. by J.H. Kalivas 1995 Elsevier Science B.V. All rights reserved.
133
Chapter 6 Fundamentals of cluster analysis using simulated annealing D. E. Brown* and C.L. Huntley Department of Systems Engineering, University of Virginia, Charlottesville, Virginia U.S.A. 22903 Clustering provides a mechanism for organizing or classifying things. The existence of large, complex data sets coupled with the widespread availability of computational resources has lead to the growth in computer algorithms to perform clustering. In this chapter, we formalize clustering as a combinatorial optimization problem with a user-defined internal clustering criterion. This criterion defines the type of structure sought by the clustering algorithm. Our formalization provides the means to accommodate the two general classes of clustering algorithms: partitional and hierarchical. With these formulations we then show how simulated annealing can be used to generate near optimal clusterings. Finally we provide examples of the use of simulated annealing with different internal criteria on a problem from the domain of multi-sensor data fusion. In these examples, simulated annealing is used to find both a near-optimal partitioning and hierarchy with respect to each of several clustering criteria for a variety of simulated data sets. 1. CLUSTERING Clustering concerns the organization and classification of things. Human understanding depends upon this ability to organize and classify as the basic method for dealing with vast quantities of information. In science the maturity of a field is frequently judged by the ability of its practitioners to organize knowledge and to effectively classify objects within its domain. For example, within chemistry the periodic table produced by Mendeleyev stands as an example of genius to see connections among disparate objects and to use those connections to organize elements in the environment into categories that effectively relate similar properties and attributes. Humans have a remarkable innate aptitude to cluster and classify. This ability appears to have been born of necessity. By grouping things together, we can more efficiently manage large quantities of information about the world. Babies seem to perform elementary clustering of objects into edible and inedible. As adults most of us are required to perform much more complex clustering, such as organizing bacterial infections into those
* This work was supported in part by the Jet Propulsion Laboratory under grant number 95722
134 treatable by the same antibiotics or grouping subordinates into teams to accomplish various work tasks. We typically cluster to accomplish one of three objectives: information retrieval, prediction, and understanding. Clustering for information retrieval provides an organizational framework for quickly finding relevant information. This library function has become more important as the quantity of information within most professions has experienced exponential growth. Clustering for prediction allows us to more easily estimate outcomes. For example, machines produced by similar manufacturing processes might have similar reliabilities and failure rates. Diseases with similar causes might also have similar treatments. Finally, clustering for understanding attempts to gain a deeper appreciation of the objects under study by using the organizational scheme provided by the cluster analysis. Grouping arrow points allows the archeologist to better appreciate the cultural relationships among various tribes. This search for understanding through structure motivates much scientific interest in clustering and has become more important as we attempt to find structure in complex data sets. In addition, once a structure is found in the data it can serve as the basis for communicating results to others. These three objectives do not necessarily or even frequently produce the same resulting clustering. We might organize novels for retrieval by alphabetic similarities in the last names of the authors, while students of literature would find this an absurd method to organize these works for understanding the contributions of the authors. This example illustrates another important point: there is not one globally correct clustering. Instead, a clustering can only be judged by its usefulness with respect to the objective sought. While humans provide unsurpassed clustering ability for small groups of objects, the shear quantity of information in many domains has lead to the need for automated clustering procedures. Clustering algorithms implement the rules that will guide the discovery of structure in the data. Once implemented, these clustering algorithms allow for the rapid organization of large quantities of data according to their underlying rules. Three requirements exist for the implementation of clustering algorithms: 1) a data structure used to define clusters; 2) a measure of similarity or distance between objects in clusters; and 3) an internal clustering criterion based on the similarity or distance measure and a model of the clusters expected in the domain. The data structure provides the method for describing the partitioning of the objects. Typically the data structure for the objects consists of a matrix in which each row represents a different object and each column represents a different attribute. This matrix can also be converted into a (possibly symmetric) matrix of similarities or distances in which the ij entry shows the similarity or distance between objects i and j. A convenient data structure for clustering is a vector of length equal to the number of objects and in which each entry labels the corresponding object with its assigned cluster. Similarity and distance (or dissimilarity) measures provide the means for converting the attributes of the objects into a relevant numerical score.
135 There are many choices in developing these measures and these involve the types of variables (ordered or categorical) and the methods for combining the variables. For instance we might not want to weight all variables equally and we might have both ordered and categorical variables. Anderberg (1973) provides a complete discussion of various methods to provide this measure. Suffice it to say that the choice of the similarity or distance measure dramatically affects the resulting outcome. The clustering criterion uses the measure of similarity or distance to judge among competing clusterings of the data. An internal clustering criterion uses only the information present in the data to make this judgment. Sometimes when we evaluate competing clustering approaches we start with a known partition of the data. We then evaluate the different approaches by examining how well they recover this existing partition. In this case we use an external clustering criterion because we use the information about the known clusters which is not normally available to a clustering algorithm. The internal clustering criterion allows us to formulate clustering as an optimization problem. Unfortunately, this optimization problem falls into the category of NP-hard making it intractable for all but the smallest problem instances. Hence, a number of heuristic approaches have been advocated and in many cases these approaches do not explicitly specify the internal criterion being optimized. The advantage of simulated annealing as shown in the first chapter of this book is its generality to combinatorial optimization problems. It should come as no surprise therefore that it has applicability to clustering. However, this applicability is not straightforward. In a previous study by Klein and Dubes (1989) simulated annealing provided good clusterings, but proved impractical for repeated use on large clustering problems because of the computational effort involved. In this chapter we show practical applications of simulated annealing to clustering. In the next section of this chapter we formalize the clustering optimization problem. This formalization allows us to apply simulated annealing as a global optimization technique, which we describe in Section 3. Section 4 provides examples of the use of simulated annealing clustering algorithms and the importance of internal clustering criteria to these techniques. Section 5 contains our conclusions. 2. THE CLUSTERING OPTIMIZATION PROBLEM Clustering problems can have numerous formulations depending on the choices for data structure, similarity/distance measure, and internal clustering criterion. This section first describes a very general formulation, then it details special cases that corresponds to two popular classes of clustering algorithms: partitional and hierarchical. At a basic level, clustering is a combinatorial optimization problem:
136 Let Q be the set containing all objects to be clustered, C be the set of all feasible clusterings of Q, J: C ---> 91 be the internal clustering criterion; Then Minimize J(c)
(1)
Subject To c E C.
(2)
Equations (1) and (2) represent the most general form of the optimal clustering problem. The objective is to find the clustering c that minimizes an internal clustering criterion J. J typically employs a similarity/dissimilarity measure to judge the quality of any c. The set C defines c's data structure, including all the feasible clusterings of the set Q of all objects to be clustered. A clustering algorithm maps Q into C. There are two basic types of clustering algorithms. The first type is partitional, and these algorithms construct a simple partitioning of Q into a set of nonoverlapping clusters. The second type is hierarchical, and these algorithms decompose Q into several levels of partitionings. Hierarchical decomposition is structured as a dendrogram, a tree that iteratively splits Q into smaller subsets until each object is in its own subset. The dendrogram can be created from the leaves up to the root (the "agglomerative" approach) or from the root down to the leaves (the "divisive" approach). The most common agglomerative clustering schemes are described in Johnson (1967). Partitioning is most appropriate when one is only interested in the subsets or clusters, while hierarchical decomposition is most applicable when one seeks to show similarity relationships between clusters. Section 2.1 formalizes the combinatorics of the partitional strategy and Section 2.2 does the same for hierarchical methods. The formulations we derive here provide the basis for the application of the simulated annealing algorithm to the underlying optimization problem as we show in Section 3.
2.1. Partitional clustering Building on the basic combinatorial problem in (1) and (2), we define optimal partitioning, where the vector p represents the assignment of objects to clusters:
137 Let Q be the set of all objects to be clustered, n = IQI be the number of objects in Q, k n be the maximum number of clusters, P = {p: Vi E {1,...,n},pi
E
{1,...,k}} be the set of all partitionings,
J: P --> 91 be the internal clustering criterion; Then Minimize J(p)
(3)
Subject to p E P.
(4)
Each cluster has a unique, integer cluster "label" in (1,...,k), and the vector p assigns a cluster label p, to the i-th object in Q. The function J maps elements of P into a real-valued cost. This formulations shows clustering as an assignment problem to facilitate direct implementation of combinatorial optimization techniques (e.g., simulated annealing). There are a variety of algorithms to solve such a problem. A thorough survey of partitional clustering algorithms is in Jain and Dubes (1988). Few partitional algorithms guarantee a global-optimum solution to their associated problem formulation. K-means (see Hartigan, 1975), for example, uses a greedy improvement heuristic to approximate the best "squared error" clustering. Thus, the algorithm is based on minimizing the total squared distance of the objects to their associated cluster means. There are many variants on K-means, and many of them converge rapidly on a locally optimal clustering, but none converge on the global optimum. As shown by Klein and Dubes (1989), simulated annealing tends to find significantly better clusterings, but often requires much greater computational effort. Unlike K-means or simulated annealing, some algorithms do not have any clear objective. Often, they solve a constraint-satisfaction problem. Consider, for example, ISODATA(Ball and Hall, 1965), a popular partitioning algorithm based on a squared error criterion with k=n . Since minimizing squared error with k=n is solved by placing each object in its own cluster, ISODATA translates this underlying objective into a set of "splitting" and "lumping" constraints on the clusters. The algorithm starts with an arbitrary clustering and splits or joins clusters until all clusters satisfy the splitting and lumping constraints, settling on some number k' .^n of clusters. Although simulated annealing cannot be applied directly to constraint satisfaction problems, one can often define a J that approximates the meaning of one or more constraints, as we show in the examples of Section 4. 2.2. Hierarchical clustering As stated earlier, hierarchical clustering algorithms operate on a type of tree called a dendrogram. Each leaf of the dendrogram contains one and only one element of Q and all elements have a leaf node. From these leaf nodes
138 the dendrogram shows the iterative grouping of elements into clusters and these clusters into larger clusters until at the root node all elements are combined in one cluster. Each cluster in the dendrogram is described by a subtree; for any given subtree u of a dendrogram t there exists a subset S(u) of objects in Q. Each subtree with I S(u) I >1 has nonempty left and right subtrees (1(u) and r(u), respectively) that divide S(u) into two nonoverlapping subsets. All subtrees u # t also have a parent a(u), for which u e (l(a(u)) u r(a(u)))). Note that there is a subtle distinction between a dendrogram and any other subtree: a dendrogram is a subtree with no parent that clusters all the objects in Q. Also note that a dendrogram leaf is a subtree that clusters a single object and has no left or right subtrees. We can now formulate hierarchical clustering as a combinatorial optimization problem. To do this we use the general procedure of Wallace and Kanade (1990). Let Q be the set containing all objects to be clustered, T be the set of all feasible dendograms of Q, U D T be the set of all subtrees u of dendogram t e T, 1 : U ---> U be the left subtree of u, Vu e U , r:U - U be the right subtree of u, \hie U, e : U ---> 91+ be a nonnegative, real function on U; J : C —* 91 be the internal clustering criterion; Then Minimize J (c)
(5)
Subject To
t E T ,
(6)
Where Vu e U 0,
if u = 0
J(u) =
(7) 1 e(u) + J (1(u)) + J (r(u)),
otherwise
In this general formulation of the hierarchical clustering problem, the internal criterion J(t) is calculated recursively from all the subtrees u of t. The value e(u) is sometimes called the level of the subtree u in the dendrogram. In keeping with this interpretation, e is nonincreasing along paths from the root to the leaves. Associated with each level is a partitioning of the objects. More specifically, if the level is v then we define the partitioning as follows:
139 Let
j(u) = {°° e(u)
if u = 0, otherwise,
Then
p(v) = {S(u) : j(u) ^ v and j(a(u))> v).
(8)
This definition selects the largest clusters S(u) such that j(u) ^ v. The function j is necessary because there is no parent subtree for a dendrogram (i.e. a(t) = 0 for all tE T). Note that p(.) contains the cluster of all objects (i.e. pi
= pj for all i < j
^
n) and p(0) contains the single clusters (i.e. i.e. pi # pj for all i < j ^ n). Many hierarchical clustering algorithms are based on ultrametric partitionings of the objects. As described by Johnson (1967), an ultrametric d between clusters satisfies the usual metric conditions plus the ultrametric inequality,
d[a,d ^ maxid[a,b],d[b,c]) Vclusters a,b,c
(9)
Given an ultrametric, Johnson's hierarchical clustering schemes define the level e(u) of a subtree u as
e(u) = d[S(1(u)),S(r(u))]
(10)
and construct the dendrogram agglomeratively, merging at each iteration the two least distant clusters. Consider, for example, the complete and single linkage methods described in Sokal and Sneath (1963). For complete linkage, d[a,c] = max{d[a,b], d[b,c] }, and e(u) is the maximum distance between objects in the cluster S(u). Thus, the complete linkage method attempts to minimize the diameters of the clusters at each level of the dendrogram. For single linkage, d[a,c] = minfd[a,b], d[b,c] }, and e(u) is the minimum distance from an object in 1(u) to an object in r(u). With this interpretation, the single linkage method is equivalent to finding the minimal spanning tree of the objects. Agglomerative methods, such as single link and complete link, are stepwise procedures. The formulation in (5)-(7) allows us to define the hierarchical clustering problem in terms of combinatorial optimization. To do this, however, we need an appropriate internal clustering criterion. The most obvious is squared error. Squared-error is one of the most common of all clustering criteria. Provided that the clusters are fairly spherical and are of approximately the same size, squared-error performs extremely well. Thus, in the absence of any prior information, squared-error is often a suitable choice for exploring a new data set. This explains why squared-error is fundamental to many partitioning algorithms like ISODATA (Ball and Hall, 1965) and K-MEANS (see Hartigan, 1975). It provides a compact, accurate measure of clustering
140 quality whose behavior can be understood on both intuitive and theoretical levels. What is not clear is exactly how one might use squared-error in a formulation like (5)-(7). If one follows the example provided by Wallace and Kanade (1990) in their use of a criterion with characteristics similar to squared-error's, then we should define the level of a cluster by its total squared-error:
X
n„ d
(u) —(u).2 i u))2 (Xu —
e(u) = IV i=1 j=1
u
where n is the number of objects in S(u)={x(u)} and d is the dimensionality of the data. Equation (11) corresponds to minimizing the squared-error summed over all clusters. Note, however, that most clusters appear in more than one level partitioning (equation (11)). Hence, another possible squared-error level function is
u , ( i, - ) n„
d
e(u) = /d n x u) x., -(0 2
(12)
i=1
where 4, is the number of level partitions in which S(u) appears. Equation (12) corresponds to minimizing the squared-error summed over all partitions. The idea for this second criterion came from examining the performance of Ward's clustering algorithm (Ward, 1963), a popular squarederror technique. Ward's algorithm builds its dendrogram from the bottom up, starting with a level partition of all objects into singleton clusters. At each iteration, the algorithm forms a new cluster m by merging the two clusters p and q that minimize the increase in squared-error: AError 2
n„, d
=
(i ;) EE(x V))2i Elf(x,(;) - .,(q))2 (m)
EE x , - x
(m).2
-i=1j
np d
(p) _
i=1 j=1 nm d
_
(13)
i=1 j=1
n =
d
X•—X• P q E ( 71(P) —.1(q))2 np + nq j=1
(Since the change in squared-error is based solely on the cluster means and the sizes, (12) can be calculated very efficiently. Such efficiencies are also exploited by the simulated annealing implementation.) The use of the change in squared-error is intriguing because the sum of these cost increases along any path from a leaf node to the root of the dendrogram is constant. Hence, if one were to use (12) as a level function in the Wallace and Kanade sense, then
141
no dendrogram would be any better than any other. However, because large clusters with relatively small squared-error are created as early as possible, the squared-error summed over the sequence of partitions created is nearly optimal. This is just the property that (11) tries to capture: minimize the squared-error of the partitions, rather than concentrating on any particular cluster. 3. SIMULATED ANNEALING FOR CLUSTERING As noted earlier in chapter 1 of this book, simulated annealing is a powerful optimization technique that attempts to find a global minimum of a function using concepts borrowed from Statistical Mechanics. Although it was first described in its entirety by Kirkpatrick et al.(1983), significant portions of the method were described as early as 1953 by Metropolis et al. (1953). Simulated annealing exploits the obvious analogy between process annealing and combinatorial optimization problems, where the "molecules" are the variables in the data structure and the "energy" function is the objective function. For clustering the objective function is one of the internal clustering criteria shown in equations (3) and (5) for partitional clustering hierarchical clustering, respectively. In these clustering problems and more generally for combinatorial optimization problems, the "temperature" is a real-valued scalar that controls the degree of randomness of the search. Simulated annealing slowly decreases the temperature (by a factor of a each iteration) from the initial temperature To to the final temperature T1, by which time the values of the decision variables have "frozen" into a very stable state. As shown in Aarts and Korst (1989), the limiting state as the temperature approaches zero is the global minimum. 3.1 Simulated Annealing for Partitional Clustering The application of simulated annealing to the partitional clustering formulation in equations (3) and (4) is straightforward. Algorithm 1 shows simulated annealing in the context of the general combinatorial optimization problem for partitional clustering. To implement this algorithm we need to provide details on two remaining problem-dependent details: the perturbation operator 8 and the annealing schedule (MaxIt,To,a,T f) for partitional clustering. The perturbation operator for partitional clustering switches a randomly-chosen object i in Q from one cluster to another randomly chosen cluster. Algorithm 2 shows the basic procedure. The set L contains the cluster labels used in p. Similarly, L C contains the labels not used in p. The switching procedure first selects an integer m in the range [0, ILI]. If m equals 0 and there exists an unused cluster label (i.e., ILI
142 Algorithm 1 Simulated annealing for partitional clustering. Procedure SA-P(8,MaxIt,T0,a,T) Let C be the set of all feasible clusterings, c, c' E C be the current and perturbed clusterings, repectively,
8 : C —> C be a randomized perturbation operator, J: C ---> 91 + be the internal clustering criterion,
T E 91 + be a "temperature" parameter that controls the "greediness", U: 3 2 --> [0,1) be a function that returns a random number between 0 and 1, Maxlt E 3 + be the number of iterations of the Metropolis algorithm, a
E 91 + ,
a < 1 be an "attenuation" constant for reducing the temperature,
To and Tf be the initial and final temperatures.
T T, REPEAT FOR i 1 TO Maxlt DO c'
8(c)
A
J (c' ) — J (c)
IF A < 0 OR (e / T U[0,1]) THEN ENDFOR T off UNTIL T T1
143 Algorithm 2 Perturbation operator for partitional clustering
Function 8(p) n = IQ' be the number of objects to be clustered, Let L = {i E {1,...,k}: 3m E il,...,n1 3 pm = i} be the set cluster labels in
p,
LC = {i E {1,...,k}: i 0 L} be the set of cluster labels unused in p, SELECT(range) be a function that returns a random element from the set range, p, p' E P be the original and perturbed partitionings, respectively.
P' <- P i <--- SELECT(1,...,n)
REPEAT m f-
SELECT(0,...,IL1)
EF ILI = k
OR m > 0 THEN
p: <— SELECT(L)
ELSE 14 <-- SELECT(LC)
ENDELSE UNTIL p: # p, RETURN p' We have designed the annealing schedule to standardize the computational effort among implementations without compromising the quality of the resulting clustering solutions. The computational effort is made fair among implementations by allowing each run a fixed number of trial perturbations. The total number of perturbations tried in any run is MaxIt . NumTemp, where Maxlt is a fixed multiple of the number of objects to be clustered and NumTemp is a user-defined constant. The solution is made accurate using a very conservative annealing schedule. We calculate the initial temperature with the formula from Aarts and Korst (1989), which uses statistics compiled from Maxlt random permutations:
T, = pllog where
xm + -
n m (1- x)(MaxIt - m + ) 1
(14)
144 m + = the number of cost increases in MaxIt random perturbations; 1.1 + = the average cost increase over the perturbations; and c = the acceptance ratio, a real-valued scalar in (0,1). For the final temperature, we require that e mi+/Tf
where 0 < E < I and 0 <
=E,
/3 <1
(15)
meaning that at the final temperature simulated annealing accepts a cost increase of T1
py + with probability
= –/3,u + Ilog(s).
E.
This simplifies to the following:
(16)
This formula is analogous to the estimate in White (1984), with pit+ representing the smallest cost increase caused by a perturbation from a local minimum and
el
representing the number of perturbations possible at each
step. With this interpretation, equation (15) implies that E approximates the probability of escaping a local minimum at the final temperature. Given NumTemp,To, and Tf, the calculation of a is straightforward: 1 / NumTemp
a = I Tf
(17)
To
With sufficiently large NumTemp and Maxlt and sufficiently small (1-x),
p,
and E, the annealing schedule ensures slow, steady convergence to a nearglobal optimum clustering. For the runs reported in this paper, the settings were NumTemp=200, Maxlt=4n, (1-x)=0.25, /3=0.125, and e=0.00000000001.
3.2. Simulated annealing for hierarchical clustering We now turn to the application of simulated annealing to hierarchical clustering as formulated in (5)-(7). Algorithm 3 shows our simulated annealing approach to this problem. As with partitional clustering we need to also specify the temperature schedule and perturbation operator. The temperature schedule we use here is identical to that of partitional clustering. However, for hierarchical clustering the perturbation operator must work on dendrograms rather than level partitions as in partitional clustering. Finding a good perturbation operator is in many ways the key to successful implementation of simulated annealing for hierarchical clustering.
145 Algorithm 3 Simulated annealing for hierarchical clustering Procedure SA-H(8,MaxIt,To,oc - 9E MAD) Let C be the set of all feasible dendrograms, t, t' E
C be the current and perturbed dendrograms,
: C ---> C be a randomized perturbation operator, J :C
91+be a hierarchical clustering criterion,
T E 91+ be the "temperature" parameter, U : 3 2 ---> [0,1) be a uniform random number generator, T be the initial temperature, 0 Maxlt E 3+ be the number of Metropolis iterations, a E 91+ , < 1 be an "attenuation" constant, EMAD
be a stopping criterion.
T T O REPEAT MAD f -0 FOR i 1 TO Maxlt DO
c'
5(c)
A 4— J(c')— J(c) IF A < 0 OR (e—AIT U[0,1]) THEN C
c'
MAD <— MAD + IAI/MaxIt ENDIF ENDFOR T dr UNTIL MAD < E
MAD
The perturbation operator d we use, called a grab [11], relocates two randomly selected subtrees p and q within the dendrogram. Provided that neither p nor q is an ancestor of the other, grab(p , q) modifies the dendrogram so that p and q are siblings (i.e., share the same parent). If p or q is an ancestor of the other, then the result would be a cycle in the dendrogram (i.e.,
146 how can a node be both an ancestor and a sibling?). Also, if p and q are already siblings, then the call to grab(p , q) is wasted because the dendrogram does not change. Such cases are avoided by the following procedure: 1. Select p at random such that p is neither the dendrogram root t nor one of its two immediate children, 1(t) and r(t). 2. Randomly select one of p's ancestors to be the common ancestor c. 3. If 1(c) contains p then randomly select q from among the subtrees in r(c), inclusive. Otherwise, randomly select q from among the subtrees in 1(c), inclusive. 4. If q is p's sibling and q is not a dendrogram leaf, then repeat step 3. The above procedure selects siblings only when q is a dendrogram leaf. 4. EXAMPLES As noted in Section 1, clustering can have several objectives: retrieval, understanding, or prediction. In order to avoid confounding the evaluation of clustering approaches it is convenient to conduct comparisons on data with known structure and to use external criteria as the basis for the evaluation. As we will see, the simulated annealing approach also lends itself to providing a fair comparison of different internal clustering criteria (cf. Brown & Huntley, 1992). In this section we provide examples of the usefulness of simulated annealing for clustering and also show the power of simulated annealing when used with different internal clustering criteria . The problem domain for our examples comes from multi-sensor data fusion. Essentially, we have objects in two dimensional space and sensors that detect these objects with some uncertainty about the actual location. Our goal is to group or cluster the detections and thus predict the number of actual objects in the plane. Figure 1 shows an example of this problem. In this case we have 194 sensor detections represented by the dots. The circles show at their centers the means of the "true" clusters. There are twenty such circles in the figure, corresponding to twenty actual entities present in the plane for this example. Obviously, the true cluster means are unknown to the clustering algorithm, but provide us with a way to compare the performance of different approaches. To conduct this comparison we need a formal method for evaluation. The next subsection describes the Jaccard and Rand scores as external criteria for evaluating clustering performance. Following that we provide results from applying simulated annealing to both partitional and hierarchical clustering of the data from this example domain
147
e
0 e
0
So
e . c5
E.).
P"
:.
Z
®
0
0
0
'es
Figure 1. A sample data set 4.1. External clustering criteria The Rand (1971) and Jaccard ( see Sokal and Sneath, 1973) scores provide the most accepted and used external criteria for clustering. Both of these criteria require that one know which objects truly cluster together (i.e., one needs to know the true partitioning). Using the notation from partitional clustering in Section 2.1, the criteria measure the similarity between the true partitioning g and the partitioning p returned by a clustering method. Both measures use the following statistics: s+ = The number of times that g. = g, when p, = pi, si = The number of times that g, # g, when p, # p3, s+ = The number of times that g, = g, when p, # p1 , and s: = The number of times that g, # g, when p, = p3. The first two statistics count the number of time that p agrees with g, while the last two count the number of disagreements. The Rand criterion calculates the ratio of agreements to the total number of comparisons: s+ +s= s_ _ s: + s_ Rand(g, p) = ÷ s, +s_ +s_÷ +s„- — n(n-1)/2
(22)
148 The Jaccard criterion is calculated similarly, except for the omission of the (negative-negative) agreement statistic:
s : Jaccard (g,p) = , s+ + s_ + s+ Because Jaccard's measure is monotonic with Rand's, improvement in one of the measures implies improvement in the other. Therefore, since the Jaccard criterion is more sensitive than the Rand criterion, we only report the Jaccard score in the next sections.
4.2. Results for partitional clustering A major advantage to the formulation of the clustering problem in Section 2.1 is that we can apply the most appropriate internal criterion to the domain of interest. Most existing clustering algorithms come with a fixed internal criterion. The use of simulated annealing with this formulation frees us from this restriction. Thus we have the capability to employ the most appropriate internal clustering criterion for the problem domain. This section illustrates this use of simulated annealing within our optimization formulation for the example multi-sensor fusion data described above. We emphasize that the choice of the internal criterion for use in a partitional clustering problem is critical to the interpretability and usefulness of the resulting output. However, for computational reasons, it is desired that the criterion be simple. One of the simplest of the internal clustering criteria is total within-cluster distance:
W(p) = 1 du p r =p1
This criterion is simpler than squared error (and other, similar criteria) because all distance calculations can be preprocessed and stored in a static matrix. However, since minimizing W(p) with k=n places each report in its own cluster, the user must estimate the true number of clusters, which is non-trivial for many real surveillance situations because the number of entities sensed often varies over time. So, although W(p) is simple enough for straightforward application of most combinatorial optimization algorithms (e.g., integer programming techniques) to small problem instances, applicability is limited to cases in which one can accurately estimate the true number of clusters. Barker (1989) eliminated the need to accurately estimate the number of clusters by incorporating a distance threshold v into W(p):
B(p) = I(du – v) P1=P1
Barker's new formulation, which is computationally identical to W(p) once all of the (d 13-v) terms have been preprocessed, is analogous to the constraint
149 satisfaction problem solved by ISODATA. The criterion penalizes large clusters by adding in more d i, values. It penalizes small clusters by subtracting fewer v values. Hence, Barker's formulation captures the spirit of the two ISODATA constraints with penalties for large and small clusters. There is a practical advantage to Barker's formulation, however, because it makes the tradeoff between large and small clusters explicit. Barker's criterion and total within-cluster distance are competing criteria for use in clustering the multi-sensor data and for clustering problems in general. Simulated annealing provides us with a way to objectively compare these two criteria. Additionally, one of the most popular partitional clustering algorithms, K-means employs W(p) in a stepwise greedy procedure. Hence, any solution obtained with W(p) using simulated annealing, which provides a global perspective on the optimization problem should dominate K-means. To compare the two criteria we examined 32 data sets where each data set has between 150 and 600 objects, split into 20 true clusters on the average. We applied simulated annealing partitional clustering 5 times to each set with both B(p) and W(p). We then compared the best test results for W(p) and B(p) over the 32 data sets. Table 1 shows for each data set the best Jaccard score for each of the criteria. In addition, for W(p) it shows the best "number of clusters" parameter k in 118, 19, 20, 21, 22). Similarly, for B(p) the table shows the best "median distance" parameter v in the set (2.0, 2.5, 3.0, 3.5, 4.0) and the associated number of clusters in the final partitioning. At the bottom of the table is the minimum, maximum, and mean of each column. Tables 2 and 3 may be of use for practitioners, who do not have the benefit of knowing the best parameter values for a given data set. For each of the parameter values tested, the tables show the average performance of the criteria over the 32 data sets. This allows the user to select the "best" default parameter value in a given range. From table 2, it appears that a good initial estimate for the number of clusters k in a future data set is 18, for which W(p)'s average Jaccard score is approximately 72%. Since the best Jaccard scores were achieved with k=18 (whereas the mean number of true clusters is 20), it is likely that Jaccardoptimal W(p) clusterings underestimate k by at least 2 clusters. From table 3, good estimates for v are in the range [3.0, 3.5], for which B(p)'s average Jaccard score is approximately 95%. Note also that the B(p)'s worst performance over the range [2.0, 5.0] is 86%. Hence, even if the best v for a particular data set is not in the range, [3.0, 3.5], B(p) still seems to outperform W(p). These results show the power of the simulated annealing approach to finding a good clustering by appropriate choice of internal criterion.
150 Table 1 The Best Results for Each of 32 Data Sets B(p)
W(P)
Best Jaccard Parameter Data Best Jaccard Parameter Resulting k k Set v 0.712 18 1 2.0 0 .932 28 2 0.778 18 1.000 4.0 20 3 0.624 21 0.985 2.5 23 20 4 0.705 3.5 0.989 21 21 0.748 3.0 22 0.959 5 0.698 0.894 2.5 6 19 27 0.622 7 19 0.952 3.0 20 0.725 18 1.000 3.5 8 20 0.641 19 0.807 3.0 21 9 3.5 10 0.992 0.812 18 21 11 0.611 1.000 3.0 21 19 3.5 20 0.928 12 0.682 18 0.749 20 3.0 21 0.988 13 1.000 3.5 14 0.859 18 20 1.000 3.5 21 20 0.589 15 24 0.936 2.5 16 0.978 19 22 0.807 19 0.981 3.0 17 21 1.000 3.0 18 0.797 19 3.5 20 1.000 19 0.773 18 3.5 19 0.949 0.903 20 19 22 0.992 3.0 18 21 0.831 3.0 20 0.953 22 0.733 18 23 0.957 3.0 0.792 20 23 21 0.935 3.0 0.683 18 24 4.0 18 0.851 18 25 0.652 21 1.000 3.0 19 0.719 26 20 3.5 1.000 0.916 18 27 24 2.5 0.935 0.674 18 28 19 0.990 3.5 0.726 29 19 20 1.000 3.5 0.815 18 30 3.5 21 0.992 19 0.774 31 23 3.0 0.939 0.793 19 32 Min Avg Max
0.589 0.747 0.978
18.0 18.9 21.0
0.807 0.963 1.000
2.0 3.2 4.0
18.0 21.3 28.0
151 Table 2 Average Jaccard score by parameter k for W(p). Parameter k 18
19
20
21
22
0.726
0.705
0.684
0.655
0.620
Avg. Score
Table 3 Average Jaccard score by parameter v for B(p). Parameter v 1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Avg Score
0.636 0.771 0.879 0.937 0.948 0.944 0.916 0.893 0.867
4.3. Results for hierarchical clustering We applied three hierarchical clustering methods to the same 32 data sets used in the previous section for partitional clustering: (1) simulated annealing with the criterion in equation (11); (2) simulated annealing with the criterion in equation (12); and (3) Ward's algorithm. Each method generated a separate dendrogram for each data set. The ability of each method to find partitions with small total squarederror varied. For each of the level partitions from K=2 to K=26 in the dendrograms the total squared-error was calculated. The sum of the average of these values for each K are SA with (7): 4674.8 SA with (8): 4531.2 Ward's: 4595.1 These results show clearly the importance of the optimization criterion to clustering. The computationally simple Ward's method performs better than the simulated annealing approach with a simplistic criterion. However, a criterion that more correctly accounts for the hierarchy, by minimizing the sum of squared error at each level, performs much better. As with partitional clustering the application of simulated annealing to hierarchical clustering requires careful selection of the internal clustering criterion. Although hierarchical methods are designed to produce dendrograms to organize the data, they are also frequently used to define partitions of the data. Hence, it is of interest to compare the partitions found by these methods over the range of values which we know are actually present in these data sets. To do this we again employ the Jaccard score. The results are in Table 4. As with partitional clustering a method with a high Jaccard score is said to have effectively "recovered" the known clusters.
152 These results again demonstrate the generality of the simulated annealing algorithm to choice of internal criterion and confirm the importance of the selection of this criterion. In this case, the simplistic measure in equation (11) is dominated by the other two approaches. However, the results also show the power of the simulated annealing approach. Ward's method is stepwise and is designed to provide low within cluster sum of squared error for higher values of K. In contrast the simulated annealing approach employed here attempts to optimize the sum of squared error over the entire dendrogram. Hence, we would expect Ward's method to do better than the simulated annealing over these values of K. In fact, simulated annealing with equation (12) does better than Ward's method for 7 of the 10 values in this range and does particularly well at the high end. Table 4 The Jaccard scores versus the number of clusters for each method. Number of Clusters (K) SA with (12) SA with (11) Ward's 0.821 15 0.819 0.819 0.854 0.858 16 0.853 0.895 0.894 17 0.882 0.930 18 0.916 0.931 0.936 0.958 0.960 19 0.967 0.966 20 0.938 0.922 0.913 21 0.889 0.866 0.879 22 0.849 0.825 0.844 0.797 23 0.790 0.807 0.739 24 0.764 0.783 0.701 25
5. SUMMARY AND CONCLUSIONS Clustering is an important technique for many problems in science and engineering. Depending on the exact nature of the problem we may employ one of two primary approaches to perform the clustering: partitional and hierarchical. Traditional algorithms for both approaches have sought quick, locally optimal solutions. The advent of useful global optimization techniques, such as simulated annealing, has made it possible to consider obtaining better clusterings. However to apply simulated annealing to clustering we need a problem formulation that admits the use of an optimization technique. This chapter has provided the needed formulations for both partitional and hierarchical clustering problems. For the partitional problem we viewed clustering as the assignment of a data point to a label. Our problem then becomes one of minimizing some measure of performance for this assignment. Convenient measures of performance are typically variants of within cluster distance or squared error. With this formulation we provided a general simulated annealing algorithm for obtaining near-optimal partitional clusterings. We also showed a useful perturbation operator which met the requirements of simulated annealing for a local neighborhood structure. We also described a convenient temperature schedule.
153 For hierarchical clustering we also provided a combinatorial optimization formulation, but in this case our search is over a set of dendrograms. The measure of performance for hierarchical clustering is more delicate requiring careful consideration of how we value partitioning at each level. Similarly the perturbation operator requires cutting and splicing dendrograms rather than interchanging labels as we did for partitional clustering. We provided an algorithm to accomplish this perturbation and the higher level simulated annealing algorithm for hierarchical clustering. Lastly we demonstrated the use of simulated annealing on examples from multi-sensor data fusion. These examples showed the effectiveness of simulated annealing in performing both hierarchical and partitional clustering. They also showed the importance of the internal criterion to the results obtained. Our results also demonstrated how simulated annealing can help choose the most appropriate internal criterion. In both the partitional and hierarchical cases clustering performance was dramatically affected by this choice. In the partitional case our testing results showed that Barker's criterion outperformed within-cluster distance. In fact, the worst Jaccard score for Barker's criterion was better than the average Jaccard score for within-cluster distance. For the hierarchical case the performance differences for internal criteria were equally dramatic. The simple version of within cluster distance was dominated by both a simulated annealing algorithm with a criterion that fairly weighted the levels of the dendrogram and by a traditional hierarchical method proposed by Ward. We also found that the simulated annealing approach with the criterion in equation (12) outperformed Ward's method across the most interesting region of the dendrogram. This work provides clear evidence that simulated annealing is useful for both partitional and hierarchical clustering. Additional work can extend the use of simulated annealing to study internal clustering criteria. We anticipate a range of choices, similar to the single link, complete link, and average link choices now available for greedy heuristics in hierarchical clustering. REFERENCES Aarts, E. and Korst, J. (1989) Simulated Annealing and Boltzmann Machines, Wiley, New York. Anderberg, M. (1973). Cluster Analysis for Applications, Academic Press, New York, NY.. Ball, G. and Hall, D. (1965). ISODATA, a Novel Method of Data Analysis and Classification, Research Report AD-699616, Stanford Research Institute, Stanford, CA. Barker, A. (1989). Neural Networks for Data Fusion, Masters Thesis, University of Virginia, Charlottesville, VA. Brown, D. and Huntley, C. (1992). A Practical Application of Simulated Annealing to Clustering, Pattern Recognition 25, 401-412. Klein, R. and Dubes, R. (1989). Experiments in Projection and Clustering by Simulated Annealing, Pattern Recognition 22, 213-220.
154 Hartigan, J. (1975). Clustering Algorithms, Wiley, New York, NY. Jain, A. and Dubes, R. (1988). Algorithms for Clustering data, Prentice Hall, Englewood Cliffs, NJ. Johnson, S. (1967). Hierarchical Clustering Schemes, Psychometrika 32, 241254. Kirkpatrick, S. Gelatt, C. and Vecchi, M. (1983). Optimization by Simulated Annealing, Science 220, 671-680. Metropolis, N. Rosenbluth, A. Rosenbluth, M. Teller, A. and Teller, E. (1953). Equations of State Calculations by Fast Computing Machines, Journal of Chemical Physics 21, 1087-1092. Rand, W. (1971). Objective Criteria for the Evaluation of Clustering Algorithms, Journal of the American Statistical Association 66, 846850. Sokal, R. and Sneath, P. (1963). Principles of Numerical Taxonomy, W. H. Freeman, San Francisco. Wallace, R. and Kanade, T. (1990). Finding Natural Clusters Having Minimal Description Length, In Proceeding of the 1990 IEEE Conference on Pattern Recognition, 438-442, Atlantic City, NJ. White, S. (1984). Concepts of Scale in Simulated Annealing, In Proceedings of the IEEE International Conference on Computer Design, 646-651, Port Chester.
Adaption of Simulated Annealing to Chemical Optimization Problems, Ed. by J.H. Kalivas © 1995 Elsevier Science B.V. All rights reserved.
155
Chapter 7
Classification of materials
Ruqin Yu, Lixian Sun and Yizeng Liang
Department of Chemistry and Chemical Engineering, Hunan University, Changsha, 410082, People's Republic of China
1. INTRODUCTION
Cluster analysis as an unsupervised pattern recognition method is an important tool in exploratory analysis of chemical data. It has found wide application in many fields such as disease diagnosis, food analysis, drug analysis, classification of materials, etc. Hierarchical and optimization - partition algorithms are the most widely used methods of cluster analysis [1]. One of the major difficulties for these conventional clustering algorithms is to guarantee a global optimal solution to the corresponding problem. Simulated annealing as a stochastic optimization algorithm [2] could provide a promising way to circumvent such difficulties. Recently, generalized simulated annealing has been introduced into chemometrics for wavelength selection [3] and calibration sample selection [4]. The use of cluster analysis methods based on simulated annealing for chemometric research is of considerable interest. Three modified clustering algorithms based on simulated annealing, K-means algorithm and principal component analysis(PCA) are proposed and used to chemometric research. A modified stopping criterion and perturbation method are also proposed. These algorithms are all tested by using simulated data generated on a computer and then applied to the classification of materials such as Chinese tea, bezoar ( traditional Chinese medicine calculus bovis) , beer samples and biological samples. The results compare favourably with those obtained by conventional clustering methods.
156 2. CLUSTER ANALYSIS BY SIMULATED ANNEALING
2.1. Principle of cluster analysis by simulated annealing Simulated annealing (SA) which derives its name from the statistical mechanics simulating the atomic equilibrium at a fixed temperature belongs to a category of stochastic optimization algorithms. According to statistical mechanics, at a given temperature Ti and under thermal equilibrium the probability of a given configuration
i,
obeys the Boltzmann-Gibbs
distribution:
f, = k e.xp ( -Ei )
(1)
where k is a normalization constant and El is the energy of the configuration i [5, 6] . SA was proposed by Kirkpatrick et al. [7] as a method for solving combinational optimization problems which minimizes or maximizes a function of many variables. The idea was derived from an algorithm proposed by Metropolis et al. [8] who simulated the process involving atoms reaching thermal equilibrium at a given temperature T. The current configuration of the atoms is perturbed randomly and then a trial configuration is obtained according to the method of Metropolis et al.[8]. Let Ec and Et denote the energy of the current and trial configuration, respectively. If Et < Ec , which means that a lower energy has been reached, the trial configuration is accepted as the current configuration. If
Ec , then the trial configuration
is accepted with a probability which is directly proportional to exp(- E t - Ec )/T ). The perturbation process is repeated until the atoms reach thermal equilibrium, i.e. the configuration determined by the Boltzmann distribution at the given temperature. New lower energy states of the atoms will be obtained as T is decreased and the Metropolis simulation process is repeated. When T approaches zero, the atomic lowest-energy state or the ground state is obtained. Cluster analysis can be treated as a combinational optimization problem. Selim and Alsultan [5] as well as Brown 'and Entail [9] described the analogy between SA and cluster analysis. The atomic configuration of SA corresponds to the assignment of patterns or samples to a cluster in cluster analysis. The energy E of the atom configuration and temperature T in SA
157 process correspond to the objective function 0 and control parameter T in cluster analysis, respectively. Suppose n samples or patterns in d -dimensional space are to be partitioned into k clusters or groups. Different clustering criterions could be adopted. The sum of the squared Euclidian distances from all samples to their corresponding cluster centers is used as the criterion .
l./(1 = I, 2, ..., n ; j = 1, 2 ,
, d ) be an n x d sample data matrix, and
Let A = [a
ig ] (i = 1, 2,...,n ; g= 1, 2,..., k) be an
W=[ w
sample i is assigned to cluster g and otherwise
ti
ig
k cluster membership matrix; w = 1,
Ewig = 1. Let Z
( g = 1, 2 ,..., k ;
g=1
j= I,2,..., d) be a kxd matrix of cluster centers where
Z
w. a.. /
• =
i = 1
Wlg
(2)
i=1
The sum of the squared Euclidian distances is used as the objective function to be minimized:
(
W11
••
• , W lk;
—
,• wnl,...,wnk) —
1=1 1;=1
z
.)
2
(3)
j=1
The clustering is carried out in following steps: Step 1. Set initial values of parameters. Let T I be the initial temperature, T2 the final temperature, p the temperature multiplier, N the desired number of Metropolis iterations, IGM the counting number of a Metropolis iteration and i the counting number of a sample in the sample set . Take Ti = 10 , T2
=
1 0 -99 , iu = 0.7 - 0.9, N = 4n , IGM = 0 and 1= O.
Step 2. Assign an initial class label in k classes to all of n samples randomly and then calculate corresponding values of the objective function 0 . Let both the optimal objective
158 function value Ob and the current objective function value O c be 0, the corresponding cluster membership matrix of all samples be Wb . Tb
is the temperature corresponding to
the optimal objective function Ob Tc and We are the temperature and cluster membership matrix, respectively, corresponding to the current objective function O (' . Let Tc = T , Tb = T1 , Wc = Wb Step 3. While the counting number of Metropolis sampling step is less than N , i.e. IGM < N, go to step 4, otherwise, go to step 7. Step 4. Let flag = 1, let p be the threshold probability and if IGM N / 2 , p = 0.80: otherwise, p = 0.95, a trial assignment matrix W t can be obtained from the current assignment We by the following way: If i > n , then, let i = i - n ; otherwise, let I = 1+ 1, take sample i from the sample set, initial class assignment ( wig ) of this sample is expressed by f (where f belongs to arbitrary class of k classes), i.e. f = wig , then draw a random number u (u = rand, where rand is a random number of uniform distribution in the interval [0,1] ), if u > p , generate a random number r which lies in the range [1, k ], here r # f, put sample i from class f to class r , let wig = r, flag = 2; Otherwise, take another sample, repeat the above process until flag = 2. Step 5. Let corresponding trial assignment after above perturbation be Wt . Calculate the objective function value 0 1- of the assignment. If O t
Oc let We = Wt 0c = Ot • If Ot <
Ob , then, Ob = Ot , Wb = Wt , IGM = 0. Step 6. Produce a random number y , here y = rand, if y < exp - (it - .Ic )/ Tc ), then, W. = Wt ,Oc = 01. • Otherwise, IGM = IGM + 1, go to step 3. Step 7. Let Tc = pTc , IGM = 0, Oc = Ob , Wc = Wb . If Tc < T2 or Tc /Tb <
10 - 10 ,
then, stop; otherwise, go back to step 3. A flow chart of the program is shown in Figure 1.
2.2. Treatment of Simulated data
The algorithm of cluster analysis based en SA was tested by using simulated data generated
159
Input data & parameters
Generate an initial cluster membership matrix Wb randomly and calculate 4 ; =4); ti)
= (1); T =T ;
IGM < N ? no
yes Do perturbation operation to obtain W t: flag = I; if IGM N/2; p = 0.8; else ; p = 0.95; end while flag 2 if i > n; i = i - n; else; i = i + I ; end f=w ; u rand; ig if u > p; w = r ; r f ) ; flag = 2; end ig
Calculate
:
= rand; v < exp(-((1) t - c )IT )?
GM = IGM +
yes
T= p Tc ; IGM = 0 ;
yes
utput of results
Figure 1. Flow chart of a cluster analysis by simulated annealing.
160 on the computer and composed of 30 samples containing 2 variables ( x, y) for each. These samples were supposed to be divided into 3 classes. The data were processed by using cluster analysis based SA, hierarchical cluster analysis [10] and K-means algorithm[11-12] The optimal objective function (
) values obtained are shown in Table 1. Comparing the
results in the column 4 with those of column 7 in Table 1 , one notices that for SA there is only one disagreement of the class assignment for the sample No. 6, which is actually misclassified by all three methods. This shows that the SA method is more preferable than hierarchical cluster analysis and K-means algorithm. The objective function values of K-means algorithm with 50 iterative cycles are listed in Table 2, the lowest value is 168.7852. One notices that the behavior of K-means algorithm is influenced by the choice of initial cluster centers, the order in which the samples were taken, and, of course, the geometrical properties of the data . The tendency of sinking into local optima is obvious. Clustering by SA can provide more stable computational results.
2.3. Classification of tea samples Liu et al. [13] studied the classification of Chinese tea samples by using hierarcharcal cluster analysis and principal component analysis. In their study, three categories of tea of Chinese origin were used: green, black and oolong. Each category contains two varieties: Chunmee (C) and Hyson (H) for green tea, Keemun (K) and Feng Quing (F) for black tea, Tikuanyin (T) and Se Zhong (S) for oolong tea. Each sample in these groups was assigned a number according to its quality tested by tea experts on the basis of the taste and the best quality was numbered as I. One notices that the assignment of quality by the tea experts is valid only for samples belonging to the same category and variety. The names of the samples are composed of the first letter of the variety name, followed by the number indicating the quality. The data which involve concentration(% w/w, dry weight) of cellulose, hemicellulose, lignin, polyphenols, caffeine and amino acids for various tea samples were processed by using cluster analysis based on SA and K-means algorithm. The results obtained by using the two methods were compared with those given by Liu et al. [13] using hierarchical cluster analysis. The results are summarized in Table 3, where the numbers 1, 2, 3 refer to the groups the tea samples
161 Table 1 Comparison of results obtained by using different methods in classification of simulated data Classification results No.
Simulated data
Actual class of
x
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
1.0000 3.0000 5.0000 0.0535 0.3834 5.8462 4.9103 0 4.0000 2.0000 11.000 6.0000 8.0000 7.0000 10.000 9.5163 6.9478 11.5297 10.8278 9.0000 7.2332 6.5911 7.2693 4.1537 3.5344 7.5546 4.7147 5.0269 4.1809 6.3858
Objective function value, 0b,
y
Hierarchical
K-means*
annealing
simulated data clustering
0 3.0000 2.0000 4.0000 0.4175 3.0920 4.2625 2.6326 5.0000 1.0000 5.6515 5.2727 3.2378 2.4865 6.0000 3.9866 1.5007 3.9410 2.0159 4.7362 12.000 9.0000 9.5373 8.0000 11.000 11.6248 8.0910 7.0000 9.8870 9.4997
1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
169.3275
1 2 2 1 1 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
189.3531
simulated
1 1 1 1 1 2 2 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
168.7852
* The result with lowest value of b among 50 independent iteration cycles.
1 1 1 1 1 2 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
166.9288
162 Table 2 Oil values obtained by using K-means algorithm with 50 different random initial clustering
307.3102 177.2062 168.7852 169.3275 181.4713 181.4713 347.3008 169.3275 306.3492 173.0778 173.0778 178.9398 347.3008 169.3275 173.0778 177.2062 169.3275 173.0778 171.6456 177.2062 181.4713 178.9398 176.4388 168.7852 373.6821 173.0778 202.4826 168.7852 173.0778 171.6456 181.4713 173.0778 306.3492 169.3275 177.2062 202.4826 373.6821 173.0778 202.4826 178.9398 347.3008 169.3275 181.4713 176.4388 202.4826 173.0778 178.9398 177.2062 347.3008 177.2062
are classified into ( tea samples denoted by the same number are classified into the same group ). The objective Ob
for the hierarchical clustering obtained by Liu et al. [13] was
calculated according to equation 3. As shown in Table 3, different results could be obtained by using different methods with the same criterion. Objective function 0 obtained by using cluster analysis based on SA was the lowest among all methods listed in Table 3. It seems that cluster analysis based on SA can really provide a global optimal result. The column 4 of Table 3 shows the classification results of K-means algorithm with lowest O b among 50 independent iterative cycles. Comparing the results in the column 5 with those of column 2 in Table 3 , one notices that there is only one case of disagreement of the class assignment for the tea sample K2. K2 was distributed to class 1 and class 2 by hierarchical clustering and SA, respectively. As mentioned above, the assignment of quality by the tea experts is valid only for samples belonging to the same category and variety. According to hierarchical clustering sample K2 is classified as 1,
163 Table 3 Comparison of results obtained by using different methods in classification of Chinese tea samples Classification results Sample
Hierarchical clustering K-means a K-meansb
Cl C2 C3 C4 C5 C6 C7 H1 H2 H3 H4 H5 K1 K2 K3 K4 Fl F2 F3 F4 F5 F6 F7 T1 T2 T3 T4 S1 S2 S3 S4
1 1 1 1 2 2 2 1 1 1 2 2 1 1 2 2 1 1 1 1 2 2 2 3 3 3 3 3 3 3 3
1 1 1 2 2 2 2 1 1 2 2 2 2 2 2 2 1 1 2 2 2 2 2 3 3 3 3 3 3 3 3
Simulated annealing 1 1 1 1 2 2 2 1 1 1 2 2 1 2 2 2 1 1 1 1 2 2 2 3 3 3 3 3 3 3 3
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 2 2 3 3
Objective function value, 0b
50.8600
140.6786
119.2311
50.6999
a. The result obtained in one iteration cycle with arbitrary initial clustering. b. The result with lowest value of ob among 50 independent iteration cycles.
164 meaning K2 is at least as good as C4, H3, etc. SA gave a classification of class 2 for K2, qualifying it as the same quality as C5, 114, etc. As the value of Ob for SA is slightly lower , the results for proposed algorithm seem to be more appropriate in describing the real situation. The clustering results obtained by K-means algorithm seem not sufficiently reliable.
2.4. Some computational aspects of simulated annealing algorithm Selim and Alsultan [5] pointed out that no stopping point was computationally available on clustering analysis based on SA. Searching some stopping criteria for the use of SA cluster analysis deserves further investigation. It is rather time-consuming to proceed the calculation until Tc < T2 , as theoretically T2 itself should approach zero ( T2 = 10-99 is taken here. ) . In general, the exact value of T2
is unknown for practical situations. The present authors
propose a stopping criterion based.on the ratio of temperature Tb which corresponds to the - 10 optimal objective function Ob to current temperature Tc T. When Tc /Tb < 10 10 , one stops computation ( step 7, vide supra ). For example, during the data treatment when Tc = 3.8340 x 10-54 and Tb = 9.6598x 10 -44 , one stops computation. This is a convenient criterion, it saves computing time substantially comparing to the traditional approach using extremely small T2 . The methods to carry out the perturbation of the trial states of sample class assignment in cluster analysis as well as the introduction of perturbation in the SA process also deserve consideration. The present authors propose a perturbation method based on changing class assignment of only one sample at each time ( Figure 1 ). Such a method seems to be more effective, as usually only the class assignment of one sample is wrong, and only the class assignment of this sample has to be changed. Brown and Entail [9] took the sample to be perturbed on a random basis, the corresponding perturbation operation is described as follows:
do perturbation operation to obtain Wt flag = I; if IGAI ^ N/2; P = 0.8; else; p = 0.95; end while flag # 2
165 i = rand (1, n) ; where i is a random which lies in the interval I I , n . f= wig ; u = rand; if u > p; wig = r ; flag = 2; end end
It seems that in such a method the equal opportunity in perturbation for each sample might not really be guaranteed. Every sample will have equal opportunity in perturbation in step 4 ( vide supra) which takes less computation time in obtaining comparable results ( Table 4 ). On the other hand, Selim and Alsultan [5] proposed the below perturbation method in which the class assignments of several samples might be simultaneously changed in each perturbation :
do perturbation operation to obtain Wt flag =1; if IGM ^ N/2; P = 0.8; else; p = 0.95; end i=0; while flag 2 or i n ifi>n; i=i-n; else; i=i+1;end f = wig ; u = rand; if u > p; wig = r ; flag = 2; end end
The comparison of different perturbation operations is shown in Table 4 . One notices that the method proposed by Selim and Alsultan [5] takes the longest time, i.e. this method converges to the global optimal solution rather slowly, and the present method converges quite quickly. Cluster analysis based on SA is a very useful cluster algorithm, although it has some insufficiency. As mentioned above, the modified algorithm is more effective than K-means algorithm and is also preferable than hierarchical cluster analysis. A global optimal solution may be obtained by using the algorithm. Feasible stopping criterion and perturbation method are important aspects for the computation algorithm. The present authors use minimization of the
166 Table 4 Comparison of results obtained by using different perturbation methods in classification of simulated data Actual class of simulated data 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
Selim's method
Brown's method
1 1 2 1 1 2 2 1 1 1 2 1 2 1 2 2 2 2 2 2 3 3 3 2 3 3 3 2 3 3
Objective function ( Ob ) Computation time ( hrs . )
Classification results
209.4124
23.5069
of Present method
1 1 2 1 1 2 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
1 1 1 1 1 2 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
173.7650
166.9288
11.7061
1.1627
sum of the squared Euclidian distances as the clustering criterion. As one knows that Euclidian distances are suitable only for spherical distribution data set, searching other clustering criteria
167 suitable for different kinds of data sets for the use of SA cluster analysis deserves further investigation.
3. CLUSTER ANALYSIS BY K-MEANS ALGORITHM AND SIMULATED ANNEALING
3.1. Introduction From the above research results, we notice that cluster analysis based on SA is a very useful algorithm, but it is still time-consuming, especially for large data set. Although we have made some improvements on the conventional algorithm and proposed a modified duster analysis based on simulated annealing (SAC), further modification for this sort of algorithm is of considerable interest. On the basis of the above works, a modified clustering algorithm based on a combination of simulated annealing with the K-means algorithm (SAKMC) can be used. In this procedure the initial class labels among k classes of all n samples are obtained by using the K-means algorithm instead of random assignment. A flow chart of the SAKMC program is shown in Figure 2. The algorithm is firstly tested on two simulated data sets, and then used for the classification of calculus bovis samples and Chinese tea samples. The results show that the algorithm which is guaranteed in obtaining a global optimum with shorter computation time compares favorably with the original SAC and K-means algorithm.
3.2. Treatment of simulated data The simulated data sets were composed of 30 samples (data set I) and 60 samples (data set II) containing 2 variables ( x, y) for each (see Figure 3 and Figure 4 , respectively ). These samples were supposed to be divided into 3 classes. The data were processed by using cluster analysis based on simulated annealing (SAC) and cluster analysis by K-means algorithm and simulated annealing(SAKMC), respectively. As shown in Table 5, the computation time needed to obtain corresponding optimal objective function value (O b ) for clustering based on K-means algorithm, simulated annealing(SAC) and SAKMC are about 3 min., 70 min. and 55 min. for data set I, and 5 min., 660 min., 183 min. for data set II, respectively. From the above results, one notices that the larger the data set , the longer the computation time.
168
Input data & parameters
Generate an initial cluster membership matrix Wb by K-means algorithm and calculate 4; (1)
=4); 4r
0 ; T b
147 =1 V •
IGM < N ? no
yes Do perturbation operation to obtain W t
:
flag = I; if IGA 4 N/2; p = 0.8; else ; p = 0.95; end while flag 2 if i > n; i = i n; else; i = +
f=w .
;
ig
ifu > p; w
end
u = rand; ig
= r ; r f ) ; flag = 2; end
e Calculate 4) t : t
c
W t
f O t ^.4) b ; 4' b =4 t ;
=
Wb Wt ;
IGM=0; end
end
W=W;
T = T ; IGM =0;
tP
c =0 1, ; Wc = Wb ;
es
Figure 2. Flow chart of a cluster analysis by K-means algorithm and simulated annealing.
169 Table 5 The comparison of results of different approaches Simulated data set II
Simulated data set I
K-means SAC SAKMC
K-means SAC SAKMC Objective function
value, (I)
Computation time ( min .
63.7707
3
60.3346 60.3346
114.6960 114.5048 114.5048
55
70
183
660
5
8 )1:
76
**
5-
4321 0
1
2
3
4
5
x
Figure 3. A plot of simulated data set I
6
7
8
170 8 7 6 5_
*
*
2 1
0
,* 1
* 2
3
4 x
5
6
7
Figure 4. A plot of simulated data II 3.3. Classification of calculus bovis samples Calculus bolds or bezoar is a widely used traditional Chinese medicine suitable for the treatment of the fever and sore throat. The microelement contents in natural calculus bovis and cultivated calculus bovis samples were determined by using an Iarre11-Ash 96-750 ICP instrument[14]. The data after normalization and the results obtained by using different methods are listed in Table 6. K-means algorithm takes the shortest time to obtain the final result (0 b = 96.7396), which is really a local optimal solution. Cultivated calculus bovis samples No.4 and No.7 were misclassified into natural ones by K-means algorithm. Both SAC and SAKMC can get a global optimal solution ( b = 94.3589 ), only sample No. .4 belonging to cultivated calculus bovis was classified into a natural one corresponding to 0 b = 94.3589. If sample No. 4 is classified into a cultivated one, the corresponding objective function b would be 95.2626. this indicates that sample No.4 is closer to natural calculus bovis . From the above results, one notices that calculus bovis samples can be correctly classified into natural and cultivated ones on the basis of their microelement contents by means of SAC and SAKMC except the sample No. 4. The computation times for SAC and SAKMC were 21 and 12 minutes, respectively.
171
Table 6 The normalized data of microelement contents in natural and cultivated calculus bovis samples Sample No.* 1
Cr 0.2601
Cu 1.2663
Mn
Ti
Zn
Pb
-0.3583
-0.8806
2.1462
0.7075 0.9897 0.2796
2 3
-0.5501 -0.2094
-0.4793 1.4416
0.4264 1.3222
-0.7349 -0.9609
1.6575 1.3797
4 5
-0.1412 -0.0352
-0.7887 0.3886
-0.3329 0.9366
-0.9448 -0.8966
-0.4879 -0.3549
2.3681 0.5646
6 7
0.4039 -0.8455
-0.1633 1.6040
0.3890 -0.8126
-0.2495 -0.2655
-0.4589 -0.5768
-0.6124 -0.0425
8 9
-0.5539 -0.5880
-0.9086 -0.6811
20.7482 -0.4788
0.2371 1.6784
-0.4448 -0.5340
-0.0360 -1.0765
10 11
-1.5648 0.0178
-0.7790 0.9968
-1.0007 -0.9148
-0.4273 0.6422
-0.5804 -0.5779
-0.3776 -0.4834
12 13
2.3159 1.4905
-0.8352 -1.0622
-0.6767 2.2487
1.9193 0.8831
-0.5841 -0.5835
-1.1406 -1.1406
Ca
K
Na
Sample No.* Mo
1
0.3092
2.4408
-0.8782
-0.7953
2.7953
2 3 4 5 6 7
0.3092 1.8715
1.1206 0.7906
-0.9742 -0.7823
-0.8503 -0.7128
-0.7975 -0.9002
-0.1121 -0.6738
0.8089 -0.4288
-0.3985 0.7528
0.4146 1.2395
0.3316 -0.0790
-0.9546 1.1693
-0.0437 -0.5755
1.3284 -0.2066
0.6620 0.3871
-0.3356 -0.6435
8 9 10 11
-0.5334 -0.9546 -0.1121 1.5906
-0.8322 -0.4471 -0.8505 -0.4838
-1.7122 1.0406
1.9544 -0.7953
0.8487 -0.3026
0.6620 0.1671
-0.4382 0.7422 -0.2329 0.7422
12 13
-0.9546 -0.9546
-0.6580 -0.8413
-1.1660 -0.9742
-1.4827 -0.8503
-0.7462 -0.4382
* No. 1-3 are cultivated calculus bovis and No.4-13 are natural calculus bovis samples.
3.4. Classification of tea samples The data which involve concentration(% w/w, dry weight) of cellulose, hemicellulose, lignin,
172 polyphenols, caffeine and amino acids for various tea samples were processed by using SAC, SAKMC and K-means algorithm. The results obtained by using the three methods were compared with those given by Liu et al. [13] using hierarchical cluster analysis. The results obtained are summarized as follows: Hierarchical clustering: Class 1: C1-4, H1-3, K1-2, F1-4; Class 2: C5-7, H4-5, K3-4, F5-7; Class 3: T1-4, S1-4. Objective function value Oh : 50.8600 ( The objective function Oh was calculated according to equation 3 ) Computation time: 10 min. K-means: Class 1: C1-7, H1-5, K1-4, F1-7; Class 2: T1-2, S1-2; Class 3: T3-4, S3-4. Objective function value Oh : 119.2311 Computation time: 6 min. SAC and SAKMC Class 1: C1-4, H1-3, Kl, F1-4; Class 2: C5-7, H4-5, K2-4, F5-7; Class 3: T1-4, S1-4. Objective function value Oh : 50.6999 Computation time: 107 min.(SAC); 68 min.(SAKMC) One notices that there is only one case of disagreement of the class assignment for the tea sample K2. K2 was classified into class 1 by hierarchical clustering and K-means, and it was classified into class 2 by SAC and SAKMC. Both SAC and SAKMC give a relatively low objective function value. The K-means algorithm is inferior as shown by the objective function. It puts all the green and black teas into the same group and separates the oolong teas into a high and a low quality group. 3.5. Comparison of methods with and without simulated annealing Comparing the results obtained from the simulated data, calculus bovis samples and tea samples, one notices that different results can be obtained by using different methods with the same criterion, the K-means algorithm in these cases converges to local optimal solutions with shortest time, the behavior of K-means algorithm is influenced by the choice of initial cluster centers, the order in which the samples were taken, and the geometrical properties of the data. The tendency of sinking into local optima is obvious . Both SAC and SAKMC can obtain global optimal solutions but SAKMC converges faster than SAC . Both SAC and SAKMC adopt the same stopping criterion. The main reason why SAKMC converges faster than SAC is that SAKMC uses the results of K-means algorithm as the initial guess. As K-means algorithm is very quick one gets faster convergence with the SAKMC version.
173 4. CLASSIFICATION OF MATERIALS BY PROJECTION PURSUIT BASED ON GENERALIZED SIMULATED ANNEALING
4.1. Introduction The classical principal component analysis (PCA) is the basis of many important methods for classification of materials since the eigenvector plots are extremely useful to display ndimentional data preserving of the maximal amount of information in a space of reduced dimension. The classical PCA method is, unfortunately, non-robust as the variance is adopted as the optimal criterion. Sometimes, a principal component might be created just by the presence of one or two outliers [15]. So if there are outliers existing, the coordinate axes of the principal component space might be misdetermined by the classical PCA, and reliable classification of materials could not be obtained. The robust approach for PCA analysis using simulated annealing has been proposed and discussed in detail in the Chapter " Robust principal component analysis and constrained background bilinearization for quantitative analysis". The projection pursuit (PP) is used to carry out PCA with a criterion which is more robust than the variance[ 16], generalized simulated annealing(GSA) algorithm being introduced as the optimization procedure in the process of PP calculation to guarantee the global optimum. The results for simulated data sets show that PCA via PP is resistant to the deviation of the error distribution from the normal one, and the method is especially recommended for the use in the cases with possible outlier(s) existing in the data. The theory and algorithm of PP PCA together with GSA are described[ 16]. Three practical examples are given to demonstrate its advantages.
4.2. The IRIS data A set of IRIS data[17] which consists of three classes: setosa, versicolor and virginia was used to determine the applicability of the PP PCA algorithm for analyzing multivariate chemical data. Figure 5 shows the classification results of PP PCA and SVD. It can be seen that the PP PCA solutions provide a more distinct separation between the different varieties.
174 PP classification
1
7
0
PP classification
6
4
-3 -4 -4
-2
3 -4
0 PC1
0
10
SVD classification
12
PC1
Figure 5. The plot of PP and SVD for IRIS data * Setosa samples
PC1
SVD classification
8
-2
0 Versicolor samples - Virginia sample
2
175 4.3. Classification of tea samples The tea samples mentioned in section 2.3. are classified into three classes according to their origin by using the PP PCA and SVD. As shown in Figure 6, the tea samples are clearly classified into three classes by PP PCA algorithm which uses the SA. The method is more feasible than classical SVD approach. 5
PP discrimination
2
PP discrimination
4 aw3
-5 -2
a,
0
cP
0
-2 -2
0 PC1 SVD discrimination
2
2 0 0, 00
2 PC1 SVD discrimination
0
cP°
0
0 SA( **
-2 -5
A, *
5
0
PC1
-2
*
-4 -5
5 PC1
Figure 6. The comparison of results for tea samples by using the PP and SVD classification * Green tea
o Black tea
Oolong tea
4.4. Classification of beer samples Xu and Miu determined the contents of alcohol and many other chemical parameters in beer samples and classified them by using the PCA and nonlinear mapping technique [18]. This data set is processed by PP PCA and SVD. The results is shown in Figure
7.
One sample ( No.17 ) which was first classified into " plain " beer by the manufacture is classified into " imperial " one. The beer expert's further examination confirmed this classification. PP PCA compared favourably with the traditional SVD algorithm and was at least as good as the nonlinear mapping technique.
176
-0.4
PP discrimination
-5.5
PP discrimination
*
a.,
0* 4 *
-0.6
-0.8 -0.6
-0.5
-0.4
0 -6 a-,
-0.3
D -6.5 -0.6
0
-0.5
4
-0.4
-0.3
PC1
PC1 SVD discrimination
1
SVD discrimination 0
*0 0 a.,
**
0 1-CD"
-2 Act. 20
30
40
-1 20
30
40
PC1
PC1
Figure 7. The comparison of results of beer samples by using the PP and SVD classification * "Imperial" beer
o "Plain " beer
4.5. Classification of biological samples The contents of Sr, Cu, Mg and Zn in the serum of patients of the coronary heart diseases and normal persons were determined by using the ICP-AES[19]. The data were evaluated by using ordinary principal component analysis, cluster analysis and stepwise discrimination analysis. It has been found that ordinary principal component analysis and cluster analysis could not give satisfactory results with four samples misclassified. There were two samples misclassified in stepwise discrimination analysis. These data sets were treated by PP PCA and SVD. The PC1--PC2 plot of PP classification shown in Figure 8 has only two samples misclassified. The results further demonstrate that PP PCA is more preferable than the traditional SVD algorithm.
177 PP classification
1.5
PP classification
2
0 00g 0
1
0
CIC 0 0 * 0 * 0
N
U05 41-1
0 -2
:4
4.1
0 civ
-4
0
o
-0.5
00 0
-6
0
1
0.5
0.5
0
PC1
PC1 4
SVD classification
SVD classification
2
0
0 0.5
c\I
U 0
0 *
00
.4 * 0"T• * * *0
-2
0
coio
0
-4 20
40
60 PC1
80
0.5 20
60
40
80
PC1
Figure 8. The comparison of results of biological samples by using PP and SVD classification * Serum samples of patients with coronary heart diseases o Serum samples of normal persons
ACKNOWLEDGEMENT This work was supported by National Natural Science Foundation of P.R.C. and partly by the Electroanalytical Laboratory of Changchun Institute of Applied Chemistry, Chinese Academy of Sciences.
178
REFERENCES 1.
N. Batchell, Chemometrics and Intelligent Laboratory Systems, 6 (1989) 105.
2.
N. E. Collins, R. W. Eglese and B. L. Golden, American Journal of Mathematics and Management Science, 8 (1988) 209.
3. 4.
J. H. Kalivas, S. N. Robert and J. M. Sutler, Analytical Chemistry, 61 (1989) 2024. J. H. Kalivas, Journal of Chemometrics, 5 (1991) 37.
5. 6.
S. Z. Selim and K. Alsultan, Pattern Recognition, 24 (1991) 1003. V. Cerny and S. E. Dreyfus, Thermodynamical Approach to the Traveling salesman problem: An efficient simulation algorithm, Journal of Optimization Theory and Applications, 45 (1985) 41.
7. 8.
S. Kirkpatrick, C. Gelatt and M. Vecchi, Science, 220 (1983) 671. N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller and E. Teller J. Chem. Phys.,
9.
21 (1953) 108. D. E. Brown and C. L. Entail, Pattern Recognition, 25 (1992) 401.
10. J. H. Ward, J. Am. Stat. Ass., 58 (1963) 236 . 11. G. Li, G. Cai, Computer Pattern Recognition Technique, Chap. 3. Shanghai Jiaotong University Press, (1986). 12. Q. Zhang, Q. R. Wang and R. Boyle, Pattern Recognition, 24 (1991) 331. 13. X. D. Liu, P. V. Espen, F. Adams and S. H. Yan, Anal. Chim. Acta., 200 (1987) 424. 14. Q. Zhang, K. Yan, S. Tian and L. Li, Chinese Medical Herbs (Chinese), 14 (1991) 15. 15. H. Chen, R. Gnanadesikan, J. R. Kettenring, Sarikhya, 1336 (1974) 1. 16. Y. Xie, J. wang Y. Liang, L. Sun, X. Song and R. Yu, Journal of Chemometrics, 7 (1993) 527. 17. R. A. Fisher, Annals of Eugenics, 7 (1936) 179. 18. C. Xu and Q. Miu, Computers and Applied Chemistry (in Chinese), 3 (1986) 21. 19. L. Xu, Y. Sun, C. Lu, Y. Yao and X. Zeng, Analytical Chemistry (in Chinese), 19 (1991) 277.
APPENDIX Principle of K-means algorithm Consider n samples in d dimensions, a1 , a 2 ,..., an , where the sample vector a1 = [ail , ai, ,..., aid], i =1, 2, ..., n, and assume that these samples are to be classified into k groups. The algorithm[ 11-121 is stated as follows:
k arbitrary samples from all n samples as k initial cluster centers z1 (0,z2 (0) , , (01= (0) k, the superscript (0) refers to the g I, (o) zK (0), where ze = Zgd Step 1. Select
179 initial center assignment. Step 2. At the h -th iterative step distribute the n samples among k clusters, using the Zg(h) II < II am - z1(h)11 for all k, g i, where sg(h) denotes the set of samples whose
following criterion, am should belong to sg(h) m =1, 2, ... , n, and i = 1,
if Il am -
represents the norm; Otherwise the clustering of am remains
cluster center is Zgal), unchanged.
Step 3. According to the results of step 2, the new cluster centers zgal+1), g = 1, k, are calculated such that the sum of the squared Euclidian distances from all samples in sg(h) to the new cluster center, i.e. the objective function 9 =
EE
Ilam – zg(h+1) II
m= I, 2, ..., n
g=1 amesezi
is minimized. Zg0+1) is the sample mean of sg(h). Therefore, z 01+1) = 1 7 a g n g m 4.1.–d am e sg (h)
8.--- 1, 2, ..., k
where ng is the number of samples in sg(h). The name "K-means" is derived from the manner in which the cluster centers are sequentially updated. Step 4. If Zg 0 + 0 = zg (h) for g = 1, k , stop, the algorithm is converged; otherwise, go to step 2.
This Page Intentionally Left Blank
Adaption of Simulated Annealing to Chemical Optimization Problems, Ed. by J.H. Kalivas © 1995 Elsevier Science B.V. All rights reserved.
181
Chapter 8
Chemical batch process scheduling I. A. Karim? and S. Hasebeb aBenger Laboratory, E.I. du Pont de Nemours & Co., Waynesboro, VA 22980, U.S.A.* "Department of Chemical Engineering, Kyoto University, Sakyo-ku, Kyoto 60601, Japan Chemical manufacturing processes can be classified into two types based on their modes of operation. A continuous process or unit is the one which produces product in the form of a continuously flowing stream, while a batch unit or process is the one that produces it in discrete batches. The continuous processing has been the most prevalent and sought-after mode in the chemical processing industry (CPI). In recent years, the noncontinuous or batch processing has received increased attention due to many reasons. The batch CPI employs combinations of three types of processing units: continuous, semicontinuous and batch. A semicontinuous unit is a continuous unit that runs intermittently with starts and stops, e.g. pump, filter, etc. A continuous process, in most cases, is dedicated to produce a fixed product with little or no flexibility to produce another, while most batch processes possess the flexibility to produce multiple products. Batch plants are well suited for producing multiple low-volume, high-value products requiring similar processing paths and/or complex synthesis procedures as in the case of specialty chemicals such as pharmaceutical, cosmetics, polymers, biochemicals, food products, electronic materials, etc. In addition, they are forgiving in the face of seasonal or uncertain demands and lack of process or product knowledge and offer flexibility in terms of operation times. A survey by Parakrama (1985) clearly highlighted the importance of batch processing and the need for computer aids. Since sharing of resources (time, equipment, manpower, utilities, raw materials, etc.) to manufacture multiple products is the principal feature of most batch plants, the need for optimization invariably arises both in the design and the operation of such plants. These optimization problems are often complex and combinatorial in nature because there are usually many ways in which a plant can be designed and configured (the design problem) to produce a set of products and similarly, many ways in which a plant and its resources can be allocated or scheduled (the scheduling problem) to produce a given slate of product orders. Two main approaches, mathematical programming and heuristic algorithms, have been used to solve design and scheduling problems arising in batch CPI. The former *This work was performed, while the author was at the Department of Chemical Engineering, Northwestern University, Evanston, IL 60208-3120, U.S.A.
182 has been the principal approach for the design problems, while the latter for the scheduling problems. The mathematical programming approach results in either Nonlinear Programming (NLP) or Mixed Integer Linear Programming (MILP) or MINLP formulations. When the full range of structural or scheduling options is modeled in these formulations, the MILP's or the MINLP's can easily become intractable. Similarly, as the problem size and complexity increase, most formulations fail to guarantee globally optimal solutions. On the other hand, heuristic algorithms are normally designed to exploit special features of a problem, hence are generally problem-specific. There is very little carry-over from one problem to another. Furthermore, the ability of most heuristic algorithms to give good suboptimal solutions deteriorates rapidly with increasing problem size. Simulated annealing (SA), because of its simplicity, generality and success in solving large scale combinatorial problems, appears to be very promising for the above problems. Batch plants have usually been classified (Mah 1990) into two types; multiproduct and multipurpose. A multiproduct plant is more structured and consists of a series of processing stages, each stage comprising of one or more units. Products produced by multiproduct plants are similar and hence follow the same processing path. A multipurpose plant is more like a pool of processing units that can be configured into different production lines to produce nonsimilar products. The multipurpose plants are more complex of the two from the standpoints of design as well as scheduling. Being the simpler of the two, the multiproduct plants have been well studied in the literature (Reklaitis 1990, Musier and Evans 1990). In this chapter, we restrict ourselves to only the batch process scheduling applications of SA. For applications in the design area, we refer the reader to Patel et al. (1991). We apply SA to three scheduling problems arising in multiproduct batch plants. The problems illustrate the complexity of the batch problems in general and also demonstrate the potential of SA. The first two problems involve the scheduling of a serial multiproduct process (or a flowshop) with a single production line of M batch units. The difference between the two problems is in the scheduling objective. In the first problem, we minimize the total time required to produce a set of product batches, i.e. to maximize productivity; while in the second one, we minimize the total penalty due to delays in meeting customer orders. The third problem represents a large scale application that involves the scheduling of a more complex plant in that it has two production lines and several additional features and constraints.
1. A SERIAL MULTIPRODUCT BATCH PROCESS The process (Figure 1) consists of a series of M batch units (reactors,
0
Semicontinuous Units
Storage Units
Figure 1. Schematic diagram of a serial multiproduct process.
Batch Unit
183 blending tanks, crystallizers, batch stills, dryers, furnaces, etc.) that are linked together by various semicontinuous units (pumps, heat exchangers, filters, packagers, bottlers, etc.). The products from this process are similar (e.g. various types of lubricating oils or paints or cosmetics or beverages, etc.) in nature, hence require similar recipes and raw materials. The raw materials for all products are processed in the same processing sequence of M units except that some products may skip a few units. The batch of material coming out from unit M represents a batch of finished product. It is desired to produce a slate of N product batches of various products. The goal is to determine the sequence in which to produce them so as to minimize the total time or to maximize productivity. The times required to process each batch on various units are given and it is assumed that there are at least (N-1) storage units available between every two adjacent batch units to store intermediate products temporarily. Such a system is called a flowshop with unlimited intermediate storage (UIS) because a storage unit will always be available to an intermediate product batch whenever it needs it. Furthermore, we assume the following. 1. The N batches will be processed in the same order on each batch unit. 2. Once a batch begins processing on a unit, no other batch can interrupt its operation until completion. 3. A unit cannot process more than one batch at a time. 4. A batch cannot be processed by more than one unit simultaneously. For the time being, we assume that the times to transfer batches between processing and storage units and the times required to change over from one batch to another are negligible. We chose the above problem merely to evaluate the potential of SA for the batch process scheduling problems. It was an obvious choice, because it has been studied extensively (Ku et al. 1987) in the literature, thus algorithms for comparison already exist in the literature. This problem of finding a sequence with minimum total time (called makespan) to produce all N batches has been shown to be NP-complete for M>2 by Garey et al. (1976), thus no polynomialtime algorithms exist for getting optimal solutions. We now describe our implementation of the SA algorithm. 1.1. SA algorithm details If we represent the N batches by integers 1 to N, then the solutions to this scheduling problem are simply all permutation sequences of the N integers. By a sequence k l -k2 - .... - kN , we will mean that batch ki is produced ith in the production sequence. Let tii denote the time to process batch i on unit j, then the time at which the ith product in the sequence will finish processing on unit j, Cii , is given by (Rajagopalan and Karimi 1989), C ii = max [Ci(j _ i) , C(i_ i)j ] + tkii i = 1, N; j = 1, M
(1)
where Cii = 0 for i=0 or j=0. The objective in this problem is then to minimize CNM or the completion time of the Nth batch on unit M. In this study (Ku and Karimi 1991a), we experimented with several strategies for perturbing a current sequence to generate a new sequence. We tested combinations of random batch insertions, random batch interchanges and pairwise interchanges of adjacent batches. Although
184 several strategies were equally effective, none worked well unless the pairwise interchange of adjacent batches was a part of the strategy, therefore, in view of simplicity, we selected the pairwise interchange strategy for sequence generation. During the algorithm, we keep a record of the total number of new sequences that have been generated. To get the (r+1)th sequence from the rth, we identify the pair of batches to switch by calculating i = mod(r-1, N-1) + 1. Then we switch the ith and the (i+l)th batch in the rth sequence to create the (r+1)th sequence. We used a simple annealing schedule of decreasing the temperature kT in the Metropolis algorithm by a constant factor a (= 0.95 here) in 20 steps and used 3N3 total sequence generations as the termination criterion. Thus, at each value of kT, we evaluated int(0.15N 3 ) sequences and then reduced kT by 5% for the next step. We determined the initial value of kT for each pair of M and N by generating 3000 random sequences via the pairwise interchange strategy. The initial value of kT was taken as 1.5 times the average difference in makespans resulting from the interchanges. Clearly, our version of the SA algorithm is not optimum. Das et al. (1989) and Malone (1989) have studied this problem further to optimize the SA procedure. Even this fairly crude version of SA was quite successful as compared to some heuristic algorithms that we discuss below.
1.2. Heuristic algorithms Several heuristic algorithms exist in the literature (Ku at al. 1987) for finding good suboptimal solutions to the above UIS flowshop problem. Of them, IMS (Idle Matrix Search) developed by Rajagopalan and Karimi (1987) has proven the best. It is able to handle the cleaning and transfer times and tries to reduce the idle time of units between batches in order to reduce the makespan. On larger (N>8) problems, it gives as much as 8% lower makespans than other heuristics. As shown by Ku and Karimi (1991a), the fraction of sequences that are optimal decreases rapidly as N or M increase and the worst sequence can have a makespan as much as 50% greater than the minimum makespan. Thus, it becomes harder and harder to get sequences close to the optimal, as N or M increase and most heuristic algorithms become less effective. Having developed several heuristic algorithms, it seemed hard to believe that a general randomization algorithm such as SA would do as well as specially tailored heuristic algorithms such as IMS. This led us to devise two more control algorithms to better evaluate SA. The first control algorithm, call it Al, simply evaluates 3N3 randomly generated sequences and selects the best. This was designed just to see how different SA is from just another random algorithm. The other control algorithm, call it A2, is similar to SA, but does not use any annealing schedule. It uses the pairwise interchange strategy as in our SA algorithm, but accepts no uphill moves unless no better sequence is found after (N-1) pairwise interchanges. Thus it makes an uphill move only when it gets stuck and this is done by randomly interchanging two adjacent batches in the current sequence. A2 also evaluates only 3N3 sequences.
1.3. Numerical evaluation We examined several combinations of NxM for evaluating SA and used 50100 (100 for N<9, 50 for N>8) simulated test problems for each combination. For the test problems, we generated batch processing times on units from a
185 Table 1 Evaluation of SA for problems with N<9 % of Solutions Optimal Initial
% Mean Devs from Optima
N M kT (h) SA
Al
A2
SA
Al
A2
6
3
1.30
99
95
85
0.7
1.5
6
5
1.75
96
71
70
1.3
1.5
3.0 3.5
6
8
2.35
97
71
66
1.4
1.3
3.6
6
10
2.75
95
63
65
1.6
1.2
2.5
7
3
1.50
98
78
81
0.2
1.0
1.6
7
5
2.10
100
53
68
0.0
1.6
2.9
7 7
8 10
3.00 3.40
96
64
0.8
1.4
3.0
94
27 37
1.90
98
54
0.6 1.2
2.6
3
57 64
1.5
8
1.4
3.0
5
2.75
92
28
28
0.8
2.1
5.0
8
8
3.15
90
15
15
0.3
2.3
5.2
8
10
3.50
82
14
20
1.1
2.2
5.1
8
Reprinted with permission from Industrial Engineering Chemistry & Research, Vol. 30, No. 1, 1991. Copyright 1991 American Chemical Society.
uniform distribution over [0.1-20] h. We used one value of initial kT (determined as described before) for all the problems in each NxM combination. We did the evaluation in two parts according to the problem size. For small-size (N<9) problems, we compared SA, Al and A2 with the optimal solutions found by complete enumeration. For larger (N>8) problems, we compared the relative performances of SA, IMS and Al. Since the optimal solutions are expensive to obtain for the larger problems, we took the best of the solutions obtained by the three algorithms and compared each algorithm with respect to this best. The results are shown in Tables 1 and 2. In Table 1, the % of solutions optimal is the percentage of test problems for which an algorithm gives the optimal solutions and the % mean deviation is the mean of differences between the algorithm makespans and the optimal makespans calculated as percentages of the latter. Similarly, in Table 2, % of solutions best is the percentage of test problems for which an algorithm gives the best makespan and % mean deviation is the mean of differences between the algorithm makespans and the best makespans calculated as percentages of the latter. Note that the mean deviations are based on only those solutions that differ from the best or the optimal. From Table 1, it is clear that SA has the highest and a greater than 90% chance of finding the optimal sequence for small-size problems. Furthermore, unlike the two control algorithms, its performance does not deteriorate rapidly with increasing N or M. Even when SA fails to get the optimum, it gets a very good suboptimal solution, as evident from the % mean deviation values. Surprisingly, Al, the purely random control algorithm, does better than A2, the
186 Table 2 Evaluation of SA for problems with N>8. % of Solutions Best Initial N M kT (h) SA
% Mean Devs from Best
IMS
Al
SA
IMS
Al
10 3
2.10
60
0.00
1.7
0.6
5
2.30
100 96
52
10
8
0.03
4.8
3.1
10 8
2.90
96
4
12 6
0.10
4.7
3.2
10 10
3.15
98
0
4
0.04
5.4
3.1
15 3
0.95
100
42
58
0.00
1.2
0.7
15 5
1.60
98
12
10
0.003
15 8
2.20
100
0
0
0.00
3.3 5.5
15 10
2.35
100
2
0
0.00
5.2
3.3 6.0 5.5
20 3
0.80
98
34
48
0.003
1.0
0.6
20 5
1.30
100
10
2
0.00
3.5
4.2
20 8
1.90
100
0
0
0.00
4.9
6.6
20 10
2.00
100
0
0
0.00
4.9
6.6
30 3
0.70
100
50
0.00
0.4
30 5
100
8
0.00
30 8
1.05 1.35
100
0
0.00
2.0 4.5
30 10
1.60
100
0
0.00
5.6
-
Reprinted with permission from Industrial Engineering Chemistry & Research, Vol. 30, No. 1, 1991. Copyright 1991 American Chemical Society.
algorithm with uphill moves. The above results indicate that SA is quite good at solving this scheduling problem and it is not just another random search. Because of bad performance by A2 for N<9, we omitted it for problems with N>8 and used IMS instead. It is clear from Table 2 that SA almost always gives the best solutions of the three algorithms. In fact, of the 800 test problems, SA failed to find the best solution in only 8. The effectiveness of IMS and Al decreases rapidly for M>3 as judged by the two criteria in Table 2. Thus, SA gives 5-6% lower makespans than the other two algorithms. Al does quite well because of the sheer number of sequences that it examines as compared to IMS. The computation times required by the algorithms are available in Ku and Karimi (1991a). As we would expect, they are the major difference between SA and IMS. SA requires 10 to 50 times more computation time than IMS. With the ever increasing computing power and the better solutions from SA, one can justify the computational effort required by SA. Moreover, many aspects of the algorithm used in this paper can be refined further to yield even better SA performance.
187 2. EXTENSION TO DUE-DATE PENALTIES In the serial multiproduct process of section 1, we assumed the scheduling objective of minimizing makespan and also used simplifying assumptions about set-up and transfer times. To illustrate the generality of SA, we now apply essentially the same algorithm to a serial flowshop in which meeting the due dates of customer orders is more important than maximizing productivity. A survey by Chaudhary (1988) suggests that customer service is crucial for many batch plants. To this end, we assume that a list of N product batches is given and each batch carries a due date by which it should be produced. Such a due date can be assigned to each batch based on the due dates and the amounts of the customer orders. Let us say that batch i has a due date d i and if the batch is not ready by di , then it is said to be late or tardy and incurs a penalty of p i $ per unit time due to the loss of consumer satisfaction. No penalty or reward is incurred for producing the batch early. We now relax some of the assumptions in the process of section 1. We no longer assume unlimited storage between batch units, but allow any one of the storage policies discussed by Ku and Karimi (1990a). Briefly, these allow for no or few storage tanks between units, so that a batch may wait in a batch unit if the downstream batch unit is busy and no storage unit is available. Moreover, we no longer assume the times required to transfer batches between units and storage and the times for set-up and cleaning to be negligible. Let a ij denote the time required to transfer out batch i from unit j and let sii denote the set-up or clean-up time required on batch unit j before a batch can resume processing on it. However, we assume that sii does not depend on what was processed on unit j previous to batch i. With these additional features, the penalty incurred by the ith batch in a sequence k l-k2- .... - kN is given by, Pi = pk i max[0, Cm + akiM - dki]
i = 1,N
(2)
and the total due date penalty for the sequence is P = P 1 + P2 + .... + PN. This scheduling objective is known as the weighted tardiness criterion. Ku and Karimi (1990b) developed a branch and bound algorithm for solving the above stated scheduling problem exactly, but as with most scheduling problems, such optimal algorithms are not useful for N>8. As before, we again devise a few simple heuristic algorithms to evaluate SA effectively. 2.1. Heuristic algorithms The first algorithm, called the DOP (Due-date Over Penalty) algorithm, is the one used by Ku and Karimi (1990b). Its solution, called the DOP sequence, is simply the sequence of batches with an increasing order of di/pi. The second algorithm, called the SB (Sequence Building) algorithm, uses a priority list of batches and gradually builds a full sequence in N iterations. We use the DOP sequence as the priority list. The algorithm develops an n-batch (n = 1,N) sequence Nn at the end of nth iteration. At the (n+1)th iteration, it takes the (n+1)th batch from the DOP sequence and generates (n+1) new sequences from in by inserting that batch at every position (1, 2, 3, ..., n+1 in that precise order) in Nn. During the insertion process, the relative positions of all the
188 batches in wri are not changed. It then evaluates the total penalty for each of the newly generated n+1 sequences and uses the best as iin+1 for the next iteration. If multiple sequences with same penalty are generated at any iteration, then it uses the one that was generated the first. The third algorithm, called the MPR (Maximum Penalty Reduction) algorithm, is a simple iterative procedure that starts with the DOP sequence and successively generates better and better sequences. Given a current sequence, it finds the batch with the highest due-date penalty. To reduce this penalty, it tries to insert that batch forward at all earlier positions in the sequence, while keeping the relative positions of the other batches in tact. If such insertions result in better sequences, then it takes the best of them as the next current sequence and starts over again. If a better sequence does not result, it identifies the batch with the second highest penalty and employs the same insertion strategy on it. If it still fails to generate a better sequence, it tries the batch with the third highest penalty and so on. The algorithm stops, when penalty is not reduced by forward inserting any batch in the current sequence. The fourth algorithm, called the FPE (four Product Enumeration) algorithm, is a partial enumeration scheme. The algorithm does (N-3) iterations and generates a new current sequence at the end of each iteration. In the first iteration, the DOP sequence is taken as the current sequence. At the nth iteration, it selects a zone of four consecutive batches in the current sequence, which consists of the nth, the (n+1)th, the (n+2)th and the (n+3)th batches in that sequence. The algorithm now enumerates 24 4-batch sequences of the above four batches. It then constructs 24 full (N-batch) sequences from these 4batch sequences by appending the remaining batches from the current sequence. Specifically, the 1st through the (n-1)th batches in the current sequence are placed in front of each of the 24 sequences, while the (n+4)th through the Nth are placed to the rear, without changing their positions in the sequence. The algorithm chooses the best of them as the current sequence for the next iteration. The sequence obtained after the (N-3)th iteration is the best. 2.2. Numerical evaluation To evaluate the above algorithms and SA, we used 50 randomly simulated test problems for several NxM combinations. Although the approach is fully applicable to problems with arbitrary storage policies, we used the UIS policy for the test problems. For each problem, we picked the ti i as real numbers from a uniform distribution over [0, 30] h; while the sii, the a ij and the pi as integer numbers from a uniform distribution over [1, 5]. However, we assigned the di using a special procedure so as to assure that the optimal sequence will have a zero penalty for each problem. For each problem, after generating its data, we selected a random sequence and calculated the completion times of its batches. We used these completion times as the due dates for the respective batches. Such a construction makes it possible to evaluate SA even for large problems, because the optimal solutions are known a priori. Table 3 shows a comparison of the various algorithms using two criteria. The first is the percentage of problems for which each algorithm obtained the optimal solution of zero penalty, while the second compares the penalty of the DOP sequence with that of the sequence from each algorithm. The percentage
189 Table 3 Evaluation of SA for due-date problems % of Solutions Optimal
% Mean Reductions over DOP
N M SB FPE MPR SA
SB FPE MPR SA
8
2
60
68
80
100
82.7
87.8
90.7
100
8
4
58
50
78
100
81.7
83.2
91.0
100
76
100
6
48
40
81.7
80.4
88.8
100
10 2
32
30
66
100
75.2
76.8
89.2
100
10 4
24
22
64
98
73.6
75.9
88.6
99.4
8
6
22
20
56
96
71.6
72.6
87.6
99.1
20 2
0
0
18
74
67.8
58.3
87.1
98.5
20 4
0
0
14
62
68.9
57.4
86.1
97.8
20 6
2
0
20
67.6 63.9
56.4 47.2
86.6 84.9
97.7 98.3
10
30 2
0
0
6
64 74
30 4
0
0
2
38
61.4
46.4
84.8
97.4
30
6
0
0
2
20
64.1
45.1
85.4
97.0
50
-
2
0
0
0
59.2
32.7
84.2
50 4
0
0
0
60.5
36.1
86.6
50 6
0
0
0
60.3
36.6
86.2
Reprinted with permission from Computers & Chemical Engineering, Vol. 15, No. 5, H.-M. Ku and I. A. Karimi, Scheduling Algorithms for Serial Multiproduct Batch Processes with Tardiness Penalty, pp. 283-286, Copyright 1991, with kind permission from Elsevier Science Ltd., The Boulevard, Langford Lane, Kidlington 0X5 1GB, UK.
reduction over the DOP is given by 100 (1 - algorithm penalty / DOP penalty). However, note that the mean reduction is based only on those solutions that are suboptimal. If the solutions for all the test problems are optimal, then the % reduction over DOP is 100%. It is clear that SA outperforms the other three algorithms. Although the percentages of optimal solutions also go down for SA for large problems, they are still far better than those of the rest. The SB and FPE fail to attain the optimal solutions for N>10. The MPR algorithm does quite well, while the FPE algorithm is the worst. It is noteworthy that SA obtains 97100% reductions over the DOP sequences even for large problems, i.e. gives solutions within 3% of the optima. FPE fails to perform well for large N because of its local enumeration strategy. SB is better in that respect because it gives each batch a chance to be anywhere in the sequence. In terms of the computational effort (See Ku and Karimi 1991b), SB is the cheapest and SA is the costliest. MPR comes second after SA. In comparison to SB, MPR needs 1-8 times, FPE needs 5-7 times and SA needs 200-700 times more computation time. MPR requires more time for large N, because it has to examine many sequences just to terminate.
190 In this problem, the DOP represents a more intuitive scheduling procedure that a human scheduler may use in practice. The evaluations tell us that much better schedules can be obtained by using heuristics such as the ones presented in this paper. SA, especially with its ability to get excellent solutions, is quite attractive in spite of its computational burden. 3. A LARGE SCALE SCHEDULING PROBLEM The batch plant considered so far was relatively simpler than most plants in real practice. Although it had some realistic features such as nonzero transfer and cleaning times and allowed arbitrary storage policies between units, it had only one production line. The products were similar and did not need to use different processing units. This feature made it possible to define a schedule by a single sequence of batches. Even the scheduling objectives were simpler in that they consisted of just one aspect (productivity or customer satisfaction) of plant operation. Finally there were no constraints on the plant operating times and unit availabilities. In other words, units did not have scheduled maintenance or down times and the plant operated round the clock. Also there were no constraints on the availability of other resources such as manpower or utilities. As it is, SA demands extensive computational effort. If one introduces more complex features in a batch plant, it becomes even more difficult to use SA for such problems. This is because it needs more details to define a schedule and even if processing orders of batches on different units are specified, a large amount of computation is required just to arrive at the times at which each operation should begin so that none of the operating constraints is violated. In this section, we extend the applicability of SA to a large scale scheduling problem. To achieve this, we propose an efficient procedure for determining the start times of operations and we also make a simple modification to SA. Then we apply the methodology to a real life scheduling problem. We begin with a description of the batch plant. 3.1. A batch plant with two production lines
Production Line 1 3 2
9
10
11
Production Line 2
y
lo-6-0.-F ^ Fil-0-
12
13
14
i ilmr__n
Figure 2. A Batch plant with two production lines.
15
16 iin
191 A schematic diagram of the plant is shown in Figure 2. The plant consists of 17 batch units that are configured into roughly two production lines. For simplicity, the semicontinuous units are not shown in the figure. The plant also has three storage tanks that can be used to hold intermediate products temporarily between units 4 and 5 and units 12 and 17. These tanks offer the flexibility that the processing orders of batches can be changed before and after the tanks by holding batches in these tanks. There are no storage tanks available for other processing units, hence if a batch of materials is completed on any of them, then the batch must be held in the same unit until the downstream unit is ready to receive it for further processing. In some cases, the intermediate product thus produced may be unstable and hence must begin processing on the next unit without delay, then the start times of operations must be adjusted so that the intermediate product batch will not need to wait. In this plant, unit 8 is common to both production lines, hence there are two ways in which batches can be assigned to each production line, namely one from units 1 or 9 and the other from unit 8. The set of units that a batch of materials must follow to produce a given product will be called a production path. For each product, there is one fixed production path and it is given a priori. From Figure 2, the following five production paths are possible. 1. Units 1 > 2 > 3 > 4 > (Storage Tank) > 5 > 6 > 7 2. Units 8 > 2 > 3 > 4 > (Storage Tank) > 5 > 6 > 7 3. Units 8 > 10 > 11 > 12 > 13 > 14 > 15 > 16 4. Units 9 > 10 > 11 > 12 > 13 > 14 > 15 > 16 5. Units 8 > 10 > 11 > 12 > (Storage Tank) > 17 > 16 Although a production path may show a storage unit, a batch following that path does not have to go through the storage and can skip it to proceed directly to the next unit. The processing of a batch of material on any batch unit consists of four steps which we call "basic operations." First is the filling step in which the batch of material from the previous unit is discharged into the current unit. Second is the processing step in which various physical or chemical operations are carried out. Third is the discharging step in which the processed batch is either sent to the next batch unit or a storage unit. Fourth is the cleaning step in which the current batch unit is cleaned out and made ready to receive the next batch. In many cases, it is possible to insert delays or idle times between successive basic operations. For instance, if there is no empty storage or the downstream batch unit is not available, there will be a delay between the end of the processing step and the start of the discharging step. Lastly, any proposed schedule must satisfy the following operational constraints. 1. If an intermediate product is unstable, there can be no delay in its further processing after the processing step that produces it. 2. Once a batch begins processing on a unit, no other batch can interrupt its operation until completion. 3. A batch unit cannot process and a storage tank cannot hold more than one batch at a time. 4. A batch cannot be processed by more than one unit simultaneously. 5. There is no operation on Sunday, so processed or unprocessed material can be held in a batch or storage unit.
192 6. Some units have scheduled maintenance periods and they must be empty during those periods. 7. Every filling operation to a batch unit requires a unit of manpower and the manpower is restricted during the nights and weekends. 8. The times required to clean a batch unit and the cost involved in changing from one product to another on a batch unit depend on the sequence in which the different products are processed and they are known a priori. 9. There is a due date penalty for every late or tardy batch, which is proportional to its tardiness. It is desired to produce a list of N batches of different products so as to minimize a composite performance index that consists of a weighted sum of three terms; namely due date penalties, change over costs and completion times of last batches or jobs on the two production lines. The scheduling algorithm is to determine the processing orders of various jobs on each batch unit and the start times of all operations. 3.2. Schedule definition In the serial multiproduct process studied earlier, the definition of a schedule was simply a sequence of batches, as the processing order on all units was the same. In this plant, we do allow different processing orders on different batch units, so a schedule cannot be defined by just one sequence. If we had complete freedom to use arbitrary processing orders on all unit, we could define a sequence of processing for every batch unit. That would indeed be a large problem. Fortunately we do not have that much freedom because the processing order of jobs on two consecutive batch units must be the same, if there is no intermediate storage unit between them. Thus, the processing order has to be the same on units 2, 3 and 4 or units 5, 6 and 7 or units 13, 14 and 15 etc. Note that there are only two places in Figure 2, where a change in processing order can take place, namely between units 4 and 5 and units 12 and 17. Clearly then, some optimization is in order for defining the processing orders. First of all, we need to know how many sequences must be defined to completely specify the processing orders on all units uniquely. Similarly, we need to know for which units they must be defined. At present, we do not have a
Unit Group 1
Unit Group 4 6
Unit Group 2
Unit Group 5 13
Unit Group 3
Figure 3. Five unit groups in the plant.
7 —n
14
15
16 -n
193
Unit Group 4
Unit Group 1
Unit Group 2
Unit Group 3
Unit Group 5 Figure 4. A production sequence of 13 jobs
formal algorithm to answer the above two questions. The criterion that will determine the answers to the above questions is that processing orders should be defined in such a way that eliminates any redundancy in possible schedules. In this instance, we find that five sequences seem to be an appropriate choice and these sequences are defined on five unit groups as shown in Figure 3. Each set of units encircled by dashed lines constitutes a unit group. We define one processing order for all units in a group. We call this order a group sequence. A set of group sequences, called a production sequence, defines the processing orders on all units. The unit groups should be selected in such a way that different sets of group sequences will result in different production sequences. For instance, if units 1, 2, 3, 4, 8, 9, 10, 11 and 12 were to be made one group, then the group sequences A-H-B-C-D-E-I-J-K-F-G-L-M, A-B-H-C-DE-I-J-K-F-G-L-M, A-B-C-H-D-E-I-J-K-F-G-L-M, A-B-C-D-H-E-I-J-K-F-G-L-M, etc. will all result in the same production sequence shown in Figure 4. In Figure 4, circles represent the jobs, solid arrows define the processing order of these jobs on a unit group and dashed arrows indicate the production path of a job through different unit groups. The information contained in the production sequence of Figure 4 can be expanded further to extract the processing orders of jobs on individual units. For instance, Figure 5 shows the processing orders of jobs J, K and L on units that belong to groups 2, 3 and 5 for the production sequence in Figure 4. Here the interpretation of circles and arrows is the same as in Figure 4 except that unit groups are now replaced by individual units. Having decomposed a group sequence into sequences for its constituent units, each unit sequence can be further decomposed into a sequence of basic operations. For example, the unit sequence K-L on units 10 and 11 is
194
Unit 8
0 0
Unit 9 Unit 10 Unit 11 Unit 12 Unit 13 Unit 14 Unit 15 Unit 17 Unit 16 Figure 5. Processing orders of jobs on units in unit groups 2, 3 and 5.
decomposed into the sequences of basic operations in Figure 6. These are the real primary elements or the building blocks of every schedule and the start and end times of these basic operations on each unit define a complete schedule. Also note that even if a job sequence on a unit is given and the start times of the filling steps are given, the start and end times of other operations are not fixed automatically, as delays may have to be added between various steps to satisfy other operational constraints. Even though the basic operations are the building blocks of a schedule, we do not generate new schedules by resequencing them in SA. In SA, we generate new schedules only by modifying group sequences. There are several ways to modify a group sequence. For instance, we could exchange
Unit 10
Unit 11
0 : filling
0 : processing
: discharging
Figure 6. Sequencing of basic operations of jobs K and
0 : cleaning
L on Units 10 and 11
195 two randomly selected jobs in a given group sequence to generate a new one or we could insert a randomly selected job after another randomly selected job in a group sequence to generate a new one. In this paper, we use the latter strategy. Thus new group sequences A-D-B-C-G-E-F for unit group 4 and A-B-D-C-E-F-G for unit group 1 in Figure 4 are created by inserting job C after job B in unit group 4 and inserting job C after job D in unit group 1 sequences. Thus, to create a new production sequence in SA, we randomly select a job and then insert it at randomly selected positions in each of the group sequences in which it appears. 3.3. Infeasible production sequences We eliminated a number of infeasible sequences by defining appropriate unit groups in the last section. This elimination was due to our realization that two consecutive batch units must have the same job sequence, if there is no intermediate storage between them. However, this may not be the case when there are some storage tanks. Now let us say that we have S storage tanks between two consecutive batch units u (upstream) and d (downstream). Does this give us the freedom to change the job sequence on unit d in any manner as compared to the sequence on unit u? The answer is no. This is because one storage tank cannot hold more than one job at a time. To hold two jobs at the same time, we must have two tanks. In other words, all those sequences that require more than S tanks are clearly infeasible. Well, how do we identify such sequences? For this, consider Figure 7 which shows processing sequences for jobs on units 4 and 5 of the plant in Figure 2. Note that when job D is being processed on unit 5, job B has not been processed on it, but it has already completed processing on unit 4. Clearly job B must be in a storage tank. In this case, there are two storage tanks, so it is possible to do this. However, realize that a sequence such as A-E-B-C-D-F-G is infeasible, because the production of three jobs B, C and D is delayed and that requires at least three storage tanks. Thus a sequence on unit 5 in which a job has been advanced by more than two (the number of storage tanks) positions as compared to its position in the sequence on unit 4 is infeasible. To detect such infeasible schedules, we use the following equation to find the set (thus number) of jobs in storage tanks at every job on the unit after the storage units. { Jobs in Storage, } . {Jobs Succeeding n { Jobs Preceding} While Job j on Unit d
Unit 4
Job j on Unit d
Job j on Unit u
0 0 0 0 0 0 0 ,,,
Unit 5 0
1
1
0
2
Figure 7. Calculation of jobs in storage tanks
(3)
196 In Figure 7, the numbers below the job circles on unit 5 sequence give the numbers of jobs in storage tanks during the processing of those jobs on unit 5. Since the number of jobs in the storage tanks at any time does not exceed the number of storage tanks, it is a feasible sequence. Thus, if the maximum number of jobs in storage at any time exceeds the available number of storage tanks in any production sequence, then that production sequence is infeasible and should be eliminated. Now that we have identified infeasible production sequences, let us see how we can determine the start times of all operations in a feasible production sequence. 3.4. Simulation algorithm As mentioned earlier, the start and end times of various operations must satisfy the various operational constraints. We classify these constraints into four categories and use an iterative algorithm to satisfy them. Type 1: Precedence Relationship Constraints These merely specify the orders of basic operations. From the group and the unit sequences, an order of basic operations is derived for each unit. When two basic operations are connected by an arrow as in Figure 6, their start and end times satisfy certain inequality constraints. For instance, for a job j on unit i, the time at which the processing step starts must be greater than or equal to the time at which the filling step starts plus the time required to fill, etc. Type 2: Work Pattern Constraints These constraints arise because of special circumstances such as night operation, weekend operation, Sunday, maintenance schedule, etc. They specify times during which certain basic operations are not possible. Type 3: Utility Constraints Most utilities in a real plant have certain constraints. For instance, there may be a constraint on the manpower or the maximum electric current or maximum steam demand, etc. If simultaneous processing of some basic operations violates any of these constraints, then those operations cannot be done simultaneously. Clearly there are no pre-specified periods for these and they will depend on the start times of basic operations. Type 4: Finite Intermediate Storage Constraints When all the storage tanks are occupied between two units and the downstream batch unit is busy, the discharge time of a batch from the upstream unit will be delayed. The duration of this delay will depend on the filling and discharge times of previous batches to and from the storage units. Now the complete algorithm (Hasebe et al. 1992) is as follows. 1. Set the start time of every basic operation as the earliest start time which is given as input data. 2. Delay the start times of some basic operations to satisfy the Type 1 constraints. 3. If the schedule from Step 2 satisfies all the Type 2 constraints, then proceed to Step 4. Otherwise, delay the start times of some operations to satisfy all the Type 2 constraints and return to Step 2. 4. If the schedule from Step 3 satisfies all the Type 3 constraints, then proceed to Step 5. Otherwise, find the earliest time at which a Type 3 constraint is not satisfied and delay the start time of one of the basic operations being done at that time so as to satisfy that constraint. Then return to Step 2.
197 5. If the schedule from Step 4 satisfies all the Type 4 constraints, then stop. The desired schedule satisfying all the constraints has been obtained. Otherwise, find the earliest time at which a Type 4 constraint is not satisfied and delay the start time of a discharge operation at that time so as to satisfy that constraint. Then return to Step 2. The above algorithm is iterative in nature, because the process has a complex structure and the processing orders of jobs can be changed between the unit groups. If the process consists of a single production line and the processing orders of jobs are the same on all units, then it is possible to use a more efficient completion time algorithm (Ku and Karimi 1990a) that is not iterative. The methodology used by Ku and Karimi (1990a), even though it deals with storage as a shared resource, can be applied to other shared resources such as operation time, manpower, utility, etc. and thus the algorithm can, in principle, be generalized to determine the start times for this problem. Its main feature is that every batch is scheduled in such a way that the constraints are automatically satisfied, so iterations are not necessary. Both algorithms show that it is not possible to guarantee the optimality of the completion times for a given production sequence because of a complex interplay of shared resources. 3.5. Improved simulation algorithm In the above simulation algorithm, Steps 2-4 must be repeated every time the start time of a basic operation is delayed to satisfy a Type 4 constraint. Therefore, a large amount of computation time is required. We now show that Type 4 constraints can be converted to Type 1 constraints, thus reducing the computational effort considerably. Let us consider the sequences shown in Figure 7 for units 4 and 5. In this case, the minimum number of jobs held in the storage tanks at the start of a processing step on unit 5 was given by Equation 3 for each job. However, as shown for a specific schedule in Figure 8, satisfying this constraint does not mean that the discharge step on unit 4 can be started at any time. In this case, because of the large processing time of job A, both jobs B and C have to go to storage and the discharge of job D on unit 4 cannot start until job C has been sent from the storage to unit 5 or time T as indicated in Figure 8. Let us say that we are at Step 5 of the simulation algorithm and we are trying to decide by how much the start time of the discharge step of job k on unit 4
Time T I Job E
Job A Job B Job C Job D Unit 4
i
4
i
Storage ii Tan 1 Storage Tank 2
1-111tEE1-1=1
HD'
H
=1: filling fE9 : processing I-I : discharging 1=1: cleaning F - - i : holding
M El
Unit 5 1
4"
i
4
i
Job A Job C Job D Job B Figure 8. Gantt chart for the schedule in Figure 7
Time
198 should be delayed so that we will not use more storage than available. This is going to depend on how many jobs are in storage tanks at this time. Let ti,cid denote the start time of the discharge of job k from unit 4. Using Equation 3, we can calculate the number of jobs in storage, when job k starts processing on unit 5. Let that number be Hk. If Hk = S, the number of storage tanks between units 4 and 5, then it is clear that all the storage tanks are full and job k cannot go to a storage tank and must wait until unit 5 becomes free, i.e. until job k starts filling into unit 5. Therefore, we get a precedence constraint as follows. tk = t5f
(4)
k
where, is the time at which job k starts to fill into unit 5. Now if Hk < S, then at least one storage tank must be empty and hence job k can start discharging into a storage tank at some time earlier than the time given by Equation 4. But that time must be the time at which one storage tank must have become free. Let Gk be the set of jobs completed on both units 4 and 5 before job k and let k' be the job that finished (S - Ilk) jobs prior to job k on unit 5. Note that only the jobs in Gk are counted to determine the job k'. Then, the earliest time at which job k can discharge into a storage tank is the time at which job k' is already in unit 5 fully. The precedence constraint that describes this is as follows. = t5f + Filling time for job k' on unit 5
(5)
The constraints in Equations 4 and 5 are similar to those of Type 1 and hence can be included in them. For any given production sequence, these can be generated uniquely and included in Type 1 constraints. This reduces the number of iterations in the algorithm, thus reduces the computation time.
3.6. The SA algorithm For this problem, we use exactly the same algorithm as in Section 1.1, except that a different sequence rearrangement strategy is used and the initial values of kT are determined differently. We specify appropriate values for initial kT, the final kT, the number of schedules (NS) actually evaluated at each kT and the reduction factor a by which kT is reduced at each iteration. As mentioned in section 3.2, we select a job randomly and insert it at randomly selected positions in each group sequence to create a new production sequence from the current one. Then we check its feasibility as discussed in section 3.3. Only if the sequence is feasible, we determine its completion times using the simulation algorithm and we do not count the infeasible sequences in the number of schedules NS at each kT. The algorithm, as described above, requires a huge amount of computation time, when applied to the plant in Figure 2, as the simulation algorithm is time consuming. To eliminate unnecessary start time calculations, we modify the probabilistic step in SA slightly. In most cases, the newly created production sequences in SA are worse than the current candidate for the best schedule. However, to realize this, we have to execute all the steps of the simulation algorithm, which is a waste of time. If these bad schedules can be eliminated at
199 an early stage in the simulation algorithm, then the computation time can be decreased drastically. To this end, note that the performance index used for this problem is a monotonically nondecreasing function of the completion time of each job. Since the start times of operations always increase during the progress of the simulation algorithm, the performance index also increases accordingly. In the Metropolis algorithm, we generate a random probability p for accepting a solution with a higher performance index than the current one. Instead of generating p afterwards, we generate it before we do the start time calculations. From this number, we calculate a critical value of performance index that cannot be exceeded for the new schedule to be accepted. If during the simulation algorithm, the performance index exceeds this critical value, then we stop the algorithm and reject that schedule. In other words, we replace a random p by a random critical performance index number and use it in the simulation algorithm to avoid unnecessary calculations. With this modification, the computation time of SA is reduced considerably. 3.7. Example To demonstrate the applicability of the improved SA algorithm, as discussed above, let us now apply it to a fairly large scheduling problem for the plant in Figure 2. The data for the example are summarized in Table 4. Figures 9-11 show the results of the computation. The algorithm was coded in C language and the CPU times are on a SUN SPARCStation 10. Figure 9 shows the change in the performance index with the computation time. Table 4 Figure 10 shows the effect of Data for the example changes in SA parameter Parameter Value NS. With small NS, the Number of Jobs (N) 50 algorithm is more sensitive to different random seeds, but Number of Product Types 26 the time required is less as Initial kT 500 expected. In Figure 10, all Final kT 0.12 other parameters are the same as in Table 4 except NS. 0.92 kT Reduction Factor (a) Figure 11 shows the final No. of schedules at each kT (NS) 600 detailed schedule for this example. The storage units are not shown in the figure. It is difficult to say if the schedule in Figure 11 is optimal. However, it is the best available at present, as we could not improve it by manually changing the processing orders on the basis of experience. When the same problem was solved without the modification to SA, the computation time was four times as long. This clearly indicates the impact of a small modification in the SA algorithm.
200
0
100 150 CPU Time (h)
50
200
250
Figure 9. Change in Performance Index with CPU Time
0
13
o •
NS = 300 NS = 600 (Fig. 9)
o
NS = 600 NS =1200
IS
O
GI
0
0 o 0
•
O
AA A A
0
1
2
A
A 3
CPU Time (days) Figure 10. Effect of changing NS on SA solution
4. CONCLUSIONS Our evaluation of SA suggests that it offers great promise for solving a variety of batch process scheduling problems. Its most attractive features are its simplicity and versatility. In this work, we applied essentially the same algorithm to three different (in size, scheduling objective, plant configuration,
EJ :
I : processing I filling : discharging 1===1 : cleaning F - - I : holding (waiting in the batch unit)
16 17 15 14 13 12 11 c D 109 8 1 2 3 4 5 6 7 1 7
14
Figure 11. The final SA schedule for the example
21
1 28
1 35
r, 42 [days]
202
operational constraints, etc.) scheduling problems and SA proved the best in. all. Although at the expense of greater computational effort, SA gave significantly better solutions than those offered by the various problem-specific scheduling heuristics. Finally, by proposing an efficient simulation algorithm and by modifying the probabilistic step in SA, we demonstrated that the computational demands of SA can be reduced and thus SA can be used even for real-life large scale scheduling problems with several constraints.
REFERENCES Chaudhary, J. (1988), "Batch Plants Adapt to CPI's Flexible Game Plans," Chemical Engineering, 95, 2, 31. Das, H., Cummings, P. T., and Levan, M. D. (1990), "Scheduling of Serial Multiproduct Batch Processes via Simulated Annealing," Computers & Chemical Engineering , 14, 1351-1362. Garey, M. R., Johnson, D. S., and Sethi, R. (1976), "Complexity of Flowshop and Jobshop Scheduling," Mathematics of Operations Research, 1, 117. Hasebe, S., Murakami, Y., and Hashimoto, I. (1992), "A Flexible Simulation Algorithm for Scheduling of Batch/Semicontinuous Plants," American Institute of Chemical Engineers Annual Meeting, Miami Beach, November, Paper No. 136c. Ku, H.-M., and Karimi, I. A. (1990a), "Completion Time Algorithms for Serial Multiproduct Batch Processes with Shared Storage," Computers & Chemical Engineering , 14, 1, 49-69. Ku, H.-M., and Karimi, I. A. (1990b), "Scheduling in Serial Multiproduct Batch Processes with Due Date Penalties," Industrial Engineering Chemistry &
Research ,
29, 4, 580-590.
Ku, H.-M., and Karimi, I. A. (1991a), "An Evaluation of Simulated Annealing for Batch Process Scheduling," Industrial Engineering Chemistry & Research , 30, 1, 163-169. Ku, H.-M., and Karimi, I. A. (1991b), "Scheduling Algorithms for Serial Multiproduct Batch Processes with Tardiness Penalties," Computers & Chemical Engineering, 15, 5, 283-286. Ku, H.-M., Rajagopalan, D., and Karimi, I. A. (1987), "Scheduling in Batch Processes," Chemical Engineering Progress„ 83, 8, 35 - 45. Mah, R. S. H. (1990), "Chemical Process Structures and Information Flows," Boston:Butterworths, 244-332. Malone, M. (1989), "Batch Sequencing by Simulated Annealing," American Institute of Chemical Engineers Annual Meeting, San Francisco, November. Musier, R. F. H., and Evans, L. B. (1990), "Batch Process Management,"
Chemical Engineering Progress,
86, 78.
Parakrama, R. (1985), "Improving Batch Chemical Processes," Chemical Engineering, Sept. 24. Patel, A. N., Mah, R. S. H., and Karimi, I. A. (1991), "Preliminary Design of Multiproduct Noncontinuous Plants Using Simulated Annealing," Computers & Chemical Engineering, 15, 7, 451-169.
203 Rajagopalan, D., and Karimi, I. A. (1987), "Scheduling in Serial Mixed Storage Multiproduct Processes with Transfer and Set-up Times," Computer-Aided Process Operations, New York: CACHE/Elsevier, 679. Rajagopalan, D., and Karimi, I. A. (1989), "Completion Times in Serial Mixed Storage Multiproduct Processes with Transfer and Set-up Times," Computers & Chemical Engineering, 13, 1/2, 175-186. Reklaitis, G. V. (1990), "Progress and Issues in Computer Aided Batch Process Design," Foundations of Computer-Aided Process Design, CACHE/Elsevier, 241-275.
This Page Intentionally Left Blank
Adaption of Simulated Annealing to Chemical Optimization Problems, Ed. by J.H. Kalivas © 1995 Elsevier Science B.V. All rights reserved.
205
Chapter 9
Nuclear fuel management G.T. Parks' and D.J. Kropaczekb a Cambridge University Engineering Department, Trumpington Street, Cambridge CB2 1PZ, United Kingdom b Department of Nuclear Engineering, North Carolina State University, Raleigh, NC 27695-7909, U.S.A.
1. INTRODUCTION The Pressurized Water Reactor (PWR) reload core optimization problem, though easily stated, is far from easily solved. The designer's task is to identify the arrangement of fresh and partially burnt fuel (fissile material) and burnable poisons (BPs) (control material) within the core which optimizes the performance of the reactor over that operating cycle (until it again requires refueling), while ensuring that various operational (safety) constraints are always satisfied.
FE
4.8 w/o Fresh Fuel (with BPs)
FE
4.4 w/o Fresh Fuel (with BPs)
E.:I 4.2 w/o Once Burned Fuel
•
3.8 w/o Once Burned Fuel
1111 3.8 w/o Twice Burned Fuel Figure 1. A Typical PWR Loading Pattern A typical PWR core contains, depending on its rated thermal power, anywhere from 121 to 241 fuel assemblies arranged with quarter core (reflective or rotational) symmetry, as shown in Figure 1. At each refueling between one third and one half of these assemblies may be replaced by an incoming (fresh) fuel region. In order to achieve improved fuel utilization, a given fuel region may be subdivided according to varying enrichments (w/o) of Uranium-235. It is common practice for (some of) the fresh fuel assemblies to carry BPs, of which there can be many different possible loadings. It is also usual to shuffle (some of) the burnt fuel assemblies remaining in order to improve the characteristics of the new core. This shuffling can entail the
206 exchange of corresponding assemblies between core quadrants, which is equivalent to changing the assembly 'orientations', or the exchange of different assemblies, which changes their locations and possibly their orientations also — see Figure 1. Thus, purely from an optimization viewpoint, the search for the best loading pattern (LP) represents a combinatorial problem of tremendous magnitude — there are approximately 10 43 possible LPs even if no BPs are used. Nuclear power reactors inherently have some highly nonlinear characteristics, which means that whatever objective function (there are several possibilities) and constraints are used to define and quantify acceptable LPs, some of these system variables will inevitably be nonlinear functions of the problem's control (decision) variables. Examples of such nonlinearities include the effects of local thermal and hydraulic feedback and the time dependence that results from radiation exposure (an accumulated history effect). Particularly with respect to the latter, the computational expense associated with analyzing a single LP solution can be substantial. When considered within the context of of tens of thousands of solutions, as is often required for the use of modern optimization routines, the CPU run time cost becomes prohibitively expensive.
Orientation Changing
Location Changing
Location & Orientation Changing
Figure 2. Typical Fuel Assembly Exchanges For the fuel management problem, numerous local minima (or maxima) can be created at constraint boundaries. Such behavior is particularly common in large combinatorial problems. Of course, only one of these extrema represents the global optimum sought, and, hence, it is desirable that the optimization technique employed not be deceived by local minima. The rigorous licensing regulations governing the nuclear industry require that, before implementation, LPs must satisfy scrutiny using extremely detailed (and time consuming) computer codes. Clearly, it would be totally impractical to use these codes to assess potential LPs within an optimization routine. Thus, a long-standing difficulty facing engineers tackling the reload core optimization problem has been to find a means of modeling the core with sufficient accuracy for the results to be meaningful without expending prohibitive computational effort in calculating them. The unavoidable necessity of using computer-based core models introduces the further disadvantage that, in general, the derivative information required by most sophisticated optimization algorithms is not directly available. To evaluate these derivatives by small difference techniques can be a very lengthy process when there are many control variables (as there are in realistic reload core problems) and is not always reliable in the face of highly nonlinear functions.
207 In combination these attributes: • high combinatorial dimensionality • nonlinear objectives and constraints • multmodality (multiple maxima or minima) • computationally intensive objective and constraint evaluations • lack of direct derivative information describe an extremely difficult optimization problem, which can be expected to defeat the vast majority of (if not all) available optimization library routines. Of course, progress in this field of research has always been inextricably linked to that in the field of computer technology. This is perhaps the main reason why, although substantial advances have been made in the last 30 years, investigators seeking to design optimal reload cores have, until now, been obliged to solve problems which are, by varying degrees, not trueto-life because of the simplifying assumptions required to facilitate solution. This, in turn, has reduced confidence in their results and led to a reluctance among utilities to apply optimization techniques in practice [1]. Utilities continue to employ design methodologies relying primarily on experience, which produce acceptable, but not optimal, LPs. Any attempt to tackle the PWR reload core optimization problem requires a combination of: • an evaluation method of sufficient speed such that tens of thousand of solutions can be examined in an overnight run on an engineering workstation and sufficient accuracy that the results obtained are meaningful; and • an optimization technique not requiring derivative information, not deceived by local optima, not upset by nonlinear functions, and incorporates active constraints. The code FORMOSA-P has been developed over a number of years at North Carolina State University [2-4] for the purpose of automating the process of determining the family of near optimum fuel and BP LPs, while taking into account, with a minimum of assumptions, the complexities of the reload design problem. FORMOSA-P couples the stochastic optimization technique of Simulated Annealing (SA) [5] with a computationally efficient neutronics solver based on second-order accurate, nodal generalized perturbation theory (GPT) [6-7] for evaluating core physics characteristics over the cycle. 2. LOADING PATTERN EVALUATION
2. 1. Reference neutronics model The wealth of previous work in the field has established that a two-dimensional (axially homogenized) reactor model using the two-group neutron diffusion equations calculates the core characteristics of interest with sufficient accuracy for PWR loading pattern optimization purposes. However, to evaluate every candidate pattern by solving these time (i.e. exposure) dependent, coupled, elliptic, eigenvalue partial differential equations directly would make any optimization routine prohibitively slow. Generalized perturbation theory provides a means of overcoming this limitation. GPT is a method of evaluating the effects of cross-section perturbations on quantities that can be formulated as integral responses, such as reactivity and power density. An initial requirement is an 'exact' solution of a reactor physics model for a reference core configuration. In FORMOSA-P the reference neutronics model is a two-dimensional Cartesian [x-y] geometry implementation of the nodal expansion method (NEM) to solve the two-group, steady-state neutron diffusion equation: ANEM 0 . x J3 0
(1)
208 where $.1) is the flux, X the eigenvalue, ANEM is the removal operator and B the production (i.e. fission operator). The principal characteristics of this polynomial nodal method are its quartic expansion of the one-dimensional transverse integrated flux and quadratic leakage model for the transverse leakages [8]. The solution algorithm is based on the nonlinear iterative NEM strategy [9], a key feature of which is the preservation of a coarse-mesh finite difference (CMFD) matrix structure. This is achieved by updating of the CMFD diffusion coupling coefficients to force agreement with the node surface average current determined from a spatially decoupled NEM calculation spanning two adjoining nodes. Thus, the nonlinear operator of equation (1) may be written as: A NEM = ACMFD
— NEM (4) (0) + D
(2)
Each two-group, two-node NEM problem produces a 16 by 16 linear system of equations which is reducible to a single 8 by 8 and two 4 by 4 system of equations [10]. These correspond to the continuity of current and discontinuity of flux across adjoining node interfaces and the preservation of the 0th, 1st and 2nd flux moments within each node. By maintaining the CMFD matrix structure a conventional outer-inner iterative solution strategy may be employed, with periodic updating of the matrix structure performed to address local thermal and hydraulic and fission product feedbacks, soluble boron criticality search (if required), and NEM correction of diffusion coupling coefficients. Acceleration of the outer iterations is achieved through the use of eigenvalue shift [11], while the inner iterations are accelerated through the use of SOR with optimum relaxation parameter estimation [12]. Additional computational speed is obtained through the functionalization of tabular group constants to piecewise cubic spline polynomials, with automatic determination of region boundary knot values to minimize the least squares fit error [13]. The use of cubic splines has been shown to reduce the cross-section evaluation time by a factor of eight compared to a four point Lagrangian interpolation scheme, and is crucial for obtaining peak performance within the GPT neutronics solver. This reference neutronics model gives excellent agreement with nuclear design licensing codes, such as the Electricite de France code COCCINELLE (Version 2.2) [4], and therefore provides a suitable basis for the GPT model that is essential to the computational viability of the optimization process. 2. 2. Generalized perturbation theory model As FORMOSA-P offers the user a variety of objective function and constraint formulations from which to choose, including ones that require the evaluation of feed enrichment, local power peaking and discharge exposure, its GPT model must maintain accuracy with respect to an equivalent forward NEM solution for a wide range of LP perturbations. A change in a core response, such as power density, at a core location '1' is related to the current estimates of the governing system operators, flux and eigenvalue through the following GPT functional: ARip = –
0-01* , ( Aepst _ xepst Be; t ) oepst )
(3)
The superscript 'est' denotes an estimate for the condition `p' of a perturbed LP, the subscript `o' refers to the reference LP, and Fr denotes the adjoint flux for the generalized core response in location '1'. For linear operators, equation (3) is accurate to one higher order than the estimate of the flux and eigenvalue used in the functional. In such cases, linear superposi-
209 tion of single assembly contributions (e.g. flux changes) to the LP perturbation can be performed, thereby providing the first-order estimates required to obtain a second-order GPT response. This approach has been successfully extended to the estimation of the operator ANEM of equation (2) for perturbations in the coupling coefficients of the nonlinear NEM strategy [7]. For the treatment of local thermal and hydraulic and fission product feedbacks, an additional second order accurate, nonlinear correction is applied in the evaluation of equation (3). This correction is obtained by direct substitution of the 'uncorrected' power density response of equation (3) into analytical sensitivities for the change in cross-section with local power [14]. As noted above, first-order accurate estimates of the core reactivity, flux, and nodal coupling coefficients corresponding to any given loading pattern may be expressed employing linear superposition as follows: N
xpest =
+
Axs
(4)
Leos
(5)
s=1
±
oiep st =
s=1 —e
st
—
-
D = D o + L ADS
(6)
s= 1
All possible single assembly perturbations 's' to the reference LP are executed individually prior to the start of an optimization, and the changes in flux AO, eigenvalue AX and NEM coupling corrections An are found using the exact NEM neutronics solver and stored for later use, as in equations (4)-(6). Once within the GPT neutronics solver during optimization, an estimate of the CMFD operator of equation (2) for a given perturbed LP is readily obtained by performing a single feedback update using the first-order, reconstructed flux results of equation (5). An estimate of the NEM operator, required for use in the GPT functional, is then obtained as follows: Aest = ACM FD (4est) fi est
(7)
Depletion is handled by solving for exposure using a forward difference operator, both for the single assembly perturbations and the second-order reconstructed power density responses. Since this approach yields group constants accurate through second-order in exposure, there is no need to consider sensitivities with respect to exposure explicitly within the GPT functional. Using this GPT model, the evaluation of a single LP is completed about 8 times faster than it would be using the reference forward nonlinear NEM solver [4]. Of course, the overhead of calculating the effects of all the single assembly perturbations and the requisite adjoint functions up-front does reduce this advantage somewhat, but, even so, the benefit gained in terms of computational speed is substantial — at the price of a reduction in the accuracy of (some of) the evaluations. The evaluation of 30 000 LPs for a typical problem with quarter core symmetry (160 nodes) and 8 exposure steps takes 4.8 hours on a DECstation-5000 (which gives a floating point performance of approximately 1.5 MFLOPS), with the. GPT precalculations rep-
210 resenting (15-20)% of total CPU time. Recent results obtained for an IBM RS6000 (Model 375) showed actual turnaround times of approximately 80 minutes for the same problem in a multi-user environment. If the errors in the GPT evaluations are too large, the optimization search may be misled, so, in order to control their magnitude, a restriction is placed on the size of the reactivity change (relative to the reference LP) associated with any single assembly perturbation. A typical limit for assuring the accuracy of GPT is on the order of 18% Ak. This has the possible effect of precluding the placement of some assemblies in some locations, and, hence, not all configurations may be accessible from a single reference LP. In order to span the solution space, it may be necessary to execute multiple, successive simulated annealing 'cooling iterations', utilizing the optimal LP located in one cooling iteration as the reference LP for the next. However, as Section 3 will explain the use of more than one cooling iteration may be desirable from a Simulated Annealing optimization viewpoint. The basic structure of FORMOSA-P is as shown in Figure 3.
Read in reference LP & control data
t
Perform GPT precalculations
Search for optimal LP using Simulated Annealing
Yes
n
Stop
Store new reference LP
Figure 3. The Basic Structure Of FORMOSA-P
3. OPTIMIZATION METHODOLOGY When considered as a technique for tackling realistic PWR reload core optimization problems with the characteristics described in Section 1, optimization by Simulated Annealing (OSA) has a number of attractive features:
211
• It is well established that OSA is a comparatively efficient method on problems with high dimensionality. • Because they employ a random search technique, OSA routines do not require (or deduce) any functional derivative information. Thus, they are unaffected by the type of nonlinearities that cause problems to other optimization routines. • Because of their ability to 'climb uphill', OSA routines can escape from the local minima of multimodal problems. • Because they employ a random search technique, OSA routines identify families of near-optimum LPs that can be further examined by the reload core designer for 'soft' attributes and utilized to quantify the cost of constraint margin. 3. 1. Solution generation A key component in the optimization process itself, is the generation of new solutions. For a PWR reload core problem, the control (decision) variables to be determined are: • the fuel assembly to be loaded in each location, • the BP loading with each assembly, • the 'orientation' of each assembly. These are manipulated to create new LPs by executing binary (A —> B, B —3 A) or ternary (A --> B, B --> C, C —> A) exchanges between randomly selected locations, the BP loadings and/or orientations of the assemblies in these locations being randomly changed simultaneously. However, not all assembly exchanges, BP loadings or orientations are necessarily permitted: • Some are excluded in order to maintain the core symmetry specified — FORMOSA-P can handle every conceivable variant of full, half, quarter or eighth core symmetry. • Some can be excluded in order to limit the search space (e.g. local versus global) and the magnitude of GPT errors generated in evaluating the LPs, as explained in Section 2. 2. • Some can be excluded explicitly by the user, who is able to fix assembly locations, BP loadings or orientations, if so desired. 3.2. Problem functions A further advantage of OSA is that the algorithm structure allows easy transition between different objective function formulations. FORMOSA-P caters for the most commonly used PWR reload core objectives: • cycle length maximization, • radial power peaking (FAH) minimization, • region * average discharge exposure maximization, • feed enrichment minimization. In addition, various inequality constraints, which define acceptable (safe) operating conditions, can be specified. These set limits on: • the maximum radial power peaking, • the maximum octant power tilt, • the maximum individual assembly discharge exposure, • the maximum region average discharge exposure, • the maximum BP loading in the core, • the maximum hot full power (HFP) soluble boron (poison) level. * A region denotes a grouping fuel assemblies that was loaded at the same time
212 Obviously, for some choices of objective function, not all these constraints are needed. One option for handling inequality constraints in an OSA routine is simply to reject any infeasible solutions generated. This is the approach used for all but the first two constraints noted above. Treating these in this way too may well prevent an optimal LP from ever being found, because the binary or tertiary exchange of fuel assemblies with widely varying nuclear characteristics will inevitably create a power peaking problem. If such exchanges are always rejected, the routine will never get the opportunity to move other fuel with attributes that reduce peaking in the problem location. The solution is to construct an augmented objective function incorporating any violations of these constraints as a penalty function: f A = f+ye
(8)
where f A and f represent the augmented and true objective functions, respectively, 0 is a positive-valued function that quantifies the extent of constraint violation, and y is the penalty coefficient. The value of y is gradually increased as the optimization proceeds, thus biasing the search progressively more heavily towards feasible space. By expanding the search space to include both feasible and infeasible solutions, a smoother path with more rapid convergence is obtained. 3. 3. Annealing schedule
Of course, the path followed during the optimization search is also strongly dependent on the frequency with which 'uphill' moves are accepted according to the standard SA probabilistic criterion, written here as: Af A P = exp (– T )
(9)
where T is the relevant control parameter, which, by analogy with the algorithm's original application [15], is known as the system 'temperature'. An appropriate initial temperature T o is found by making a series of random changes to the starting (reference) LP, i.e. a search at infinite temperature is conducted. At the end of this search, the standard deviation a f of the distribution of the objective function values of solutions accepted is calculated. The initial temperature is then calculated using White's formula [ 16]: To = C iaf
(10)
in which C i is a constant, the value of which depends on whether a global or local search is to be performed (see the next section for more details). During the optimization the temperature is held constant until either a given number of acceptances, L tran , have been achieved or a given number of trials, Lchain , have been attempted, whichever comes first. The values of these parameters are given by: Ltran ".-- 4iNtot Lchain = CiNtot
(12)
213 where N t. t is the total number of possible single assembly perturbations, which is a good measure of the size of the search space. Again the values of constants 4 i and c i depend on the type of search to be performed. A new temperature is then calculated according to the formula: Tri+1 = aTn
(13)
where a is either a user input constant or adaptively calculated using Huang's formula [17]: a = exp Hujn )
(14)
f)
where v is a constant, for which Huang recommends a value of 0.7. To avoid the temperature falling too fast, a is restricted to values greater than 0.5. 3. 4. Search strategies
As discussed above, the size of the search space and the accuracy of the LP evaluations depends on the size of the permitted reactivity change associated with any single assembly perturbation. Thus, the more wide-ranging a search is, the less accurate it will be. To take this into account, two search strategies are implemented within FORMOSA-P: • Global search — a wide-ranging search, permitting quite large reactivity changes (Ak --- 0.15), designed to locate the neighborhood of the optimum. • Local search — a concentrated, more accurate search (Ak – 0.075) designed to converge onto the nearest optimum. It is standard practice for global and local searches to be alternated, so that the search space is spanned in the manner shown schematically in Figure 4. An important difference between the annealing schedules for global and local searches, in addition to those outlined in the previous section, is that during global searches the temperature is held constant once the ratio of acceptances to total number of trials falls below a threshold of 30%. This ensures that the search is able to escape local minima and allows continued sampling over a greater portion of the search space. This contrasts with local searches where the temperature continues to decrease until convergence is detected. In both types of search a manoeuvre known as a 'return to base' is executed if no progress is being made. During a return to base the current solution is replaced by the best solution encountered so far and the search is then resumed. A return to base is executed if a new best solution has not been encountered in (D i N ot trials. Global searches are terminated after a given number of trials: L ingth = XiNtot
(15)
Local searches are terminated when 'convergence' is detected, where convergence is defined to have occurred if the ratio of acceptances to trials falls below 10% and more than M ot trials have passed since a new best solution was found, or, as a backstop, after Lingth total trials. Tables 1 and 2 present results illustrating the effectiveness of the global and local search strategies. These results show average performances over 20 runs initiated from different ini-
214 tial conditions (i.e. a different reference LP and/or random number seed), on a sample power peaking minimization problem. Each run was performed with a global search followed by a local search. Also shown, for comparison purposes, are results obtained using a 'traditional' SA implementation (with a fixed value of a etc.). The global and local searches were performed both with and without the ternary exchange feature.
successive optima (reference LPs) search space
Figure 4. A Schematic Representation Of FORMOSA-P Searches
For the global search algorithm it can be seen that, at the cost of a slight increase in the average number of trials per run M, there is a significant improvement in the quality of the solutions, i.e. a lower average objective function f, and in the consistency of performance, i.e. a lower standard deviation in the distribution of solutions cf. The use of ternary exchanges does not improve the average performance significantly but does further improve its consistency.
215
For the local search algorithm it can be seen that the average performance is improved both in terms of solution quality f and run time, as measured by M. The use of ternary exchanges not only improves consistency, as measured by a i, but also solution quality. Similar performance improvements have been observed for other objective function formulations Table 1 A Comparison of Global Search Performances Global search (no ternary exchanges)
Global search (with ternary exchanges)
1.3562
1.3421
1.3420
a-
0.0137
0.0102
0.0079
M
21177
23909
23979
Parameter Traditional SA f f
Table 2 A Comparison of Local Search Performances Parameter Traditional SA
Local search (no ternary exchanges)
Local search (with ternary exchanges)
f
1.3028
1.2956
1.2931
a- f
0.0130
0.0100
0.0074
M
30655
24512
25409
3. 5. Archiving
Convergence onto the global minimum itself cannot, in general, be guaranteed with OSA, but empirical evidence shows that for many problems the algorithm converges successfully on a solution in the neighborhood of the global minimum, identifying a number of near optimum solutions in the process. In the present application, this feature of OSA proves to be a distinct advantage. Two different 'archiving' schemes are implemented within FORMOSA-P to exploit this ability to find many good solutions. The first, known as 'FAH binning', saves the best solutions found with power peaking (FAH) values in user-prescribed ranges. This enables the 'cost of margin' to be assessed without the need to perform multiple optimizations. For example, Figure 5 presents some of the information yielded by FAH binning during a FORMOSA-P run on the problem of maximizing the end of cycle (EOC) critical soluble boron concentration (ppm), which is equivalent to maximizing the cycle energy production. The fact that the best solutions archived are all stored in the top two bins: and, indeed, are clustered towards the top of these bins, accords with expectations of a trade-off between cost and margin. The more random distribution of solutions in other bins can be attributed to less extensive searching of non-optimal regions by the algorithm.
216
40
Bin 4
Bin 3
Bin 2
Bin 1
e.'
ci
E
g
'"
a, 30 c
a
Pc) 7,.
wore
a
20
EP
0
U
0 0°
10
43 BD m
0 1.00n— 1.42 Lower Bin Limit
1.44
1.46
1.48
1.50 FAH Limit
Power Peaking (FAH)
Figure 5. A Typical Set of Solutions Stored Using FAH Binning
The second archiving scheme, known as 'dissimilarity archiving,' records the best solutions that are dissimilar by a specified amount. The degree of dissimilarity between two LPs, X and Y, is defined in terms of their beginning of cycle (BOC) reactivity distributions: 1 Sxy = f\T-
[k(c xj,0) – k(c yi 3O)] 2 + [k(cxj,0) – k(cxi,bxj) – k(Cyp0) k(cycbyd]
2
(16)
i=1
where k(r,p) is the BOC reactivity value for fuel assembly r with BP loading p, c xi is the fuel assembly loaded in location i of pattern X, b xi is the BP loading in location i of pattern X, and N is the number of fuel assemblies in the core. The variation in the LPs archived gives some indication of how large and flat the region of optimality is, and, indeed, whether there is more than one region where good solutions are obtained. Figure 6 plots the objective function value (EOC soluble boron concentration) against dissimilarity relative to the best LP identified for the solutions archived on another FORMOSA-P run. This diagram shows that very few dissimilar LPs were found in the neighborhood of the optimum indicating that the region of optimality is quite small. There is also evidence of another very dissimilar local optimum. The large number of inferior LPs archived confirms that the feasible space was widely traversed during optimization. The wide gaps in the distribution of solutions are due to intermediate solutions accepted during the search being infeasible and therefore not being archived.
217
40 xtr a
szl. 30
0
0
C
a
0 0
113
20 (4
w 10
0 1.00
1.42
1.44
1.46
1.48
1.50
Dissimilarity * (x1000) *
relative to the best solution
Figure 6. A Typical Set of Solutions Stored Using Dissimilarity Archiving
Both of these archiving schemes provide the designer with a lot of information about the nature of the problem and with a family of good solutions (many of them near optimum), which can be subjected to further scrutiny, considering other more subtle core performance attributes, before an LP is finally selected.
4. APPLICATION RESULTS AND CONCLUSIONS In practice, FORMOSA-P typically needs to examine on the order of 50 000 LPs over two OSA cooling iterations (alternating global and local searches) in order to solve realistic reload design problems. If the fuel cycle is modeled over 8 exposure steps, such a run takes about 3 hours to execute on an IBM RS6000 (Model 375). Figure 7 shows the best solution found by FORMOSA-P during a reanalysis of the fuel management strategy for a currently operating reactor. The objective was to minimize the fresh fuel enrichment. The active constraints included: cycle length, FAH, assembly discharge exposure, HFP soluble boron concentration, and a restriction on the placement of BPs under control rod positions. Also shown is the actual LP loaded for the cycle which was the initial starting pattern for the optimization performed.
218
Initial Loading Pattern (3.60 w/o Fresh Fuel)
Optimized Loading Pattern (3.40 w/o Fresh Fuel)
El Fresh Fuel (with BPs) Once Burned Fuel
III
Twice Burned Fuel
Figure 7. Initial and Optimized Loading Patterns
1.42 1.41 1.40 1.,4 1.39
E
Reference LP 1.38
.,
1.37 Optimized LP
1.36 o
1.35 1.34 . 1.33
0
50
100
150
200
250
300
Cycle Exposure (Effective Full Power Days) Figure 8. Peaking Factors Versus Cycle Exposure
350
219
The optimization reduced the fresh fuel enrichment required to satisfy the cycle energy requirements from 3.60 w/o to 3.40 w/o. As shown in Figure 8, the value of FAH predicted by FORMOSA-P over the fuel cycle was also reduced from 1.41 to the constraint limit (user input) value of 1.38. The optimized fuel pattern achieves substantial improvements in fuel cycle economics while at the same time improving the margin to thermal operating limits. The progress of the OSA algorithm with respect to improving the objective function and satisfying the FAH (penalty) constraint limit is displayed, respectively, in Figures 9 and 10 for the first 16 0000 OSA solutions (histories). The results shown were achieved during the global search portion of the algorithm. From Figure 9, it is observed that the global search exhibits non-monotonic behavior consistent with the 'freezing' of the OSA temperature once the acceptance ratio has dropped below the prescribed limit (at approximately history 6 000). In contrast to the local search option, which attempts to converge on the true objective function minimum, the global search seeks to archive a broad spectrum of good solutions of equal merit. Also note that feasibility with respect to the power peaking constraint limit is not achieved until nearly half of the total number of histories within the global search have been examined. This behavior is consistent with the penalty function implementation for treating constraint limits. In conclusion, Simulated Annealing as implemented in the code FORMOSA-P enables in-core PWR fuel management problems to be solved within a reasonable computational time frame, offering the designer, through the archiving schemes, a selection of near-optimum solutions from which to choose and considerable insight into the individual characteristics of each problem.
3.75 .. . .«,
3.7
3.65
t
• $. e sp. be+:
%4 0 3.6 2.4 it: 4.. 3.55 w
3.5 L.4
3.45
3.4
i 1 t, 4 4 4I4 it : .4.1 i#0/4 • ; • • * el : . it 4 IirtA! 4 4 * 4 k 4 * 1Pet * C rPV•t • • 4 1 . * • a...-.. :* * 4 n. 0., girt •4. Nt •• use..•.0 41,1,., ; t 4Itsv; * i* **is ,. * ** Of 1 • A * .4 • l• . 1: , .40.ligi . :; 1 ** :,4i 1• *4 1: %:1 . 0 ! ¤ :.. '7 4)".7:4•S, . • it*
i
f
i
3.35 0
2000
4000
6000
8000
10000
12000
14000
History Number Figure 9. Objective Function Behavior (Global Search)
16000
220 2.1
irveNot.****444.010*4000.0^.40-4-.044 ort -
1.3 0
2000
4000
6000
8000
10000
12000
14000
16000
History Number
Figure 10. Penalty Constraint Behavior (Global Search)
ACKNOWLEDGEMENTS This work was supported by the North Carolina State University Electric Power Research Center (EPRC), a membership consortium of nuclear vendors, utilities and national laboratories.
REFERENCES 1. 2. 3. 4.
5. 6. 7.
T.J. Downar and A. Sesonske, Adv. Nucl. Sci. Tech. 20 (1988) 71. G.H. Hobson and P.J. Turinsky, Nucl. Tech. 74 (1986) 5. D.J. Kropaczek and P.J. Turinsky, Nucl. Tech. 95 (1991) 9. D.J. Kropaczek, P.J. Turinsky, G.T. Parks and G.I. Maldonado, Reactor Physics and Reactor Computations (Edited by Y.Ronen and E.Elias), Ben Gurion University of the Negev Press (1994) 572. S. Kirkpatrick, C.D. Gerlatt, Jr., and M.P. Vecchi, Science 220 (1983) 671. D.J. Kropaczek and P.J. Turinsky, Trans. Am. Nucl. Soc. 61 (1990) 362. G.I. Maldonado, P.J.Turinsky and D.J. Kropaczek, Proc. Joint Int. Conf. Mathematical Methods and Supercomputing in Nuclear Applications, 1 (1993) 787.
221
8. R.D. Lawrence, Prog. Nucl. Energy, 17 (1986) III-271. 9. K.S. Smith, Trans. Am. Nucl. Soc., 44 (1983) 265. 10. P.R. Engrand, G.I. Maldonado, R.M. Al-Chalabi and P.J. Turinsky, Trans. Am. Nucl. Soc., 65 (1992) 221. 11. E.L. Wachspress, Iterative Solution to Elliptic Systems and Applications to the Neutron Diffusion Equations of Reactor Physics, Prentice-Hall, Englewood Cliffs, NJ (1966). 12. R.S. Varga, Matrix Iterative Analysis, Prentice-Hall, Englewood Cliffs, NJ (1962). 13. A.M. Yacout, R.P. Gardner and K. Verghese, Nucl. Instr. Meth. Phys. Res. 220 (1984) 461. 14. G.I. Maldonado, P.J. Turinsky and D.J. Kropaczek, Trans. Am. Nucl. Soc., 68 (1993) 218. 15. N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller and E. Teller, J. Chem. Phys. 21 (1953) 1087. 16. S.R. White, Proc. IEEE Int. Conf. Computer Design, (1984) 646. 17. M.D. Huang, F. Romeo and A. Sangiovanni-Vincentelli, Proc. IEEE Int. Conf. Computer Aided Design, (1986) 381. GLOSSARY bx i the burnable poison loading in location i of pattern X cx i the fuel assembly loaded in location i of pattern X f the objective function k reactivity P the acceptance probability the average objective function value over several optimization runs A the neutron production operator in the neutron diffusion equation B the neutron removal operator in the neutron diffusion equation the nodal coupling coefficients in the neutron diffusion equation D FAH a measure of radial power peaking L a number of trials M the total number of trials in a search N the number of fuel assembly locations in the core Not the total number of possible single assembly perturbations R a generalized core response T the system temperature a the temperature decrement parameter y the penalty function coefficient S a measure of dissimilarity between loading patterns C i a constant used in the calculation of an initial system temperature the eigenvalue of the neutron diffusion equation 4 i a constant used in the control of the annealing schedule a f the standard deviation of objective function values accepted aC i
iu
the standard deviation of objective function values averaged over multiple runs a constant used in the control of the annealing schedule a constant used in the calculation of the temperature decrement parameter
222 the neutron flux 4) o)i a constant used in the control of the search pattern xi a constant used in the control of the search pattern 1-.1* the adjoint flux for the generalized core response in core location '1' 0
a positive-valued function quantifying constraint violation
Adaption of Simulated Annealing to Chemical Optimization Problems, Ed. by J.H. Kalivas © 1995 Elsevier Science B.V. All rights reserved.
223
Chapter 10
Design of cost-effective emission control strategies for pollutants R. G. Derwent Atmospheric Processes Research Branch, Meteorological Office, Bracknell, Berkshire, RG12 2SZ, United Kingdom.
1, INTRODUCTION Over the last several decades, Europe has become increasingly aware that the air and rain over its continental areas have become grossly polluted with acidic sulphur compounds from the burning of fossil fuels [1]. Human activities on a global scale now add roughly the same quantities of sulphur compounds to the atmosphere as natural sources. However, most man-made sources are concentrated in densely populated and highly industrialised areas such as Europe. Natural sulphur emissions in Europe have been completely overwhelmed by about an order of magnitude by man-made sources [2]. The concentrations of sulphur compounds in air and rain are many times greater than they were in preindustrial times [3] and they exceed by a wide margin the levels at which deleterious environmental effects begin to appear. Sulphur dioxide emissions in Europe are highest in a broad region stretching from the United Kingdom in the west, through the low countries, Germany and to Poland and eastern Europe [4]. Once emitted into the atmosphere the sulphur dioxide travels with the prevailing winds. Some of the sulphur dioxide is absorbed by the earth's surface and the vegetation growing upon it. This is the process of dry deposition and it is the main removal process for sulphur dioxide, accounting for about two-thirds of the total. It has a timescale of several days, which limits the geographical scale of pollution impacts from sulphur dioxide itself to within several hundred kilometres from its point of emission. Sulphur dioxide takes part in chemical reactions with substances naturally present in the atmosphere and with other pollutants, some of them driven by sunlight and others by the presence of cloud droplets. The end product of the oxidation of sulphur dioxide is sulphuric acid, together with ammonium sulphate, in the form of suspended particles. These sulphur particles, known collectively as sulphate aerosol, tend not to be removed particularly efficiently by dry deposition and have timescales limited only by the scavenging during rain events. Sulphate aerosols may have lifetimes up to 10 days and may travel hundreds and thousands of kilometres before encountering rain. The capture of sulphate aerosol by rain leads to the process of wet deposition and this process accounts for the remaining one third of the total removal of sulphur species. By whatever route, whether wet or dry deposition, ultimately all the sulphur dioxide emitted into the atmosphere comes back to the earth's surface in one form
224 or another. The location of the deposition will however depend on the prevailing wind directions and the patterns of the rain events in the downwind environment. In Europe, sulphur deposition does not necessarily occur in the same country as emissions and long range transboundary pollutant transport is an important policy issue [5]. Fluctuations in the weather however mean that imports and exports of sulphur deposition change from one day to the next. As the airstreams pass over high emission regions they become loaded with sulphur dioxide and acidic sulphur compounds which can, given particular weather patterns, lead to episodes of high deposition in remote environments. Fluctuations in deposition can occur over longer periods following the influence of prolonged dry and wet periods, cold or warm and wet or dry years. The dry and wet deposition of acidic sulphur compounds arrives on the soil and on the vegetation. Most rain occurs over land surfaces and washes the pollutants absorbed on vegetation through the soil into freshwater systems. As the rain passes through the soil it is affected by many chemical reactions which can change its composition before it reaches a head-water stream or upland lake. These chemical reactions can be driven by the increased deposition of acidic sulphur compounds and can lead to the acidification of the soils and the freshwaters. Acidified soils are not as productive as well-buffered soils and have depleted communities of flora and fauna. Acidified freshwaters have depleted fisheries and populations of insects, amphibians, mammals and birds. Understanding of the processes involved in the long range transport and deposition of acidic sulphur compounds has expanded rapidly. It is now realised that other pollutants, such as the oxides of nitrogen and ammonia, also contribute to the acidification of remote, sensitive environments. Oxides of nitrogen are emitted from motor vehicles as well as from fuel combustion in power stations. Oxides of nitrogen are oxidised to nitric acid and nitrate aerosols which can undergo long range transport and deposition. Ammonia is emitted mainly from agriculture and once in the atmosphere it can react with the acidic sulphur and nitrogen species to form ammonium aerosols. These also undergo long range transport and deposition. In remote environments the deposition of the oxidised nitrogen and ammonia compounds can trigger soil and freshwater acidification, though the mechanisms have been much more difficult to unravel compared with those involving sulphur compounds. At low deposition levels, of course, nitrogen deposition can exert a beneficial or fertilising impact on semi-natural ecosystems, in contrast with sulphur deposition. In an attempt to reverse the environmental deterioration driven by the long range transport of sulphur and nitrogen compounds, the countries of Europe have agreed an international convention on Long Range Transboundary Air Pollution under the aegis of the United Nations Economic Commission for Europe [6]. Protocols to this convention address sulphur dioxide emissions and their control, together with those of nitrogen oxides and volatile organic compounds. In agreeing to combat environmental acidification, countries have been careful to agree to act in concert to control pollution emissions and in ways which are seen to be efficient and cost-effective. The search is for strategies which deliver maximum environmental benefit for the minimum investment in expensive pollution
225
abatement technologies. In general terms, most pollution control strategies are technology-based. Engineering considerations are used to define the best available technical means for pollution control and regulation sees that these are implemented. Monitoring is not necessarily carried out to check whether the anticipated environmental improvement is actually delivered or not. The Protocols to the UNECE convention attempt to implement effects-based strategies [7]. Environmental understanding is used to define the capacity available in the environment to accept pollution loads and the aim of policy is to move towards the situation in which they are no longer exceeded. There are an infinite range of strategies with this aim in mind but only a limited number will achieve this aim at least cost. The effects-based approaches to acid rain abatement require a large amount of information about emissions, long range transport and deposition, environmental impacts, emission controls and their costs. Much of this information has been collected together into the integrated assessment models which provide the framework for the analysis of the policy options. The aim of optimising a strategy to maximise benefit whilst minimising cost, subject to constraints is a familiar problem in applied mathematics. With the acid rain debate, there are a very large number of possibilities which cannot be analysed exhaustively. An efficient optimisation algorithm is required so that the amount of simplification introduced into tractible problems can be limited. This where optimisation by simulated annealing [8-10] has come into its own, to identify optimal solutions, that is solutions which are close to the theoretical limit of cost efficiency. Here cost efficiency or effectiveness is measured by the reduction in environmental impacts divided by the investment in pollution controls. It is generally considered that the theoretical optimum solution would be so fragile an edifice that it would be impossible to guarantee the agreement of all the countries of Europe to it. Optimisation by simulated annealing has allowed us to develop a conceptual framework in which optimal strategies can be identified [11,12]. It has been important to learn what makes good strategies optimal and the rules have been more useful than the solutions themselves. These rules once learnt will ensure that the Protocols negotiated between the countries of Europe move in the direction of the optimum balance between environmental protection and economic development.
2. AN OPTIMAL STRATEGY FOR RETROFIT FLUE GAS DESULPHURISATION IN THE UNITED KINGDOM 2.1. The nature of the problem The processes which are involved in environmental acidification in Europe are becoming better understood. It is now clearly recognised that the United Kingdom is a major source of sulphur dioxide and must accept some responsibility for the increased burdens of sulphur compounds in both air and rain over Europe [13]. In the long term, United Kingdom sulphur dioxide emissions can be reduced by investment in a range of low-polluting technologies including nuclear, hydro and
226 renewable electricity generation. In the medium term, say up to the year 2010, United Kingdom sulphur dioxide emissions can only be reduced significantly by the retrofitting of flue gas desulphurisation FGD to the existing large coal-fired electricity generating stations or by switching to combined cycle natural gas-fired power stations. In the formulation of a programme in which the existing large coal-fired power stations are retro-fitted with flue gas desulphurisation, many practical engineering constraints and considerations must be taken into account. Some of these will be of overriding importance and may dictate the order in which some of the existing United Kingdom power stations are tackled. Nevertheless, there may be good environmental protection reasons for retrofitting particular power stations before others in an optimal strategy. In this section, techniques are developed which allow the formulation of a strategy for retrofitting flue gas desulphurisation which maximises environmental benefit. The adopted approach involves a number of stages: * the selection of sensitive ecosystems, receptor sites or catchment areas, establishing critical deposition loads for these receptor sites, attribution of the sources of the sulphur deposition at each site, * finding the optimal strategy for reducing sulphur dioxide emissions which gives the greatest reductions in deposition at the receptor sites. 2.2. Selecting sensitive receptor sites In the analysis of environmental acidification from sulphur dioxide emissions, an important first step is the selection of possible locations for receptor sites which may be particularly sensitive to the deposition of acidic sulphur species. Table 1 contains the locations of arbitrary receptor sites which have been chosen to reflect the contribution from sulphur deposition to environmental damage caused by the acidification of soils, surface waters and freshwater ecosystems [14].
Table 1 Illustrative receptor sites which may be sensitive to sulphur deposition. receptor site central Wales south west Scotland southern Norway south west Sweden
location 52°20'N 2°40'W 54°40'N 5°00'W 58°00'N 6°40'E 57°40'N 12°20'E
227 The choice of sites is purely illustrative and made solely to facilitate the analysis. In the final analysis, the choice of ecosystems to be protected and the extent of protection to be given are for society to decide. 2.3. Critical loads for sulphur deposition. Having established the locations of the sensitive receptors, the next step is to quantify their sensitivity. This involves answering questions about the nature of the pollutant causing the damage and both the damage and deposition mechanisms. At this stage, the concept of critical loads is brought in to the analysis to provide the required quantification of the deposition-damage relationship. The critical load for a particular deposition-damage combination is defined as the highest deposition load that a particular ecosystem can tolerate without long term damage occurring. Critical loads are being developed as deposition standards, acceptable to the scientific community, and subject to continuous verification and adjustment in the light of new scientific information and changing circumstances [7]. Critical loads data are usually presented as maps showing the sensitivity of receptors present at each location or as a percentile in the frequency distribution of the receptors present within a particular grid square. Critical loads maps are available for the United Kingdom [15] for the acidification of soils and surface waters by sulphur and nitrogen compounds and for Europe [16], for the acidification of soils. Based on the critical loads maps for the acidification of UK soils and freshwaters by the deposition of acidic sulphur compounds, critical loads of 5 kg S/ha/yr have been adopted to reflect the contribution to environmental damage caused by sulphur deposition in the regions with sensitive geologies selected in Table 1. These estimates are purely illustrative and have been developed to facilitate the analysis. In all cases it is assumed that both wet and dry sulphur deposition are equally damaging and henceforward the term, total sulphur deposition, refers to their sum. 2.4. Attribution of sulphur deposition A crucial step in the design of rational sulphur abatement strategies is an ability to attribute sulphur deposition at a particular sensitive receptor site to its emission sources. Currently this is only possible with some form of theoretical model approach which addresses the known processes of emission, transport, transformation and deposition. Here, a simple trajectory model approach has been adopted to relate the total deposition of oxidised sulphur, oxidised and reduced nitrogen species to the respective emissions of sulphur dioxide, nitric oxide and ammonia. An air parcel extending from the ground surface to the top of the boundary layer is advected by the wind over the emissions grid to reach the arrival or receptor site. The pollutants within the air parcel undergo chemical transformations and are removed by dry and wet deposition. Full details of the model are given elsewhere [17].
228 The trajectory model was used to determine the deposition pattern away from a source of sulphur dioxide with unit emission. This pattern defines the sourcereceptor relationship for sulphur, see equation 1: TSD = E/100 x exp(3.07 - 3.99 R+1.57 R2- 0.366 R3 + 0.048 R4 -0.0034 R5 + 0.0000097 R6)
(1)
where TSD is the total sulphur deposition in kg S/ha/yr, E is the sulphur dioxide emission in ktonnes SO 2 /yr, and R is the downwind distance in km divided by 150km. Deposition loads at particular receptor sites are calculated by overlaying the deposition contributions from all the sources within the emissions grid, using the above source-receptor relationship. Pollutant emissions on a 150km*150km grid are available for the entire European continent for sulphur dioxide, oxides of nitrogen, ammonia and volatile organic compounds through the UN ECE EMEP programme for each year from 1985-1993 [18]. 2.5. Finding optimal strategies The aim of the optimisation is to determine the spatial pattern of retrofit flue gas desulphurisation FGD which, for a given total installed capacity of abatement, minimises the magnitude of the difference between the deposition loads and the critical loads for total sulphur deposition at the receptor sites. Such differences between deposition loads and ciritical loads are termed critical load exceedences. For a near continuous distribution of emission controls at about 50 power stations and 11 receptor sites, there are a large number of possible strategies to work through in an exhaustive analysis. The problem was solved using optimisation by simulated annealing [8-10], a specialised iterative improvement technique. In the initial application of simulated annealing, the baseline, uncontrolled emission inventory was set up and an initial configuration of retrofit FGD technologies was established. The given, total installed capacity of FGD retrofit was divided up into a number of equal-sized discrete components which were free to move independently and at random, executing a 'drunkard's walk' over the emissions grid. At any point in the optimisation, the actual SO 2 emissions from a particular grid element was therefore the net of the baseline emissions and the total of the discrete components of FGD which were located in that grid square. The sulphur deposition and critical load exceedences at the receptor points were then calculated from the new emissions field and the source-receptor relationship. A rearrangement was applied to the spatial configuration of the discrete retrofit flue gas desulphurisation technologies and the new sulphur dioxide emission field was calculated. If the reaarangement decreased the critical load exceedence at the receptor sites, then it was kept and the process continued. These steps were repeated until no further improvements in critical load exceedence could be found. Eventually, the improvements get harder to find and the method gets stuck in a minimum which is generally a local minimum and not the required global minimum. Using the Metropolis algorithm [19], controlled uphill steps were incorporated
229 in the search for better solutions. Initially in each optimisation, many uphill steps are accepted but progressively through the optimisation, fewer and fewer uphill steps are accepted. This progression was controlled throughout the optimisation using a 'temperature' parameter. The 'temperature' cycle of the optimisation contained a 'hot soak' followed by a 'cooling curve'. Eventually, when all uphill and downhill steps cease, the configuration of the retrofit FGD technologies 'freezes' and the spatial pattern is optimised. 2.6. Optimisation by simulated annealing and retrofit flue gas desulphurisation strategies. To illustrate the application of simulated annealing to the design of optimal retrofit flue gas desulphurisation FGD strategies, the case of a single, sensitive receptor site in central Wales was first taken. The baseline total sulphur deposition was found to be 12.5 kg S /ha/yr, giving a critical load exceedence of 7.5 kg S /ha/yr. In steps of 100 ktonnes S /yr (200 ktonnes SO, /yr), progressively more total installed capacity of retrofit FGD abatment was made available in discrete 10 ktonnes SO, /yr increments. Each discrete increment was placed by the optimisation algorithm where it could produce the greatest reduction in deposition at the one receptor site without reducing any of the points in the new sulphur dioxide emission inventory below zero. As the steps of 100 ktonnes S /yr built up, critical loads exceedence at the Welsh receptor site declined monotonically as expected. The plot of critical load exceedence against cumulative installed retrofit capacity exhibited a marked effect of "diminishing returns". That is, the decrement in critical loads exceedence with each additional 100 ktonnes S /yr tranche of retrofit FGD capacity, declined progressively. From the point of view of this illustrative study, the locations for siting the retrofit FGD capacity are of most interest. The analysis shows quite straightforwardly that there is an optimal strategy which maximises benefit to the United Kingdom. Such an optimal strategy for a sensitive site in central Wales would entail installing retrofit FGD capacity with highest priority in Wales, then in the west Midlands, the south west and finally, at lowest priority, in the east Midlands. This same procedure was repeated for an illustrative site in south west Scotland, see Table 1. The order of priority as revealed by simulated annealing is found to be quite different from that for the receptor site in central Wales. In the optimal strategy for south west Scotland, retrofit FGD capacity is installed preferentially in Northern Ireland, south east Scotland and north west England. None of these conclusions about optimal strategies were particularly counterintuitive or different from those which would have been generated from an exhaustive search of the large, but not unreasonably large, number of possibilities. They served to show that optimisation by simulated annealing could find reliable solutions in cases which could be checked manually. Furthermore they showed that a flexible representation of control strategies could be built to handle all the features of the acid rain problem in Europe. This flexible representation was then extended to tackle more of the complexities of the situation which would take the
230 analysis out of reach of the exhaustive methods.
3. RETROFIT FGD STRATEGIES AND LONG RANGE TRANSBOUNDARY TRANSPORT 3.1. Long range transport from the United Kingdom to Scandinavia Using illustrative sensitive sites in southern Norway and south west Sweden from Table 1, then simple optimal strategies can be developed with the methodology described in the previous section. Optimal strategies found with simulated annealing inevitably involve the installation of sulphur abatement technologies preferentially in Norway, Sweden and Denmark. As the extent of reduction in total sulphur deposition is increased then the geographical regions in which reductions in SO 2 emissions are required spread further away from the receptor site. For southern Norway, some small amount of SO 2 retrofit capacity is found in optimal strategies which seek relatively modest reductions in total sulphur deposition. Reducing total deposition from a baseline figure of 5.67 kg S /ha/yr down to about 5.0 kg S tha/yr may not necessarily require significant SO2 emissions abatement in the United Kingdom. Whereas, reducing total sulphur deposition much below 5 kg S /ha/yr appears to entail the installation of significant retrofit FGD capacity in the United Kingdom. For south west Sweden, the picture is somewhat different. In all optimal strategies, retrofit FGD capacity is installed locally at first and then progressively further away as further reductions in deposition are sought. Reducing total deposition from a baseline figure of 9.42 kg S /ha/yr down to about 5.4 kg S /ha/yr can be achieved in optimal strategies without requiring any significant emissions reductions from United Kingdom sources. To reduce deposition at both Scandinavian receptor sites below the 5 kg S /ha/yr illustrative critical loads, the orders of priority for the retrofitting of FGD to United Kingdom coal-fired power stations are similar and entail retrofitting FGD capcity in Scotland and north east England. This pattern of retrofit is distinctly different from that required to meet critical loads in south west Scotland and central Wales. 3.2. Optimal control strategies and pollution control technology costs The above optimisation studies using fixed and constant abatement costs in $ per tonnes SO 2 are useful for exploring spatial trade-offs across Europe, but they are far from realistic. A next step is to consider a range of pollutant control technologies with different abatement costs. An optimal strategy would consider deploying a range of technologies over Europe and reduce deposition below critical or target levels for the minimum total expenditure on all control technologies. This is an exceedingly detailed and complex task to perform with any realism. However, by making some simplifying assumptions, it is possible to learn some robust rules about technology substitution on the European scale. If attention is directed to the long range transport scale, then the main policy issue becomes the transboundary
231
transport of acidic sulphur species and the circumstances in which SO 2 control technologies such as retrofit FGD are installed in one country to reduce acidic deposition in another. The question of a range of SO2 pollution control technologies with different unit costs has been studied using optimisation by simulated annealing. The range of technologies has been represented by assuming two types of pollution sources, 'low cost' and 'high cost'. The difference between these two source types depends on the cost of control technology required to achieve a unit tonne S reduction in emissions. European scale emission inventories are required for each pollution source, both resolved spatially over the model domain. For a fixed total investment cost, then different installed capacities of 'low cost' and 'high cost' control technologies could be installed. The object is to use simulated annealing to study the trade-off between the 'high cost' and 'low cost' control technologies. In formulating an optimal strategy, the starting point would seem to be to install abatement technologies on both the 'low cost' and 'high cost' sources as close to the sensitive receptor point as possible at first and then at progressively increasing distance until each capacity limit is reached. The question would then be about the merits of exchanging x tonnes of abatement capacity of the 'high cost' technology for px tonnes of the 'low cost' technology, where p>1 and where p represents the ratio of the respective technology costs in $ per tonne SO 2 abated. The situation would inevitably arise where the the x tonnes of the 'high cost' technology may be in one country and the px tonnes of the 'low cost' technology may be at a significant distance away in another country. Optimisation by simulated annealing shows how the trade-off between 'high cost' and 'low cost' technologies may work in the simple case of one sensitive receptor site in south west Sweden. For 'high cost' technologies close in to the sensitive receptor to be efficiently traded against 'low cost' technologies in a more distant country, then the cost differential factor, p, must be high enough to counter the declining deposition per unit emission with increasing distance. Experience with pollution control costs shows that this is unlikely to be the case and that sulphur abatement technologies span only about an order of magnitude cost per unit tonnes S removed. Whereas, deposition per unit emission may well span two orders of magnitude over the scale of Europe. In optimal strategies on the long range transport scale, the costs of optimal strategies are more likely to be influenced by the spatial deployment of the control technologies rather than the costs of individual abatement technologies deployed. This is because the cost differential factors which would be required to drive the substitution of expensive abatement technology in one country with cheap abatement technology in another, appear to be large compared with the expected range of technology costs. The further apart the expensive and cheap abatement opportunites are, the more significant long range transport becomes, and the less cost-effective technology substitution becomes.
4. COMBATTING THE LONG RANGE TRANSPORT AND DEPOSITION OF NITROGEN SPECIES IN EUROPE
232 4.1. NO. abatement strategies Here the geographical development of NO„ abatement strategies across Europe is considered following on from previous sections which have dealt with sulphur dioxide emission controls. The simplest abatement strategy is the so-called "blanket emission controls" strategy. In this, emissions from all NO x sources would be reduced at some date in the future by an identical percentage below the emissions of some base year. Deposition would then be "rolled back" with the emissions, leading to the same percentage reduction in deposition in the remote, sensitive areas, making due allowance for any background or uncontrolled deposition contributions and any non-linearities in the emission-deposition relationship for NO.. In this section, we show how alternative abatement strategies can be drawn up using optimisation by simulated annealing that give the same reduction in deposition despite requiring a substantially smaller capacity of costly NOx abatement technologies. These optimal abatement stragies appear highly costeffective when viewed against "blanket" percentage emissions reductions. This is because all controls are installed within a limiting distance of the sensitive deposition sites, with this distance depending on the lifetime of the oxidised nitrogen compounds in the atmosphere. The design and formulation of abatement strategies to achieve cost effective deposition reductions employs theoretical models to unravel the complex processes which link NOx emissions to oxidised nitrogen deposition in Europe. Many uncertainties still remain in these theoretical studies and it is therefore important to demonstrate that any abatement strategy developed is robust to changes in understanding of the processes involved. The adopted approach involves a number of stages, as before with sulphur, * the selection of sensitive ecosystems, receptor sites or catchment areas, * attribution of the sources of the oxidised nitrogen deposition at each site, * finding the optimal strategy for reducing NO x emissions which gives the greatest reductions in deposition at the receptor sites. 4.2 Selection of sensitive receptor sites for oxidised nitrogen deposition The first stage in the development of the theoretical techniques for the evaluation of NO x abatement strategies is the selection of possible locations for deposition receptor sites which may be environmentally sensitive to the deposition of oxidised nitrogen species. Table 2 contains the locations of seven arbitrarilychosen receptor sites which have been selected in the absence of authoritative criteria, to add to those in Table 1. The sites were chosen to reflect the possible contribution from oxidised nitrogen deposition to environmental damage in the following ecosystems [14]:
233 * acidification of surface waters and aquatic ecosystems, * acidification of forest soils, * eutrophication of the marine environment, * nitrogen saturation effects in forests, heathlands and bogs. The choice of sites is purely illustrative and has been made to facilitate the analysis. The assumption has been made that both wet and dry oxidised nitrogen deposition are equally damaging.
Table 2 The additional illustrative receptor sites used in the analysis of optimal strategies for NOx control receptor site The Netherlands west Norway south west Germany North Sea south east Germany Baltic Sea south Finland
location 52°40'N 5°40'E 60°40'N 4°40'E 48°20'N 7°20'E 54°20'N 7°40'E 49°00'N 12°00'N 55°20'N 15°40'E 59°40'N 23°20'E
4.3. Emission deposition relationship for oxidised nitrogen The behaviour of the oxides of nitrogen emitted from a source of pollution is much more complicated than for sulphur dioxide. The oxides of nitrogen are much more chemically reactive and form a whole family of oxidised nitrogen compounds. Each family member has a different fate and behaviour. Dry deposition and wet deposition occur downwind from the NO x source through the respective contributions from nitrogen dioxide, nitric acid and nitrate aerosol [17]. The long-term total oxidised nitrogen deposition downwind of an NO x source takes the form: LTD = 1.2E/x (exp -0.0017x - exp-0.0057x) where LTD is the long term deposition in kg N/ha/yr, x is the downwind distance of the recptor point in km and E is the NO x emission in thousands of tonnes as NO, /yr.
(2)
234
4.4. Finding optimal strategies for NO. abatement To represent a NO. abatement strategy, it is imagined that abatement devices are installed in particular emission grid squares and that the total installed capacity of abatement is divided up into a large number of these discrete equalsized devices. Any number of these discrete devices can be installed in any emissions grid square, up to the limit implied by the uncontrolled emission. The devices were given random displacements so that they execute a 'drunkard's walk' over the emissions grid, until the simulated annealing algorithm 'freezes' them in their optimal locations, having minimised the objective function. With a set of receptor points, the objective function requires some thought before the trade-offs in the locations of the abatement devices can work efficiently. Iri a first series of optimisations, the objective function was defined as the sum of the ratios of the total oxidised nitrogen deposition to those in the uncontrolled base case. This formulation of objective function expresses the desire to maximise the fractional improvement at all sites and gives the trade-offs scope to produce a fully European strategy. For a total installed capacity of NO. abatement of 1 million tonnes NO 2 /yr corresponding to a reduction in emissions of 6.7%, optimal strategies were found using simulated annealing. The reduction in deposition found in optimal strategies at the receptor sites simultaneously varied between 2.5% and 39%, implying efficeincy factors of between 0.4 and 5.8 times that of "blanket reduction" strategies. Cost-effective abatement strategies can be found which improve on "blanket reduction" strategies for nine out of the eleven receptor sites. For the remaining two sites, emission reductions are significantly below 6.7%, showing that their deposition has not been significantly optimised at all in the optimal strategy. The two receptor sites were those sites in the south east and south west of Germany which received the highest deposition in the uncontrolled case. From the optimal control strategy point of view, there are clearly seen to be two distinctly separate groups of receptor sites, with the more remote and sensitive sites in one group and the high deposition sites in the other. The results from this first optimisation would suggest that there are two distinct problems to solve: * high deposition of oxidised nitrogen in central Europe, * deposition in the more remote regions of Europe. In a second series of optimisations, an objective function defined by the mean deposition over the eleven receptor sites was defined. This expresses the desire to minimise the total deposition over all eleven sites, accepting that any reduction in absolute deposition is equally valuable wherever it is produced. Again, the same 1 million tonnes NO 2 /yr total abatement was assumed. The optimal strategy in this case dramatically reduced the deposition at the two sites in Germany which responded least in the first series of optimisations. in the second series their depositions were reduced by 4.9% and 14.0% instead of 2.5% and 5.6%, respectively. However for the other nine sites, except the Netherlands site, the second series gave lower NO. emission reduction requirements compared
235
with the first series. The results from the second series confirm the view gained from the first series that there are two problems to solve with the two areas showing different responses to abatement. Experience from these simulated annealing optimisations dictates separate abatement strategies for the two sensitive regions, whether the remote fringes of Europe or central Europe. Optimal strategies always involve local controls ahead of controls on more distant sources. The difference lies in the groupings of sources between which cooperation is required to secure cost-effective reductions in deposition.
5. OPTIMAL STRATEGIES FOR POLLUTANTS IN COMBINATION 5.1. Photochemical oxidant control strategies The studies described above have each involved one single pollutant in turn and the acidic deposition derived from them have been examined separately. To an extent these pollution problems have few interactions between them and so there is likely to be little conflict between the strategies so derived [20]. However, NO. abatement strategies are under consideration in Europe for other policy reasons, in addition to environmental acidification. NO. emissions contribute to regional scale ozone formation and long range transport [21]. However, it is not possible to define optimal strategies for implementation in Europe for the control of photochemical oxidants through NO. emission reductions alone, without considering the possible contribution which may be made by hydrocarbon emission reductions. In the paragraphs below, the optimisation be simulated annealing method is applied to the joint optimisation of hydrocarbons and NO. emissions controls in the formulation of optimal photochemical oxidant control strategies. Regional scale episodes of elevated concentrations of photochemical oxidants occur every summer in Europe. During summertime, anticyclonic weather conditions, ozone concentrations steadily build up over several days and may exceed internationally-accepted criteria values (ref) set to protect human health, crops and trees [22]. There are no emissions of ozone into the atmosphere and all the ozone found close to the ground in pollution episodes has been formed there by chemical reactions involving the precursors, hydrocarbons and the nitrogen oxides, in the presence of sunlight. Strategies for the control of photochemical oxidants contain information not only on the extent of both hydrocarbons and NO„ controls but their separate locations. Optimisation by simulated annealing has been applied to the investigation of optimal strategies for ozone control, involving the simultaneous abatement of hydrocarbons and NO. emissions in concert. This application involves the following stages: * the selection of sensitive ecosystems, receptor sites, establishing critical levels for ozone at these receptor sites,
236 * defining the ozone to hydrocarbons and NO„ source-receptor relationship, * representing the emissions of hydrocarbons and NO. and their control, * finding the optimal strategy which for a given level of investment in pollution control, brings about the maximum reduction in ozone. Again, little of this information is available for Europe and illustrative assumptions are appropriate to make progress and to establish some guiding principles. 5.2. Selecting sensitive receptors for ozone During regional scale ozone episodes, elevated concentrations become widespread across Europe [20]. Transboundary transport is an important feature of the problem. The concentrations exceed air quality guidelines set to protect human health [22] and critical levels set to protect crops and trees from ozone damage. To illustrate the approach, the target ecosystem is taken to be human health so that the sensitive ecosystems can be located in the major urban centres of London, Paris, Brussels, Bonn, Amsterdam, Copenhagen, Stockholm and Oslo. For these ecosystems, the maximum hourly mean in a worst-case episode is the most appropriate index of harm. 5.3. Ozone to hydrocarbons and NO. source-relationship The source receptor relationship has been derived for ozone, using a photochemical trajectory model [23] and its shape is dramatically different to that of acidic sulphur and nitrogen deposition. This difference is almost entirely due to differences in averaging time. With environmental acidification it has been generally assumed that long term deposition is the appropriate index of harm. With photochemical episodes, concern is assumed to lie in the episodic peak concentrations found infrequently during summer periods. As a result of the focus on pollution episodes, the 1/r2 decline in long term potential damage with increasing distance, r, between source and receptor does not apply. The photochemical ozone source-receptor relationship implies a growth phase at short range, a maximum at mid-range and an exponential decline due to dry deposition to the earth's surface. 5.4. Application of simulated annealing The application of optimisation by simulated annealing [8-10] entails considering a configuration of both hydrocarbon and NO. abatement devices undergoing separate 'drunkard's walks' across their respective emissions grids. Initially the spatial distribution of abatement devices was chosen at random. The actual hydrocarbons and NO. emissions at each point in the grids were calculated from the baseline emissions and the capacity of abatement that had been located there. The ozone concentrations at each receptor point could then be calculated using the source-receptor relationship. The value of the objective function was then calculated from the ozone concentrations at the receptor sites, taking into account
237 the exceedence of the critical loads. The configurations of the hydrocarbons and NO„ abatement devices were then given a random rearrangement, which was kept if either the objective function decreased or if the Metropolis condition [19] was satisfied and an uphill rearrangement allowed. The optimisation was controlled using some form of temperature parameter and an annealing schedule. The optimal strategies generated gave the minimum exceedence of ozone critical loads for a given installed total capacity of hydrocarbon and NO. controls. By steadily changing these total capacities and rerunning the optimisation ,it has been possible to construct 'trade-off curves' between hydrocarbon and NO„ controls. The results of the optimisations showed that geographically differentiated abatement strategies could be found quite readily. The spatial patterns taken up by the hydrocarbon and NO. abatement devices could not have been more different. No preferred locations were found for the NO. abatement devices, whereas the hydrocarbon devices congregated in the high emission density areas of north west Europe. Since it is the role of hydrocarbon emissions to stimulate the photochemical ozone production, it is advantageous to concentrate controls in the major source areas in the centre of Europe where the maximum number of trajectories cross. The role of NO„ emissions is to maintain the long range transport so it is advantageous not to concentrate controls in the centre of Europe but to move them out to the fringes of north west Europe. In these NOR-limited regions, NO„ control can be particularly effective in protecting the more remote receptors from elevated ozone concentrations.
6. CONCLUSIONS These rudimentary conclusions are based on a range of tentative assumptions concerning the spatial pattern of emissions, source-receptor relationships, choices of sensitive sites, indices of harm and critical loadings of pollutants. Each of these assumptions has been formulated to show how they link together using some powerful, novel optimisation techniques to generate optimal pollution control strategies. Optimisation by simulated annealing has shown us that it may well be possible to control pollution much more cheaply than could be achieved by arbitrary percentage cuts in emissions 'across the board'. Future work will need to consider all the details and intricacies which we have avoided in these illustrative examples. Nevertheless, with optimisation by simulated annealing, we have been able to make an encouraging start but the tentative conclusions reached here may be modified both spatially and quantitatively.
7. ACKNOWLEDGEMENTS The author wishes to express his gratitude to Professor Sir Sam Edwards of the Cavendish Laboratory, University of Cambridge for drawing attention to the potential of optimisation by simulated annealing. This work forms part of the Air
238 Quality Research Programme of the United Kingdom Department of the Environment under contract EPG 1/3/17. REFERENCES 1. S. Oden, The acidifcation of the atmosphere and precipitation and its consequences in the natural environment, SNSR (1968) Stockholm Sweden. 2. H. Rodhe, Ambio, 18 (1989) 155. 3. P. Brimblecombe and J.I.Pitman, Tellus, 32 (1980) 261. 4. A. Eliassen and J. Saltbones. Atmospheric Environment, 17 (1983) 1457. 5. OECD, The OECD programme on long range transport of air pollutants, Measurements and Findings, OECD (1979) Paris France. 6. H. Wuster, Acidification Research. Evaluation and Policy Applications, pp 221-239 Elsevier Science Publishers B.V. Amsterdam, 1992. 7. UN ECE, The critical loads approach, EB.Air/WG.5/R.1. (1989), United Nations Economic Commision for Europe Geneva Switzerland. 8. S. Kirkpatrick , C.D. Gellatt and M.P. Vecchi, Science 200 (1983) 671. 9. E.H.L. Aarts and P.J.M. van Laarhoven, Statistica Neerlandica, 43 (1989) 31-52. 10. W.H. Press, S.A. Teutolsky, Vetterling W.T. and Flannery, Numerical recipes in Pascal, Cambridge University Press Cambridge UK,1992. 11. R.G.Derwent, Nature, 331 (1988) 575. 12. R.G. Derwent, Environmental Models: Emissions and Consequences, pp 437 Elsevier Science Publishers B.V. Amsterdam, 1990. 13. HMSO, This common inheritance, H.M. Stationery Office, London UK, 1990. 14. J. Nilson and P. Grennfelt, Critical Loads for Sulphur and Nitrogen, Nord 1988:15, Nordic Council of Ministers Copenhagen Denmark, 1988. 15. CLAG, Critical loads of acidity in the United Kingdom, Institute of Terrestrial Ecology Edinburgh Scotland, 1994. 16. CCE, Mapping critical loads for Europe, RIVM Bilthoven Netherlands, 1991. 17. R.G. Derwent, G.J. Dollard and S.E. Metcalfe, Quarterly Journal of the Royal Meteorological Society 114 (1988) 1127. 18. J-H. Tuovinen, K. Barrett and H. Styve, Transboundary acidifying pollution in Europe. EMEP/MSC-W report 1/94 Oslo Norway, 1994. 19. N. Metropolis, A. Rosenbluh, M. Rosenbluh, A. Teller and E. Teller, Journal Chemical Physics 21 (1953) 1087. 20. P. Grennfelt, 0 Hov and R.G. Derwent, Ambio 23 (1994) 425-433. 21. B. Lubkert R. Derwent, J. Alcamo and J. Bartnicki, Environmental Pollution 58 (1989) 237. 22. WHO, Air Quality Guidelines for Europe, World Health Organisation, Geneva Switzerland, 1987. 23. R.G. Derwent, Environmental Pollution 63 (1990) 299.
Adaption of Simulated Annealing to Chemical Optimization Problems, Ed. by J.H. Kalivas C 1995 Elsevier Science B.V. All rights reserved.
239
Chapter 11
Determination of biexponential fluorescence lifetimes by using simulated annealing and simplex searching Sanford L. Shew* and Carter L. Olson Division of Medicinal Chemistry, College of Pharmacy, The Ohio State University, 500 West 12th Avenue, Columbus, Ohio 43210 1. INTRODUCTION In analytical chemistry, developing a calibration curve or modelling a phenomenon often requires the use of a mathematical fitting procedure. Probably the most familiar of these procedures is linear least-squares fitting [1]. Criteria other than least-squares for defining the best fit have been developed for linear parameters when the data possibly contain outliers [2,3]. Sometimes, the model equation to be fit is nonlinear in the parameters. This requires appeal to other fitting methods [4]. In this chapter we report implementation of several of these techniques for determining nonlinear fit parameters. In particular, we want to characterize biexponential decays of fluorescence induced in a system consisting of a fluorescent ligand and a non-fluorescent protein [5]. A ligand bound to the protein may have a different fluorescent lifetime from a free ligand [6, 7]. The importance of measuring the fluorescence lifetime parameters from a mixture of such free and bound ligands lies with the fact that the information obtained can provide an estimate of the extent of binding to the protein. This estimate is thus accomplished without a physical separation of free and bound ligands. Data analysis for determination of the lifetimes is complicated by several factors. In addition to a nonlinear decay law, the instrument itself introduces distortions that can be fairly severe. Furthermore, because of the nature of the light source used, the data sets are sometimes a bit noisy. We initially had some success in using simplex searching to determine the fluorescence lifetimes. As is well-known, however, simplex searching is susceptible to finding fit parameters that correspond to a local optimum on the goodness-of-fit surface, but not the global optimum. We then turned to simulated annealing [8].
* Current address: Extrel FTMS, Inc., 6416 Schroeder Road, Madison, Wisconsin, 53711.
240 This gave much more reliable results. However, we had not implemented a variable step-size algorithm, and therefore obtained fit parameters that were only in the neighborhood of the best fit parameters. Given our previous experience with simplex searching, we combined the two techniques in order to obtain more precise parameter estimates. In this chapter we will outline the mathematics of fluorescence decays, briefly describe the instrumentation used in the measurements, and detail our implementation of simplex searching, simulated annealing, and the combination of the two as applied to ligand-protein binding. 2. BACKGROUND In order to appreciate the different methods of data analysis, it is helpful to have a clear idea of the mathematics of the physical phenomenon of fluorescence [9]. On a Jablonski diagram as shown in Figure 1, S o represents the energy level of the ground electronic state of a particular molecule, S i , an excited singlet state, and Ti , a triplet state; energy of the states is represented vertically, with S i at higher energy than So . The more closely spaced levels within each labeled electronic level represent different vibrational energy levels. Changes in electronic energy levels, e.g., from So to S i , can be brought about by interaction with ultraviolet or visible light, while changes in vibrational energy levels can be probed with electromagnetic radiation in the infrared region. As represented in this diagram, the processes leading to observation of fluorescence begin with the lefthand vertical arrow. This arrow represents absorption of a photon of frequency v (and hence energy hv) by a molecule originally in the lowest vibrational energy level of the ground electronic state. After absorption, the molecule resides in an excited vibrational energy level in the excited electronic state S i ; in order for absorption of the photon to occur, the initial and final energy levels must be separated by energy of exactly hv. There are several paths for the excited molecule to shed this excess energy. The first step is usually vibrational relaxation to the lowest vibrational level of the excited electronic state. From here, radiationless relaxation by internal conversion is typically the major event. This is indicated by the vertical line labeled IC in Figure 1, and the excess energy acts to increase the kinetic energy of the un-excited molecule. A second, less common pathway is radiationless intersystem crossing to the triplet state, represented by the line labeled ISC. From the triplet state, phosphorescence can occur, which is the emission of a photon of lower energy hv" than the exciting photon.
241
ISC
S1
IC
hv' hv"
so
Figure 1. Jablonski diagram. Fluorescence occurs when a photon, of energy hv' in Figure 1, is emitted from an excited singlet state. Not every compound that absorbs light fluoresces, but for those that do, it is often a very efficient process. In fact, fluorescence quantum yields (the ratio of the number of emitted photons to the number of absorbed photons) can approach unity. A fourth fate not explicitly shown in the diagram in Figure 1 arises when certain species collide with an excited molecule [9]. The outcome is similar to internal conversion (that is, the excited molecule relaxes back to the So energy level with perhaps higher kinetic energy) but the extent of the effect is dependent on the concentration of quencher. We can write a rate equation for the disappearance of the excited molecule due to such collisional quenching; this involves a bimolecular rate constant k Q so that d [M* ] dt
= - kQ
[ Afi 1 ,
(1)
where [M* ] is the concentration of excited molecules and [Q] is the concentration of the quencher. Often, [Q] >> [le] , and [Q] may be regarded as a constant. Similar first-order rate equations can be written for all other immediate fates from S i described above. Allowing the rate constants kir , k1sc and kF for internal conversion, intersystem crossing and fluorescence, respectively, we can write
242
d[114*] = — Icic [M* ] — kisc [M* ] — kF [Al* ] — kQ [Q] [M*] dt
(2)
Taking [Q] as constant, this gives
[fre ]
teI 0 = exp (— (kic + kisc + kF + k(2 [Q] )t) . u
(3)
Since the intensity of fluorescence is proportional to the concentration of excited state, the observed fluorescence intensity follows the same time course, or F (t) = a exp (— (kic. + kisc + kF + kc, [Q] )t) ,
(4)
where F (t) is the intensity of fluorescence as a function of time and a is the initial intensity of fluorescence (at zero time). The fluorescence lifetime, denoted by T, is defined as the time for the fluorescence intensity to decay to 1/e , or about 37%, of its initial value. From Equation (4) it is obvious that the lifetime is equal to the reciprocal of the sum of the rate constants, T =
1 +k +kkF + k (2[0 ' kic+
(5)
and the monoexponential decay law becomes F (t) = a exp (—UT) .
(6)
A collection of fluorescing molecules in a particular homogeneous environment has a characteristic lifetime. In a different environment, the same molecules may have a different lifetime. A collection of two species, either different molecules or similar molecules in different microenvironments, may exhibit a fluorescent decay with a time course that contains contributions from both species. Thus we would observe a biexponential decay with two distinct lifetimes, F (t) = a i exp (—t/T 1 ) + a2exp (—t/tie) .
(7)
Of course, it is possible to have more than two fluorescing species in solution, in which case F (t) is multiexponential, with each species having a characteristic lifetime. In practice, one assumes that the decay law with the fewest terms that adequately fits the experimental data is indicative of the number of distinct fluorescent species. In other words, if no monoexponential decay can be found to fit
243
the observed fluorescence, but a biexponential decay does fit the observed data, we assume that we have two separate fluorescent species. In the case of biexponential (and other multiexponential) decays, the product arci is proportional to the concentration of the ith species. Since under linear conditions, the number of molecules excited is proportional to the number of molecules present, cci-ci is proportional to the total number of molecules of species i and thus similarly proportional to the concentration. The fluorescence decay of a ligand in the absence of protein is (normally) monoexponential, and is characterized by a particular lifetime, for example, T i . The fluorescence decay from a mixture of the ligand and a protein to which the ligand binds is biexponential (assuming only one type of binding site on the protein). Determining the biexponential decay parameters from such a system then allows us to estimate the extent of binding by comparing the quantities a l (proportional to the amount of free ligand) and a2T2 (proportional to the amount of bound ligand). 2.1. Instrument description The instrument used in this work [10] is shown schematically in Figure 2. The light source is a nitrogen/dye laser combination that provides excitation from 337 nm through the visible region, with pulse widths of 200-300 ps and energies of 50-60 1.1.J. The sample is housed in a holder (SCH) attached to the front of the emission monochromator (MONO). Fluoresced or scattered light is emitted in all directions, and a fraction passes through the monochromator. The monochromator allows the wavelength of emitted light to be selected before impinging on the photomultiplier tube (PMT). The electrical output from the photomultiplier tube passes through a delay line before being digitized by the boxcar averager. The delay line is necessary because of a fixed delay, internal to the boxcar averager, between the receipt of a trigger and the actual start of digitization. The trigger is provided by the output of the trigger photodiode, which receives a small fraction of the exciting laser light reflected from the beam splitter (BSP). In this way, the exciting light pulse itself provides the reference for the start of the time base. This is required by the short time scales of fluorescence decays, on the order of nanoseconds. Experiments are controlled by the PC, which is connected to the boxcar averager by a GPIB line. Boxcar parameters such as the number of data points to collect, the dwell time, and the number of scans to average, are set by using the PC. Data sets are then transferred from the boxcar averager to the PC for storage. The PC is connected by ethernet to a workstation computer, where data analysis is performed.
244
boxcar averager SCH MONO
PMT
delay line GPIB L2
trigger photodiode BST M BSP Ll
Ethernet PC
dye laser
nitrogen laser
UNIX® workstation Figure 2. Schematic diagram of fluorescence lifetime determination apparatus. Ll and L2 are lenses; M, mirror; BST, beam steerer; BSP, beam splitter; SCH, sample cell holder; MONO, monochromator; PMT, photomultiplier tube. Other components are described in text. 2.2. Instrumental distortions The methods for determining fluorescence lifetimes fall into three categories: impulse-response [11], time-correlated photon counting [12], and frequency-
245
response [13]. The data sets in this work are acquired by the impulse-response method. Here, fluorescence is initiated by a brief, intense pulse of light. If this excitation pulse were infinitely short, as a delta pulse, and if the detector responded infinitely quickly, then the observed decay in fluorescence intensity would be exactly as predicted by Equation (6) or Equation (7). In practice, however, neither the exciting light source nor the detector are ideal. As a result, the time-course of the observed fluorescence signal is distorted. Fortunately, the effect of this distortion on the ideal fluorescence decay (Equations (6) or (7)) is well understood. The nonidealities of the light source and the detector, as well as any other distortions introduced by, for example, the detection electronics, can be grouped together as the instrument response function. The observed fluorescence signal is then the convolution of the ideal fluorescence decay with the instrument response function [14]. Because we can detect the instrument response function by observing scattered light from a non-fluorescent sample, this nonideality presents only an additional step in our data analysis algorithm. Examples of the instrument response function and observed fluorescence are shown in Figure 3. The data set in the top plot of Figure 3 was obtained by observing light scattered from a glucose solution with both the output monochromator of the dye laser and the emission monochromator set to 366 nm. Thus this represents the instrument's intrinsic response. Note that the full width at half the maximum height of the peak (FWHM) is about 2 ns, greater than the laser's quoted. 200-300 ps; this implies that the detection electronics do not respond instantaneously on this time scale. The lower plot of Figure 3 shows the observed fluorescence from 10 p,M dansylamide, with excitation at 366 nm and emission monitored at 560 nm. The monoexponential lifetime was determined to be about 2.5 ns.
3. DATA ANALYSIS Exponential decays are common forms for a wide variety of physical phenomena. Consequently, many tools have been developed for their analysis. The present application has several complications. We are dealing for the most part with biexponential decays, so that we have two lifetimes to determine. We are also interested in the relative contribution from each lifetime component to the total fluorescence. Furthermore, the intrinsic response of the instrument (Figure 3, top) is not insignificant with respect to many of the lifetimes that we determine. Thus our observed data are distorted, albeit in a well-defined way, from the simple biexponential in Equation (7). For these reasons, many of the literature techniques for exponential analysis, such as semilogarithmic plots, the phase-plane method [15,16], and Fourier division [17,18], are not useful to us.
246
10
15 time, ns
1
0 0
5
10
15
20
25
time, ns Figure 3. Plots showing instrument response function (top plot) and observed fluorescence decay from dansylamide (bottom plot). lb simplify the data analysis somewhat, we work with normalized preexponential factors in Equation (7). In other words, we assume that the intensity of fluorescence at zero time is unity, i.e., a / + a2 = 1. This is acceptable because we scale the expected fluorescence decay (see below) to the observed data set. This scaling is necessary because, although we try to match the intensities of observed fluorescence decays to the measured instrument response function during acquisition, it is nearly impossible to match the intensities exactly by using neutral density fil-
247
ters. Thus, instead of working with Equation (7), we substitute a 2 = 1— a l and obtain F (t) = a l exp (—t/t / ) + (1 — oc i ) exp (—t/T 2 )
(8)
as the ideal decay law. This eliminates one fit parameter. Some further simplification is obtained by restricting the guesses for the parameters to physically reasonable values. For example, the lifetimes, -c i and -u 2 , are both greater than zero, and a iis between zero and one. Because we are working with discretely sampled data, we can rewrite Equation (8), substituting ti for t. The expression for the ideal fluorescence decay in this case is F (ti) = oc i exp (—/-t i ) + (1— a l ) exp (—t i /T 2 )
(9)
By measuring the instrument response function (Figure 3, top) and knowing that the observed fluorescence decay (Figure 3, bottom) is equal to the convolution of the ideal fluorescence decay (Equation (8) or (9)) with the instrument response function, we can construct an expected or calculated fluorescence decay for given fit parameters t
F calc (T1, T2, al'
t)
fF (T i , T 2, al , T) I (t — T) dT ,
(10)
o
or in discrete terms N F calc (T 1' T2' a l' t i )
E
=
F ( T v T2' a l' t j) I (t i — 1 id ,
J=11
where N is the number of data points acquired across the time base (usually 256 points are collected). The practical data analysis techniques for us include iterative methods in which estimates for the parameters are refined until we get an acceptable fit. The quality of the fit is judged by how closely the calculated fluorescence decay (generated from the instrument response function and an assumed biexponential decay law) and the observed fluorescence decay match in a least-squares sense. The goodness-of-fit parameter is defined as
248
N
E 2
X ( T i, T 2, a l)
=
(F calc (T 1 , .1 2 9 a l' ti) — F obs (t1)) 2)
i
=
1
(12)
N
The important theoretical, calculated, and measured quantities for biexponential fluorescence decay analysis are collected in Table 1. Table 1 Important quantities for biexponential fluorescence decay parameter estimation. Quantity
Description
Origin
ideal decay law (continuous time)
Equation (8)
ideal decay law (discrete time)
Equation (9)
I (t1)
instrument response function
measured from non-fluorescent scattering sample
F obs (ti)
observed fluorescence
measured from fluorescent sample
calculated fluorescence
Equation (11)
goodness-of-fit of F calc (T 1' T2' a l , t i ) to F obs (ti)
Equation (12)
F (t)
F (t i)
F calc (T 1' T 2, , 2 X ( T 1' T 2'
a 1' ti)
al)
The general process, then, begins by measuring both the instrument response function I (t1) and the observed fluorescence decay F obs (t i ) . Then a starting guess is made at the decay parameters, defining the ideal decay law, and a calculated decay F caic (-c1' a l , ti) is obtained by convolving the instrument response function with the ideal decay law. The fit parameters are then refined, in a process dependent on the particular fit technique, using the least-squares fit of the observed and expected decays as a guide. A flowchart for the general case is shown in Figure 4. 3.1. Marquardt fitting Non-linear least-squares fitting by the Marquardt method [19,20] appears to be the most commonly used technique for biexponential fluorescence decay analysis, at least for a time-domain measurement such as used here [21,22]. Fitting by this method requires evaluation of the derivatives of the model equation (Equation
249 (11)) with respect to the parameters to be fitted. Since we do not know an analytic expression for the instrument response function 1 (ti) in Equation (11), we considered the model equation actually to be the ideal fluorescence decay given by Equation (9). We then did the discrete convolution in Equation (11), and calculated the x2 indicator as in Equation (12). Derivatives of F (t1 ) with respect to -c i , -c2 and al are then easily evaluated, and used with the Marquardt algorithm (as given in Reference 20) to iteratively estimate the parameters.
Acquire instrument response function 1(ti) (from scattered light).
Acquire observed fluorescence Fobs(ti)•
Initial guess at T i , T2 and al; calculate F„ 1c(r 1 ,T2,ai ,ti) from Equation (11); calculate initial goodness-of-fit response x 2 from Equation (12).
Use fitting algorithm to generate new estimate of -E l , T2, and al ; calculate corresponding x2.
Y
Figure 4. Flowchart showing generic approach to determining biexponential fluorescence decay parameters.
250 In spite of its prevalence in the fluorescence decay literature, we were not universally successful with this fitting method. Most reports of bi- or multiexponential decay analysis that use a time-domain technique (as opposed to a frequencydomain technique) use time-correlated photon counting, not the impulse-response method described in Section 2.1. In time-correlated photon-counting, noise in the data is assumed to have a normal distribution. Noise in data collected with our instrument is probably dominated by the pulse-to-pulse variation of the laser used for excitation; this variation can be as large as 10-20%. Perhaps the distribution or the level of noise or the combination of the two accounts for our inconsistent results with Marquardt fitting. 3.2. Simplex searching Another data fitting technique that we assessed was simplex searching. Following its development by Nelder and Mead [23], simplex searching was soon applied to chemical problems [24]. Algorithms are readily available and in-depth theory and extensions to the original simplex search have been published [24,25]. Because the simplex method is well-documented in the literature, we here give only a cursory description. For our discussion, we will appeal to a picture of the goodness-of-fit indicator (the least-squares metric as described in Section 3) as an undulating surface, with the fit parameters as abscissas (for this discussion it is perhaps easier to consider only two parameters to be fit—in actual practice, the dimensionality of the problem is essentially unlimited). Since the best fit values for the parameters are considered to be those that minimize the least-squares indicator, we are really looking for the values of the parameters that correspond to the minimum on this response surface. The simplex method describes an algorithm for finding such a minimum, without calculating any derivatives. The "simplex" itself is a geometrical figure defined by vertices, initially given by the user, which are sets of guesses for the parameters. In simple terms, the algorithm forms a new simplex by replacing the vertex having the worst goodness-of-fit indicator with a new vertex. The basic algorithm can be made more sophisticated by altering the means by which it determines the new vertex. The algorithm used for determining biexponential fluorescence decays by simplex searching is shown in Figure 5. Not requiring the evaluation of partial derivatives is an advantage, not only in terms of the computational burden, but also in the case of a model that is not analytically derivable. Also, other criteria for goodness-of-fit, such as the sum of the absolute values of the differences between points in the observed and the expected decays, are just as easily used as least-squares. Unfortunately, simplex searching possesses a significant disadvantage. The response surface may have many local minima in addition to the global minimum. The simplex, once it has begun
251
Define starting simplex (4 vertices).
Find vertices with best, worst, and second-worst responses; increment iteration count.
Reflect vertex with worst response through opposite face of simplex.
Is response of new vertex better than existing best response?
Is response of new vertex worse than existing secondworst response?
Extend reflection by a factor of 2.
Shorten reflection by factor of 2.
Is response of new vertex still worse than existing secondorst response?
Contract simplex around vertex with best response.
Figure 5. Flowchart for determining biexponential decay parameters by using simplex searching.
252 descending into such a local minimum, will become trapped here, unaware of the global minimum. We used a published algorithm for the simplex search [20], with a minor modification. Since we know that the parameters that we are trying to fit have reasonable values only in certain ranges, we can restrict the search to these regions by setting the response to a vertex outside the valid region to a large value. As we were fitting three parameters (T , T 2 , and a l ), we supplied the simplex routine with four starting vertices. Because of familiarity with the chemical systems, we had a reasonable idea of the values for the lifetimes. Often, however, estimating a 1 was more difficult. In practice, we often found that the final estimated parameters were dependent on the starting simplex. Sometimes the parameters to which the simplex search converged were obviously unreasonable for the physical system under study. Often, though, the parameters were reasonable, even though these parameters were not identical to other reasonable parameters obtained by starting with a different simplex. We were led to conclude that not only did simplex searching often fail to converge to the global minimum, but that the local minima that we did find could be close together in parameter space. Thus, we were skeptical of the accuracy of fitted parameters. As in the case of Marquardt fitting, the poor performance could be an artifact of our data. 3.3. Simulated annealing The failure of simplex searching to consistently provide reliable parameter estimates led us to consider alternative data analysis routines. The undesirable behavior of simplex searching was its willingness to become trapped at a local minimum. A parameter estimation technique that offers resistance to this pathology is simulated annealing [8,26]. This resistance is provided by the willingness of simulated annealing to accept a "bad" move in parameter space (one in which x2 increases) and actually move away from what appears to be a desirable region. In describing this application of simulated annealing, we adhere to the notation of Bohachevsky et al. [8]. The algorithm in our implementation follows the outline given in that work closely. For purposes of our discussion, a point in parameter space is denoted by the vector x = [T 1 , T 2 , a l ] T lb begin the parameter estimation, the user selects a starting point x o . With an appropriate instrument response function, an initial calculated fluorescence decay, F calc(x0' t i) is developed and a value for x2 (x0) is calculated by comparison with F obs (ti ) . Next a unit vector with random direction is generated and a predefined fraction of it is added to x0 . In Reference 8, this fraction is a scalar A r. However, in this
253
particular instance, since the values of the lifetimes in nanoseconds (typically 1 to 30) are much different from the values of the preexponential factor a l (always between 0 and 1), A r is replaced with a vector r. The components of r are specified in accord with the expected absolute values of the three parameters. In specifying the components of r, a compromise must be struck that allows parameters to be fit with reasonable accuracy, while still allowing reasonably rapid exploration of possible values for each parameter. The new parameter vector is called x * . A check can be inserted here to ensure that x* is in bounds (i.e., that each component of x is within its limits of physically reasonable values). If x* is out of bounds, a new *. x is generated from xo as above. Following generation of a valid x * , a new Fcal, (x* , t,) is developed. Then a comparison is made of x2 (x* ) and x2 (x 0 ) . If x2 (x) <
X 2 (x0)
, then we consider that
F calc (x*, t i) is a better fit to Fobs (t,) than is F coic (x0, ti). This is a favorable step, and is unequivocally accepted by setting x o = x* . Then a new trial vector x* is generated and the above process is repeated. If, however, F oak (x* , t1) is a poorer fit than F col,. (x o , t,) (i.e., x 2 (x* ) > X 2 (x0) ), x may or may not be accepted. The probability of accepting such a detrimental move in parameter space is defined in Reference 8 as p=
exp
(-15(x2 (xo) ) gA X 2) ,
where A x 2 =
x 2 (x0) — x 2 (x*. , ) and
(13)
3 and g are constants to be defined. The prob-
ability p is between 0 and 1. lb determine whether a detrimental step should be accepted, p is compared with a random number from a uniform distribution over the same range. If p is greater than this random number the detrimental step is accepted and we set xo = x* . Regardless of whether the step is accepted, we generate a new x from x o , and repeat the comparison of x2 (x* )
and
x2 (x 0 ) .
The constant 3 in Equation (13) is chosen to regulate the fraction of detrimental steps accepted. In Reference 8 it is recommended that this acceptance rate be maintained between 50% and 90%. In practice, we found that this rate is achieved with values of f3 between 2 and 4. Also, the constant g is a negative number, typically –0.5 or –1.0, and makes the probability of accepting a detrimental step decrease as the goodness-of-fit indicator improves. A flowchart for our implementation of simulated annealing is shown in Figure 6. Convergence is defined by x2 (x0) being less than some predefined value. How-
254
Define starting vertex; calculate response.
Add fraction of random unit-length vector to current vertex; increment iteration count.
N
Is response at new vertex better than response at urrent vertex?
Calculate probability p of accepting new vertex as in Equation (13).
Figure 6. Flowchart for determining biexponential decay parameters by using simulated annealing.
255 ever, termination of fitting is almost always due to the maximum number of iterations being reached. The performance of simulated annealing in determining fluorescence decay parameters was far superior to any of the other methods we had tried. In particular, it was much more robust than either Levenberg-Marquardt fitting or simplex searching. Regardless of the initial estimates for the parameters, the final estimates were close to each other. Because the fraction r of the random unit vector added to the current parameter vector x o is fixed, we expect that the final parameter estimates (obtained with different starting parameter estimates for the same data set) in some sense surround the best parameter estimate. One approach to obtaining a better estimate of this best parameter estimate might be to average the results from a large number of simulated annealing runs. 3.4. Combination of simulated annealing and simplex searching Another approach to obtaining more accurate parameter estimates with simulated annealing analysis is the use of a variable step size algorithm. This was done by Sutter and Kalivas [27]. However, given our previous experience with simplex searching, we elected to combine the techniques of simulated annealing and simplex searching. The combination is effected by performing an appropriate number of simulated annealing analyses and using these parameter estimates as the starting vertices for a subsequent simplex search. This approach takes advantage of the assurance provided by simulated annealing that the fitted parameters are near the global optimum of the goodness-of-fit indicator. Because all vertices of the starting simplex are near the global optimum, we anticipate that the simplex search is now much less likely to become trapped in a local optimum. Also, the results from the simulated annealing analyses act to define a range within which we would expect each parameter to reside. Thus, at least in a qualitative sense, the experience one gains with the simulated annealing analysis with each data set increases the confidence one has in the results from the subsequent simplex analysis. This combination has proven to be the most robust means of data analysis. It has enabled us to determine biexponential fluorescence decay parameters from dansylated amino acids in the presence of bovine serum albumin (BSA) [28,29]. An example of this type is shown in Figure 7. Here a plot of the observed fluorescence from dansylproline and BSA is overlaid with the calculated fluorescence obtained from Equation (11), the measured instrument response function, and the parameters determined from the hybrid analysis. These results, including the intermediate simulated annealing results, are shown in Table 2.
256 Also shown in Figure 7 is a plot of the point-by-point differences between Fobs (t,)and Fcak (T T2, a i , td. Although the sum of the squares of these residuals is the basis for the goodness-of-fit indicator (and thus the residuals have already guided the fit to the parameters that are optimum for this model), one may wish to see the residuals plotted as a function of data point. With this plot, systematic deviation is obvious. Such deviation is indicative of an improper choice of model equation, so that we can confirm that a biexponential model is correct. 3.5. Error analysis In addition to allowing fine-tuning of the fitted parameters, the final step of simplex searching offers a convenient means of estimating the error associated with each parameter. This process has been described by Phillips and Eyring [30]. Briefly, one determines a quadratic approximation to the error surface, from which an error matrix is developed. This matrix can then be used to calculate standard deviations of the fitted parameters. These standard deviations are reported as error estimates of the parameters in Table 2.
1
J A
A
10
30
20
50
40
time, ns
r\--AivAk\AANA-kipvii
Ar
NANN tvi
Figure 7. Plots of observed fluorescence (scatter plot) and calculated fluorescence (smooth line) from 10 1.1M dansyl-L-proline and 5 bovine serum albumin (top). The fitted parameters are T 1 = 2.61 ns , T 2 = 22.6ns , a, = 0.864 and a2 = 0.136. Also shown is a plot of the residuals (bottom).
257 4. RESULTS The hybrid data analysis method has allowed us to study semi-quantitatively the binding of fluorescent ligands to BSA. Furthermore, by titrating a mixture of a fluorescent ligand and BSA with a nonfluorescent ligand, we can observe binding of the nonfluorescent ligand indirectly as changes in the biexponential decay as the fluorescent ligand is displaced from the protein.
Table 2 Determination of biexponential fluorescence decay parameters for data shown in Figure 7. method simulated annealing (four runs)
simplex searching error estimates
Ii, ns 2.726 2.543 2.568 2.711 2.61 0.04
a1
T2, ns
(X2
0.874
24.510
0.126
0.859
21.740
0.141
0.864
22.726
0.136
0.872
23.998
0.128
0.864
22.6
0.136
0.004
0.7
0.004
REFERENCES 1. T. A. Brubaker, R. Tracy, and C. L. Pomernacki, Anal. Chem., 50 (1978) 1017A. 2.
S. C. Rutan and P. W. Carr, Anal. Chim. Acta, 215 (1988) 131.
3.
S. L. Shew, Amer. Soc. for Mass Spectrometry 40th Conf. on Mass Spectrometry and Allied Topics, Washington, D.C., May, 1992, pp. 1500-1501.
4.
T. A. Brubaker and K. R. O'Keefe, Anal. Chem., 51 (1979) 1385A.
5.
S. L. Shew and C. L. Olson, Anal. Chem., 64 (1992) 1546.
6. J. Yguerabide, Chapter 24 in Methods in Enzymology, Volume 26, Ed. C. H. W Hirs and S. N. Timasheff, Academic Press, New York (1972). 7. M. G. Badea and L. Brand, Chapter 17 in Methods in Enzymology, Volume 61, Ed. C. H. W. Hirs and S. N. Timasheff, Academic Press, New York (1979). 8.
I. 0. Bohachevsky, M. E. Johnson, and M. L. Stein, Thchnometrics 28 (1986) 209 •
9. J. D. Ingle and S. R. Crouch, Spectrochemical Analysis, Prentice Hall, Englewood Cliffs, NJ, 1988. 10. S. L. Shew, Ph. D. Dissertation, The Ohio State University (1990).
258
11. S. S. Brody, Rev. Sci. Instrum. 28 (1957) 1021. 12. E. W. Small, L. J. Libertini, and I. Isenberg, Rev. Sci. Instrum. 55 (1984) 879. 13. L. B. McGown and F. V. Bright, Anal. Chem. 56 (1984) 1400A. 14. A. E. W. Knight and B. K. Selinger, Spectrochim. Acta 27A (1971) 1223. 15. J. N. Demas, and A. W. Adamson, J. Phys. Chem. 75 (1971) 2463. 16. J. R. Bacon and J. N. Demas, Anal. Chem. 55 (1983) 653. 17. U. P. Wild, A. R. Holzwarth, and H. P. Good, Rev. Sci. Instrum. 48 (1977) 1621. 18. J. C. Andre, L. M. Vincent, D. O'Connor, and W. R. Ware, J. Phys. Chem. 83 (1979) 2285. 19. P. R. Bevington, Data Reduction and Error Analysis for the Physical Sciences, McGraw-Hill, New York, 1969. 20. W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling, Numerical Recipes in C, Cambridge University Press, Cambridge, 1988. 21. A. Grinvald, Anal. Biochem. 75 (1976) 260. 22. A. Grinvald and I. Z. Steinberg, Biochim. Biophys. Acta 427 (1976) 663. 23. J. A. Nelder and R. Mead, Computer J. 7 (1965) 308. 24. S. N. Deming and S. L. Morgan, Anal. Chem. 45 (1973) 278A. 25. F. H. Walters, L. R. Parker, S. L. Morgan, and S. N. Deming, Sequential Simplex Optimization, CRC Press, Boca Raton, FL, 1991. 26. J. H. Kalivas, N. Roberts, and J. M. Sutter, Anal. Chem. 61 (1989) 2024. 27. J. M. Sutter and J. H. Kalivas, Anal. Chem. 63 (1991) 2383. 28. G. Sudlow, D. J. Birkett, and D. N. Wade, Mol. Pharm.acol. 11 (1975) 824. 29. G. Sudlow, D. J. Birkett, and D. N. Wade, Mol. Pharmacol. 12 (1976) 1052. 30. G. R. Phillips and E. M. Eyring, Anal. Chem. 60 (1988) 738.
Adaption of Simulated Annealing to Chemical Optimization Problems, Ed. by J.H. Kalivas © 1995 Elsevier Science B.V. All rights reserved.
259
Chapter 12
Simulated annealing applied to crystallographic structure refinement A.T. Brungera* and L.M. Rice" a The Howard Hughes Medical Institute and Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520
1. INTRODUCTION X-ray crystallography ever increasingly contributes to our understanding of the structure, function, and control of biological macromolecules. Developments in molecular biology and X-ray diffraction data collection have allowed nearly exponential growth of macromolecular crystallographic studies during the past decade. Diffraction data, however, must generally be analyzed by sophisticated computational procedures including methods of phasing, density modification, chain tracing, refinement, and structure validation. Many of these procedures can be formulated as non-linear optimization problems of a target function, which usually measures the agreement between observed diffraction data and data computed from a model. This target function generally depends on several parameters such as phases, scale factors between structure factors, or atomic coordinates. A well-refined crystal structure is clearly necessary for a detailed understanding of molecular interactions. Here we focus on crystallographic refinement, a technique aimed at optimizing the agreement between an atomic model and observed diffraction data. Optimization problems in macromolecular crystallography generally suffer from the multiple minima problem, which arises in part from the high dimensionality of the parameter space (typically at least three times the number of atoms in the model). There exist many local minima of the target function which tend to defeat gradient descent optimization techniques such as conjugate gradient or least-squares methods [1]. These methods are simply not capable of shifting the atomic coordinates enough to correct errors in the initial model. This limited radius of convergence arises from the high dimensionality of the parameter space, but also from what is known as the crystallographic "phase problem" [2]. With monochromatic diffraction experiments on single crystals one can measure the amplitudes, but not the phases, of the reflections. The phases, however, are required to compute electron density maps by Fourier transformation of the structure factor described by a complex number for each reflection. Phases for new crystal structures are usually obtained from experimental methods such as multiple isomorphous replacement [3]. Electron density maps computed by a combination of native crystal amplitudes and multiple isomorphous *This work was supported in part by the Howard Hughes Medical Institute and by a grant from the National Science Foundation to ATB (DIR 9021975). LMR is an HHMI Predoctoral Fellow.
260 replacement phases are sometimes insufficient to allow a complete and unambiguous tracing of the macromolecule. Furthermore, electron density maps for macromolecules are usually obtained at lower than atomic resolution and are therefore prone to human error upon interpretation. A different problem arises when structures are solved by molecular replacement [4,5] using an appropriate search model of similar structure. In this case the resulting electron density maps can be severely model-biased, i.e. "confirming" the chain trace of the search model without providing clear evidence of differences between it and the crystal structure. In either case, initial atomic models usually require extensive refinement. This review addresses the common case in which experimental phases are either unavailable or inaccurate. In the unsual case that very good experimental phases are available, refinement is much more straightforward [6]. Experimental phase information tends to increase the degree to which the global minimum of the target function can be distinguished from local minima. Its omission from the refinement process exacerbates the multiple minima problem to a point that gradient descent methods have little chance of finding the global minimum. Simulated annealing [7-9] is an optimization technique particularly well suited to the multiple minima characteristic of crystallographic refinement. Unlike gradient descent methods, simulated annealing can overcome barriers between minima and thus explore a greater volume of the parameter space to find "deeper" minima. After its introduction in 1987 [10], crystallographic refinement by simulated annealing (often referred to as molecular dynamics refinement) was quickly accepted in the crystallographic community because it significantly reduced the amount of human labor required to determine a crystal structure. In fact, more than 75% of all published crystal structures during the past three years were refined by this method [11-13]. This review summarizes the theory, applications, and recent developments of crystallographic refinement by simulated annealing. 2. CRYSTALLOGRAPHIC REFINEMENT An initial atomic model obtained from multiple isomorphous replacement or molecular replacement is likely to contain errors which will hinder the understanding of the chemistry of the crystallized macromolecule, and hence the need for crystallographic refinement. Crystallographic refinement can be formulated as a search for the global minimum of the target [14]
E = Echem + WxrayExray
(1)
where Echem comprises empirical information about chemical interactions, Exray describes the difference between observed and calculated data, and W — xray i s a weight chosen to balance the forces arising from each term. Echem is a function of all atomic positions, describing covalent (bond lengths, bond angles, torsion angles, chiral centers, and planarity of aromatic rings) and non-covalent (Van der Waals, hydrogen bonding, and electrostatic) interactions. Several algorithms to minimize E have been developed, including least-squares optimization [15-17], conjugate gradient minimization [14,18], and simulated annealing refinement [10].
261 2.1. The crystallographic residual Exray The most common form of Exray consists of the crystallographic residual, defined as the sum over the squared differences between the observed (1Fobs (fi,)) and calculated (IFcaic(fi) I) structure factor amplitudes: Exray = E(IF0b.,(K)1 — killatc(r1)1) 2 (2) where h = (h, k ,1) are the Miller indices of the reciprocal lattice points of the crystal. The scale factor k is usually obtained by minimization of Eq. 2. This can be accomplished analytically by setting it to the value which makes the derivative of Exray with respect to k zero. The structure factor of the atomic model is given by Fcalc (71,) =
E E qi fi (K) exp(—Bi (F* • E) 2 /4) exp(27rifi, • (Os • .T" • fbi + rs )).
(3)
The first sum extends over all space group symmetry operators (O s , Es ) composed of a rotation matrix Os and a translation vector ts . The second sum extends over all unique atoms i of the system. The quantity ri denotes the orthogonal coordinates of atom i in A. is the 3 x 3 matrix that converts orthogonal (A) coordinates into fractional coordinates; T* is its transpose. Bi , qi are respectively the atomic temperature factor and occupancy for atom i. The atomic form factors fi (h) are typically approximated by an expression consisting of several Gaussians and a constant [19]
= E akiexp(—bki •r •
ii) 2 /4) + cto i .
(4)
The structure factor expression given by Eq. 3 is too computer time intensive for practical purposes. Approximations are usually made in order to make crystallographic refinement feasible. One such approximation consists of computing Fcalc (h) by numerical evaluation of the atomic electron density on a finite grid followed by Fast Fourier transformation of the electron density. This speeds up the calculation by at least an order of magnitude [20,21]. Another approximation keeps the first derivatives of E xray constant during the refinement process until any atom has moved by more than a specified small distance relative to the position at which the derivatives were last computed [22]. The crystallographic residual (Eq. 2) only incorporates information about the amplitudes of the observed reflections. A penalty term ("phase restraints") [23] based on the difference between observed phases and those calculated from the model can be added to the residual: Wp E f (95obs( 71) — calc( 71 )) • (5) 1.1 wp is the weight given to the phase restraint, and f is a square well function with a width equal to the arccosine of the figure of merit (fom(ii)) for each reflection. Another possible form of Exray which we call the "vector residual" does not use the amplitude residual at all but instead simultaneously restrains the real (A) and imaginary (B) parts of the structure factor [24]. It has the form Exray = E
(1 Fobs( E ) I — k i Fcatc( 171 )1) 2
Exray = E fom(ii) [(A obs (K) — kAca/c (E)) 2 + (Bobs(/) — kBcalc(K)) 21 • -t ;
(6)
262 2.2. The chemical term Echem A possible choice of Echem is an empirical potential energy function [25-30] Echem =
E
kb (r — ro) 2 +
bonds
E
k0(0 — 90)2
(7)
angles
+ E ko cos(nq5 + d) + > k„ (co — w0)2 dihedrals chiral,planar + E (ar _12 ± br --6 + cr-1) atom—pairs
Empirical energy functions were originally developed for energy minimization and molecular dynamics studies of macromolecular structure and function (see [31], for an introduction). The parameters of the empirical potential energy Echem are inferred from experimental as well as theoretical investigations, in particular, vibrational spectroscopy and small-molecule crystallography [25-30]. Since these energy functions were designed for another purpose, it is not surprising that they require some modification for use in crystallographic refinement. For example, empirical energy functions must be extended to simulate crystal contacts between molecules related by crystallographic or non-crystallographic symmetry [22,32]. Empirical energy functions also behave poorly at the high simulation temperatures characteristic of simulated annealing. They must also be modified to cope with the addition of experimental restraints (Exray)• To prevent distortions of aromatic rings, peptide bonds, and chiral centers, certain energy constants ko, k,, in Eq. 7 often need to be increased [23]. Furthermore, the energy constant ko for the proline w angle can be decreased to enable cis to trans transitions. However, experience has shown [32] that this constant should be set to its original value during the final stages of refinement in order to obtain acceptable geometry around these peptide bonds. Finally, since bulk solvent is usually omitted from refinement, the charged groups of Asp, Glu, Arg and Lys residues have to be screened in order to avoid formation of artificial interactions with backbone atoms. This can be accomplished either by setting the charges to zero [33] or by neutralizing the charged groups [34]. Crystallographic refinement is not very sensitive to the accuracy of the empirical energy function [35]. The electrostatic term in Eq. 7 is sometimes purposely omitted in order to avoid possible bias due to the empirical energy function. Furthermore, one can use a "geometric" energy function consisting of terms for covalent bonds, bond angles, chirality, planarity, and nonbonded repulsion where the corresponding parameters are derived from equilibrium geometry and root-mean-square (r.m.s.) deviations of bond lengths and angles observed in a small-molecule data base [36]. The differences between a geometric energy function and an empirical energy function mainly affect regions that are not well determined by the experimental information. Little difference is observed for well-defined structures. For instance, the r.m.s. difference for backbone atoms between a structure of crambin refined at 2 A resolution by PROLSQ [16] a program which effectively uses a geometric energy function, and the same structure refined by conjugate gradient minimization using X-PLOR [37] with the CHARMM20 empirical energy function [26] was only 0.05 A [22].
263 2.3. Additional restraints and constraints Additional constraints or restraints may be used to improve the ratio of observables to parameters. For example, atoms can be grouped such that they move as rigid bodies during refinement or bond lengths and bond angles can be kept fixed [15,38,39]. The existence of non-crystallographic symmetry in a crystal can be used to average over equivalent molecules and thereby reduce noise in the data. This is especially useful for virus structures: non-crystallographic symmetry can be used to "overdetermine" the problem, assisting the primary phasing and the subsequent refinement [40,41,32]. 2.4. Weighting The weight Wxray (Eq. 1) balances the forces arising from Exray and Echem. The choice of Wxray can be critical: if wxray is too large, the refined structure will show unphysical deviations from ideal geometry; if Wxray is too small, the refined structure will not satisfy the diffraction data. Jack and Levitt [14] proposed that Wxray be chosen so that the gradients of Echem and Exray have the same magnitude for the current structure. This approach implies that w xray xray has to be frequently re–adjusted during the course of the refinement. Briinger et al. [22] developed an empirical procedure for obtaining a value for wxray that can be kept constant throughout the refinement. It consists of performing a short molecular dynamics simulation with w xray set to zero, then calculating the final r.m.s. gradient which is due to the empirical energy term Echem alone. Next one calculates the gradient due to the experimental restraints Exray alone, and choosesbalance the two. A recent correction to this procedure is to divide the resulting Ww xrxaryaYby totwo, this produces optimal phase accuracy as judged by the free R value (Brunger, unpublished data). 2.5. The free R value The most common measure for the agreement between electron density calculated from a model and the observed diffraction data is the R value, defined as R=
Eh' I
ibs(7-)1— Fo killaiccroli
(8) • The R value is closely related to the crystallographic residual Exray (Eq. 2) which has interesting mathematical properties. It can be shown that Exray is a linear function of the negative logarithm of the likelihood of the atomic model if one assumes that all observations are independent and normally distributed W. Crystallographic refinement consists of minimizing Exray and thereby maximizing the likelihood of the atomic model. Exray can be made arbitrarily small, however, by simply increasing the number of model parameters used during refinement. The theory of linear hypothesis tests has been employed in order to decide whether the addition of parameters or the imposition of fixed relationships between parameters results in a significant improvement or a significant decline in the agreement between atomic model and diffraction data [42]. This theory strictly applies to the situation where the restraints can be expressed as holonomic boundary conditions, e.g., fixed bond lengths, and thus does not apply to non-linear restraints, such as Echem (Eq. 7). Briinger [43,44] proposed the free R (Rfr") statistic as a better way to gauge the progress of a refinement. In close analogy to the statistical method of cross validation, Eh: I Fobs( K)1
264 R I' measures the agreement between data calculated from the atomic model and a subset of the observed diffraction data (the "test" set) that has been omitted from the refinement. Interestingly, R free is highly correlated with the phase accuracy of the atomic model. In practice, about 10% of the observed diffraction data (chosen at random from the unique reflections) are sequestered into the test set. The size of the test set is a compromise between the desire to minimize statistical fluctuations of R fr" and the need to avoid a deleterious effect on the atomic model by omission of too much experimental data. In some cases, fluctuations in R fr" cannot be avoided because the test set is too small; this is especially true at low resolution. R fr" can be generalized by performing complete cross-validation [45]. This minimizes the dependence of R f"e on a particular test set by repeating the same refinement with different test sets and combining the results. In a first approximation, R f"e is related to Bricogne's likelihood estimation [46]. Although R f"e is a reciprocal-space measure, it is obtained from computer simulation in real space; this allows the inclusion of arbitrarily complex restraints such as geometric energy functions. The likelihood estimation is formulated in reciprocal space, which makes it prohibitively difficult to include useful real-space restraints. The principle of cross-validation requires that the model not be refined against data included in the test set. One cannot therefore compute a free R value from a model that has been refined against all data. However, one can try to remove the model's "memory" of a test set by extensively refining the (already refined) model against all data except this new test set. This is best accomplished using a refinement method capable of producing significant change such as simulated annealing refinement. In this approximation, Rfr" is defined as the R value computed for the new test set. The free R value provides an objective way to optimize the overall weighting between diffraction data and chemical restraints in crystallographic refinements (Eq. 1) [43]. For example, Rf"e was used to obtain the optimal relative weighting between bond length, bond angle, dihedral angle, and van der Waals restraints in Eq. 7 [36,44]. It was found that the relative distribution of bond lengths and bond angles as it is found in the Cambridge Structural Database, though derived purely from small molecules, is in fact optimal for protein structures as well. The observed deviations of the geometry from ideality were surprisingly small [44] (0.008À and 1° for bond lengths and bond angles, respectively).
3. SIMULATED ANNEALING REFINEMENT Annealing denotes a physical process wherein a solid is heated until all particles randomly arrange themselves in the liquid phase, and then slowly cooled so that all particles arrange themselves in the lowest energy state. By formally identifying the target E (Eq. 1) with the potential energy of the system, "simulated" annealing can be carried out [7]. Simulated annealing is an approximation algorithm: there is no guarantee that it will find the global minimum (except asymptotically) [8]. Compared to gradient descent methods where search directions must follow the gradient, simulated annealing achieves more optimal solutions by allowing motion against the gradient as well [7]. The likelihood of counter-gradient motion is determined by a control parameter referred to as "temperature" : the higher the temperature the more likely the optimization will overcome barriers. It should be noted that the simulated annealing temperature normally has
265 no physical meaning and merely determines the likelihood of overcoming barriers of the target function. The simulated annealing algorithm requires a generation mechanism to create a Boltzmann distribution at a given temperature T
B ( q i, • • • , qi)
=
exP(
- E (qi, • • • , qi) )
(9)
kbT
where E is given by Eq. 1, kb is the Boltzmann constant, and q l , . . . , qi are adjustable parameters. Simulated annealing also requires an annealing schedule, that is, a sequence of temperatures T1 > T2 > • • • > T1 at which the Boltzmann distribution is computed. Implementations of the generation mechanism differ in the way they generate a transition or "move" from one set of parameters to another which is consistent with the Boltzmann distribution at given temperature. The two most widely used generation mechanisms are Metropolis Monte Carlo [47] and molecular dynamics [48] simulations. Metropolis Monte Carlo can be applied to both discrete and continuous optimization problems, but molecular dynamics is restricted to continuous problems. 3.1. Monte Carlo The Metropolis Monte Carlo algorithm [47] simulates the evolution to thermal equilibrium of a solid for a fixed value of the temperature T. Given the current state of system, characterized by the parameters qi of the system, a "move" is applied by a shift of a randomly chosen parameter qi . If the energy after the move is less than the energy before, i.e. LE < 0, the move is accepted and the process continues from the new state. If, on the other hand, AE > 0, then the move may still be accepted with probability AE, P
= exP(
kT )
(10)
where kb is Boltzmann's constant. Specifically, if P is greater than a random number between 0 and 1 then the move is accepted. In the limiting case of T = 0, Monte Carlo is equivalent to a gradient descent method; the only moves allowed are the ones that lower the target function until a local minimum is reached. At a finite temperature, however, Monte Carlo allows uphill moves and hence can cross barriers between local minima. The advantage of the Metropolis Monte Carlo algorithm is its simplicity. A particularly troublesome aspect concerns the efficient choice of the parameter shifts that define the Monte Carlo move. Ideally, this choice should in some way reflect the topology of the search space qi [8]. In the case of a mono–atomic liquid or gas, for example, the coordinates of the atoms of the gas are essentially uncoupled so that the coordinate shifts can be chosen in random directions. In the case of a covalently connected macromolecule, however, random shifts of atomic coordinates have a high rejection rate: they immediately violate geometric restrictions such as bond lengths and bond angles. This problem can in principle be alleviated by carrying out the Monte Carlo simulation in a suitably chosen set of internal coordinates, or by relaxing the strained coordinates through minimization [4951].
266 3.2. Molecular dynamics A suitably chosen set of continuous parameters q i can be viewed as generalized coordinates that are propagated in time by the Hamilton equations of motion [52] a H (p, q) _ dqi ap i — dt
(11)
a H (p, q) dpi _ aqi dt where H(p,q) is the Hamiltonian of the system and p i are the generalized momenta conjugate to qi . If the generalized coordinates represent the degrees of freedom of a molecular system, this approach is referred to as molecular dynamics [48]. If one makes the assumption that the resulting trajectories cover phase space (or more specifically, are ergodic) then they generate a statistical mechanical ensemble [53]. Molecular dynamics can be coupled to a heat bath (see below) so that the resulting ensemble asymptotically approaches that generated by the Metropolis Monte Carlo acceptance criterion (Eq. 10). Thus, molecular dynamics and Monte Carlo are in principle equivalent for the purpose of simulated annealing although in practice one implementation may be more efficient than the other. Recent comparative work (Adams, Rice, & Briinger, in preparation) has shown the molecular dynamics implementation of crystallographic refinement by simulated annealing to be more efficient than the Monte Carlo one. In the special case that the generalized coordinates qi represent the Cartesian coordinates of n point masses and, furthermore, that momenta can be separated from coordinates in the Hamiltonian H, the Hamilton equations of motion reduce to the more familiar Newton's second law: .924 m, = —Vi E = P ( f . (12) mi at2
i
The quantities mi and 4 are respectively the mass and coordinates of atom i, Pi is the force acting on atom i, and E is the potential energy. In the context of simulated annealing, E denotes the target function being optimized (Eq. 1), which contains "physical" energies such as covalent and nonbonded energy terms as well as "non-physical" energies that correlate observed and calculated diffraction data. The solution of the partial differential equations (Eq. 12) is normally achieved numerically using finite-difference methods [48]. Initial velocities are usually assigned from a Maxwell distribution at the appropriate temperature. 3.3. Torsion angle molecular dynamics Although conventional Verlet-type molecular dynamics places restraints on bond lengths and bond angles, one could conceivably want to implement these restrictions as holonomic constraints. This is supported by the observation that the deviations from ideal bond lengths and bond angles are usually small in X-ray crystal structures. There are essentially two possible approaches to solve Newton's equations (Eq. 12) with holonomic constraints. The first involves a switch from Cartesian coordinates 4 to generalized internal ones 49i . Having thus redefined the system, one would solve equations of motion for the
267 generalized coordinates analogous to the Cartesian ones. This formulation has the disadvantage that it is difficult (but not impossible) to calculate the generalized gradients. Since the gradients are functions of the generalized coordinates only, however, conventional finite-difference integration schemes [48] can be used. A second possible approach is to retain the Cartesian formulation, where the gradient calculations remain relatively straightforward and topology independent [39]. In this formulation, the expression for the acceleration becomes more a complicated function of positions and velocities:
=
a
(13)
M-1 (11 C 2 (7-
where cl represents the system acceleration vector, and M and C2 denote the (6 x 6) system inertia matrix and (6 x 1) generalized force vector, respectively. This does not present insurmountable difficulties, but instead requires different integration schemes such as a fourth order Runge Kutta integration scheme [54]. The equations of motion for constrained dynamics in Cartesian coordinates are derived in complete generality by Bae and Haug [55,56]. A slightly simpler derivation specific for fixed bond lengths and bond angles can be found in [39]. What follows is a simple sketch of one particular implementation of molecular dynamics with holonomic constraints. Consider two bodies, i and j, connected by a bond of fixed length hij 1. We make the assumption that the only allowable relative motion between the two bodies is a rotation about h ii . Let 4, and f."; locate (with respect to an inertial frame) the center of mass of body i and j respectively, and let sij and sji define the points of attachment for each body with respect to its center of mass. The position of the center of mass of body j with respect to that of body i is simply rij = rj - 4. Finally, the scalar qii measures the relative angle of rotation about the bond The assumption that the only allowable relative motion between the two bodies is a rotation about the bond connecting them implies a relationship between the angular velocity Ci of their respective centers of mass measured in an inertial ("lab") frame: cvj = c7.5i + h ii 4ii
(14)
where 4ii denotes the time derivative of the relative angle between the two bodies and hij = his the unit vector along the bond connecting them. The expression for i 7"; can
be re-written:
1 ;
(1 5)
+rij
=
+ ±
I hii
z
This expression can be differentiated and then rearranged, resulting in an expression for the center of mass velocity of body j in terms of that of body i:
7 ./ = =
=
(16)
1Kii I G.71i X gij
X hij – qj X
ri – gij X Oi – I Ei i X
= ri–rij X Oi – h ij X gii)
gii X Oi – gij h ij X
268 Thus, assuming certain constraints act between atoms or groups of atoms, one can obtain an expression for the velocity of one group in terms of another. This relationship can be differentiated to give a relationship between accelerations, and integrated to give a relationship between positions. The algorithm is recursive, so the equations of motion for two bodies easily extend to many. Our implementation groups atoms into rigid bodies, allowing only torsion-angle motions between bodies. The connectivity of these bodies defines a tree-like topology for the macromolecule, with one arbitrarily chosen body identified as the base (or root). As any molecular dynamics algorithm, the torsion angle one begins with positions, velocities, and forces for all atoms. Then center of mass positions, velocities, and forces are computed for each rigid group. Starting at the outer ends of the tree topology, each chain is 'reduced' one body at a time by solving for the relative acceleration between the tip and the directly inner body. Then the tip's inertial properties are mapped, or aggregated, into the inner body's, resulting in a chain one link shorter. This process continues until an expression for the acceleration of the base body is obtained. After solving for the base's acceleration (which only requires inversion of a 6 x 6 matrix), the aggregation of bodies is reversed. The acceleration of the body "outboard" of the base is determined by the base's acceleration and the relative acceleration between the bodies. This outward expansion is continued until the tree has been completely covered [see [39] for more details]. Then a Runge Kutta integration step updates positions and velocities. Finally, new forces are calculated, and the whole process begins anew. The formalism is a general one: several tree—like topologies can be handled, as can closed topological loops such as those formed by disulfide bonds. 3.4. Temperature control Simulated annealing requires the control of the temperature during molecular dynamics. The three most commonly used methods are velocity scaling, Langevin dynamics, and temperature coupling. Velocity scaling consists of periodic uniform scaling of the velocities v i , i.e., v7r;lew
viold
(17) Tcurr
for all atoms i where T is the target temperature and Tcurr is the current temperature. Langevin dynamics incorporates the influence of a heat bath into the classical equations of motion, mi
= —viEtotai — mai— at at;
+ n,
(18)
where Pyi specifies the friction coefficient for atom i, and R(t) is a random force. R(t) is assumed to be uncorrelated with the positions and velocities of the atoms. It is described by a Gaussian distribution with mean of zero and variance < R(t)R(e)
2ma2kT6(t — t')
where k is Boltzmann's constant and 6(t — t') is the Dirac delta function.
(19)
269 The temperature coupling method of Berendsen [57] is related to Langevin dynamics except that it does not use random forces and it applies a temperature dependent scale factor to the friction coefficient
v iE total
=
mi ,yivi (1
T
7 )
at 2 cur,„
(20)
The second term on the right hand side of Eq. 20 represents positive friction if Tcurr > Tcurr < T, thus increasing the temperature.
T, thus lowering the temperature; it represents negative "friction" if
3.5. Annealing Schedule The success and efficiency of simulated annealing depends on the choice of the annealing schedule [58], that is, the sequence of numerical values T1 > T2 > • • • > T1 for the temperature. Note that multiplication of the temperature T by a factor s is formally equivalent to scaling the target E by 1/s. This applies to both the Monte Carlo as well as the molecular dynamics implementation of simulated annealing. This is immediately obvious upon inspection of the Metropolis Monte Carlo acceptance criterion (Eq. 10). For molecular dynamics this can be seen as follows. Let E be scaled by a factor 1/s while maintaining a constant temperature during the simulation, a2r 7/7,,
ate
E —Vi —
n 1
ari2
Ekin = i 2
(21)
= const.
(22)
at
This is equivalent to a2r m i = —VIE ato
Eki in =
n l
ari 2
E_ mi (_) =sEkin 2
at,
(23)
(24)
kin, is with t' = V75 , i e.1 the kinetic energy and, thus, the temperature defined as T =2E 3 nkb scaled by s. The equivalence between temperature control and scaling of E suggests a generalization of simulated annealing schedules where in addition to the overall scaling of E, relative scale factors between or modifications of the components of the target E are introduced, i.e., simulated annealing is carried out with a variable target function. In this case, the annealing schedule denotes the sequence of scale factors or modifications of components of E. A particular example of this type of generalized annealing schedule is the use of a "soft" van der Waals potential during high-temperature molecular dynamics followed by a normal van der Waals potential during the cooling stage [39].
270 3.6. Annealing control The analogy of simulated annealing with the physical annealing of solids can be more formally expressed by a connection to statistical mechanics. Both Monte Carlo and molecular dynamics simulations can create statistical mechanical ensembles [53]. Approximately, at least, one can use a statistical mechanical language to describe the progress of simulated annealing. For example, changes in the degree of the order of the system can be viewed as phase transitions. They can be detected by finding large values of the specific heat c during the simulation, c=
< (E(t)— < E(t) >) 2 > kbT2
(25)
where the brackets <> denote the mean computed over appropriate intervals of the simulation. It has been suggested [7] that the cooling rate be reduced at phase transitions since the system is in a critical state where fast cooling might trap the system in a meta—stable state. The observed fluctuations in E are relatively small during simulated annealing refinement, however, indicating local conformational changes rather than global phase transitions [33]. Thus, control of the annealing schedule by monitoring c has not yet been attempted and annealing schedules consisting of a pre-defined sequence of temperatures and modifications of Echern, are used. 3.7. Commonly used annealing schedules The two early implementations of simulated annealing refinement made use of the equivalent methods of temperature scaling [10] or energy scaling [59]. The influence of the temperature control method, energy term weighting, cooling rate, and duration of the heating stage was studied [33], and it was found that temperature coupling [57] is preferable to velocity scaling since velocity scaling sometimes causes large temperature fluctuations at high temperatures. Temperature coupling also outperformed Langevin dynamics in the context of simulated annealing since the always positive friction of Langevin dynamics tends to slow atomic motions. Slow-cooling protocols (typically 25K temperature decrements every 25 fsec) produced lower R values than faster-cooling protocols. An alternative to the slow-cooling approach was recently presented [39]. It consists of a constant high temperature molecular dynamics stage at 5000K over a period of 4 psec, followed by a fast cooling stage at 300K over a period of 0.1 psec. The more robust torsion angle molecular dynamics algorithm outlined above allowed conformational sampling at much higher temperatures than are possible with conventional unconstrained molecular dynamics. 4. RADIUS OF CONVERGENCE A number of realisitic tests on crambin [22], aspartate aminotransferase [23], myohemerythrin [60], phospholipase A2 [34], thermitase complexed with eglin c [59], and immunoglobulin light chain dimers [61] have shown that simulated annealing refinement starting from initial models (obtained by standard crystallographic techniques) produces significantly improved overall R values and geometry compared to those produced by least-squares optimization or conjugate gradient minimization.
271 In recent tests [39], arbitrarily "scrambled" models were generated from an initial model of a-amylase inhibitor built using experimental phase information from multiple isomorphous replacement diffraction data [62]. Scrambling of this initial model was introduced by increasingly long molecular dynamics simulations at 600K computed without reference to the X-ray data. Errors are thereby distributed throughout the structure, and are probably typical of those found in molecular replacement models or in poorly built initial models. In order to compare the power of refinement techniques, a series of these models was refined using two standard ones: conjugate gradient minimization and slow-cooling simulated annealing.
1.6 `14/3-\ c4
- - - - Conjugate Gradient Minimization Slow-cooling Simulated Annealing --- Average Torsion Refinement -- Best Torsion Refinement
1.4
• • •
1.2 •
P4
•
1.0 •
•-C) ••••
0.8
• •
el
0.6
ctS
0.4
• 0.2
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
Initial Backbone RMSD (A) Figure 1. Radius of convergence of conjugate gradient minimization, slow-cooling, and torsionangle simulated annealing. Convergence is measured by the final backbone atom r.m.s. coordinate deviation to the crystal structure. Thin lines show the result from one conjugate gradient minimization (dashed) or one slow-cooling simulated annealing refinement (solid). The thick dot-dashed line shows the average backbone atom r.m.s. coordinate deviation obtained from ten high-temperature torsion-angle refinements at 5000K, and the thick solid line shows the backbone atom r.m.s. coordinate deviation achieved by the torsion angle refinement with the lowest free R value.
Results are presented in Fig. 1, which depicts the backbone atom r.m.s. coordinate deviations before and after refinement for a number of different refinement methods. A similar graph for a perfect refinement technique would simply be a straight line along the horizontal axis: no matter how great the initial errors, the result would be in good agreement with the answer. Clearly this is not the case for conjugate gradient minimization, or even for slow-cooling simulated annealing, although slow-cooling simulated annealing is
272 clearly a more powerful refinement technique than conjugate gradient minimization. For refinements carried out between 5 and 2 A, slow-cooling simulated annealing can correct backbone atom r.m.s. coordinate deviations of around 1.3 A. Constant temperature torsion angle refinements (Fig. 1) outperform the slow-cooling protocol on average, dramatically so if one only considers the best model from each series. The torsion angle refinements are able to correct backbone atom r.m.s. coordinate deviations of at least 1.65 A. Clearly, the backbone atom r.m.s. coordinate deviation is only available if one knows the answer in advance. Fig. 2, however, shows the strong correlation between Rfree and backbone r.m.s. coordinate deviations. Thus in practice Rfree can be used to consistently identify the best models from a series of refinements.
2.0
•• #.
••
1.8
%•
0-4-,-e
1.6
A
1.4
a) o r4 i 7'0 .,5. 4-,
M• •
• • • ••
1.2 1.0 0.8 0.6
• • •
0.4 0.2
•
6.•
35.0
Se
40.0
45.0
50.0
55.0
60.0
Rfree (%)
Figure 2. Free R value vs. backbone atom r.m.s. coordinate deviation for torsion-angle constant– temperature refinements using a-amylase inhibitor [62] as a test case.
Simulated annealing has made crystallographic refinement more efficient by automatically moving sidechain atoms by more than 2 A, changing backbone conformations, or flipping peptide bonds without direct human intervention. Figure 3 shows a representative case where simulated annealing refinement has essentially converged to a manually refined structure of the enzyme aspartate aminotransferase [23]. The imidazole ring of the histidine sidechain has undergone a 90° rotation around the x i bond during simulated annealing refinement. This rotation was accompanied by significant structural changes of the backbone atoms. This resulted in convergence of the refined structure to the manually refined structure. The conformational changes were not accomplished by conjugate gradient minimization without re-building. Large rigid-body like corrections of up to 10° through simulated annealing refinement were observed by Gros et al. [59].
273
Figure 3. The segment consisting of residues Cys-192 and His-193 of the 2.8 A resolution structure of a single site mutant of aspartate aminotransferase [23]. Superimposed are the initial structure (dotted lines) obtained by fitting the atomic model to a multiple isomorphous replacement map, the structure obtained after several cycles of rebuilding and restrained leastsquares refinement (thick lines), the structure obtained after simulated annealing refinement (thin lines), and the structure obtained after conjugate gradient minimization (dashed lines).
Simulated annealing refinement is most useful when the initial model is relatively crude. Given a well-refined model, it offers little advantage over conventional methods, with the possible exception of providing information about the accuracy and conformational variability of the refined structure [63]. However, when only a crude model is available, simulated annealing refinement is able to greatly reduce the amount of human intervention required. The initial model can be as crude as one that is obtained by automatic building based on Ca positions alone [64]. In spite of the success of simulated annealing refinement, the importance of manual inspection of the electron density maps after simulated annealing refinement cannot be over-emphasized. This is essential for the placement of surface sidechains and solvent molecules, for example, and for checking regions of the protein where large deviations from idealized geometry occur. Figure 4 illustrates a problem that occurred during simulated annealing refinement of influenza virus hemagglutinin [32]. A poorly defined tryptophan sidechain moved into strong density belonging to N-linked carbohydrate that was not included in the first round of simulated annealing refinement (Fig. 4a). Simulated annealing can move atoms far enough to at least partially compensate for missing parts of the model. The model was manually rebuilt in this region, and the missing carbohydrate added. In subsequent rounds of simulated annealing, proper model geometry and fit to
-
274
(b)
(a) Trp-22
id
ffiiiif /4AM .i,fit...nt 4 AN r
AVIIii Ant IZICI .01111Sivro .1 awl MM. ti ;1111Wr -,....... .111111* iln•., qmrAil Jim Mop. II.v.g 1 a IF eirlill1Vnn
....--Z....._ .il 670r
" „ Ala e hz • n w .
I' ......4
r4 ilk IG l kaj
si i: a :-...4 /AVM% A fir--% IST N,,.._ 4ei • :0- -,,,r c-1 lo-tv 4/g. • a Nrzstih,--, _
et-it,
WavEraprtif ............. law 111
iiiiiI 0111 feel tA 4 111.411 11 ilfgrei
/
L fil ife,:_/ e" "vi itt ....,, avit. ..,....., 11,111 1)3.4..,:.‘ !gip, 144.... tio, 1 211- ir,: %/11 r11 AI% I vieiNaI kl ri kv,i0k..A Wai lfirdr; Filyr ,ZZ....•n• ,,tagm in 'Aurae t "''''74111 J41111‘ 4111111/11T 'NI
Figure 4. Simulated annealing refinement can move atoms far from their initial positions to compensate for missing model atoms. Residue Trp 222 of influenza virus hemagglutinin, which has weak sidechain density, moved into strong N-linked carbohydrate density in the first round of simulated annealing refinement, before carbohydrate was added to the model. The electron density maps were computed from (2Fobs — Fcaic) amplitudes using Fcaic phases corresponding to the atomic models at 3.0 A resolution. It shows density corresponding to missing atoms as well as the current model. The electron density maps are displayed as a "chicken wire" which represents the density at a constant level of one standard deviation above the mean. The maps have been averaged about the 3-fold non-crystallographic symmetry axis. (a) Electron density and coordinates after round 1, showing missing density for Trp 222 CO . (b) as in (a), showing properly built N-linked carbohydrate and Trp 222.
the electron density map was maintained (Fig. 4b). Simulated annealing refinement can produce R values in the twenties for partially incorrect structures. For example, after refinement of the protease from human immunodeficiency virus HIV-1 a partially incorrect structure [65] produced an R value of 0.25 whereas the correct structure produced a R value of 0.184 [66] with comparable geometry. 5. DIRECTIONALITY OF SIMULATED ANNEALING REFINEMENT The goal of any optimization problem is to find the global minimum of a target function. In the case of crystallographic refinement, one searches for the conformation or conformations of the molecule that best fit the diffraction data at the same time that they maintain reasonable covalent and non-covalent interactions. As the above examples have shown, simulated annealing refinement has a much larger radius of convergence than
275
gradient descent methods. It must therefore be able to find a lower minimum of the target E (Eq. 1) than the local mimimum found by simply moving along the negative gradient of E. Paradoxically, the very reasons which make simulated annealing such a powerful refinement technique (the ability to overcome barriers in the target energy function) would seem to prevent it from working at all. If it so easily crosses barriers, what allows it to stay in the vicinity of the global minimum? The answer lies in the temperature coupling. By specifying a fixed kinetic energy, the system essentially gains a certain inertia which allows it to cross energy barriers. The target temperature must be low enough, however, to ensure that the system will not "climb out" out the global minimum if it manages to arrive there. While temperature itself is a global parameter of the system, temperature fluctuations arise principally from local conformational transitions - for example from an amino acid sidechain falling into the correct orientation. These local changes tend to lower the value of the target E, thus increasing the kinetic energy, and hence the temperature, of the system. Once the temperature coupling has removed this excess kinetic energy, the reverse transition is very unlikely, since it would require a localized increase in kinetic energy where the conformational change occurred in the first place. Temperature coupling maintains a sufficient amount of kinetic energy to allow local conformational corrections, but does not supply enough to allow escape from the global minimum. This explains the directionality of simulated annealing refinement, i.e., on average the agreement with the data will improve rather get worse. It also explains the occurrence of small spikes in E during the simulated annealing process [3 3]. If the temperature. of the simulated annealing refinement is too high the system can get out of control due to numerical instabilities, eventually resulting in unreasonably large conformational changes. By suppressing high-frequency bond vibrations, torsion angle dynamics has significantly reduced the potential for this to happen. In fact, we are now able to use much larger temperatures than previously possible with the conventional (flexible bond and bond angle) molecular dynamics implementation [39].
6. SIMULATED ANNEALING OMIT MAPS Simulated annealing refinement is usually unable to correct very large errors in the atomic model or to correct for missing parts of the structure. The atomic model needs to be corrected by inspection of a difference Fourier map. In order to improve the quality and resolution of the difference map, the observed phases are often replaced or combined with calculated phases, as soon as an initial atomic model has been built. These combined electron density maps are then used to improve and to refine the atomic model. The inclusion of calculated phase information brings with it the danger of biasing the refinement process towards the current atomic model. This model bias can obscure the detection of errors in atomic models if sufficient experimental phase information is unavailable. In fact during the past decade several cases of incorrect or partly incorrect atomic models have been reported where model bias may have played a role [67]. Difference maps phased with simulated annealing refined structures often show more details of the correct chain trace [23]. However, the omission of some atoms from the computation of a difference map does not fully remove phase bias towards those atoms if
276 they were included in the preceeding refinement. More precisely, small re-arrangements of the included atoms can bias the phases towards the omitted atoms [68]. Thus, the structure needs to be re-refined with the questionable region omitted before the difference map can be computed. Simulated annealing is a particularly powerful tool for removing model bias [69]. The improved quality of simulated annealing refined omit maps has been used to bootstrap about 50% of missing portions of an initial atomic model of a DNaseActin complex [70]. It should be noted that this is a rather extreme case for the amount of omitted atoms. Usually, omit maps are computed with about 10% of the atoms omitted. In general, the improvement of the electron density map achieved in simulated annealing refinement is a consequence of conformational changes distributed throughout the molecule. This is a reflection of the fact that the first derivatives of the crystallographic residual (Eq. 1) with respect to the coordinates of a particular atom depend not only on the coordinates of that atom and its neighbors but also on the coordinates of all other atoms including solvent atoms in the crystal structure. 7. REFINEMENT WITH PHASE RESTRAINTS As shown in Fig. 1, simulated annealing has a large radius of convergence. The use of torsion angle molecular dynamics combined with a repeated high-temperature annealing schedule significantly increased the radius of convergence compared to a slow–cooling protocol at relatively high resolution. For refinements at 5 to 3A resolution no significant extension of the radius of convergence was observed [39]. In fact, convergence at this resolution range can be sparse: the limited resolution drastically reduces the number of reflections (observables), and can result in a severely underdetermined problem. This adverse observable to parameter ratio can be improved using experimental phase information, for example phases obtained from multiple isomorphous replacement diffraction data. Using phase restraints (Eq. 5) improves the radius of convergence somewhat while the vector residual (Eq.6) shows a significantly increased radius of convergence [39]. Figure 5 summarizes convergence for refinements at 5 to 3A resolution, again using backbone atom r.m.s. coordinate deviations from the crystal structure as a measure of convergence. As for high resolution refinements, torsion angle refinements (Fig. 5) consistently outperform slow-cooling (Fig. 5). These refinements were performed without cross-validation because of the adverse parameter to observable ratio which makes Rfree prone to larger fluctuations. There is, however, a strong correlation between the R value and backbone atom r.m.s. coordinate deviations. This is in general not the case, as R values for medium resolution refinements performed without phase information do not distinguish as well between good and bad models. 8. CONCLUSIONS Simulated annealing has greatly improved the efficiency of crystallographic refinement. However, simulated annealing refinement alone is still insufficient to automatically refine a crystal structure without human intervention. Thus, crystallographic refinement of macromolecules proceeds in a series of steps, each of which consists of simulated annealing or minimization of E (Eq. 1) followed by re–fitting the model structure to difference electron density maps with interactive computer graphics [64]. During the final stages of
277
b) Torsion Angle High Temperature Refinements
a) Conventional High Temperature Refinements
1.8
1.6 C2 1.4
---' ------
0.4
1.25
1.35
1.45
1.55
1.65
1.75
Initial Backbone RMSD (A)
1.85
0. 0.4
1.25
1.35
1.45
1.55
1.65
1.75
1.85
Initial Backbone RMSD (A)
Figure 5. Convergence of medium resolution (a) slow-cooling (starting at 5000K) and (b) constant temperature (10000K) torsion angle refinements against the vector residual (Eq. 6) of increasingly worse models for amylase inhibitor [62]. Convergence is measured by backbone atom r.m.s. coordinate deviations from the crystal structure, with lowest values shown in solid lines, average values in long-dashed lines, and highest values in short-dashed lines. The best models can be identified by a low R value [39]. Averages were calculated over ten refinements for every model.
refinement, solvent molecules are usually included and alternate conformations for some atoms or residues in the protein may be introduced. With currently available computing power, tedious manual adjustments using computer graphics to display and move positions of atoms of the model in the electron density maps represent the rate-limiting step in the refinement process. The availability of high-performance computers has opened the door for "experimentation" with new mathematical techniques such as simulated annealing. We can expect that with ever increasing computing power and the application of novel mathematical algorithms the crystallographic structure determination process for macromolecules could be fully automated in the near future. REFERENCES 1. W.H. Press, B.P. Flannery, S.A. Teukolosky, and W.T. Vetterling, Numerical Recipes, Cambridge University Press, Cambridge, 1986. 2. H.A. Hauptman, Physics Today, 42 (1989) 24. 3. D.M. Blow and F.H.C. Crick, Acta Cryst., 12 (1959) 794. 4. W. Hoppe, Acta Cryst., 10 (1957) 750. 5. M.G. Rossmann and D.M. Blow, Acta Cryst., A15 (1962) 24. 6. W.I. Weis, R. Kahn, R. Fourme, K. Drickamer, and W.A. Hendrickson, Science, 254 (1991) 1608. 7. S. Kirkpatrick, C.D. Gelatt, and M.P. Vecchi, Jr., Science, 220 (1983) 671.
278 P.J.M. Laarhoven and E.H.L. Aarts (eds.), Simulated Annealing: Theory and Applications, Dordrecht: D. Reidel Publishing Company, 1987. 9. M.E. Johnson, American Journal of Mathematical and Management Science, 8 (1988) 205. 10. A.T. Briinger, J. Kuriyan, and M. Karplus, Science, 235 (1987) 458. 11. W.A. Hendrickson and K. Wiithrich, Macromolecular Structures 1991, Atomic Structures of Biological Macromolecules Reported During 1990. Current Biology Ltd., London, 1991. 12. W.A. Hendrickson and K. Wiithrich, Macromolecular Structures 1992, Atomic Structures of Biological Macromolecules Reported During 1991. Current Biology Ltd., London, 1992. 13. W.A. Hendrickson and K. Wiithrich, Macromolecular Structures 1993, Atomic Structures of Biological Macromolecules Reported During 1992. Current Biology Ltd., London, 1993. 14. A. Jack and M. Levitt, Acta Cryst., A34 (1978) 931. 15. J.L. Sussman, S.R. Holbrook, G.M. Church, and S.-H. Kim, Acta Cryst., A33 (1977) 800. 16. J.H. Konnert and W.A. Hendrickson, Acta Cryst., A36 (1980) 344. 17. W.A. Hendrickson, Meth. Enzymol., 115 (1985) 252. 18. D.E. Tronrud, L.F. Ten Eyck, and B.W. Matthews, Acta Cryst., A43 (1987) 489. 19. J. Ibers and W.C. Hamilton (eds.), International Tables for X-ray Crystallography. International Union of Crystallography, The Kynoch Press, Birmingham, 1974. 20. L.F. Ten Eyck, Acta Cryst., A29 (1973) 183. 21. A.T. Briinger, Acta Cryst., A45 (1989) 42. 22. A.T. Briinger, M. Karplus, and G.A. Petsko, Acta Cryst., A45 (1989) 50. 23. A.T. Briinger, J. Mol. Biol., 203 (1988) 803. 24. E. Arnold and M.G. Rossmann, Acta Cryst., A44 (1988) 270. 25. S. Lifson and P. Stern, J. Chem. Phys., 77 (1982) 4542. 26. B.R. Brooks, R.E. Bruccoleri, B.D. Olafson, D.J. States, S. Swaminathan, et al., J. Comput. Chem., 4 (1983) 187. 27. G. Nemethy, M.S. Pottie, and H.A. Scheraga, J. Phys. Chem., 87 (1983) 1883. 28. J. Hermans, H.J.C. Berendsen, W.F. van Gunsteren, and J.P.M. Postma, Biopolymers, 23 (1984) 1513. 29. L. Nilsson and M. Karplus, J. Comp. Chem., 7 (1986) 591. 30. S.J. Weiner, P.A. Kollman, D.T. Nguyen, and D.A. Case, J. Comp. Chem., 7 (1986) 230. 31. M. Karplus and G.A. Petsko, Nature, 347 (1990) 631. 32. W.I. Weis, A.T. Briinger, J.J. Skehel, and D.C. Wiley, J. Mol. Biol., 212 (1989) 737. 33. A.T. Briinger, A. Krukowski, and J. Erickson, Acta Cryst., A46 (1990) 585. 34. M. Fujinaga, P. Gros, and W.F. van Gunsteren, J. Appl. Cryst., 22 (1989) 1. 35. M. Hahn and U. Heinemann, Acta. Cryst., D5 (1993) 468. 36. R.A. Engh and R. Huber, Acta Cryst., A47 (1991) 392. 37. A.T. Briinger, X-PLOR, Version 3.1. A System for X-ray Crystallography and NMR. Yale University Press, New Haven, 1992. 8.
279
38. R. Diamond, Acta Cryst., A27 (1971) 436. 39. L.M. Rice and A.T. Briinger, Proteins: Structure, Function, and Genetics, 19 (1994) 277. 40. J.N. Champness, A.C. Bloomer, G. Bricogne, P.J.G. Butler, and A. Klug, Nature, 259 (1976) 20. 41. M.G. Rossmann, E. Arnold, J.W. Erickson, E.A. Frankenberger, J.P. Griffith, et al., Nature, 317 (1985) 145. 42. W.C. Hamilton, Acta Cryst., 18 (1965) 502. 43. A.T. Briinger, Nature, 355 (1992) 472. 44. A.T. Briinger, Acta Cryst., D49 (1993) 24. 45. J.-S. Jiang and A.T. Briinger, J. Mol. Biol, 243 (1994) 100. 46. G. Bricogne, Acta Cryst. , A40 (1984) 410. 47. N. Metropolis, M. Rosenbluth, A. Rosenbluth, A. Teller, and E. Teller, J. Chem. Phys., 21 (1953) 1087. 48. L. Verlet, Phys. Rev., 159 (1967) 98. 49. M. Saunders, J. Am. Chem. Soc., 109 (1987) 3150. 50. Z. Li and H.A. Scheraga, Proc. Natl. Acad. Sci. USA, 84 (1987) 6611. 51. R. Abagyan and P. Argos, J. Mol. Biol., 225 (1992) 519. 52. H. Goldstein, Classical Mechanics, 2nd ed., Addison-Wesley Pub. Co., Reading, Massachusetts, 1980. 53. D.A. McQuarrie, Statistical Mechanics, Harper & Row, New York, 1976. 54. M. Abramowitz and I. Stegun, Handbook of Mathematical Functions, Applied Mathematics Series, vol. 55, Dover Publications, New York, 1968. 55. D.-S. Bae and E.J. Haug, Mech. Struct. & Mach., 15 (1987) 359. 56. D.-S. Bae and E.J. Haug, Mech. Struct. & Mach., 15 (1988) 481. 57. H.J.C. Berendsen, J.P.M. Postma, W.F. van Gunsteren, A. DiNola, and J.R. Haak, J. Chem. Phys., 81 (1984) 3684. 58. D.G. Bounds, Nature (London), 329 (1987) 215. 59. P. Gros, M. Fujinaga, B.W. Dijkstra, K.H. Kalk, and W.G.J. Hol, Acta Cryst., B45 (1989) 488. 60. J. Kuriyan, A.T. Briinger, M. Karplus, and W.A. Hendrickson, Acta Cryst., A45 (1989) 396. 61. Z.-B. Xu, C.-H. Chang, and M. Schiffer, Protein Engineering, 3 (1990) 583. 62. J.W. Pflugrath, G. Wiegand, R. Huber, and L. Vertesey, J. Mol. Biol., 189 (1986) 383. 63. F.T. Burling and A.T. Briinger, Israel Journal of Chemistry, (1994) in press. 64. T.A. Jones, J.-Y. Zou, S.W. Cowan, and M. Kjeldgaard, Acta Cryst., A47 (1991) 110. 65. M.A. Navia, P.M.D. Fitzgerald, B.M. McKeever, C.-T. Leu, J.C. Heimbach, et al., Nature, 37 (1989) 615. 66. A. Wlodawer, M. Miller, M. Jask6lski, B.K. Sathyanarayana, E. Baldwin, et al., Science, 245 (1989) 616. 67. C.I. Branden and A. Jones, Nature., 343 (1990) 687. 68. R.J. Read, Acta Cryst., A42 (1986) 140. 69. A. Hodel, S.-H. Kim, and A.T. Briinger, Acta Cryst., 48 (1992) 851.
280 70. W. Kabsch, H.G. Mannherz, D. Suck, E.F. Pai, and K.C. Holmes, Nature, 347 (1990) 37.
Adaption of Simulated Annealing to Chemical Optimization Problems, Ed. by J.H. Kalivas © 1995 Elsevier Science B.V. All rights reserved.
281
Chapter 13
Multi-dimensional searches in macromolecular X-ray crystallography S. Subbiah Department of Structural Biology, Stanford University Medical School, Stanford California 94305, United States of America. It is often the case in the X-ray crystallographic studies of biological macromolecules that only noisy or insufficient experimental data is available. If an approximation of the expected macromolecular structure is available beforehand, the situation can be remedied without recourse to further more complete or more accurate data collection. However the remedy requires that the independently available rough model be correctly oriented with respect to the crystal axes. In principle, the formulation of this orientation problem involves exhaustive search calculations in vast multi-dimensional spaces. In practice, such enormous calculations cannot be done with present-day computers. However, simulated annealing strategies can overcome such limitations. This article will focus on such strategies.
1. INTRODUCTION This article will describe the successful use of simulated annealing techniques to overcome two difficult types of massive search problems in modern X-ray crystallographic studies of large biomolecules. This in mind, a brief but sufficient overview of the current practice of biological X-ray crystallography will be first presented. Then two separate search problems will be discussed. Next, a simulated annealing solution to these problems will be described. Finally, the successful application of the developed methodology will be demonstrated with the help of real-life examples.
1.1. Overview of macromolecular X-ray crystallography The typical biological macromolecule of structural interest, be it protein, DNA or RNA, is on the order of 10 to 100 A in its dimensions. Despite these relatively large sizes, much of what makes these molecular structures biologically interesting is at the atomic scale - 1 to 2A. To date, as there are no X-ray microscopes that function at this resolution, only 2 practical ways remain for obtaining 3-dimensional atomic models (or equivalently, electron density distributions) of biological macromolecules. While nuclear magnetic resonance (NMR) is making significant contributions in this regard, X-ray crystallography has long been the method of choice. In order to begin this time-consuming and arduous task that has no guarantee of eventual success one needs to have good crystals of the desired macromolecule. (Since most crystals have been of proteins, the remainder of this article will consider protein crystallography to be representative of all biological crystallography). Typically,
282 protein crystals are obtained only after a combination of much effort and a little serendipity. They are then subjected to an intense beam of X-rays. These X-rays are scattered by the many ordered copies of the protein that make up the crystal lattice. The scattered rays form a regular pattern of diffracted spots that is recorded on either film or other suitable electronic detector. The individual intensities associated with each of the diffracted X-ray spots corresponds directly to the amplitude (or magnitude) of individual, unique Fourier terms of a Fourier series in 3 dimensions. Mathematically speaking, the observed diffraction pattern is a simple Fourier transform of the 3n Euclidean real-space atomic coordinates x, y, z - of the n atoms comprising each protein molecule. Since each term in a Fourier series is defined by two variables - the amplitude, F, and the phase, 0 knowing the values of F and 0 for all theoretically possible diffracted spots is exactly equivalent to knowing all 3n of the unknown coordinates of the n protein atoms. Although, in theory a perfect crystal can generate an infinite number of diffracted spots, in practice, the imperfect protein crystals generate only a limited number of readily detectable Fourier diffraction spots. This practical limitation in Fourier space, which is also known as reciprocal space, reciprocally translates to a corresponding lack of precision in defining the exact coordinates of the protein atoms in real-space. This imprecision, and thus the implicit real-space resolving power of the available crystal diffraction data, is routinely quantified by assigning a resolution limit (in A) to the data. By extension, this quantitative limit applies equally to any atomic model resulting from this data. Typical, high-quality atomic models of proteins can be obtained to better than 2A resolution, while anything less than 3.5 A resolution is unlikely to result in a more than sketchy atomic skeleton of the protein's backbone. Elements of protein secondary structure appear only at resolutions higher than 5 A (i.e. higher resolution meaning smaller integer values in A). When Fourier data is not available beyond 5 A, only macromolecular electron density envelopes, corresponding to a coarse van der Waals surface about the whole protein, can be ultimately obtained. Assuming that atomic-resolution diffraction is observed, there is a crucial hurdle to be crossed before the available reciprocal space experimental data can be turned into a real-space electron density map and subsequently interpreted to create the desired atomic model. The recorded diffraction spots - also known as Fourier reflections - can be readily converted to their associated Fourier amplitudes, F. On the other hand, owing to fundamentally insurmountable limitations in X-ray recording technology, the phase, 0, associated with each reflection is forever lost. However, the standard inverse Fourier transform that relates reciprocal space to real space depends on knowing both the amplitudes and phases of each reflection. Thus it cannot be used to compute the desired realspace electron density distribution. Nevertheless, in practice it is possible to overcome this hurdle - the so-called phase problem - by a tedious and time-consuming, hit-or-miss experimental procedure called the heavy-atom method. Today, assuming that crystals are already available, this is the major obstacle to overcome. While it is beyond the scope of this overview to detail the heavy-atom trick, it should be understood that it is a highly-leveraged boot-strap process which at best gives a noisy start on the road to an eventual atomic model. It involves repeating the entire diffraction experiment under many varied sets of perturbing conditions. These heavy-atom induced perturbations are by nature de stabilizing to the crystal lattice. Therefore, it is not surprising that, heavy-atom derived phases cannot be obtained to the
283 same high-resolution as the amplitudes. So, even when amplitude data is available to better than 2A, the absence of high-resolution phases usually restricts the application of the inverse Fourier transform to 3.5A or worse. Both the error in the far-from ideal heavy-atom phases and the error in the experimentally measured amplitudes contribute significantly to the overall error in reciprocal space. This inaccuracy manifests itself in real space as noise in the electron density map. The interpretation of these imperfect maps to produce an atomic model of the polypeptide chain is called tracing. Tracing is a labor-intensive process carried out by an expert. Piece by piece atoms are fitted by hand into the map to create as complete an atomic model as subjective interpretation will allow. Typically, the electron density map is too noisy to produce anything better than a fragmented model that accounts for some 50 to 60% of the atoms expected to be in the macromolecule. Nevertheless, such an incomplete model is a sufficient starting point for the next step known as crystallographic refinement. Here massive amounts of computer time is spent adjusting the current partial model to better fit the available Fourier amplitude data (N.B. The relatively more error-prone heavyatom derived phase data is not used in assessing fitness). This fitness criterion is commonly referred to as the R-factor and the object of the refinement step is to reduce this value. The adjusted partial model coordinates are then used to create a new set of phases via a Fourier transform calculation. These model-derived calculated phases are assumed to be a better estimate of the true phases then the original error-prone heavy-atom phases. So these calculated phases, in conjunction with the original experimental Fourier amplitudes, are subjected to an inverse Fourier transformation. The resulting new electron density map typically fits the partial model better. More importantly, it also displays additional interpretable regions of density. This allows further tracing and consequently leads to an embellishment of the model. Moreover, previously mistraced segments can be identified and corrected as well. The entire sequence of map synthesis, interpretation, tracing and model refinement is iteratively repeated while the Rfactor steadily decreases. When this boot-strap cycling procedure converges to a minimal value of R, the structure is deemed as solved. This final atomic model is said to have a residual discrepancy of R against the experimental Fourier amplitude data. As can be seen, the solution of a crystal structure entails many steps of intensive computation where combinatorial optimization algorithms could be expected to expedite the process. One area where simulated annealing has been beneficial to crystallography is that of the aforementioned crystallographic refinement step. This is described in considerable detail by A. Brunger elsewhere in this book. Another application of simulated annealing is in the real-space search problem of crystallography. This problem arises when the initial electron density map obtained by the heavy-atom method is so poor that no obvious tracing of the polypeptide chain is even partly possible. Typically one expects to see connected tubes of electron density corresponding to mainchain atoms strung along the polypeptide backbone. When the heavy-atom phases are poor, much of this connectivity is lost and the remaining bubbles of isolated density are impossible to interpret. Frequently, when a crude approximation of the expected model is already available from other unrelated sources, this otherwise fatal situation can be overcome.
284 Such prior models are often available since the repertoire of protein structural motifs is evolutionarily limited and crude folding themes frequently repeat themselves. Moreover, proteins can have more than one domain and these occur in different modular combinations. Thus, the macromolecule under crystallographic investigation could in part contain a domain similar to one whose atomic structure is already known. Additionally, many versions of the same protein are found in different biological species. Almost always they share a very similar overall structure and differ only in their atomic detail. Hence, crudely related structural models are often available to aid the crystallographer who is stuck with an uninterpretable electron density map Huber, 1965; Colman & Webster, 1985; Reynolds et. al., 1985). Once the crude model is correctly fitted into the electron density the tracing outlined by the model can be used to guide the interpretation of the correct connectivity in the noisy map. However, the correct fitting of the crude model into the map is a non-trivial 6-dimensional problem. The 3 translational and 3 rotational degrees of freedom that can be applied to the center of mass of the model relative to the map constitute a substantial search problem. In practice, this task requires the exhaustive 6-dimensional enumeration of all possible rigid-body fits of the model in the map. Such a brute force enumeration is not possible with current computing resources. Nevertheless, a simulated annealing approach has been successfully applied to overcome this real space search problem in crystallography (Subbiah & Harrison, 1989). This topic will be discussed in the first part of this article. Simulated annealing has also proven useful with the reciprocal space equivalent of the real space search problem. More conventionally known as the molecular replacement problem, this reciprocal space search problem occurs when at the outset of the crystallographic experiment a reasonably detailed approximate model of the macromolecule is already available and the intention is to altogether avoid the painful acquisition of heavy-atom phases (Rossman, 1972). The absence of heavy-atom derived phases differentiates this reciprocal space version of the search problem from its real space analog previously discussed. In fact, to date, the majority of crystal structures are solved without heavyatom derived phases. Once the atomic structures of a macromolecule of medicinal or industrial interest is initially solved by tedious heavy-atom methodology, the next step is often the rational design of drugs and inhibitors or the design of mutant enzymes by protein engineering. Almost by definition, this implies the solving of crystal structures that are generally only slight modifications of the parent structure. As to be expected these instances far outnumber the solution of crystal structures that are entirely new. In these instances, the need for heavy atoms can be frequently side-stepped by performing a reciprocal space search in 6 dimensions - 3 rotational and 3 translational. For each point in this vast 6-dimensional space, the calculated Fourier amplitudes from the suitably rotated and translated model can be compared with the experimental Fourier amplitudes. Such an exhaustive search can in principle give the correct orientation and location of the available approximate model in the new crystal. This allows the calculation of approximate phases for the crystal structure and ultimately leads to an accurate atomic structure. However, such a molecular replacement solution does not always work. This is because in practice, a truly exhaustive 6-dimensional search is not possible given present day computing resource. So this 6-dimensional problem is routinely split into two far smaller and consecutive 3-dimensional problems - 3-dimensional
285 rotation followed by a 3-dimensional translation. The best solution to the so-called rotation function is assumed to reflect the true orientation of the model (Rossman & Blow, 1962). The model is then kept fixed in this orientation over the course of the subsequent translation function (Rossman, 1972). This divide-and-conquer technique is based on the assumption that the inverse Fourier transform of typical macromolecular shapes can be linearly decomposed into translational and rotational components. This assumption is only approximately true. So, the best solution of the rotation function can frequently be a false input for the translation function. This is particularly true when the available model is increasingly different from (or only a progressively smaller fragment of) the structure in the crystal. Since computational limitations preclude a full 6-dimensional search, such cases either default to the heavy atom approach or are abandoned. However, simulated annealing can successfully tackle the full 6-dimensional calculation and this is addressed in the second part of this article.
2. THE REAL SPACE SEARCH PROBLEM 2.1. Statement of the problem The problem is a general one of placing a rigid molecular structure in an electron density map in the most objective fashion. There are a number of different types of such problems in protein crystallography. One particular instance is the placement of small substrate (or equally well an inhibitor) structures in poor electron density maps of protein binding sites. Another case is when homologue structures are used in the initial location of new protein structures within the unit cell of poor electron density maps. The location of a known structure within a poor electron density map from a different space group is yet another variant. The size, C, of all such search problems is determined by the following parameters: (1) The dimensions of the electron density map to be searched -a, b and c. Often this is no more than the dimensions of the asymmetric unit. (2) The resolution of the available map, k. (3) The number of atoms in the search object, n. Ideally, a complete 6-dimensional search would involve an inner exhaustive translational search for every rotational setting. At each such configuration of the search object some measure of the quality of fit has to be calculated. Given the limited computer resources available, this measure is often simply the sum of the electron density at each of the n atomic locations, for any given configuration of the search object, n E = 1 electron density at atom i i=1 Each such determination of electron density corresponding to a single atom can be viewed as a basic computational operation (b.c.o.). The size of the
286 crystallographic search problem, C, can then be estimated in terms of the number of b.c.o. required. (1) Using the standard 1/3 resolution criterion for interpolation, the number of translational configurations is given by T (abc) (3/k)3
•
•
•
•
•
•
•
•
•
1 grid unit
• a/2
•
•
•
•
•
Figure 1. Given a resolution k and using the standard 1/3 resolution criterion for the choice of grid spacing of an electron density map, the arc length = r0 formula can be used to estimate the required angular interval AO 1 for a real-space rotational search. With a the length of an edge of the map to be searched, (a/2) A0 1 = 1 grid unit = k/3 implies that the number of intervals for a complete search of 2n rad = 3n a/k. Reprinted from: S. Subbiah and S.C. Harrison, Acta Cryst., 1989, A45, p337-42.
(2) With reference to Figure 1, the number of rotational configurations is given by W N (2n/A0 i ) (2 %6 1 ) (n/A01) (1/2abc) ( 33t/k)3 (3) Therefore, the size of the crystallographic search problem is given by C nTW 1/2nt3 (abc)2 (3/k)6 2.2. Implementation of the simulated annealing solution First a general outline of the approach will be described. Then the details and parameters particular to the real space search problem will be presented. In general, the minimization of the quantity E(x) over the vast range of the generalized coordinate variable x by simulated annealing proceeds as follows.
287 (1) A random start point xo is chosen and E(x0) is computed. (2) A single small displacement Ax is made and E(x 0+Ax) is evaluated. (3) _E = E(xo+Ax) - E(x0) is computed. (4) If AE is negative or equal to 0, the new value of x is accepted. x0 is then replaced by x0+Ax and the process returns to step (2) for another cycle. If _E > 0, a random number r lying between 0 and 1 is generated. If r s exp -(E/T), the new value of x is accepted. Then x0 is replaced by xo+Ax and the process returns to step (2) for another cycle. If r > exp-(E/T), the new value of x is rejected. Then, with x 0 unaltered, the process returns to step (2) for another cycle. The cycle outlined above is controlled by the parameter T. The value of T governs the probability of accepting displacements, Ax, that in the short run do not increase E. At the outset of the annealing procedure T is set to be very much larger than the typical values of AE. A large T allows nearly all random displacements to be accepted with little regard to the primary objective of reducing E. However, after a fixed number, Ic, of passes through step (2), the value of T is reduced by some empirically derived factor g. Typically Tnew = gT0ld, where g is some fraction between 0 and 1, like 0.9. Then another set of Ic cycles are attempted at this new T. These sets of IC cycles at decreasing values of T are repeated until T becomes much smaller then the AEs encountered. When this annealing procedure appears to have converged, the minimum so obtained (probably a locally minimal value of E) is recorded as a candidate good solution for the desired global minimum. In practice, it is worthwhile to also keep a list of any good solutions that may have been encountered during the entire annealing procedure. Since a priori one does not know the globally minimal value of E, estimates for what constitutes such good solutions have to be empirically determined. Such estimates can be pre-determined by performing a coarse but exhaustive search of the entire solution space prior to the annealing process itself. From this pre-calculation one can get an average, <E>, and a standard deviation, 0E, for the solution space. Suitable empirical cutoffs for E can be derived in terms of standard deviation units. The criterion for deciding on these cutoffs is the simple one of keeping the length of the good solution list from being too short to be useful or too long to be practical. Finally, each entry in this list of all encountered good solutions with E larger than the cutoff is independently maximized to its nearest maximum. Duplicate solutions are removed and any obvious clustering of the remaining solutions is trimmed by pruning to a single representative for each cluster. The parameters of the annealing protocol are largely determined empirically and are characteristic of the general topology of the particular solution space. But, in theory, there is a lower bound on the value of IC. The expected distance traversed by a random walk of lc steps in an 1-dimensional grid is of order (Ic/1)1/2 (Van Kampen, 1981). Thus, to give the annealing scheme an optimal chance of
288
equilibrating at any temperature, the number of annealing steps, lc, must be such that (Iej1) 1/2 is greater than half the longest edge of the search space.
Random number generator
Initial search object
......1 Record good fits E , Eg
Small
random shift-A x
Compute Eni AE
SE0
If la < li
a
lc'12 If accepted I =I +1 Eo = En a a r•nn4111. Metropolis test c" la+ r r + -AE/Tn If rejected E 0 E0 Decrease Temper -azure hitiial Temper -oture
If
Cooling
f
lo.
< 12
Freezing cycle
=If+1
If I
H
If
cycle
Converged best • solution
Terminate annealing
Minimize to nearest minimum
Discard identical solutions
f
1
If Ic = 12
2
'Hills between all pairs F.
List of good solutions
Stop
Figure 2. A flow diagram illustrating the particular implementation of the simulated annealing protocol for the real-space search problem of crystallography. All variable and parameter names are as described in the text. The inputs to the 12), the program include initial values for the various parameters (e.g. g, To, coordinates of the search model and a suitably large search map. Reprinted from: S. Subbiah and S.C. Harrison, Acta. Cryst., 1989, A45, p337-42.
289 Furthermore, the value of To can be selected by empirically determining the typical values of AE and arranging for To to be greater than this. A value between 0.9 and 1.0 for g, the rate of cooling, has been empirically shown to give consistently good results in a wide range of problems (Kirkpatrick et al., 1983). All these parameters - To, g and the ratio by which (Ic/1) 1/2 exceeds the longest edge of the search space - are largely determined by the particular nature of the solution space and consequently the type of problem that generates this space. Hence, once determined for a given type of problem, like for instance the real space search problem considered here, these parameters are expected to remain relatively constant from one specific example to the next. The particular implementation of the simulated annealing protocol for the real space search problem is outlined in detail in the accompanying flow diagram (Figure 2). The one obvious difference is that the quantity,
n Exyz0 1 0203 = E electron density at atom i i =1
is maximized rather than minimized. So, AE is assumed to be AE =- (Enew-Eold)• All rotations of the model corresponding to the three Euler coordinates 0 1 , 02 and 03 are done about the origin of the electron density map before any translation is applied. In all cases, a very coarse search (C - n x 10 6) was conducted beforehand to estimate both the mean, <E>, and the standard deviation, 0 E, of the solution space. Prior experience with exhaustive searches suggested 30 E to be a reasonable lower cutoff for the assessment of the goodness of any candidate solution. Accordingly all such solutions encountered along the annealing path were recorded. The particular implementation of the general annealing protocol described earlier involves some modification in the parameter I C. At each value of T at most Ii steps were allowed. Since all steps result in either the new configuration being accepted or rejected, the sum of the number of acceptances, Ia, and the number of rejections, Ir, must equal I. When Ia exceeded some pre-set value 12 (arranged to be less than Ii), T was decreased and the next round of equilibration was initiated. Such equilibration rounds, where 12 was reached before Ii, were deemed cooling cycles. Conversely, when Ii steps were completed without obtaining 12 acceptances a freezing cycle was said to have occurred. The optimal ratio of Ii to 12 was empirically determined to be 2. Similarly, the optimal values for g and To were found to be in the range 0.98-0.99 and 2-36E respectively. Once the annealing was complete, both the converged solution and the list of good solutions (i.e. >30E ) were independently maximized by a standard least-squares maximization procedure to their nearest local maxima. These maxima were all confirmed to be truly independent maxima and duplicate maxima were dropped from the list. This pruned list was accepted as the desired collection of best solutions for any given random start.
290 2.3. Experiment 1 2.3.1. The problem The structure of the bromelain-related influenza virus hemagglutinin protein was solved to 3.0 A resolution by Wilson, Skehel & Wiley (1981). This glycoprotein, a trimer of 503 x 3 amino acids binds to sialic-acid-containing receptors. The crystal structure of a 21-atom sialic acid-hemagglutinin complex has been recently solved to 2.9A (Weis et al., 1988). The placement of the sugar in the binding-site electron density was confirmed by performing a six-dimensional rigid-body search. Here, while the P41 asymmetric unit is 163.2A x 163.2A x 177.4A, the portion of the map that adequately includes the receptor binding site is much smaller - 13.53A x 10.63A x 13.53A. Accordingly, the parameters of the problem were: a = 14, b = 11, c = 14, k = 2.9, n = 21 and C = 6.1 x 10 8 b.c.o. This search required 6.5 hours of central processor unit (c.p.u.) time on a 10 mip (i.e. millions of instructions per second) machine. There were four good solutions in the top 0E of the distribution. In the top 2o E range there were 67 independent good solutions. Of these solutions, the top solutions happened also to make the most chemical sense. However, in principle, this correspondence between the objective best fit and the chemically correct solution need not occur. This solution has since been confirmed by crystallographic refinement and NMR (Weis et al., 1988). To what extent can simulated annealing reproduce these results? How much faster will simulated annealing achieve this? 2.3.2. Results Table 1 presents the results of five consecutive trials of the annealing protocol starting from five different random positions. The parameters used were <E> = 0.0, of = 7.5, g = 0.98, To= 20E , 12 = 10 000, I1= 212. The cumulative success of obtaining the pool of good solutions, as ascertained by the exhaustive search above, is indicated in fractional form. Each trial took approximately 3 c.p.u. minutes on a 10 mip machine. Within two trials, the top loE peaks were attained. After five trials, approximately 75% of all the top 67 2o E peaks were obtained. Therefore, in this case, the top 10 E solutions were generated with some 80-fold reduction in computer time. Similarly, all top 2oE peaks were found in 25-fold less time.
Table 1 The results of an exhaustive 6-dimensional search of a sialic acid model against an influenza hemagglutinin-sialic acid complex electron density map compared with those from a number of simulated annealing trials. a) The number of good solutions to the exhaustive search lying in the top 1.0, 1.5 and 2.0 o f ranges are presented in the first row. The results of the five different random trials are presented in the same ranges. (These are also expressed as percentages of the ideal results obtained in the exhaustive calculation). The same data from the five random trials are also presented cumulatively. b) Reprinted from: S. Subbiah and S.C. Harrison, Acta. Cryst., 1989, A45, p337-42.
291
Number of Peaks Top 1.06E
Exhaustive Search Random Trial 1 Random Trial 2 Random Trial 3 Random Trial 4 Random Trial 5 Cumulative (1-2) Cumulative (1-3) Cumulative(1-4) Cumulative(1-5)
4 (100%) 2 (50%) 2 (50%) 0 (0%) 2 (50%) 0 (0%) 4 (100%) 4 (100%) 4 (100%) 4 (100%)
Top 1.56E Top 2.00E
26 (100%) 7 (27%) 7 (27%) 6 (23%) 9 (35%) 3 (12%) 13 (50%) 17 (65%) 21 (81%) 22 (85%)
67 (100%) 17 (25%) 14 (21%) 12 (18%) 19 (28%) 12 (18%) 29 (43%) 36 (54%) 45 (67%) 50 (75%)
2.4. Experiment 2 2.4.1. The problem The crystal structure of HLA-A2 has been recently determined to 3.5A resolution (Bjorkman et al., 1987). The structure was obtained after averaging electron density maps for two space groups, monoclinic P21 and orthorhombic P212121. The initial orientation was established in the monoclinic form by two independent methods. The relatively poor map was traced by eye, while concurrently an exhaustive computer search was performed using the 101 alphacarbon locations of the homologous immunoglobulin constant region (i.e. CH3 from human Fc was expected on grounds of sequence similarity to have the same fold as the (32-microglobulin domain of HLA-A2) (Bjorkman et al., 1987). The same trace was confirmed. Following this, the relative orientation of the partially traced P21 model was sought in the P212121 electron density map. Packing considerations that suggested a possible relative orientation failed to give clear results. Thereafter an exhaustive search based on a fortuitous choice of a coarse grid using a model of 89 alpha-carbon atoms resulted in the unambiguous discovery of this relative orientation (Bjorkman, personal communication). This search was conducted on a very coarse grid spacing of 3A and required approximately 150 c.p.u. hours on a 10 mip machine. However, a grid spacing of at least 1A, would certainly have been necessary if the model had not been in a favorable location relative to the coarse grid spacing used. Such a search could not have been contemplated as estimates of the required 10 mip processor time would have been on the order of 18 years. The parameters of this hypothetical 18year search are a = 60, b = 40, c = 60, k = 3, n = 89 and C= 2.9 x 10 18 b.c.o. Can simulated annealing address this problem? If so, then how effectively?
292 Table 2 The results of a 6-dimensional simulated annealing search of a partially built HLA-A2 model of 89 alpha-carbon atoms against an electron density map of the complete protein in space group P212121. a) An exhaustive search to find the top peak at 100E would have taken some 18 c.p.u. years on a 10 mip machine. Five simulated annealing searches from five different random starting points all converged onto the correct/top solution. Each trial took some 4 c.p.u. hours on a 10 mip machine and found on the order of a 100 good candidate peaks above 60 E . b) Reprinted from: S. Subbiah and S.C. Harrison, Acta. Cryst., 1989, A45, p337-42.
Number of Peaks above 60E Includes correct/top peak ?
Random Trial 1 Random Trial 2 Random Trial 3 Random Trial 4 Random Trial 5
65 87 126 96 110
Yes Yes Yes Yes Yes
2.4.2. Results Five trials were attempted from five different random positions, with the following parameters: <E> = 0.0, GE = 32.5, g = 0.98, To = 30E , 12 = 25 000 and II = 212 (Table 2). In all cases the optimum, at 100E , corresponding to the true (crystal-packing) solution was attained in approximately 4 hours on a 10mip machine. Other good solutions were found. However, since the exhaustive search cannot be performed in practice, no detailed assessment of the relative success in attaining these can be made. These results nominally suggest a 88,000-fold increase in speed.
3. THE RECIPROCAL SPACE SEARCH PROBLEM 3.1. Statement of the molecular replacement problem The problem is simple. Experimental Fourier amplitudes are available. A good polypeptide backbone model is also available. Therefore the expected Fourier amplitudes can be calculated for a given orientation of the model. These can be compared with the experimental Fourier amplitudes and a fitness criterion readily calculated. Since the rotational and translational setting of the model with the respect to the crystal axes is not known the model has to be sequentially placed in all possible rotation/translation settings and the corresponding fitness criterion calculated. In the absence of any significant noise the setting that gives the best fitness is the desired answer. In practice, the asymmetric unit of a typical macromolecular crystal is about 50A in each direction. Available homologue search models cannot be expected to
293 be perfectly similar to the macromolecule in the crystal. Typically they have root mean square (r.m.s.) errors in the coordinates of their main-chain atoms of lA or more. Accordingly, invoking Shannon's criterion, the comparison of the Fourier amplitude data between the experimental set and the model-derived set can only be conducted to a resolution of 3t8i. or lower. Hence, at a minimum, this implies that the translational part of the search has to be conducted on a grid spacing At of iA. Therefore, the number of possible translational settings is 50 x 50 x 50 - 10 5 . The same simple geometric consideration discussed with regard to the real space search problem (Figure 1), implies that a molecule whose diameter, d, is on the order of the unit cell dimensions must be sampled at a rotational spacing consistent with the translational grid spacing, At. Using a similar derivation as before, this rotational spacing, AO, must be on the order of At/(d/2) = 1/(50/2) - 3 degrees. Thus, the number of possible rotational settings is 360/3 x 360/3 x 180/3 - 10 6 . Therefore, the total number of possible settings is 10 5 x 106 =10 11 . For each of these settings a Fourier transform of the entire atomic model has to be computed. In order to do this at the resolutions considered here, an electron density map over the entire 50 x 50 x 50 =10 5 pixel unit cell has to be first generated from the model. Accordingly, this requires at least 10 5 b.c.o. per setting. Subsequently a Fourier transform of this map needs to be calculated. A Fast Fourier Transform (FFT) of p = 10 5 pixels requires some plogp= 10 5 log105 = 106 b.c.o. So a full 6-dimensional exhaustive search would require at least - 10 11 x 106 = 10 17 b.c.o. Thus, even allowing for clever modifications to the standard FFT formalism (Brunger, 1989), the fastest computer chips of today that run at a billion instructions per second would take a few years to complete such a calculation. This conclusion was even more true when practical molecular replacement solutions were first attempted in the 1960's (Rossman and Blow, 1962). So, the exhaustive 6-dimensional calculation was altogether avoided by taking advantage of an approximation that allows a crude divorcing of the rotational component from the translational component. In general, the optimal solution to the rotation of an incorrectly translated model will not be the same as the correct rotation for the correctly translated model. However, since the searches considered here are to be done relative to a repeating unit cell of a crystal, a special crystal-induced mathematical property known as the Patterson function can significantly alter this general rule (Rossman & Blow, 1962). While a proper discussion of this function is beyond the scope of this article, it is sufficient to note that this function allows the splitting of the rotation/translation problem. But this decomposition of the problem in Patterson space entails the rotational search to be carried out first without knowing the correct translation beforehand. Residual noise due to the very approximate nature of the Patterson decomposition ensures that the rotation search produces multiple broad peaks rather than a single sharp peak (Figure 3a). This makes the selection of the correct rotational setting relatively difficult. Nevertheless, the model is then rotated as accurately as possible to the candidate rotational setting. The translation function search is then conducted. If the selected rotational orientation is indeed the correct one the translational search should have relatively little noise and the correct translational solution can be expected to clearly stand out (Figure
294
3b). The contrast between the signal-to-noise situations in the inferior rotational function as compared to the superior translation function is illustrated in Figure 3.
True Answer Rotation Function Signal
...
.n.....,...‘...
Generalized Rotation Variable
(a)
True Answer Translation Function Signal
•••-n,,...,...n-n..•-•••-/-•-•n_,-..,..,-....1L1,n_,n-•._.......--
Generalized Translation Variable
(b)
Figure 3. Illustration of typical molecular replacement results. a) The relatively smooth rotation function has broad peaks that are difficult to interpret accurately. b) In contrast, the sharp spike in the translation function clearly signals the correct translational solution of a correctly pre-rotated model.
295 Thus, the Achilles' heel of traditional molecular replacement is the uncertainty in the accurate deduction of the exact rotational setting from the broad rotation function peaks. In practice, this uncertainty is further compounded by any error or incompleteness present in the search model itself. Although, models that are less than 70% complete in their polypeptide backbone have been know to succeed, a higher degree of completeness is generally required (Rossman, 1972). Further, molecular replacement seldom succeeds when the error in the coordinates of the main chain atoms is more than 2A r.m.s. But to be relatively confident of success the model error has to be closer to lA r.m.s than 2A r.m.s. On the other hand, if a full 6-dimensional search can be performed, the signalto-noise problems associated with the rotation function can be avoided by effectively seeking the stronger signal normally associated with the translation of a correctly rotated model. In principle, this should allow more error, beyond the traditional 1 to 2A, to be tolerated in the search model. In order to exploit such a benefit, the full 6-dimensional search has to be addressed. It will be shown that simulated annealing can effectively accomplish this. 3.2. Implementation of the simulated Annealing solution The specific implementation of the simulated annealing approach was very similar to that for the real space search problem discussed earlier. The particular differences, as implemented in the program OBELIX, follow: (1) Instead of the previous Exyz0102037 the standard scaling-independent correlation coefficient, rxyzn A H 3 9 was maximized. Correspondingly AE was 1 _ 2_ replaced by Ar= - ( rnew - rold). (2) All rotations and translations were conducted in the same coordinate system as before, with the rotations always conducted about the origin prior to any translation. (3) As before, a prior coarse search of C n x 106 was carried out to estimate the mean, , and standard deviation, or, of the solution space. (4) Empirical observations with exhaustive searches suggested 2o r to be a reasonable cutoff for the assessment of the goodness of any candidate solution. (5) The scheduling of the annealing protocol, including the notion of cooling and freezing cycles, was as before. The previous ratio of I1 to 12 of 2 and the optimal values of g - 0.98-0.99 were found to be adequate. The value of To was a little lower at - 2or. (6) The steepest descent maximization and the subsequent pruning of the list of good solutions (i.e. > 20r) was as before. 3.3. Experiment 1 3.3.1. The problem This test case was designed to address the question of the benefits of simulated annealing over conventional molecular replacements when considering
296 progressively poorer starting models. To this end, a real-life example using real experimental Fourier data was searched with a series of atomic models that were progressively deformed from the correct one. The test case was that of the N-terminal 63 residues (i.e. n = 63) of the bacteriophage 434 repressor protein, 1r69 (Mondragon et. al., 1989). High-quality Fourier amplitude data to 2A had been collected about five years prior to the eventual structure determination. Volume estimates had strongly suggested that the P212121 crystals contained only one molecule in each of its 16.4A x 18.8A x 44.6A asymmetric unit. Since a crude and very sketchy partial model (-55 residues) was already available from an unrelated structure determination to 6A, traditional molecular replacement with this starting model was attempted (Anderson, Ptashne & Harrison, 1985). All attempts over the course of the five years failed. The structure was eventually solved by traditional molecular replacement. However, a different starting model obtained from an unrelated 2A crystal structure determination of a closely related homologue, the 434 cro protein (-60% sequence identity), had to be used (Mondragon, Wolberger & Harrison, 1989). After a final 2A model with excellent stereo chemistry and an excellent fit to the Fourier data (R=19.3%) was obtained for the 1r69 protein, a retrospective analysis of the previous failed molecular replacement effort was carried out. It turned out that the original model had a 4.1A r.m.s difference with the final correct 1r69 model. (N.B. for clarity, we shall henceforth assume that despite the nonzero R-factor of 19.3% this final model is the true and correct model from which all others are judged). Given that the common rule of thumb for successful traditional molecular replacement is a model error of less than 1 to 2A r.m.s., the repeated failure of the original molecular replacement attempt is not surprising. In contrast, the 434 cro model was only 1.09A r.m.s. from the true 1r69 model. Given this historical backdrop, and that the current author was awarded his Ph.D. thesis primarily for this structure determination effort, the 1r69 example is ideally suited for comparing the relative merits of the simulated annealing approach over traditional methods (Subbiah, 1988). First, the true 1r69 model was subjected to varying lengths of molecular dynamics simulation using the program ENCAD (Levitt, 1983). This produced a series of increasingly different conformations of the 1r69 structure, all with good stereo chemistry. The five models produced this way had an r.m.s. difference of 0.9A, 1.7A, 2.4A, 3.7A, and 4.5A from the true model and are shown in Figure 4. 3.3.2. Results Traditional molecular replacement was performed using all data to 6A resolution. The rotation function was performed using the Crowther fast rotation program (Crowther, 1972). The translation function was also done at 6A using the program TSEARCH from the CCP4/MRC package (SERC Daresbury Laboratory, 1979). Each of the five deformed models, as well as the true one, were in turn used as starting points to answer the question whether they could have led to successful traditional molecular replacement. As can be seen from Figure 5 both the true model and the 0.9À model produced a clear solution and led to the correct answer. The solution obtained from the 1.7A model was not as clear-cut. Nevertheless, with some persistence in following up alternative solutions to the rotation function it could have still led to eventual success. But the 2.4A, 3.7A and 4.5A models did not produce any solutions that could have led to eventual success.
297
(a)
(c)
(b)
(d)
Figure 4. Comparison of the deformed 1r69 structures with the true structure. Alpha-carbon models of four deformed 1r69 structures with r.m.s. errors of 0.9A, 2.4A, 3.7A and 4.5A are shown (in thin lines) superimposed onto the true 1r69 structure (in thick lines) in (a), (b), (c) and (d) respectively.
298
rms
Crowther Rotation 6A
Tsearch Transn. 6A
Obeiix Refined?
gá
Peak No. Aft. Bef.
Error deg. A
0-0
1
1
6
0.8
0-9
1
1
10
1.5
1
2
12
3.5
1
3
4
1.0
2
4
8
4.0
1-7
?
9?
2-4
X X
X
3-7
X
\/
?
4-5
Figure 5. Molecular replacement and simulated annealing results for 1r69. The results of the application of both traditional molecular replacement and the simulated annealing approach to the five deformed and one true model of 1r69 are shown. On the left, the success or failure of a rotation function followed by a translation function for each of the models is presented. The question marks refer to scenarios where the true solution had a high score but was not the highest. Nevertheless, with some persistence in following up alternative solutions, these scenarios could have led to eventual success. For comparison, the results from the simulated annealing approach are detailed, on the right. For each model, the column labeled Peak Number-Before contains the position of the true answer in the rank-ordered good candidate solution list from the 6-dimensional simulated annealing search at 9A resolution. The column labeled Peak Number-After contains the rank position after rigid-body CORELS refinement of each candidate solution using 6A data. The distance and angular errors reported in the columns labeled Error were derived as follows. After the simulated annealing search, the unrefined candidate solutions were inspected and the one closest to the correct answer was determined. The distance and angle separating this solution from the true answer appears in the table. The column labeled Refined? simply reports whether the top solution after CORELS refinement corresponded to the true answer. Thus, all models with error less than 3.7A could have led to success by simulated annealing. In the 3.7A case, success was probable since the true answer was next to best. None of the better solutions could have led to success with the 4.5A case.
299 Using 9A resolution data the program OBELIX was attempted with each model in turn. All simulated annealing runs took about 20 c.p.u. minutes on a 10 mip machine. As can be seen from Figure 5, a simulated annealing protocol using the true model resulted in a set of pruned good solutions over 2o r. Each of these candidate orientations were applied to the model. Using these as starting points, the correlation coefficient to 6A resolution was maximized by conventional rigidbody refinement with the program CORELS (Sussman et. al., 1977). The top solution prior to the refinement, which was some 0.8A and 6 degrees away from the true answer, rapidly converged to the expected answer. Similarly, the best solution prior to refinement for the 0.9A model moved some 1.5A and 10 degrees to converge to the true answer. With the 1.7A model the second best solution prior to refinement moved some 3.5A and 12 degrees to the true answer. For the 2.4A model the third best solution prior to refinement converged to the true answer. In the 3.7A case, the fourth best solution prior to confinement converged to the true answer. However, this did not have the highest correlation coefficient. One of the top 3 pre-refinement solutions refined to a wrong answer that had a somewhat higher correlation coefficient than that for the true answer. Thus, this 3.7 A model was sufficiently inaccurate to allow marginally better fits to the data with an incorrect rotational/translational setting. None of the good solutions obtained with the 4.5A model refined to anywhere near the true answer. Thus, it is clear that the simulated annealing approach allows a wider radius of convergence from poorer models. Specifically, while a 1.7A r.m.s. error is about the limit for eventual success with the two consecutive 3-dimensional searches of traditional molecular replacement, a single 6-dimensional simulated annealing search can allow success with a model error of as much as 3.7A. Since r.m.s. is a rather non-linear measure an inspection of Figure 4c clearly illustrates the rather large structural distortions associated with a 3.7A r.m.s. error. So, in retrospect, if a simulated annealing alternative to conventional molecular replacement had been available when the original Fourier amplitude data was collected, it could perhaps have led to early success with the original 4.1A r.m.s. model. In fact, when OBELIX was applied to that model the correct answer was number 4 on the list of good solutions.
3.4. Experiment 2 3.4.1. The problem This case was the real-life situation where the closed form of the elastase molecule from Pseudomonas Aeuroginosa was under crystallographic investigation (Thayer, Flaherty & McKay, 1991). These crystals were in spacegroup P212121. A P21 model of this 298 residue protein (n=298) was already available in the open form and had been previously used successfully in the molecular replacement solutions of other open form structures in different space groups. Thus molecular replacement, using the open form model, was expected to lead to the closed form structure. However, molecular replacement failed and so heavy-atom data was collected and an interpretable electron density map obtained. Concurrently, an OBELIX-based 6-dimensional search using 122 reflections to 12A resolution was attempted.
300 3.4.2. Results The top solution confirmed the heavy-atom based result. Taking some 40 c.p.u. minutes on a 10 mip machine, 3 out of 5 runs produced the same top peak which was CORELS-refined at 6A to give the correct answer. The top candidate directly after simulated annealing was 1.1À and 4 degrees away from the correct answer.
4. CONCLUSION The application of the simulated annealing procedure to crystallographic search problems result in producing a list of good solutions in much reduced time scales. With respect to the real-space search problem, in a real application where the exhaustive search can itself be performed it finds a major proportion of all good solutions that may be suitable candidates for the true fit. It is not surprising that substantial savings, on the order of 88,000 fold, accompany the application of this protocol. It is a fundamental property of simulated annealing that, as the size of the problem grows, the time required for annealing scales in a manner much less than linear. Perhaps, more important than this sheer improvement in speed, is the possibility of conducting searches that were previously impossible. In the reciprocal space search situation, the previously impossible 6dimensional search can be effectively addressed. Moreover, test case results suggest that the input search model can be significantly poorer with a 6dimensional simulated annealing search as compared to conventional molecular replacement using two approximate 3-dimensional searches. Furthermore, this superiority has been demonstrated with real-life cases. In summary, it is clear that simulated annealing approaches can be of benefit in addressing crystallographic search problems that cannot otherwise be addressed.
REFERENCES Anderson, J.E., Ptashne, M. & Harrison, S.C., Nature, 316 (1985) 596. Bjorkman,P.J., Saper,M.A., Samraoui, B., Bennet, W.S., Strominger, J.L. & Wiley, D.C., Nature, 329 (1987) 506. Brunger, A.T., Acta Cryst., A45 (1989) 42. Colman,P.M. &Webster,R.G., Biological Organisation: Macromolecular Interactions at High Resolution. P&S Biomedical Sciences Symposia., 1985. Crowther, R.A., In The Molecular Replacement Method (Rossman, M.G. ed.), New York: Gordon & Breach (1972) 173. Huber, R. , Acta Cryst. 19 (1965) 353. Kirkpatrick, S., Gelatt, C.D. Jr. & Vecchi, M.P., Science, 220 (1983) 671. Levitt, M., J. Mol. Biol., 168 (1983) 595. Mondragon, A., Subbiah, S., Almo, S., Drottar, M. & Harrison, S.C., J. Mol. Biol., 205 (1989) 189. Mondragon, A., Wolberger, C. & Harrison, S.C., J. Mol. Biol., 205 (1989) 179. Reynolds, R.A., Remington, S.J., Weaver, L.H., Fisher, R.G., Anderson, W.F., Ammon, H.L. & Matthews, B.W., Acta Cryst., B41, (1985) 139. Rossman, M.G. & Blow, D.M., Acta Cryst., 15 (1962) 24.
301 Rossman, M.G., In The Molecular Replacement Method (Rossman, M.G. ed.), New York: Gordon & Breach, 1972. SERC Daresbury Lab, CCP4: A Suite of Programs for Protein Crystallography. SERC Daresbury Lab., Warrington WA4 4AD, England, 1979. Subbiah, S., Ph.D. Thesis, Harvard University, 1988. Subbiah, S. & Harrison, S.C., Acta Cryst., A45 (1989) 337. Thayer, M.M., Flaherty, K.M. & McKay, D.B., J. Biol. Chem., 266 (1991) 2864. Van Kampen, N.G.,In Stochastic Processes in Physics and Chemistry Amsterdam: Elsevier, 1981. Weis, W., Brown, J.H., Cusack,S., Paulson, J.C., Skehel, J.J. & Wiley, D.C., Nature, 333 (1988) 426. Wilson, I.A., Skehel, J.J. & Wiley, D.C., Nature, 289 (1981) 366.
This Page Intentionally Left Blank
Adaption of Simulated Annealing to Chemical Optimization Problems, Ed. by J.H. Kalivas © 1995 Elsevier Science B.V. All rights reserved.
303
Chapter 14 Simulated annealing in the calculation of NMR structures Daqing Zhao and Oleg Jardetzky* Stanford Magnetic Resonance Laboratory, Stanford University, Stanford, CA 94305-5055, USA
1. INTRODUCTION There are more than 10,000 known sequences of proteins, but there are barely a thousand whose three dimensional (3D) structures have been determined, mostly by x-ray diffraction. With the development of multidimensional NMR (nuclear magnetic resonance) (Ernst et al., 1987) and the improvement in computing power, as well as the development of suitable computational algorithms, NMR spectroscopy has emerged as a second important method for structure determination of moderately sized proteins (up to M.W. 37,000 thus far) in solution. There is a fundamental difference in the methods of processing of experimental data to obtain atomic positions of a molecule between x-ray crystallography and NMR. In x-ray crystallography, the experimental diffraction pattern is a simple Fourier transform of the electron density and vice versa. The diffraction pattern can be converted into an electron density map and the geometry of the molecule can be constructed. Therefore, in principle, there is an one to one correspondence between the observed diffraction data and the structure obtainable through a simple mathematical transformation. With NMR spectra, however, such a direct transformation is not possible, even in principle. Measurable NMR parameters depend on distances not only between a pair of atoms (usually protons, 15N, 13c), but also on their distances to other neighboring atoms. Furthermore, NMR parameters depend on the rates of motions of the atomic or spin pair and their environment and on the rates of conformational averaging. One has to make assumptions about the dynamics and other features of the structure in order to obtain geometric distances between a given pair of spins. In addition, to obtain a set of coordinates consistent with the data, it is necessary to construct a model based on short distances between pairs of atoms. This requires a search of the possible conformational space, usually adding information not obtainable directly from NMR experiments. NMR does not measure the shape of proteins directly. The accuracy and precision of NMR solution structures are dependent on the judgment of the experimentalist in interpreting the data and on the choice of methods of conformational searching. The process of NMR structure determination can be divided into two stages: (1) spectral data acquisition, cross peak collection and sequence-specific assignment, and (2) interpretation of spectral intensities, extraction of geometric information and calculation of structures. An NMR spectrum usually contains * The Authors are supported by NIH grants RR07558 and GM33385.
304 thousands of overlapping spectral lines. In order to resolve the spectrum into contributions from individual atoms and assign these lines to a specific nucleus in a specific residue in the sequence, it is essential to use multidimensional NMR techniques. Equally important are assignment strategies. This first stage of NMR structure determination is, however, beyond the scope of this review. Its aim is to summarize the newest techniques of structure calculation, in particular simulated annealing (SA). Different simulated annealing procedures have been designed for the interpretation of NMR data. For reviews of other aspects of NMR structure calculation, see Crippen (1977) and Havel et al. (1983) for distance geometry and Lichtarge et al. (1987), Brinkley et al. (1988), Altman and Jardetzky (1989) and Altman et al. (1991) for heuristic refinement and optimal filtering. Briinger and Karplus (1991) and Briinger and Nilges (1993) also cover NMR structure calculations in general.
2. STATEMENT OF THE PROBLEM Proteins are linear heteropolymers formed from 20 different amino acids connected by peptide bonds. Atoms along the main chain are called backbone atoms and those branching off the main chain are called sidechain atoms. Protein conformational space can be described in Cartesian space (real space) or in dihedral angle space with fixed bond lengths and bond angles (for definitions of dihedral angles, see Stryer, 1988). A 3D protein structure can be described at several levels. First is the primary structure, the covalent bonded structure of the protein without any information about the noncovalent relations or the conformation; then the secondary structure defining hydrogen bonded elements, which can be classified into a-helices, 13-sheets, 3 10 helices, etc.; and the tertiary structure, which is the 3-dimensional arrangement of the secondary structure elements. Without experimental constraints, the problem of calculating protein structures is the protein folding problem, which remains unsolved. Proteins often fold at time scales from 10 milliseconds to a few minutes, but current techniques of molecular dynamics can be integrated for only a few nanoseconds. In addition, current potential energy surfaces are not realistic enough to give the correct global minimum and the correct kinetic properties so that a search usually fails, remaining trapped in local minima. In NMR protein structure determination, the folding of the structure is guided by a set of experimentally determined distance constraints, dihedral angle constraints, together with information obtained using techniques other than NMR -- the primary sequence, empirical energy functions including bond length potentials, bond angles potentials, van der Waals' interactions, chirality potentials, etc. The problem now becomes the search for a global minimum in a constrained conformational space. The more experimental constraints we have and the more accurate they are, the deeper this global minimum. With large numbers of NMR constraints, smaller proteins can be folded from a random configuration using a searching strategy of high temperature molecular dynamics and simulated annealing. Larger proteins are more difficult to fold even with a sufficient number of geometric constraints because of their large conformational space. A successful calculation of such 3D structures of proteins at atomic resolution requires an adequate conformational searching algorithm to find all the structures satisfying the given constraints as well as the given empirical energies.
305
3. TYPES OF NMR CONSTRAINTS 3.1. Distance constraints Most of protein structural information from NMR is obtained in the form of nuclear Overhauser effects or NOEs between pairs of protons that are less than 6 A apart through space. An NOE between a spin pair carries distance information, but only short distances are observed because NOEs have an inverse sixth power dependence on distance. However, the distance cannot be uniquely determined given a measured NOE intensity without making some assumption about the environment of the spin pair and the motion of the vector between them. The simplest model for obtaining a distance from cross peak intensities in a nuclear Overhauser effect spectrum (NOESY) is the isolated rigid spin pair (RRNN - rigid rotor nearest neighbor) approximation (Jardetzky and Roberts, 1981). In this approximation the observed cross peak intensity, which is proportional to the cross relaxation rate, is related to a single internuclear distance, r, a -...
v 4 h 2 re La_
(1)
10 r6
where y, h are constants and Leff is the single rigid body correlation time. The above equation holds approximately for small structures observed at short mixing times. Even for these simplest cases, one needs a specific assumption about the unknown local correlation time of the internuclear vector between the spin pair. The commonly used assumption is that the overall rotational tumbling time can be used for all internuclear vectors. This assumption turns out to be more successful than might be expected because of the relative insensitivity of the internuclear distance with respect to the correlation time and the cross relaxation rates which comes from the sixth power dependence of the distance on the cross relaxation rate or the correlation time. For structures of larger molecular weight (>15,000), the use of the RRNN model introduces serious systematic errors when converting the NOEs to distances constraints. Lane and Jardetzky (1987) and Borgias and James (1990) showed that the error associated with the use of this approximation can be from 30-200% for a mixing time of only 100ms. Without further knowledge, only semiquantitative distance constraints can be obtained. A commonly used classification is into 'strong', 'medium', 'weak' and 'very weak' NOE ranges, which correspond to 1.8-2.7 A, 1.8-3.3 A, 1.8-5.0 A and 3.0-6.0 A distance ranges, respectively (Forman-Kay et al., 1991). These distance ranges can be used directly in a conformational search. In reality, the cross relaxation rate between a spin pair depends on their distances to other spins surrounding them as well. The evolution of magnetization (peak intensity or peak volume) in the presence of multiple interactions or 'spin diffusion' is governed by the generalized Bloch equation,
dM i = -p (m - Ai (0)) -1 a (M - M (0)), dt " ' ' S*1 `"
(2)
where pi is the spin-lattice relaxation rate of the ith spin, a ii the cross-relaxation rate between spin i and spin j, and t is time. Taking into account the influence of all other spins requires some knowledge of the structure of the molecule. Once an
306
approximate structure and a model to account for internal motions are available, distance constraints obtained based on the isolated spin pair approximation can be improved (Borgias and James, 1990). Madrid et al. (1989) have achieved this by numerically integrating the coupled generalized Bloch equations from model coordinates, which determine relaxation. In matrix form (Macura and Ernst, 1980; Keepers and James, 1984), the generalized Bloch equations can be written as, dA7
m, - = –Am
(3)
dt
where /17/is the relaxation vector (N1= = iti. – /f/0 ), and R is the relaxation matrix. Diagonal matrix elements of R are p i's, while off-diagonal matrix elements are ans. Rd's are functions of the spectral density function and the distance between the two spins, R id =
a ii = 6du J(2co)– duJ(0)
Rii = pi =Ed,J(0) + 3d,J(co) + 6d,J(2co)+ Rie,
(4a) (4b)
k*i
where dij = y4 h2 / 10 rii6 and Rieak describes the non-NOE magnetization losses from the lattice. We can see that Rii depends on distances and spectral density functions. Because of the 1/re dependence, many matrix elements of R drop to zero quickly as distances increase. Therefore the R matrix is sparse. Eqn. 3 can be formally solved for mixing time t ni as
iii = cliv- Ma = A/140 ,
(5)
where A is the normalized NOE matrix. Since R is a symmetric matrix, we have: (6)
A = Le -"-L-1 ,
where A is the eigenvalue matrix of the R and L is the eigenvector matrix and is given by (7)
1:1RL = A.
The relaxation matrix can be obtained by diagonalizing the complete NOE matrix A, (see Olejniczak et al., 1986; Ernst et al., 1987), (8)
I:1 AL =D =Cm , we have R =1–]
_lnD Li
(9)
r .
Knowing R, distances between spin pairs can be easily calculated. The exponential can be expanded as,
307 + 1
e —R'" =1—
22
(10)
so that at very short mixing times, we have A = CRT's
1—
and NOE intensity is linear in Tin. The calculation of R requires a model for the spectral density function. R can also be used to calculate the NOE spectra for a given structure. Using a simple model, the isotropic tumbling of the molecule with a single correlation time t, the spectral density function is,
(12)
= 1+w2'2 where co is the spectrometer frequency. More complicated formulations of the spectral density function, taking internal motions into account, are necessary for accurate calculations. In large part such formulations still need to be developed. A caveat for relaxation matrix calculations is that they may not significantly reduce errors of structure determination because of the usual assumption of a single correlation time and lack of any motional averaging. The precision and accuracy of structures so determined are always limited by the approximations used in data analysis. 3.2. Dihedral angle constraints Backbone dihedral angle constraints can be obtained from the Karplus type relationship of the form (Karplus, 1959; Bystrov, 1976): (13) 3.1 liNa = A +B cos(0 — 60°) + C cost (O — 60°) where 3JiiinTa is the coupling constant and it can be measured by J-correlated spectroscopy (COSY) type experiments. A, B and C are empirical constants. Sidechain dihedral angles can also be measured through similar Karplus type relations. For example, xi sidechain torsion angle constraints can be obtained by analyzing the 3J043 coupling constants and the relative intensities of the intraresidue NOEs from the NH and C aH protons at one hand, to the two COH protons on the other, and in the case of Valine to the CYH3 protons. If both 3Jai3 coupling constants are small (< 4 Hz) then x i must lie in the range 60±60°. If on the other hand, one of them is large and the other small, then x i can be either in the range 180±60° or -60±60°. Then the intraresidue NOEs need to be used to make consistent assignments. Clore and Gronenborn (1989) showed a scheme to distinguish the different possibilities. Nilges et al. (1990) proposed a procedure to make stereospecific assignments based on an x-ray database. Based on sequential NOE pattern and proton exchange rate, as well as Ha chemical shifts, backbone b, IP angles for well formed a-helices can be derived (Forman-Kay et al., 1991). For example, in the calculation of E. coli trp repressor solution structures, Zhao et al. (1993) used 4' = -65°±45° and IP = -35°±45° for residues that had three or more characteristic helical NOEs and/or a slowly
308 exchanging amide proton. The conservative range of ±45° was used for the dihedral angle constraints so as not to bias the result excessively. 3.3. Chemical shift constraints Numerous studies have shown that Ha proton chemical shifts contain information about the secondary structure of proteins (Markley et al., 1967; Szilagyi and Jardetzky, 1989; Wishard et al., 1991; 1992). Wishard et al. (1991) analyzed chemical shift information for 78 proteins and more than 5000 residues. The authors first calculated the 'random coil' chemical shifts by averaging all chemical shift values in a-helices, 0-sheets and coils for each of the 20-amino acids over all residues. Then they compared the chemical shift distributions of a-helices and 11-sheets relative to the 'random coil' chemical shifts. The authors found that the chemical shift distribution for Ha in an a-helix is shifted up field by -0.39 ppm from the random coil chemical shift while that for a 0-sheet is shifted down field by 0.37 ppm and the distributions are largely disjoint. They suggested that an Ha chemical shift at least 0.15 ppm less than the 'random' coil chemical shift value correlates significantly with a well formed a-helix, and 0.15 ppm larger than this will give a well formed beta sheet. Based on these studies, Zhao et al. (1993) used the chemical shift information in their calculation of solution structures by adding backbone dihedral angle constraints of (1) = -65°±60° and W = -35°±60° for residues with a-helical Ha chemical shifts. 3.4. Types of constraint potentials The types of potentials used to represent experimental constraints are biharmonic, square-well, hard-soft wells, etc., smoothing function. Biharmonic function (Clore et al., 1985) is given by: Sk BT
(R
d) 2 when
R
2d2
(14)
ENOE = Sk T B
(R — d) 2 when R > d
where S is the scale factor, R is the averaged distance between two groups of protons, kB is the Boltzmann constant, T is the temperature and c is a specified maximum energy value. d is the estimated distance with lower and upper errors of cl_ and respectively. If there are only approximate lower (cl l.) and upper (dup) limits of the target distances, a square-well of the form: R > dw,
kNOE(R - d,,,)2 ENOE = { 0
when
dk,,,, 5_ R ^. dup
when kNoE (R — 4. )2 when
(15)
R < dioy,
where kNoE is the force constant can be used (Nilges et al., 1988). Usually one enters an upper bound based on the NOE intensity and a lower bound based on the sum of van der Waals radii of the two protons. Since the van der Waals radii are
309 explicitly expressed in the empirical energy function, the lower bound can also be simply set to zero without affecting the final result. Using zero lower bounds may help the searching during the high temperature dynamics, at which time weight of the van der Waals interaction is reduced. When a finite value of the lower bound is used, the two atoms will be restricted from being too close to each other and prevented from passing through each other. A soft square-well is used when the violations are large and a soft asymptote is needed (Briinger, 1992): E NOE = a +
b + c(R — (imp ) R — day
when R ..^ duP + sw r
(16)
The potential is the same as the square-well potential when R < dup + rsw. rsw is the switching distance, a and b are chosen so that ENOE is a smooth function at the point R = dup + rsw and c is the asymptote. When the violation is very large, the constraint energy will only increase linearly with distance, so that the potential is effective at long range. 3.5. Consistency of the experimental constraints If all constraints are correctly assigned and are represented by correct distance values, and dihedral angle constraint values are accurate, the data should be completely consistent, assuming there is no conformational averaging. According to model studies, constraint energies obtained from generated distant constraint sets are very small, but they increase appreciably when the distance constraints are inconsistent (for example, see Zhao and Jardetzky, 1993; Zhao and Jardetzky, 1994a). For larger proteins, assignment errors and distance errors may make the constraints inaccurate and inconsistent: NOE cross peaks in larger proteins may have severe overlaps. One can use simulated annealing to test the change in consistency of constraints during which a few questionable constraints may be removed. An intermediate model may also be used to help discriminate the obvious assignment errors. Many researchers use consistency to assign ambiguous cross peaks - for example, stereo specific assignments. It is conceivable, however, that when the number of incorrectly assigned constraints is large, looking for constraint consistency may lead to the removal of correct constraints. The remaining constraints may be consistent but they lead to the calculation of an incorrect structure. As we will discuss later, the inconsistency of NMR constraints can also be due to motional averaging. In that case, accurately assigned but simply interpreted NOEs still cannot be satisfied at the same time, and the use of consistency to make assignments can lead to wrong structures.
4. EMPIRICAL FORCE FIELDS FOR SA OF PROTEINS AND DNA Quantum mechanical calculations of the potential energy surfaces of smaller molecules or solids are widely used by physicists and chemists. In principle, they would also provide a more accurate basis for calculating the potential energy surfaces of proteins and DNAs. For these large biomacromolecules, however, such calculations are prohibitive because of the required amount of CPU time, especially during dynamics simulations, where energy calculations are extensive. Empirical energy functions are much faster to calculate and they have been
310 proven to provide useful information about structure and dynamics of macromolecules where contacts among non-bonded atoms contribute significantly to the total energy (Brook et al., 1988). The total energy Epot of a system during simulated annealing consists of two parts, Epos = E*Non.cal +
consduint E0)
(17)
where Eempirical is the energy obtained from sources outside of NMR, Econstraint is the NMR experimental constraint energy and co is an adjustable parameter, the weight of Econstraint relative to Eempirical. Typical Eempirical energy for proteins takes the following form, Eempirical — =E + E improper + Edt„drat + E vdW + Eelect • bond +E angle
(18)
indicates the covalent bond energy, Ea le indicates the bond angle energy, E improper is the improper energy, E dihedral is the dihedral energy, Evdw is the van der Waals energy and Eelect 15 the electrostatic energy. Improper energies are additional constraints used to preserve known geometric features (chirality, planarity of rings) which the usual set of approximate potentials does not adequately safeguard. The usual form of an improper energy is a four-atom term constraining the dihedral angle about an axis defined by the middle pair of atoms. The parameters of the potential field are obtained and refined based on both experimental data and theoretical calculations (Brook et al., 1988). A large number of short, approximate NMR experimental constraints can determine an approximate structure, because the long range NOE constraints between residues distant in the primary sequence eliminate a large segment of the conformational space. In such calculations, the ranges of the experimental constraints can be set to represent their precision while the strengths and shapes of the constraint potentials can be designed to suit the searching efficiency. Since NMR constraints are mostly proton-proton distances, a full atomic representation is often used in calculations, especially at the later stages of refinement. The extended atom representation, which explicitly includes all heavy atoms and polar hydrogens, has been used to reduce the number of degrees of freedom and the number of nonbonded interactions (Brooks et al., 1983). A reduced representation of two 'atoms' per residue was suggested by Hoch and coworkers (Connolly et al., 1994). The reduced representations are occasionally used during a coarse searching of the conformational space, or when the number of constraints obtainable is relatively small, or the qualities of constraints are low. Since for these cases, the resolution is not very high and only the correct globular fold is obtainable, it is generally preferable to use more efficient reduced representations. Solvent molecules are usually not included in NMR structure calculations except possibly at the last stage of the refinement where room temperature molecular dynamics are calculated. Chiche et al. (1990) refined the structure of the small trypsin inhibitor in water. They found that the refinement with solvent gives better free energies and a closer agreement with the crystal structure. Force fields (or potential energy surfaces) used in simulated annealing are different from those in pure molecular dynamics simulations. The former are Ebond
311
optimized for efficient and stable search of the conformational space, while the latter are set up to describe correctly the physics of protein motion. During high temperature dynamics and simulated annealing, a modified force field (Nilges et al., 1988; Nerdal et al. 1989; Scheek et al., 1989; Havel, 1991a) is used which retains the correct bonding and nonbonding geometry and which has more easily adjustable force constants. If realistic potentials are used in high temperature dynamics, a protein will disintegrate long before the system can be heated to 1000K. At room temperatures, however, it takes milliseconds to seconds for a protein to fold. To improve the searching efficiency in a constrained conformational space and to maintain the basic bond structure and globular fold of the protein, the force field needs to be modified. In principle, we can use any potential field to our advantage during the high temperature search period, but we have to gradually change it to a potential realistically representing the physical parameters of bonds, angles, and van der Waals radii. We first need to increase the force constants of bond length, so that at 1000K no chemical bond is broken and the protein does not disintegrate. The improper torsional force constants are increased so that the planarity of aromatic rings is maintained during the 'high' temperature search. We need to reduce the bond angle force constant and the improper potential force constant so that the chain is more flexible. We also have to reduce the weight of the van der Waals interaction, so that one part of the chain can pass through another part of the chain easily. The slope of NOE constraint potential is linear, so that when some constraints have larger violations, simulated annealing can still proceed to restrain to the correct structure. Without the solvent water molecules to shield the charges, the electrostatic energy terms set up by the usual force fields are no longer correct. Since most SA coarse searching is done in vacuo, the electrostatic energy terms are usually turned off (B ranger, ' 1992). The van der Waals interaction usually is assumed to be infinite at zero distance, but during the conformation search, such barriers are reduced. This helps one part of the chain to pass through another to satisfy the longer range constraints, and the searching efficiency is improved. Nilges et al. (1988) proposed the use of a pure repulsive term of van der Waals interaction. It is known that dynamically averaged van der Waals radii are smaller than static ones (Chandler et al., 1983). Levitt (1983a) suggested the use of smaller van der Waals radii in dynamics simulations of proteins, but the original set of van der Waals parameters in the final energy minimization stage. Smaller final van der Waals radii will give a smaller number of violations. Because the geometric potential is optimized to preserve the bond geometry of the structure, rather than to describe the physics of the bond interaction, the value of the geometric potential may sometimes appear quite unreasonable. This is not important as long as the geometry of the structure satisfies the parameters for covalent bonds and non-covalent interactions. We found, during our calculation of the E. coli trp repressor/ODNA complex (Zhang et al., 1994), that a SA potential for DNA by X-PLOR (Branger, ' 1992) would give a high energy value (33000 kcal/mole). However, the energy of essentially the same structure when calculated by CHARMm is only -6400 kcal/mol. When we used the unchanged physical potential CHARMM (Brooks et al., 1983), the structure deformed too far during the high temperature MD and could not be regularized during the simulated annealing period. During the simulated annealing, the mass of all particles can also be set to a larger value to reduce the overall weight of the potential energy. From Newton's
312 equation, it is obvious that scaling mass by a factor of s is equivalent to scaling Etht by a factor of 1/s. At the last stages of the refinement, more realistic potentials may be used to further add physical information to the structures. CHARMM (Brooks et al., 1983), GROMOS (van Gunsteren and Berendsen, 1987), AMBER (Singh et al., 1986), and Discover (Biosym Technologies, San Diego, CA) are among the widely used force fields.
5. SIMULATED ANNEALING The conformational space of polypeptides is very large. By simply considering two Ramachandran angles 'per residue' (around each peptide bond) and allowing each only 2 values, for a 100-residue protein the number of allowed conformers is of the order 2 200 or 10 50 . A systematic search to find those satisfying the NMR constraints is out of the question, and computations have to rely on random searches. Computationally, the calculation of a family of NMR solution structures can be accomplished in two stages. First, to obtain one approximate (starting) structure that satisfies the given constraints, and second, to find all the structures satisfying the constraints. Conventional optimization methods such as gradient descent fail because of the existence of a large number of local minima. Here simulated annealing techniques can make a difference between success and failure. Their advantage is of not only going down hill, but also temporarily going uphill to overcome a barrier. The usual SA uses the Monte Carlo method in Cartesian space with a Metropolis sampling scheme (Metropolis et al., 1953; Valleau and Whittington, 1977). The application of such a procedure in a protein system is rather inefficient, even when it is applied to the dihedral angle space (Bassolino et al., 1988). However, the Verlet algorithm of molecular dynamics (Verlet, 1967) is also an irreducible Markov chain with transition probabilities that satisfy the detailed balance condition. Constant temperature molecular dynamics (Nose, 1984) and similar implementations can therefore be used to generate a set of conformations to satisfy the Boltzman thermal distribution at a desired temperature. Therefore, by controlling the temperature, molecular dynamics can also be used as a simulated annealing procedure.
5.1. Molecular dynamics as an algorithm of SA Molecular dynamics is a powerful tool to simulate molecular events from reaction dynamics to surface adsorption (see for example, Allen and Tildesley, 1987). It is very efficient in dealing with systems in which cooperative motions are very important. During a molecular dynamics calculation, many degrees of motional freedom are allowed at the same time. McCammon et al. (1977) pioneered simulating the dynamics of proteins and the field is still rapidly developing. The success of simulated annealing in calculating NMR solution structures is based on the development of molecular dynamics simulations of biopolymers (Levitt, 1983b; Clore et al., 1985; Kaptein et al., 1985). Molecular dynamics numerically integrates Newton's equations of motion on a potential energy surface, given initial positions and velocities,
313 = [2r„ - r," - rn V r gr„)&21-1
and v. =u•
(19a) (19b)
2&
where ri 's are position vectors at step i, v i's are velocity vectors at step i, St is the time step of integration and m is the particle mass. This is the so-called Verlet algorithm. It is very important that the Verlet algorithm is time-reversible. It is easy to see that detailed balance conditions exist for the Verlet algorithm. Together with the obvious irreducibility of the underlying Markov chain, one can prove that the conformation generated by the Verlet algorithm has a Boltzmanlike distribution. Swope et al. (1982) proposed a velocity Verlet algorithm, and it gives a better accuracy for velocity calculations than Verlet's original version. The two algorithms are mathematically equivalent. Since the conformational search is mainly in the space of possible combinations of nonbonded interactions, covalent bond length and bond angles may be seen as fixed (invariant) constraints. SHAKE (Ryckaert et al., 1977) is an algorithm to fix the bond length during the molecular dynamics simulation using the Verlet algorithm and often saves computer time. Bond angle constraints are found not to be economical in terms of CPU time (Allen and Tildesley, 1987). 5.2. Temperature controls
The temperature of the system is calculated from the mean kinetic energy by 2
i=1
mi ( vi )= 3Nk T 2 B
(20)
where N is the number of atoms, m i and (q) are the mass and average velocity squared of the ith atom, T is the temperature, kB is the Boltzmann constant and where 3N is the total number of degrees of freedom of the system. During high temperature dynamics, the temperature has to be kept constant. The control of temperature is essential to the success of simulated annealing. Temperature is raised so that larger barriers can be overcome and the system is less likely to be trapped in a local minimum. As already noted, raising temperature by a factor of s is formally equivalent to scaling the target E ta by a factor of 1/s (Brunger, 1991). When the potential energy is lowered during the search, it is released in the form of kinetic energy. This extra kinetic energy has to be removed from the total energy in order for the system to be stabilized in the lower energy state. There are different ways to control the temperature. The Nose-Hoover thermostat (Nose, 1984) can keep the system within a canonical ensemble at the desired temperature T. Langevin dynamics controls the temperature of a system by including a friction term and a random force term in the equation of motion of a particle. Temperature coupling (T-coupling - Berendsen et al., 1984; Briinger et al., 1984) is a simplified version of Langevin dynamics, with a scaled friction constant and zero random forces. The method controls the temperature of the system efficiently, but it does not preserve the trajectory in a canonical ensemble. The temperature in simulated annealing is no longer a measure of the real thermodynamic quantity, but rather it is used as a searching parameter,
314
indicating the size of barriers which can be overcome or the degree of coarseness of the search. Since a Boltzmann distribution in phase space is not needed during high temperature restrained molecular dynamics and simulated annealing, Tcoupling is preferable to Langevin dynamics because the former has better searching efficiency (Branger ' and Karplus, 1991). During refinement at room temperatures and where an ensemble is generated for time averaging, it is more rigorous to use the Nose-Hoover thermostat than the T-coupling method. 5.3. Cooling rate After the high temperature search, one cannot cool the system down infinitely slowly. Once the system is cooled down, or even during the cooling process, only limited conformational changes can be achieved. We know that short molecular dynamics at room temperature cannot achieve large conformational changes. It is of crucial importance that the conformational search for the global folding is done properly during high temperature molecular dynamics. Before cooling one also needs to gradually increase the weight of van der Waals interaction and bond angle, improper force constants. The slower the cooling rate, the better the chances of finding the global minimum. It is often enough to integrate 50 steps of MD for every 25 degrees the temperature is lowered. 5.4. Convergence tests If the statistics of a family of structures no longer change as the length of the high temperature molecular dynamics is increased, the calculation can be said to have converged. The exact condition for convergence is different for each system. It is very important not to assume that the refinement converges within a certain number of picoseconds or number of steps of high temperature MD. It is more prudent to run high temperature MD for an additional length of time to ensure that the NOE constraint violations, energy values and RMSD values do not drift as the MD continues. Another test of whether convergence has been reached is to use different starting structures. In our calculation of trp repressor and it's complex with ODNA (Zhao, et al., 1993; Zhang et al., 1994), we used various starting structures to ensure that the results remained the same. For ideal data, there should be no conflict between the chemical energies and experimental energies. The constraints are completely self consistent. In such cases, the result will have minimal experimental constraint violations and very little constraint energy. There is only the chemical energy which may or may not follow the Boltzmann distribution, depending on the refinement procedure. However, in reality, due to the errors in the constraints and the inaccuracies in empirical energies, it comes down to a judgment by the person performing the computation as to which one should be trusted more. Because the NOE constraints are large in quantity but rather approximate in precision and accuracy, one cannot get the constraint energy to be very low. The weight Co of Eexpt relative to Eemprical (see Eqn. 17) can be adjusted to make the SA depend more or less on the experimental constraint energy relative to the system chemical energy. Jack and Levitt (1978) suggested that w be adjusted so that the gradient from Eexpt and Eemprical are of the same order. In their case, w is periodically adjusted so that in the current structure, the gradient from Eemprical is equal to that from Eexpt . Normally, the weight stays the same during the course of the calculation (Briinger, 1992). During the calculation of the trp repressor/ODNA
315 complex structure (Zhang et al., 1994), we found that local structure can be distorted due to the uneven distribution of the constraints.
6. APPLICATIONS OF SA 6.1. ab initio SA The implementation of simulated annealing is conceptually straightforward. For proteins with fewer than 1,200 atoms and with a sufficiently large number of constraints, one can start with a random structure and use SA alone to search for the global minimum in the restrained conformational space. A flow chart of the procedure is shown in Figure 1 and it is very similar to the standard protocols (Nilges et al., 1988; Brunger, 1992). As the size of the protein increases, the procedure becomes less and less feasible or efficient. When it is difficult to achieve a successful simulated annealing calculation using all experimental constraints, it is useful to introduce the constraints in groups. For example, sequential constraints for regular secondary structure may be included in the simulated annealing calculation before the most constraining long range NOEs. Random structure or starting structure from a coarse searching algorithm
1 200 steps of unrestrained minimization
1 15 ps of RMD at 1000K with reduced angle, improper force constants, small vdW weight, small asymptote for soft square restraint potential and weak dihedral constraints
1 10 ps of RMD at 1000K with increasing asymptote of soft square potential, increasing vdW weight and increasing angle, improper force constants
1 Cooling to 300K with 50 steps of RMD for each temperature step of 25°C, with strong dihedral constraints and vdW weight being increased to full value
1 200 steps of restrained minimization, changes from the geometric potential to the realistic potential
1 SA structure
Figure 1. A simulated annealing procedure for NMR structure calculations.
6.2. SA as a refinement procedure The amount of time needed to search from a random structure with more than 1,200 atoms to a correctly folded protein can be uneconomical or impractical. For larger structures (>1,200 atoms), SA is often used to refine approximate
316 structures obtained from more robust and/or efficient algorithms like distance geometry, optimal filtering or interactive graphic model building. When structures are generated from pure geometric methods such as distance geometry or optimal filtering, violations in bond distances and bond angles are large. Sometimes the search in the conformational space is limited or biased. The empirical energy of such structures is quite high, although the structure is more or less 'near' the global minimum except for some minor and local modifications. In these cases, restrained molecular dynamics and simulated annealing can help remove the violations by adding the chemical energy information and regularizing the bond length, bond angles, and van der Waals interactions, chirality of chiral centers, etc. This was the conventional usage of simulated annealing for larger structures until Zhao and Jardetzky (1993) proposed sequential simulated annealing which uses simulated annealing to search the spread of the conformational space allowed by NMR constraints. The procedure is similar to that for ab initio SA, except for the starting structure (Figure 1). 6.3. Simulated annealing used for making assignments Simulated annealing can also be used to make assignments. Weber et al. (1988) proposed 'floating chirality to make stereospecific assignments automatically based on the consistency of the constraints by allowing the equivalent protons to switch places during the refinement. Habazettl et al. (1990) proposed a high dimensional function for resolved but unassigned methylene cross peaks. If there are n protons A i having NOEs to two resolved but stereospecfically unassigned protons, B 1 and B2, their potential function sets up an n-dimensional potential surface which has two wells corresponding to the two possible assignments. If there are enough NOE constraints so that the assignment is fully determined, a simulated annealing with this potential will select the correct assignment. Nilges (1993) proposed a method of assigning NOE peaks in a dimer. The method is based on the observation that in a symmetric dimer an NOE cross peak will generally contain contributions from both the intramonomer and the intermonomer proton pairs, since the relevant chemical shifts are identical, i.e.:
NO; =2N0Eirra +2N0Eirer
(21)
Assuming that the two pair of spins have identical internal motions and internal correlation times, which need not be rigorously true,
dN0Eii= dtm
Rid. 2c(Rrer-6 + kra l=2c1?-ii6 with R. . (Rif inter4 + infra-61X
(22)
Except when both the intermonomer and the intramonomer distances are short and comparable, one of the two terms will dominate the above equation and can be identified, given enough other NOEs in the surroundings. According to our experience, however, there are NOEs for which both the intra- and intermonomer are short distances. For these cases 7?-ii is the quantity observed by NMR, and this is not a useful constraint. Using this procedure it is sometimes possible to distinguish NOEs between an intra- and an intermonomer
317 spin pair, but in general an experimental distinction, based on the use of isotope hybrids (Arrowsmith et al., 1990) is more rigorous and is to be preferred. 6.4. SA using time averaged constraints In normal simulated annealing calculations, several individual structures are calculated to represent an ensemble. Each structure representing an instantaneous 'snapshot' is optimized to agree with all of the experimental constraints. We know that NMR experimental constraints are measured as a time average at various time scales over all possible conformations. The assumption of a single, rigid structure becomes inadequate if the molecule or part of the molecule is very flexible. There are observations that several conformations exist in solution and distance and torsion angle constraints obtained from NOE experiments cannot be all satisfied at the same time by a single rigid structure (Kessler et al., 1988; Schmitz et al., 1992; Stolarski et al., 1992). Weisz et al. (1994) pointed out that this is particularly true for torsion angles of deoxyribose where sugar repuckering occurs. Kim and Prestegard (1989) and Scheek et al. (1991) reported refinement of two or more conformations simultaneously to better satisfy the observed experimental constraints. Torda et al., (1989 and 1990) and Torda and van Gunsteren (1991) suggested molecular dynamics with time-averaged constraints, in which the constraints are not enforced at any snapshot of the simulation but only for the ensemble with some memory function, t' (e-tV
e (R(t)).
r
-y6
_6
R(t)
0
(23)
e t
e‘ f-A dt' 0
where ti is the time constant for the damping factor. ti is used so that more recent values of R(t) are weighted more. The forces are calculated by:
FN0E ,
NOE xkt ) =(d(R(t))
(R(0)\( aR(t)).
dR(t)
&
(24)
Practical consideration on trajectory stability led to the assumption of a nonconservative 'pseudoforce' (Torda et al., 1990).
(RN7;:aR(0). FNOE,x(t)-(adV
dx
(25)
This still restrains the conformation to the target distance and it does so with much a smaller NOE force constant. The omitted a /aR(t) is of the order St/t where St is the time step of the trajectory. Pearlman and Kollman (1991) generated simulated NOE data and refined an ensemble of structures using time averaging. They found that the range of conformations using time-averaged constraints is larger than that obtainable by
318 the conventional method and that the violations of the experimental constraints are also lower. Torda et al. (1993) also used time-averaging for the J-coupling constants. Because MD can only be integrated to no more than a few nanoseconds, the procedure is not very effective for including conformational transitions on a microsecond or longer time scale seen by NMR. It should be noted that MD-tar (Molecular Dynamics with time averaged restraints) is much less efficient in searching some part of the conformational space far away from the starting structure and relatively long simulations are necessary, but it is effective in providing a correct conformational search around some specific geometry. It gives a more accurate ensemble that satisfies the constraints on a limited time scale. Therefore MD-tar can be used to refine structures obtained from regular RMD (Schmitz et al., 1993). One should always keep in mind that during a finite run of molecular dynamics, the representation of the true ensemble may not be accurate for all time scales.
6.5. 4D SA In order to flatten the potential surface and reduce the chance of being trapped in a local minimum, an additional degree of freedom can be added to the 3D space during high temperature restrained molecular dynamics and simulated annealing (Havel, 1991a; Nakai et al., 1993). This procedure was also reported as able to remove the memory of the starting structures. In this protocol, distances in the model structure during simulated annealing are defined in 4D space (3D real space plus an additional degree of freedom) and distance constraint energies are also calculated in 4D space. Molecular dynamics is also performed in 4D space. After the system is cooled down in 4D space, the force constant in the fourth dimension is increased to compress the fourth dimension to zero so that the model structure is recovered in 3D space. In principle, adding the fourth dimension is another way of modifying the potential surface and/or raising the temperature during the conformational searching period. The conventional potential surface with more local minima is recovered only after the fourth dimension is compressed to zero.
6.6. Back calculation of NMR spectra and direct refinement against NOE intensities If we have the relaxation matrix and an approximate structure, we can back calculate the NOESY spectra. The problem with the relaxation matrix method is that some of the cross relaxation rates are not observed due to spectral overlap, dynamic averaging and exchange. Boelens et al. (1988; 1989) attempted to solve the problem by supplementing the unobserved NOEs with those calculated from a model structure. From a starting structure, the authors use NOE build-ups, stereospecific assignments and model-calculated order parameters to construct the relaxation matrix. An NOE matrix is then calculated. This NOE matrix is used to calculate the relaxation matrix and it is in turn used to calculate the new distances. The new distances are then used to calculate a new model structure. The new structure can be used again to construct a new NOE matrix and the process can be iterated to improve the structures. The procedure is called IRMA or iterated relaxation matrix analysis. Baleja et al. (1990) used finite differences to compute the search gradient. Yip and Case (1989) developed an analytic expression for the gradient of the exponential (exp[-RT,]) with respect to atomic coordinates, and this is the method more frequently used. For each observed NOE, the analytical expression requires
319
operations of the order N3 where N is the number of spins. Nilges et al. (1991a) proposed the use of a spherical cutoff for individual spins to reduce the amount of computation time. An overall cutoff can also be used (Baleja et al., 1990, Forster, 1991). Yip (1993) described a procedure for evaluating the gradient without diagonalization of matrices. With an algorithm for calculating the gradient of NOE intensities, one can fit the structure to a NOESY spectrum, minimizing the difference between the observed NOE intensities and those calculated based on a full relaxation matrix. This takes into account spin diffusion in the calculation of structures but it requires at least some understanding of the dynamics of protein. Since Peng and Wagner (1992) formulated spectral density mapping techniques which can directly determine the spectral density function at several frequencies, the isotropic tumbling or the Lipari-Szabo (1982) models may be too simplistic. Finding an acceptable spectral density function then requires an adequate motional model. The recent version of the BLOCH* program by Madrid and Jardetzky (unpublished) can take any spectral density function as input and optimize the structure ensemble relative to the NOE pattern. However, the basic problem of defining the correct spectral density function for each case remains. In addition to corrections using an appropriate spectral density function, in principle one also needs to consider an ensemble of structures. Bonvin et al. (1993) used an 'ensemble' iterative relaxation matrix approach in which the NOE is measured as an ensemble property. A relaxation matrix is built from an ensemble of structures, using <1/r6> averaging of contributions from different structures. The needed order parameters for fast motions were obtained from a 50-ps molecular dynamics calculation. The relaxation matrix is then used to refine individual structures. The new structures are used again to reconstruct the relaxation matrix, and a second new set of structures is defined. One repeats the process until the ensemble of structures is converged. The caveat espressed earlier that the accuracy of the result is limited by the accuracy of the spectral density function applies to all calculations of this type.
6.7. Sequential simulated annealing It has been shown that SA can be used to find global minima of the restrained conformational space when the proteins are small or when there exist structures that are dose to the global folding. SA will refine the approximate structures from coarse search procedures, such as distance-geometry and optimal filtering, for their chemical energies, incorporating information about the bond lengths, bond angles, etc. SA was not used, in this application, to explore the possible conformations. The question is, when the constraints do not uniquely define the global minimum, will SA give the correct range of conformations? Zhao et al. (1993) found that sequential simulated annealing can also give the correct spread of conformations, indicating the error of the structure determination and protein motion.
6.7.1. Ergodic theorem A fundamental hypothesis of statistical mechanics is the ergodic theorem. Basically it says the system evolves so quickly in the phase space that it visits all of the possible phase points during the time considered. If the system is ergodic, the ensemble average is equivalent to the time average over the trajectory for the time period. The ergodicity of a system depends on the search procedure, force
320 field parameters, temperature, length of simulation, and the size of the system. One can never assume that a system is ergodic, but when it is, ensemble averages can be calculated easily by using time averages. If the system is not ergodic, then the statistical average along the trajectory will change as the length of the simulation increases. Ordinary MD simulation of proteins is not ergodic, and the result will depend on the starting structure. According to our tests, single high temperature MD at 1000 K for 15 ps and simulated annealing is not ergodic, even with reduced force constants and reduced van der Waals weights. However, Zhao and Jardetzky (1993) investigated the extent of conformational searching of simulated annealing and found that the sequential use of standard simulated annealing (Nilges et al., 1988; Briinger, 1992) will generate a converged family of structures. In order to use the sequential simulated annealing procedure, one fully refined structure is required as a starting structure. This starting structure can be obtained by coarse search using distance geometry, optimal filtering, manual model building, ab initio simulated annealing, and/or a combination of the above, followed by simulated annealing refinement. Then high temperature MD is run and the system is cooled down slowly. The output coordinates are saved and another round of high temperature MD and simulated annealing is run and another set of structures is obtained. Thus, a family of structures can be generated sequentially. The simulated annealing procedure used can replace the standard procedure as described earlier. The length of the high temperature MD has to be relatively long so that the resulting structure is significantly different from the starting one. The initial velocities and the momenta for each round of simulated annealing can also be made different by using different random generator seeds. Zhao and Jardetzky (1993) found that sequential simulated annealing is 40% more efficient for the small system tested. It is relatively time consuming to refine an approximate structure from the output of a distance geometry calculation to a structure satisfying all the empirical energies, and it is easier simply to use sequential simulated annealing to change one refined structure into another that is significantly different.
6.7.2. Starting structure The simplest way of obtaining a starting structure is manual model building with a graphic program such as Quanta (Molecular Simulations Inc., Waltham, MA), Insight (Biosym Technologies) or Maclmdad (Molecular Applications Group, Palo Alto, CA) or a coarse definition of topology using an automated model building program, such as PROTEAN I (Carrara et al., 1990). This gives only an approximate structure and gives no information on the spread of the conformational space. Distance geometry permits the definition of structures from internal distances alone (Blumenthal, 1953) and it has been applied to NMR structure calculations (Crippen, 1977; Kuntz et al., 1979; Havel et al., 1983). It is known that the sampling of distance geometry is somewhat biased. Structures calculated from distance geometry need to be refined with molecular dynamics and simulated annealing to add information about the empirical energy functions and to remove some bias of sampling (Nilges et al., 1991b). The sampling of metric matrix distance geometry algorithms can also be improved by random metrization (Havel, 1990; Kuszewski et al., 1992). The structures obtained from random
321
metrization may be more difficult to refine. DGEOM (Blaney et al., 1989), DIANA (Giintert and Wiithrich, 1991), DGII (Havel, 1991b) and X-PLOR (Briinger, 1992) are among the more widely used distance geometry programs. Optimal filtering was proposed by Altman and Jardetzky (1989) as a heuristic refinement method of NMR structure determination and has also been applied to the dihedral angle space (Koehl et al., 1992). Optimal filtering uses the exclusion paradigm, and during the search all possible conformations are retained except where they are incompatible with the data. This allows a more systematic search of the allowed conformational space. As in the case of distance geometry, it is a pure geometric method, and it calculates the mean positions and standard deviations of each atom. The output also needs to be refined to add information from the empirical force field. 6.8. SA in dihedral angle space Since protein structures can be described by dihedral angle space, there have been many efforts to design SA procedures using dihedral angle coordinates. In their effort to apply the Metropolis sampling scheme to protein studies, Bassolino et al. (1988) used backbone dihedral angle space. The attempt has a very important element in that by changing dihedral angle variables, all bond angles and bond lengths are preserved during the random walk, the degrees of freedom of the system are reduced and acceptance probability for the trial moves is increased. However, because the dihedral angle motions can be very non localized, the derivative (ditidOi) of a position vector (Rj ) with respect to a dihedral angle (0i) can be very large. As a result, a small step size at one dihedral angle can lead to very large motion for some residues far away in the sequence. Therefore, the procedure is not as effective in studying folded proteins. 6.9. Monte Carlo combined with MD Zhao and Jardetzky (1994b) recently proposed a new Monte Carlo simulated annealing method, which is a combination of molecular dynamics and Monte Carlo random walk in dihedral angle space. The procedure allows the cooperative motion of three adjacent residues so that backbone dihedral angles can make random walks at step sizes of up to ±90°. The procedure is basically the following: with a sequence of random numbers, a backbone es or 4' angle in a protein is randomly chosen and a dihedral angle constraint is set up to a rotation by an amount randomly chosen from a negative to a positive maximum stepsize. The constraint is imposed by temporally adding a steep potential well to the total Hamiltonian. Then a subset of atoms along the chain near the chosen dihedral angle is selected and allowed to move and interact with each other. All other atoms are fixed and their interactions turned off. Using the conjugate gradient method, one then energy minimizes the subsystem. With the constraint, a short constant temperature MD (e.g., 0.2 ps) is run so that the subsystem can move cooperatively to further accommodate the trial move, and the subsystem is equilibrated at 300K. One runs MD further without the long range interactions to allow the chain to translate in a reptation like motion. The trial move is then accepted or rejected following the Metropolis sampling procedure (Metropolis et al., 1953) according to the total energy change, with the constraint turned off. Then another random dihedral angle constraint is selected, and the process is repeated. Thus, a Markov chain similar to those in the Metropolis sampling scheme is constructed and the configuration space is sampled efficiently. The section of the
322
interacting chain forms effectively a 'bead' in the Zimm model and each MC step is equivalent to a trial move by the bead. Related work on the subject is in progress. 6.10. Using SA to calculate 3D structures without spectral assignment Recently, Kraulis (1994) reported a simulated annealing procedure which is more similar to crystallography refinement than other NMR refinement protocols. The algorithm first calculates, using molecular dynamics and simulated annealing, 3D positions of the unassigned proton atoms from the set of NOEs. Primary sequence and covalent bond constraints are not used at this stage. Then an exhaustive search is made for plausible residue type assignments among different possible proton position distributions in 3D space. Finally sequence specific assignments are made by searching among the plausible residue assignments for an optimal fit to the primary sequence of the protein and the positions of protons in space,using Monte Carlo simulated annealing to choose the path of the sequence in space. The procedure has been tested on small systems with 70% or more NOE distances. It is yet to be shown that it can be applied to larger systems with relatively sparse NOE distance constraints, when the problem is very underdetermined. The exclusion of primary sequence constraints and covalent bond constraints at the initial stage makes the system even more underdetermined. For systems with most of the cross peaks resolved, however, the method may save some time on spectral assignment. 6.11. Relative weight of structures within a family of structures Due to the nature of the simulated annealing algorithm, the relative thermal statistical weight of a structure within a family is unknown. For simplicity, it is often assumed that all structures have the same statistical weight, but this is conceptually incorrect. If a relatively large number of structures are calculated for the family, the relative frequency of various portions of the conformational space represents the relative weight of the structures. MD can improve the weight to a Boltzmann distribution, provided that the propagation of the Markov chain is sufficiently long. The longer the MD refinement, the better the statistical weight. 7. ACCURACY AND PRECISION OF CALCULATED STRUCTURES 7.1. Analysis of the calculated structures Average structures can be generated by first RMS superimposing the structures and then taking the average of the positions for each atom. The energies of bond, angle, van der Waals, improper terms can be also monitored for possible misassignments or failed calculations. The energy values cannot be used without considering the nature and number of the violations. Often the geometric energy terms are used to reduce the violations in bond lengths, bond angles and van der Waals radii. The energy values are not meant to have any physical significance. Altman and Jardetzky (1989) proposed the use of a single set of atomic positions and simultaneously listing their standard deviations to represent an ensemble of structures. To calculate how well defined the family of structures are, RMSDs with respect to their averages are usually calculated. Backbone atoms (or heavy atoms) are usually better determined and are the most
323
important in determining the fold of the protein. Sidechains (or hydrogen atoms) are more mobile, therefore RMSDs including only backbone atoms or only the heavy atoms usually superimposes structures better than RMSDs with all atoms. Residue by residue RMSDs are calculated to reveal poorly-defined segments. For example, the trp repressor (Mao et al., 1993) has a very rigid core but has an apparently very flexible DNA binding region. Only separate RMSDs of the core, the 'flexible' DNA reading head compared to those for the whole protein will demonstrate this phenomenon, which may be important to the biological function of the protein. A more detailed discussion of the dynamics of this region which creates the appearance of flexibility is given in Gryk et al. (1995). For intensity based refinements, an NMR R factor can be defined,
R=
I
NOES(I NOE ,obs
I
)n k(I NOE,cak
rI
(26)
NOES k (I NOE ,cak
where n is chosen to be 1/6 (James et al., 1991; Thomas et al., 1991) or -1/6 (Gonzalez et al., 1991). The R value will include the biases of the motional model used and assumptions about the auto-correlation function of the bond vectors. If an ensemble of structures is used, it is likely that the R factor will also be influenced by different empirical potentials, and the refinement protocol. 7.2 Accuracy and precision Accuracy in this context is the faithfulness of the measurement of the structure or ensemble of structures with respect to the true conformational distribution in solution. This true state usually is not known, but may be estimated with structures measured using other techniques. The precision of the structures is indicated by the reproducibility of the measured or calculated structures. NMR structures contain systematic errors due to spectral overlapping, bias of approximate interpretation of spectra and bias of the calculation strategy and algorithm. Because of the large number of pieces of information, human errors in assigning the spectra may also affect the accuracy of the structure. Gross human errors can be detected by an independent determination of the structure by a different group. It is generally accepted that NMR protein structures have an accuracy of no more than one or two Angstroms RMSD, and the average structure is often more accurate than the individual structures (Zhao and Jardetzky, 1994a). Liu et al. (1992) found that the systematic errors in the structures calculated by different methods are within a range of —1-2 A, with restrained molecular dynamics and simulated annealing generating structures with smaller errors than the other methods. A protein with N atoms has N(N-1)/2 interatomic distances. NMR only sees a fraction of the distances that are less than 6 A. Often, especially for larger structures, NMR solution structures are underdetermined (Liu et al., 1992; Zhao and Jardetzky, 1994a). There exists a range of structures that would satisfy all the constraints. Even with a relatively large number of NOEs, say 20 per residue, the structure can still be underdetermined for some residues because the constraints are not uniformly distributed. It is possible that we can calculate a family of structures with the data made completely consistent with minimal violations and still get a result that has poor accuracy. Nothing can replace large
324
quantities of high quality data, unambiguously assigned NOEs and J-coupling constants. When model structure based assignments, database based assignments and simulation based parameters are used, the quality of the structure will be dependent on the validity of the shortcuts used. Liu et al. (1992) have studied the influence of the number NOE constraints on the accuracy and precision of the calculated structures. They found that when 30% of the NOE constraints were included, the RMSD accuracy more or less converged. There is no question that initially the larger the number of experimental constraints and the more accurate the constraints, the better the structures will be determined, but at 20 constraints per residue, one is reaching the point of diminishing returns. One should be cautious not to over-interpret the data. Repeatedly using intermediate structures to 'assign' additional ambiguous constraints can lead to highly biased structures with very high precision but low accuracy. Liu et al. (1992) also found that both the accuracy and precision of larger calculated structures are less than for smaller structures, given the same data abundance. This is due to the fact that the number of short distances does not rise as rapidly as the total number of distances, and yet only short distances are provided by NMR data. Zhao and Jardetzky (1994a) have also studied the effect of errors in distance constraints on accuracy and precision of NMR solution structures and concluded that better than 1 A RMS deviation can not be obtained even if the distance errors are less than 1 A. If the number of constraints is high, but the quality is low, then the structure can be overly precise. Conformational flexibility for certain segments of the peptide chain can be correlated to NMR solution structures through the lack of NOEs distance constraints. Motional averaging often results in the disappearance of NOE peaks. The detection of conformational flexibility is a significant advantage of NMR solution structure determination compared to the x-ray crystal structure determination. For example, in E. soli trp repressor structures (Zhao et al., 1993), the flexibility of the N-termini and the DNA reading heads can be defined in the solution structure while the x-ray structure shows these segments as well formed helices, though with a large B factor. The lack of precise understanding of dynamics of various motions and their time scale may easily lead to errors in interpreting the NOE data. Therefore, the interpretation of flexibility in calculated structures cannot be made correctly without a good understanding of their dynamics. Without a good understanding of the dynamics, we are unlikely to have an accurate understanding of the structure. NMR, solution structures, when compared with crystal structures, are less well defined. This is because NMR experiments are done in solution and at room temperatures. Brownian motion of proteins is observed. When a family of structures is calculated, we use the spread of different conformations within the family to represent the precision of the coordinates of the atoms. When optimal filtering (the Kalman filter) is used, the output automatically gives a measure of uncertainty by giving the standard deviation as well as the mean value of the coordinates. Since NMR structures can not be obtained through a simple transformation, they inevitably include the judgment of the investigator and his understanding of the dynamics of the protein, so that it is very difficult to compare the quality of structures from different laboratories. Just as the same spectra may be assigned
325 with some ambiguity, depending on the investigator, the structures can be calculated on the basis of the same spectra but with different judgments on the parameters. The resulting structures may show differences in precision or may even be somewhat different and still represent a reasonable interpretation of the data. If the accuracy and precision of distance constraints and dihedral angle constraints are over-estimated, the structures will be less accurate, although more precise. More NOEs give more precise and more accurate structures, but more precise structures can also be the result of an over-interpretation of data. Since proteins are known to be flexible in solution, pursuit of a highly refined 'high resolution structure' without considering its dynamics is of questionable validity and value. The ultimate criterion for judging the quality of structures is the correct prediction of their biological function as well as the prediction of physical properties measured using chemical or physical methods.
REFERENCES Allen, M.P. and Tildesley, D.J. (1987) Computer Simulation of Liquids, Clarendon Press, Oxford. Altman, R.B. and Jardetzky, 0. (1989) Methods Enzymol. 177, 218-246. Altman, R., Arrowsmith, C., Pachter, R. and Jardetzky, 0. (1991) In: Computational Aspects of the Study of Biological Macromolecules by NMR Spectroscopy (Hoch, J.C., Poulsen, F.M. & Redfield, C., eds.) Plenum Publ. Corp., New York, p. 363-374. Arrowsmith, C.H., Pachter, R., Altman, R.B., Iyer, S. and Jardetzky, 0. (1990) Biochemistry 29, 6332-6341. Baleja, J. D., Moult, J. and Sykes, B.D. (1990) J. Magn. Reson. 87, 375-384. Bassolino, D.A., Hirata, F., Kitchen, D.G., Kominos, D., Pardi, A. and Levy, R.M. (1988) Int. J. Supercomp. Applic. 2, 41-61. Berendsen, H.J.C., Postma, J.P.M., van Gunsteren, W.F., DiNola, A. and Haak, J.R. (1984) J. Chem. Phys. 81, 3684-3690. Blaney, J.M., Crippen, G.M., Dearing, A. and Dixon, J.S. (1990) QCPE, Program 590, Indiana University, Bloomington, IN. Blumenthal, L. (1953) Theory and Applications of Distance Geometry, Clarendon Press, Oxford. Boelens, R., Koning, T.M.G., and Kaptein, R.J. (1988) J. Molec. Struct. 173, 299311. Boelens, R., Koning, T.M.G., Van Der Marel, G.A., van Boom, J.H. and Kaptein, R. (1989) J. Magn. Reson. 82, 290-308. Bonvin, A.M.J.J., Boelens, R. and Kaptein, R. (1991) J. Biomol. NMR 1, 305-309. Bonvin, A.M.J.J., Vis, H., Breg, J.N., Burgering, M.J.M., Boelens, R. and Kaptein, R. (1994) J. Mol. Biol. 236, 328-341. Borgias, B.A. and James, T.L. (1990) J. Magn. Reson. 87, 475-487. Braun, W. and Go, N. (1985) J. Mol. Biol. 186, 611-626. Brinkley, J.F., Altman, R.B., Duncan, B.S., Buchanan, B.G. and Jardetzky, 0. (1988) J. Chemical Info. & Comput. Sci. 28:4, 194-210. Brooks, B.R., Bruccoleri, R.E., Olafson, B.D., States, D.J., Swaminathan, S. and Karplus, M. (1983) J. Comput. Chem. 4, 187-217.
326 Brook, III, C.L., Karplus, M. and Pettitt, B.M. (1988) Proteins: A Theoretical Perspective of Dynamics, Structure and Thermodynamics (Adv. Chem. Phys. 71), John Wiley and Sons, New York. Brunger, A.T. (1991) Ann. Rev. Phys. Chem. 42, 197-223. Brunger, A.T. (1992) X-PLOR (Ver. 3.0) Manual, Yale University: New Haven, CT. Brunger, A.T. and Karplus, M. (1991) Acc. Chem. Res. 24, 54-61. Brunger, A.T. and Nilges, M. (1993) Quart. Rev. Biophys. 26, 49-125. Brunger, A.T., Brooks, C.L. and Karplus, M. (1984) Chem. Phys. Lett. 105, 495500 . Bystrov, V. F. (1976) Prog. NMR Spectrosc. 10, 41-81. Carrara, E.A., Brinkley, J.F., Cornelius, C.C., Altman, R.B., Brugge, J., Pachter, R., Buchanan, B. and Jardetzky, 0. (1990) QCPE Bulletin 10:4, Program 596. Chandler, D., Weeks, J.D. and Andersen, H.C. (1983) Science 220, 787-794. Chiche, L., Gregoret, L.M., Cohen, F.E. and Kollman, P.A. (1990) Proc. Natl. Acad. Sci., USA 87, 3240-3243. Clore, G.M. and Gronenborn, A.M. (1989) Critical Rev. in Biochem. and Mol. Bio. 24(5), 479-564. Clore, G.M., Gronenborn, A.M., Bringer, A.T. and Karplus, M. (1985) J. Mol. Biol. 186, 435-455. Connolly, P.J., Stern, A.S. and Hoch, J.C. (1994) J. Am. Chem. Soc. 116, 26752676. Crippen, G.M. (1977) J. Comp. Physiol. 24, 96-107. Ernst, R.R., Bodenhausen, G. and Wokaun, A. (1987) Principles of Nuclear Magentic Resonance in One and Two Dimensions, Clarendon Press, Oxford. Forman-Kay, J.D., Clore, G.M., Wingfield, P.T. and Gronenborn, A.M. (1991) Biochemistry 30, 2685-2698. Forster, M.J. (1991) J. Comput. Chem. 12, 292-300. Gonzalez, C., Rulmann, J.A.C, Bonvin, A.M.J.J., Boelens, R. and Kaptein, R. (1991) J. Magn. Reson. 91, 659-664. Gryk, M.R., Finucane, M.D., Zheng, Z. and Jardetzky, 0. (1995) J. Mol. Biol. 246, 618-627. Giintert, P. and Wiithrich, K. (1991) J. Biomol. NMR 1, 447-456. Habazettl, J., Cieslar, C., Oschkinat, H. and Holak, T.A. (1990) FEBS Lett. 268, 141-145 Havel, T.F. (1990) Biopolymers 29, 1565-1586. Havel, T.F. (1991a) J. Symb. Comput. 11, 579-593. Havel, T.F. (1991b) Prog. Mol. Biol. Biophys. 56, 43-78. Havel, T.F., Kuntz, I.D. and Crippen, G.M. (1983) Bull. Math. Biol. 45, 665-720. Jack, A. and Levitt, M. (1978) Acta Crystallogr. A34, 931-935. James, T.L., Gochin, M., Kerwood, D. J., Pearlman, D.A., Schmitz, U. and Thomas, P.D. (1991) In: Computational Aspects of the Study of Biological Macromolecules by Nuclear Magnetic Resonance Spectroscopy, (Hoch, J.C., Poulsen, F.M. and Redfield, C., eds.), Plenum Press, New York, p. 331-347. Jardetzky, 0. (1984) In: Progress in Bioorganic Chemistry and Molecular Biology (Proc. of the Internatl. Conference on the Frontiers of Bio-organic Chemistry and Molecular Biology, Moscow-Alma Ata, June 17-24, 1984), (Ovchinnikov, Yu.A., ed.), Elsevier Science Publishers B.V., Amsterdam, p. 55-63. Jardetzky, 0. and Roberts, G.C.K. (1981) NMR in Molecular Biology, Academic Press, New York, 681 pps.
327
Kaptein, R., Zuiderweg, E.R.P., Scheek, R.M., Boelens, R. and van Gunsteren, W.F. (1985) J. Mol. Biol. 182, 179-182. Karplus, M. (1959) J. Chem. Phys. 30, 11-15. Keepers, J. W. and James, T. L. (1984) J. Magn. Reson. 57, 404-426. Kessler, H., Griesinger, C., Lautz, J., Milner, A., van Gunsteren, W.F. and Berendsen, H.J.C. (1988) J. Am. Chem. Soc. 110, 3393-3396. Koehl, P. and Lefevre, J.-F. (1990) J. Magn. Reson. 86, 565-583. Koehl, P., Lefbvre, J.-F. and Jardetzky, 0. (1992) J. Mol. Biol. 223, 299-315. Kraulis, P. J. (1994) J. Mol. Biol. 243, 696-718. Kuntz, I.D., Crippen, G.M. and Kollman, P.A. (1979) Biopolymers 18, 939-957. Kuntz, I.D., Thomason, J.F. and Oshiro, C.M. (1989) Methods Enzymol. 177, 159204. Kuszewski, J., Nilges, M. and Briinger, A.T. (1992) J. Biomol, NMR 2, 33-56. Lancelot , G., Guesnet, J. L. and Vovelle, F. (1989) Biochemistry 28, 7871-7878. Lane, A.N. and Jardetzky, 0. (1987) Euro. J. Biochem. 164, 389-396. Levitt, M. (1983a) J. Mol. Biol. 168, 595-620. Levitt, M. (1983b) J. Mol. Biol. 170, 723-764. Lichtarge, 0., Cornelius, C., Buchanan, B.G. and Jardetzky, 0. (1987) Proteins Struct. Funct. Genet. 2, 340-358. Lipari, G. and Szabo, A. (1982) J. Am. Chem. Soc. 104, 4546-4559. Liu, Y., Zhao, D., Altman, R. and Jardetzky, 0. (1992) J. Biomol. NMR 2, 373-388. Macura, S. and Ernst, R.R. (1980) Mol. Phys. 41, 95-117. Madrid, M., Mace, J.E. and Jardetzky, 0. (1989) J. Magn. Reson. 83, 267-278. Markley, J.L., Meadows, D.H. and Jardetzky, 0. (1967) J. Mol. Biol. 27, 25-40. McCammon, J.A., Gelin, B.R. and Karplus, M. (1977) Nature 267, 585-589. Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H. and Teller, E. (1953) J. Chem. Phys. 21, 1087-1092. Nakai, T., Kidera, A. and Nakamura, H. (1993) J. Biomol. NMR 3, 19-40. Nerdal, W., Hare, D.R. and Reid, B.R. (1989) Biochemistry 28, 10008-10021. Nilges, M. (1993) Proteins Struct. Funct. Genet. 17, 297-309. Nilges, M., Clore, G.M. and Gronenborn, A.M. (1988) FEBS Left 229, 317-324. Nilges, M., Clore, G.M. and Gronenborn, A.M. (1990) Biopolymers 29, 813-822. Nilges, M., Habazettl, J., Brunger, A.T. and Holak, T.A. (1991a) J. Mol. Biol. 219, 499-510. Nilges, M., Kuszewski, J. and Brunger, A.T. (1991b) In: Computational Aspects of the Study of Biolgical Macromolecules by Nucelear Magnetic Resonance Spectroscopy, (Hoch, J.C., Poulsen, F.M. and Redfield, C., eds.), Plenum Press, New York, p. 451-455. Nose, S. (1984) Mol. Phys. 52, 255-268. Olejniczak, E.T., Gampe, Jr., R.T. and Fesik, S.W. (1986) J. Magn. Reson. 67, 2841. Pearlman, D.A. and Kollman, P.A. (1991) J. Chem. Phys. 94, 4532-4545. Peng, J.W. and Wagner, G. (1992) J. Magn. Reson. 98, 308-332. Ribeiro, A.A., Wemmer, D., Bray, R.P. and Jardetzky, 0. (1981) Biochem. Biophys. Res. Comm. 99, 668-674. Ryckaert, J.P., Ciccotti, G. and Berendsen, H.J.C. (1977) J. Comput. Phys. 23, 327-341 Scheek, R.M., van Gunsteren, W.F., and Kaptein, R. (1989) Methods Enzymol. 177, 204-218.
328 Schmitz, U., Sethson, I., Egan, W.M. and James, T.L. (1992) J. Mol. Biol. 227, 510-531. Schmitz, U., Ulyanov, N.B., Kumar, A. and James, T.L. (1993) J. Mol. Biol. 234, 373-389. Singh, U.C., Weiner, P.K., Caldwell, J. and Kollman, P.A. (1986) AMBER 3.0, University of California, San Francisco. Stolarski, R., Egan, W. and James, T.L. (1992) Biochemistry 31, 7027-7042. Stryer, L. (1988) Biochemistry (3rd. edition), W.H. Freeman, New York. Swope, W.C., Andersen, H.C., Berens, P.H. and Wilson, K.R. (1982) J. Chem. Phys 76, 637-649. Szilagyi, L. and Jardetzky, 0. (1989) J. Magn. Reson. 83, 441-449. Thomas, P.D., Basus, V.J. and James, T.L. (1991) Prod. Natl. Acad. Sci., U.S.A. 88, 1237-1241. Torda, A.E. and van Gunsteren, W.F. (1991) Comp. Phys. Communic. 62, 289296. Torda, A.E., Scheek, R.M. and van Gunsteren, W.F. (1989) Chem. Phys. Lett. 157, 289-294. Torda, A.E., Scheek, R.M. and van Gunsteren, W.F. (1990) J. Mol. Biol. 214, 223235. Valleau, J.P. and Whittington, S.G. (1977) In: Statistical Mechanics A. Modern Theoret. Chem., vol. 5, (Berne, B.J., ed.), Plenum Press, NY, p. 137-168. van Gunsteren, W.F. and Berendsen, H.J.C. (1987) Groningen Molecular Simulation (GROMOS) Library Manual, Biomos, Groningen, The Netherlands. Verlet, L. (1967) Phys. Rev. 159, 98-103. Weber, P.L., Morrison, R. and Hare, D. (1988) J. Mol. Biol. 204, 483-487. Weisz, K, Shafer, R.H., Egan, W. and James, T.L. (1994) Biochemistry 33, 354366. Wishard, D.S., Sykes, B.D. and Richards, F.M. (1991) J. Mol. Biol. 222, 311-333. Wishard, D.S., Sykes, B.D. and Richards, F.M. (1992) Biochemistry 31, 16471651. Yip, P.F. (1993) J. Biomol. NMR 3, 361-365. Yip, P. and Case, D.A. (1989) J. Magn. Reson. 83, 643-648. Zhang, H., Zhao, D., Revington, M., Lee, W., Jia, X., Arrowsmith, C. and Jardetzky, 0. (1994) J. Mol. Biol. 238, 592-614. Zhao, D. and Jardetzky, 0. (1993) J. Phys. Chem. 97, 3007-3012. Zhao, D. and Jardetzky, 0. (1994a) J. Mol. Biol. 239, 601-607. Zhao, D. and Jardetzky, 0. (1994b) Abstracts of Papers (208th National Meeting of the American Chemical Society, Washington, D.C.), American Chemical Society, Washington, DC, PHYS 237. Zhao, D., Arrowsmith, C.H., Jia, X. and Jardetzky, 0. (1993) J. Mol. Biol. 229, 735746.
Adaption of Simulated Annealing to Chemical Optimization Problems, Ed. by J.H. Kalivas © 1995 Elsevier Science B.V. All rights reserved.
329
Chapter 15 Structural models of tetrahedrally bonded amorphous materials F. Wooten' and D. Weaireb 'Department of Applied Science, University of Californina, Davis, CA 95616, U.S.A. b Physics Department, Trinity College, Dublin 2, Ireland
1. INTRODUCTION A theoretical picture of the structure of an amorphous solid can be based on indirect arguments or upon the direct testing of models, where the latter is interpreted literally. The computer algorithm we have developed satisfies the test by generating models of tetrahedrally bonded amorphous semiconductors that agree remarkably well with experiment [1,2]. The prototypical example is amorphous silicon (a-Si). However, the algorithm has also been used for modeling amorphous tetrahedrally bonded diamond-like carbon (ta-C) [3], Si0 2 glass (by inserting an 0 atom between each pair of covalently bonded Si atoms), hydrogenated a-Si (a-Sill x ) [4] by making simple modifications in the algorithm and disordered ice [5,6]. Amorphous silicon is a particularly interesting and important amorphous material. It is the most promising material for a wide range of applications of solar energy conversion. As such, it has been the subject of considerable effort devoted to measurement of its optical, electrical and vibrational properties [7]. Tetrahedrally bonded amorphous carbon (ta-C), often conveniently referred to as amorphous diamond (surely an oxymoron), has been a subject of considerable interest because of a hardness comparable to crystalline diamond, a high thermal conductivity (which makes it potentially useful for electronic components), optical transparency and chemical inertness [8]. It differs from a-Si and a-Ge in that it generally has a mixture of sp3 and sp2 bonding in the network, but there has been much progress in depositing ta-C films that approach pure sp3 bonding. Thus a satisfactory model of pure ta-C is needed to serve as a benchmark. A fundamental understanding of the properties of these amorphous materials, whether electrical, optical or mechanical, requires a detailed knowledge of their microscopic structure. With the advent of realistic models of their structure, much effort has gone into calculations to elucidate these properties. The models of a-Si to be described here have been used for calculations of the density of electronic states [9,10], thermal conductivity [11], optical properties [12] and charge fluctuations [13]. Models of ta-C are just beginning to be used, an example being the calculation of the electronic structure [14]. The need for structural models is clear, but where does one begin? In the case of a-Si and a-Ge, a number of experiments have given rise to a generally accepted model that consists of a random network of tetrahedrally bonded atoms [15-17]. Hand-built models with free surfaces were consistent with this interpretation but not generally useful [2].
330 Crystal structures can be determined exactly by means of x-ray diffraction and the periodicity of the lattice introduces some simplicity into the mathematical analysis. There is no such simplicity for amorphous materials. Only a statistical description is possible. In particular, a one-dimensional correlation function is often presented in the form of a radial distribution function, which is a pair-distribution function averaged over all atomic pairs. It is compatible with a large number of possible structures. The challenge is to separate out one of these realistic and compatible structures from the even greater number of random networks that are poor representations of the structure. Ideally one would like to know the coordinates of atoms in a structural model which is believed to be realistic, representative of the material and of sufficient size to allow comparison with diffraction data to within experimental error [2,17]. This typically requires models with linear dimensions of order 20-50 A [17], a requirement that is easily satisfied by our algorithm. The model can then be used to calculate from basic principles the equilibrium properties of the material as well as its response to an external stimulus. The lack of realistic models for many materials is perhaps the single greatest impediment to a fundamental understanding of their structure [17]. The dearth of such models is an indication of the difficulty of devising algorithms to generate them. Thus one can hardly underestimate the importance of developing methods to produce models that agree with experimental results. The conceptual simplicity of the algorithm described in this chapter is deceptive in that regard. A major difficulty in the application of simulated annealing to structural modeling is finding a suitable move, that is, a suitable rearrangement of bonds. One needs a clever and simple way to change configurations. The Lin-Kernighan move [18] which works so well for the traveling salesman problem [19] essentially unties knots. It is closely related to the bond-switch we use for modeling tetrahedrally bonded amorphous materials [20]. But for a-Si the network is more complicated. Each atom is bonded to four other atoms in three dimensions, unlike the traveling salesman's route, which connects a city to only two other cities. This use of simulated annealing can be attributed various degrees of physical significance, according to one's taste. The most cautious viewpoint regards it as a search for a minimally distorted network. This begs various interesting mathematical questions. What is the meaning of random [21]? Is there really such a minimal structure, for any reasonable measure of the distortion of ideal tetrahedral bonds? The first question is a deep one, but seems to be the source of little difficulty in practice. The second presents an opportunity for a useful theorem, but this is as yet unproven. Alternatively, we may choose to regard the simulated annealing process as simulation of a real physical process. This is suspect, on two grounds. Firstly, the formation of a-Si must involve other processes, especially those in which dangling bonds play a role. Secondly, the role of the barriers between states is ignored. (It is precisely this that makes more realistic molecular dynamics simulations too slow.) For these reasons, we have tended to adopt the middle ground, in which our calculations are regarded as simulations of an artificial model, which we have called sillium [2].
331 2. THE WWW ALGORITHM The original model of a-Si produced by the general algorithm to be described here has come to be known as the WWW model, and the algorithm as the WWW algorithm. Here we describe that algorithm as modified over the years by a gradual evolution in some of the techniques and procedures. The conceptual simplicity of the original algorithm [1,2] is unchanged. The starting point is a supercell of the diamond-cubic lattice (FC-2) consisting of N = 8n3 atoms, where n is the number of FC-2 cells along one dimension. Typically, the supercells have contained from 216 to 4096 atoms. These cells are randomized, roughly speaking melted, by progressively and randomly introducing bond switches as illustrated in Fig. 1.
Figure 1. Local rearrangement of bonds used to generate random networks from the diamond cubic structure. (a) Configuration of atoms and bonds in the diamond cubic structure; and (b), relaxed configuration of atoms and bonds after switching bonds. The bond switch of Fig. 1 was inspired by an analogous process used by Weaire and Lambert in studies of two-dimensional networks such as soap froths and reviewed by Weaire and Rivier [22]. In three-dimensions it is characterized by switching two secondneighbor bonds that are parallel to each other in the perfect crystal, or nearly so in the randomized structure. This bond switch introduces less strain into the otherwise perfect FC-2 structure than any other elementary rearrangement. It is the simplest possible topological rearrangement in the FC-2 structure and it is the only type of bond switch used in the process of creating a tetrahedrally bonded model. Although it was originally chosen solely on the basis of its simplicity, and has apparently not been discussed prior to the work of Wooten and Weaire in 1984 [23], it later cropped up as the precursor to melting in molecular dynamics simulations by Stillinger and Weber [24], as a suggested mechanism for self-diffusion in silicon by Pandey [25] and even as a mechanism of interconversion between the C2, and D3h isomers of C78 [26]. The supercell conforms to periodic boundary conditions, so there are no free surfaces. Each atom is bonded to four neighbors, so there are no dangling bonds. However, for
332 modeling a-Sill x it is a simple matter to change the latter requirement as needed. For purposes of energy calculations, bonding is described by the Keating potential [27], which is a simple sum of bond-bending and bond-stretching terms: 1 72,2 3 a , 3 _0 E (ru .ru + _a_)_ rli — d2 ) 2 + _ V = — 16 -cii i,i l'ii • 8 d2 3 lti,i'l
(1)
where a and /3 are the bond-stretching and bond-bending force constants, respectively, and d is the strain-free equilibrium bond length in the crystal, which for Si is 2.35 A. The first sum is over all atoms 1 and their four neighbors designated by i; the second sum is over all atoms 1 and pairs of distinct neighbors {i, i'}; and r1, is the vector from atom 1 to its ith neighbor. The bond-stretching force constant, a = 4.75 x 10 4 dyn/cm, is from Alben et al. [28], and this value was used for the original models. Later models used a = 4.85 x 104 , but all used the ratio 13/a = 0.285 (when building the model), following Martin [29]. Only the ratio matters for model building. The Keating potential arose in the context of fitting the elastic properties of Group-IV elements. Its virtue is it provides a good semiempirical description of the bonding forces with only two parameters. Central bond-stretching forces are insufficient for stabilizing a tetrahedrally bonded structure, hence the bond-bending term is essential. The Keating potential has been found to be quite adequate for model building. For other purposes, such as vibrational properties or thermal conductivity, the final structure has been relaxed with a different potential, such as that of Stillinger-Weber [24] or the Weber bond-charge potential [30], depending upon which potential seemed most suitable for the calculations at hand. The structure is relaxed and the energy is calculated after each bond switch at all stages in the process, both during the original randomization and the subsequent annealing. Bond switches are made according to the usual Monte Carlo Metropolis rules. Switches are made at random on a trial basis and the energies of the two structures are compared. If the new structure is of lower energy it is accepted. If it is of higher energy it is accepted with probability -AElkT P = e • To initially randomize the structure, a temperature T > T, must be chosen, where T, is a temperature corresponding to an order-disorder phase transition, which is roughly analogous to a melting point. Then the sample is annealed by continuing to make bond switches while slowly lowering the temperature. In this way a model is obtained that is in remarkably good agreement with experiment for a-Si. This highly idealized model for the description of the equilibrium structure of Si, including crystalline and amorphous phases, is what we have called sillium, in the spirit of jellium [2]. In the following sections we elaborate on the procedure and discuss in more detail some of the important technical points required to successfully implement the algorithm. 2.1. Rules for bond switches Each atom in the original crystalline supercell is numbered, its coordinates are listed in a coordinate table and its four covalently bonded neighbors are listed in a neighbor table. Two neighboring atoms are then selected at random, corresponding to atoms 2 and 5 in the example of Fig. 1. Then neighboring atoms, illustrated by atoms 1 and 6 in Fig. 1, are
333 chosen at random but such that the two bonds to be switched are approximately parallel. This is the switch that introduces minimum strain. It can be easily decided in the crystal, but for the random network it requires a more careful analysis. It can be achieved by using geometrical arguments to find the most nearly parallel bonds. However, in practice, the choice has generally been made using topological restrictions: It is required that the four atoms chosen not be members of the same 5-, 6- or 7-fold ring. This restriction also ensures that all bond switches satisfy microscopic reversibility. If a pair of trial bond switches is accepted, the coordinate table and the neighbor table are updated. In the original algorithm it was required that unrelaxed bonds not be excessively long. Without this restriction, a sequence of bond switches could result in a bond eventually extending over very long distances, creating a Gordian knot and yielding unphysical structures [2]. However, this came about largely because randomization in the first models [1,2] was carried out at infinite temperature. If carried out at the minimum randomization temperature, as described below, and stopped when randomness is achieved, it is not a necessary restriction. Another approach, which restricts the size of rings produced in a bond switch, has been described by O'Mard [5]. It is discussed in section 3. 2.2. Randomization of the network When bond switches are made the topology of the structure changes. In the original FC-2 crystal structure the topology is characterized by 6-fold rings in the network. There are also 8-fold and higher rings, but in the crystal these are all reducible to 6-fold rings. Reducible and irreducible rings are defined in the Appendix.
Figure 2. The bond-pair switch in a supercell viewed in the [110] direction. The switch introduces 5-fold and 7-fold rings.
334
Introducing a single pair of bond switches into the network introduces four 5-fold rings, brought about by the shortening of 6-fold rings, and sixteen 7-fold rings, eight of which are reducible to a 6-fold ring and one of the four 5-fold rings. See Fig. 2. It is common practice to quote ring statistics for random network models, and we do so in Table 1 for several models. It is almost the only way of comparing the topology of random networks. In earlier work we defined a ring as any closed non-returning path of bonds [23], rather than using the shortest path definition of Etherington et al. [15]. The latter has the advantage of yielding a distribution function for n-fold rings that is zero at high n but it is less closely tied to physical properties. For some purposes it is convenient to use irreducible rings, and we do so in Table 1. Irreducible rings and ring statistics are discussed in Appendix A. Because the randomization process introduces odd rings, one might suppose that when the ring statistics have reached a steady state with only small fluctuations about the steady-state value the structure has been randomized, but it is not so. The steady-state value for ring statistics is reached well before the structure has been randomized [23]. The most sensitive and stringent test of any long range order is the structure factor IS(q)1 2 associated with those reciprocal lattice vectors labeled [111] for the diamond cubic structure [21]. When a supercell is constructed from n 3 FC-2 cells, the basis vectors for the reciprocal lattice are smaller by a factor 1/n and their density in reciprocal space is a factor of n 3 higher. The structure factor is initially zero for all the additional q values appropriate to the supercell. It is nonzero only for some of the q values that relate to the original smaller cell of the diamond structure. The introduction of disorder by bond switching progressively redistributes the structure factor among all the q values for the supercell. We therefore require the structure factor to be of roughly equal intensity ( r,:ds 1/N) for all q values associated with the supercell. The randomizaton takes place by a a diffusion of atoms that is implicit in our earlier description of the initial randomization process as being akin to melting [1]. Later it was shown that the root-mean-square displacement of each atom must be of the order of the nearest-neighbor distance in order that the network lose all memory of the original crystal structure as measured by the structure factor S(q) [21]. In this context, the melting point can be defined as that temperature for which the mean square displacement increases linearly with time. It appears, though, that a sequence of bond switches as illustrated in Fig. 1 is not the primary mechanism for self-diffusion in silicon [31,32] 2.3. The randomization temperature Starting with the structure for the perfect crystal, bond switches are made according to the usual Monte Carlo prescription. The initial strain energy for the structure is zero. When a single pair of bond switches are introduced into the otherwise perfect crystal the strain energy given by the Keating potential is 4.5 eV. Because the perfect crystal is the minimum energy, no trial bond switches will be accepted on a permanent basis except at a finite temperature. The appropriate temperature can be determined by systematic trial. If the initial temperature is nonzero but too low, the four atoms randomly chosen for a bond switch that is actually made are likely to be chosen again after a sufficiently large number of trials. This time the reverse switch will be considered and, of course, the switch will always be accepted because the original state is of lower energy. On the other
335 hand, if the temperature is sufficiently high, a number of switches will be made before returning to the original one. Some of these switches will be in the same general region of the structure, and the strain field will be changed considerably in that region. There will then be a spectrum of strain energies associated with bond switching. Some switches requiring considerably less than 4.5 eV will be possible. These low energy switches will be accepted at a much higher rate and the structure will quickly randomize. For silicon, if 4-fold rings are explicitly prohibited by the algorithm, which was the case for the first model constructed [1], the randomization temperature is found to be kTn, = 1.0 eV. This order-to-disorder transformation is quite sharp and well-defined. The reverse is not.
i
Figure 3. A small concentration of bond-pair switches increases the density of 5-fold rings. Further randomization by bond switching is aided by the allowance of 4-fold rings created by the shortening of 5-fold rings. The region between neighboring bond-pair switches is characterized by a number of 5-fold rings. To take full advantage of potential bond switches in this region that belong to the spectrum of energies lower than 4.5 eV requires the allowance of 4-fold rings that are created by the shortening of 5-fold rings. Thus allowing 4-fold rings makes randomization easier and the randomization temperature is found to be kTm, = 0.8 eV. See Fig. 3. 2.4. Relaxation of the structure After each bond-pair switch, the structure is relaxed to the geometrical configuration of minimum energy. This is accomplished by the prescription of Steinhardt et al. [33], modified to provide a relaxation that spreads spherically, in a topological sense, about the central bond involved in the switch, the bond between atoms 2 and 5 in Fig. 1.
336
Contributions to the force on an atom arise from the stretching or contraction of bonds to neighboring atoms and from angular forces arising from deviations from perfect tetrahedral bonding. The angular forces involve nearest and next-nearest neighbors. The force (a vector) and the force gradient (a tensor) are calculated from the Keating potential for each atom in turn. The atom is then moved to its position of equilibrium keeping all other atoms fixed. The vector distance through which each atom is to be moved is found from a Taylor series expansion for the (zero) force about the position of minimum energy for the atom. This gives VV(r) = —(V2 V(r))Ar
(2)
which then requires a matrix inversion of the gradient tensor to find the distance Or. When bonds are switched, there is a strong torque on the central atoms, as illustrated by the resultant change in coordinates for atoms 2 and 5 of Fig. 1, and as embedded in the supercell shown in Fig. 2. To first order, only these two atoms move. This is the reason for relaxing atoms in spherical topological shells centered about the central bond. All eight atoms of Fig. 1 are considered to constitute the first shell. The second shell consists of the 18 atoms which are the nearest neighbors of atoms in the first shell. The third shell consists of all nearest neighbors of atoms in the second shell and so forth for other shells. Each of the atoms in the first shell is relaxed in turn, beginning with atoms 2 and 5 of Fig. 1, until all have been relaxed once. This is the first cycle of relaxation. It has been found in practice that it is generally most efficient to relax the first shell for two cycles before proceeding to higher shells. Then beginning with the central bond again both the first shell and the second shell are relaxed for two cycles. This is followed by relaxing the first three shells for two cycles and so forth for higher shells. It was generally found best to relax over more topological shells than to increase the number of cycles of relaxation beyond two for a given set of shells. The number of shells over which to relax depends upon the expected range of energy differences between configurations. During the initial randomization and the beginning stages of annealing, these differences are usually sufficiently large that energy differences can be calculated using local relaxation over a single shell. During the final stages of annealing, when energy differences are often quite small, three shells might be required to obtain sufficiently accurate energy values. Our models have been made with an annealing schedule that evolved in an ad hoc way. The first model was the result of computer experiments that, while quite educational, were more than one would be able to publish, but which led to an intuitive feeling for the process. The historical process can be summarized as this: Studies of randomization at infinite temperature showed that one needed to make roughly as many bond-pair switches as there are bonds in the model to lose all memory of the original crystal structure [21]. Note that at infinite temperature these are actual switches, not just trial switches. It was also found that during the annealing process the energy reached a steady state value with small fluctuations about the average value after making 2N bond switches, if the temperature change was of the order of several percent. This led us to a prescription in which the temperature is initially lowered in increments of about 0.05 eV, for a-Si,
337
until the temperature reaches about kT = 0.6 eV. From there until kT = 0.3 eV, the temperature is typically lowered in increments of 0.01 eV. As the temperature is lowered and the model begins to approach its asymptotic final configuration, it takes an exponentially increasing number of trial switches to achieve 2N actual switches. One simply makes a judicious choice of how many trial switches are likely to be useful. When it becomes apparent that essentially no lowering of the energy is practically feasible, the annealing is terminated and a systematic search is made for any remaining bond switches that will lower the energy at 0°K. For other materials, one simply scales the above values according to the new value of Tm found by trial. The advantage of the procedure as just described is that it allows one to experiment with different annealing schedules and perhaps develop some physical intuition for the process. On the other hand, it is possible to use a standard method of applying a simulated annealing schedule to an optimization problem, as discussed by O'Mard [5]. For generating models of about 512 atoms or more, the geometrical region over which local relaxation is made will be small compared with the model size. It would then be possible to simultaneously generate bond switching in several well-separated regions and thus speed up the entire process. This lends itself to parallel computation. We have not done this, but O'Mard has implemented such a parallel modeling system [5]. 3. THE WWW567 ALGORITHM O'Mard [5] has made a systematic study of the WWW algorithm and made some improvements in its implementation. He calls the modified algorithm the WWW567 algorithm, for reasons that are made clear below. O'Mard has recognized that the WWW process, when run for long periods of time at high temperature, produces structures that become increasingly convoluted. This was a serious problem with the first few models we constructed because they were randomized at infinite temperature. It took an excessive amount of time to anneal out the regions of high strain. A partial solution was to require that no bonds be excessively long, defined as meaning an unrelaxed bond length could be no longer than 1.7 times the ideal bond length. Later, this restriction was discontinued and the procedure was modified to randomize the structure at the randomization temperature Tm and immediately lower the temperature and begin annealing. Indeed, it is even possible to lower the temperature before achieving full randomization. The structure will continue to randomize if the temperature is only 10-20% below Tm . However, this is not a well-defined process. It introduces judgement on the part of the modeler. O'Mard has resolved the problem in a different way by prohibiting the creation of any non-valid rings, where valid rings are defined as being 5, 6 or 7-fold. Here the definition of a ring is that of a shortest path as defined in Appendix A. The restriction on ring size subsumes the original restriction on bond lengths by constraining the allowed topological structures within the model. O'Mard states that "Because no large rings are allowed to be created, the necessity of permitting four-membered
338 rings, which provide an outlet for such structures, becomes redundant." However, the allowance of 4-fold rings was originally introduced not as an outlet for large rings nor as a means of increasing the convergence rate, but simply because it seemed there was no reason to arbitrarily exclude the formation of 4-fold rings. The greatly increased convergence rate was an unexpected bonus. Whether to allow or disallow 4-fold rings is a matter of judgement. They can easily be allowed while still including the advantages of restricting shortest-path rings to 7 members by using a WWW4567 algorithm. 4. THE INITIAL CRYSTAL STRUCTURE The choice of the diamond cubic structure (FC-2) as the initial configuration to be randomized is a natural one. It is the most common structure for group IV elements and related semiconductors with tetrahedral bonding and it is the one of highest symmetry. It has periodicity built in from the outset and it allows for the possibility that the randomized structure can in principle return to the initial crystal structure. Indeed, with insufficient randomization, the nearly-randomized structure will sometimes return to the perfect FC-2 crystal structure [34]. Without exactly N = 8n3 atoms this option is precluded. It has been suggested that models could be generated based on the body-centered cubic structure (BC-2) as the starting configuration [35]. For the BC-2 structure, with 2 atoms per unit cell, a supercell would contain N = 2n3 atoms. A 4-fold coordinated network could them be constructed by assigning to each atom on a stochastic basis a set of four neighbors to which bonds would be made. Periodicity is built in from the outset, but the constraint imposed by the number of atoms in the cell makes this a bit unsatifying, even with various schemes for switching bonds to topologically relax the structure and lower the energy. There is no accessible crystalline state of lower energy because tetrahedral bonding without angular distortion is incompatible with the number of atoms in the supercell, and so it is necessarily disordered. There are, however, other possibilities for an ordered starting structure. One of these is the BC-8 structure of Si, which is a crystalline polymorph created at high pressure[36,37]. The BC-8 structure is body-centered cubic with a basis of 8 atoms per BC-2 lattice site. Thus the BC-8 unit cell consists of 16 atoms. This offers the possibility of supercells containing N = 16n3 atoms with periodicity and strained tetrahedral bonding built in. Weaire and Taylor describe the BC-8 structure as a kind of half-way house between the simplicity of the FC-2 structure and the complexity of the random network that characterizes the amorphous material [38]. The BC-8 Si structure is approximately 10% greater density than FC-2 Si, but the important difference is that it introduces both geometric strain and a different topology. One-quarter of the bonds are about 2% shorter and the rest about 2% longer than in the FC-2 structure. Furthermore, there are two types of bond angles, one of approximately 100° and the other of 118°, thus introducing deviations from the ideal tetrahedral angle. Like the FC-2 structure, there are only even rings, but unlike FC-2 there are 8-fold and 10-fold irreducible rings. We have generated a 432-atom supercell of a-Si starting from the BC-8 structure. The correlation function is essentially identical to all other models of a-Si generated with the
339 WWW algorithm. Ring statistics for the model are presented in Appendix A, along with those for a-Si models generated from the FC-2 structure. 5. APPLICATIONS AND RESULTS In this section we describe idealized models (sillium) generated by the WWW algorithm as well as models of more complicated systems generated by modified versions of the WWW algorithm. Modeling more complicated systems, such as hydrogenated amorphous silicon (aSillx ), generally requires the incorporation of additional features on an ad hoc basis. There is more intervention on the part of the modeler to achieve the desired results. These arbitrary procedures are ultimately unsatisfactory but seem to be the best one can do at present.
• *• Figure 4. A computer-generated picture of a 216-atom model of a-Si generated by the WWW algorithm. 5.1. Amorphous silicon A computer-generated picture of a 216-atom model of a-Si generated by the WWW algorithm is shown in Fig. 4. It is impossible to determine by visual inspection whether the model is a good one or not. Even random network models with large angular distortions that have not been topologically relaxed [23] seem to be visually indistinguishable from a model whose characteristics are in agreement with experiment. The primary test of a model, and the simplest and most physically appealing, beyond having the correct density (which in any case is a difficult quantity to measure experimentally), is that the correlation function be in agreement with experiment. Figure 5
340 shows the correlation function for the original WWW model [1]. Because of the size of the model (the lattice constant for the supercell is 16.28A), the function is meaningful only to an r value of about 8A. A definition and interpretation of the correlation function is given in Appendix B.
--.41Ap...~th I 0 2
I
4
I
6
Calculated Experiment I
8
10
r (A)
Figure 5. Comparison of correlation functions for a-Ge (experimental) and the original 216-atom model of a-Si (scaled to a-Ge), from reference 1. The proper comparison of correlation functions of models with experimental results is important and nontrivial. To avoid an all-too-common superficial comparison, refer to the work of Wright et al. [17,39]. Although the correlation function in Fig. 5 is generally agreed to be in very good agreement with experiment, later models produced by us and others are even better. Whereas the first model of a-Si [1] had an rms angular deviation of 10.9° from perfect tetrahedral bonding, and an rms bond-length deviation of 2.7% from the crystalline value [1], recent 4096-atom models have had angular deviations of only 10.5° and bond-length deviations of 2.5%. This latter result should put to rest the once-expressed idea that distortions in a continuous random network model would increase as the size of the model is increased, leading eventually to such a build-up in strain as to limit the size of the model. Of course, for a hand-built model, there may well be a limit to the size. There is no way to return to the interior of a hand-built model to make rearrangements that would reduce the strain for surface atoms. Figure 6 shows the correlation functions for a recent 4096-atom model of a-Si and the original 216-atom model [1]. The functions have been smoothed by including a Gaussian, line broadening corresponding to a full width of 0.23 A at half-maximum, which corresponds roughly to typical experimental broadening. The 4096-atom correlation function has less noise, because of better statistics. It is also a better model because it was ran-
341
domized at 1.0 eV, rather than at infinite temperature. This accounts for the difference in positions of the third peak. 6-
5-
4-
21-
1
2.0
6.0
4.0
8.0
10.0
r(A) Figure 6. The correlation functions for a 4096-atom model of a-Si (solid line) and the original 216-atom WWW model (thin line) of of Fig. 5. and reference 1. 5.2. The crystalline-amorphous interface The crystalline-amorphous boundary in silicon is an interesting and technologically important one [40]. An example of the technological importance of the interface arises in nanometer-machining techniques using a diamond-turning machine to produce highlypolished optical surfaces that are fiat to an accuracy of three atomic layers [41,42]. An understanding of the interface and its creation might lead to the development of parameter guidelines to speed up the production rate, which is too slow at present. Nonequilibrium molecular dynamics studies of the indentation process show a transition to the amorphous phase in a region a few atomic layers thick surrounding the lateral faces of the indentor [42], as has been suggested by experimental results [43]. This possibility has also been suggested by modeling of the crystalline-amorphous interface [40] The problem of describing an interface model has been described by Spaepen [44] as follows: The interface between two phases of a one-component system is, at any given temperature and pressure, inherently unstable. The system can always lower its free energy by removing the interface and converting to the more stable or, at equilibrium, either one of the two phases. Figure 7 shows a computer-generated picture of a crystalline-amorphous silicon interface generated by the WWW algorithm. A 512-atom supercell of Si was partitioned into two regions, separated along a (100) plane. A 192-atom region was held at 0 °K while the other region of 320 atoms was randomized and then annealed [40]. The method could easily be extended in principle to the generation of interfaces in a temperature gradient. However, realistic gradients would require computations over far greater distances than is feasible for modeling purposes.
342
Figure 7. A computer-generated picture of the crystalline/amorphous silicon interface. The structure factor, averaged over the two [100] directions for slabs parallel to the interface, drops to a value characteristic of the bulk amorphous region over a distance of about 3 A, in agreement with molecular dynamics [42] and the experiment of Minowa and Sumino [43]. The stability of the crystalline and amorphous phases at the interface is a matter of kinetics and temperature. At high temperatures, the amorphous region remains randomized. At low temperatures, the structure is effectively frozen in, with only small changes in structure possible over long time periods. It is only at intermediate temperatures, roughly half the randomization temperature Tm , that the crystalline substrate can effectively serve as a seed for crystallization to extend into the amorphous region on a feasible time scale. With the annealing schedule we use, the time (measured in Monte Carlo steps) required for crystallization at T7,12, is two orders of magnitude greater than the time spent in passing through this temperature region during the annealing process. Thus there is insufficient time for crystallization. Presumably a similar kinetic effect dominates in the experimental case [43]. 5.3. Amorphous diamond: ta-C In 1991, McKenzie et al. [45] reported the growth of an amorphous form of carbon which has been shown to be mostly fourfold coordinated — about 86% according to Gaskell et al. [46]. Because it has a mechanical hardness comparable to diamond and is largely tetrahedrally bonded, this form of amorphous carbon is commonly known as amorphous diamond (ta-C). Its experimentally determined radial distribution function is quite similar
343
to a-Si. There is great interest in ta-C as for electronic devices. Already a heterojunction diode has been fabricated from ta-C [47]. To test whether the diamond structure for C, when randomized and annealed according to the WWW algorithm, would yield ta-C and not just recrystallize because of the strong angular forces for spa-bonding in diamond, a 216-atom model [3] was constructed with the force constants appropriate to diamond: a = 1.293 x 10 4 , = 0.8476 x 10 4 dyn/cm. Even increasing by a factor of 10 did not result in recrystallization. Recently, larger and better models have been generated. All are essentially the same as a-Si, apart from having smaller angular deviations and larger deviations in bond length. This is evident from the correlation function shown in Fig. 8 for a 4096-atom model of t a- C .
12-
10
8
6 !.‹ 4
2
4.0
6.0
8.0
10.0
r(A)
Figure 8. The correlation function for a 4096-atom model of amorphous diamond. The 216-atom model of ta-C has been used by Drabold et al. [14] as an important idealization of the tetrahedrally coordinated network in their studies of the theory of diamondlike amorphous carbon. Of course it is known that ta-C as grown thus far is, at best, a complicated mixture of spa and sp2 bonding. Still, idealized models serve a useful purpose as a benchmark. More than that, though idealized as described throughout this chapter, a zero defect concentration is closer to experiment than the results of molecular dynamics modeling. These typically have defect concentrations too high by at least two orders of magnitude and rms angular distortions that are about 4° too high. In practice, the structure of amorphous carbon films are highly dependent upon growth conditions. There remains much controversy about both structure and bonding in the various forms of a-C, although in general terms it is known that films deposited by evaporation or sputtering have a mass density comparable to graphite and are characterized by a high degree of sp2 -bonding whereas cathodic-arc deposition can produce high density
344
diamond-like amorphous (ta-C) films. Jungnickel et al., have used molecular dynamics and simulated annealing to investigate the atomic-scale structural and electronic properties of both low and high density a-C films [48,49]. Frauenheim et al. have also studied hydrogenated a-C [50]. The spa-character of ta-C is strongly correlated with high mass density, which is achieved by a deposition process in which the atoms or clusters of atoms that form the film have high kinetic energy [51,52]. This favors the formation of sp a-bonding which is characteristic of diamond. Although the diamond structure is relatively empty (only 34% of its volume can be filled by hard spheres) diamond C nonetheless has the highest known number density of atoms. Jungnickel et al. generated a 512-atom model of ta-C using a modification of the WWW algorithm. After generating a randomized but not topologically relaxed network, bonds were cut at random until a network with 35% three-fold coordinated carbon atoms was generated. Then hydrogen was introduced randomly to saturate some of the dangling bonds until the hydrogen concentration reached 11% on an atomic basis. Finally, simulated annealing was used to relax the structure both topologically and geometrically. A special set of (unspecified) parameters for the three-fold coordinated carbon atoms was applied in the Keating potential, whereas the hydrogen atoms were allowed to interact via an additional harmonic potential term. Models generated by both molecular dynamics and simulated annealing were in generally good agreement with experimental diffraction results from films with the same overall composition. Hydrogen atoms were found to be of minor importance for the structure and did not significantly change the diffraction results for concentrations below 15 at.%. Low density a-C films were modeled by a completely three-fold coordinated network using a modification of the WWW algorithm [49]. Although it represents an interesting extension of the WWW algorithm, this and other extensions generally incorporate somewhat arbitrary judgements on the part of the modeler and lack the generality and simplicity that is so appealing in the WWW algorithm. These studies offer significant progress toward a fundamental understanding of amorphous carbon systems on the molecular level of chemical bonding. On the other hand, comparisons between theory and experiment have yet to be made with the degree of precision called for by Wright [39]. 5.4. Even-ring models For such binary compounds as GaAs there can be only a few "wrong bonds" (Ga-Ga or As-As), and in some idealized descriptions there are no wrong bonds and thus only even rings. There are algorithms that ensure the generation of an all-even-ring random network starting from a network with only even rings such as the diamond-cubic structure, but the bond switching mechanism is more complicated than that of Fig. 1 and, what is more important, it introduces much more strain into the network. One of these methods has been described by Rivier et al [53]. However, this model and all similar models that we have generated by computer suffer from very large geometric distortions and are quite unphysical. It might be that the best prospect for the computer generation of an even-ring model
345
would be to return to the simplicity of the original bond-switch of Fig. 1 but modified by a weighting factor to bias against the creation of odd rings. The goal is the generation of an idealized model of the homogeneous bulk region of materials such as GaAs. On the other hand, even a model with just a few odd rings would be of interest, as the occasional inclusion of odd rings is not unexpected as a topological defect in real structures. We have successfully eliminated at least 85% of the 5-fold rings usually present in a random network structure like sillium [54]. However, it seems likely that to eliminate all odd rings in a physically reasonable structure will require a much more subtle move or, more likely, a more general algorithm such as that of constrained global optimization [55].
5.5. Hydeogenated amorphous silicon: a-SiHx Amorphous silicon films used in microdevices are generally hydrogenated. Hydrogen stabilizes the structure and passivates dangling bonds that would otherwise be present in concentrations too high to make electronic devices feasible. Hydrogenation makes controlled doping possible. In many applications the hydrogen concentration can be in the range 10-20%, so it is no mere incidental feature of the structure. The modeler must decide how much hydrogen is to be incorporated into the model. These atoms can be included by cutting Si-Si bonds at random and bonding H atoms at the sites of dangling bonds. Then a trivial modification of the WWW algorithm allows one to generate a model. The increased difficulty arises from the need to include extra potential terms to stop unphysical configurations being generated, in which atoms that are not considered to be bonded come together. Note that the original WWW model has no provision for this, but it seems that the low-energy fully-bonded tetrahedral networks raise no such problems. The only change reqired in the WWW algorithm is to ensure that the two atoms that define the central bond (atoms 2 and 5 of Fig. 1) are Si atoms. The H atoms that have been incorporated into the structure will diffuse during the randomization process and the subsequent annealing. Some Si atoms may end up bonded to more than one H atom in this process. Mousseau and Lewis [4] have generated a-Sill x by first making an a-Si model, then introducing a slightly more complicated bond switch in order to create dangling bonds that are more widely separated than in the simple bond switch of Fig. 1. Models generated in this way are often used as the input for further relaxation by molecular dynamics [56].
APPENDIX A: RINGS AND RING STATISTICS There are several common ways to define rings in network structures. One is any closed nonreturning path of bonds [33]. Another is through a shortest path analysis [15,57]. In a shortest path analysis one focuses on each atom in turn. All angles defined by pairs of nearest neighbors of that atom are considered. For a tetrahedrally bonded network there are six such angles, corresponding to the six pairs of nearest neighbors. One limits the count of rings for the particular atom being considered to the shortest path connecting the atoms in each pair of nearest neighbors. This limits the number of rings passing through an atom to six. However, in the FC-2 structure, there are actually two
346 equal shortest paths for each pair of nearest neighbors, even though only one is counted. A shortest path analysis has the advantage of yielding a distribution function for n—fold rings that is zero at high n, but which is less closely tied to physical properties than a count of all nonreturning paths of bonds. Here we define and use irreducible rings for ring statistics. An irreducible ring is one that has no shortcuts across it. That is, given any two atoms (vertices) on the ring, there is no shorter path between the two atoms (as measured by the number of bonds along the path) than a path on the ring itself. One advantage of such rings for topological purposes is that the number of n-fold rings goes to zero for n large, but no topologically important rings are omitted, and the complete table of ring statistics remains finite. In the FC-2 structure, twelve 6-fold irreducible rings pass through each atom. Since each 6-fold ring belongs to six atoms, the counting is normalized to two 6-fold rings per atom. All n—fold rings are normalized in a like manner. In Table 1, we list the irreducible ring statistics for two crystal structures, FC-2 and BC-8, and for four models of amorphous silicon. When a bond-pair switch is introduced into the otherwise perfect FC-2 structure, the number of irreducible rings is conserved. Four 5-folds and eight 7-folds are created, but twelve 6-folds are eliminated. This conservation rule holds until the regions of bond-switching overlap, and is not grossly violated even then in the randomization process. Note that the total number of irreducible rings per atom is exactly two for the FC-2 structure and it is almost two for all the amorphous structures. The BC-8 structure differs markedly from FC-2, with 4.5 irreducible rings per atom. When the structure is randomized, many rings rapidly disappear. A final amorphous structure consisting of 432 atoms (the model denoted N432 in Table 1) generated from a BC-8 supercell again has nearly two irreducible rings per atom. Table 1 Irreducible ring statistics: rings per atom Structure 4-fold 5-fold 6-fold 7-fold 8-fold 9-fold 10-fold Total 2 0 0 0 0 2 0 0 FC-2 4.5 1.5 0 1.5 0 1.5 0 0 BC-8 1.944 0 0.459 0.769 0.546 0.148 0.019 0 PRL 1.963 0.002 0.003 0.444 0.745 0.528 0.171 0.042 N216 1.949 0 0.019 0.466 0.746 0.491 0.166 0.052 N1000 1.888 0.005 0.012 0.518 0.669 0.430 0.190 0.060 N432
APPENDIX B: THE CORRELATION FUNCTION The simplest and most appealing comparison between models and experiment is through the radial distribution function, which is here denoted by g(r) and defined by the equation g(r) = 47rr 2 p(r),
(3)
347
where p(r) is the local number density of atoms at a distance r, averaged with respect to the choice of atom. This gives a direct physical picture of the spatially averaged structure as illustrated in Fig. 8. The area under the first peak of g(r) is
I
g(r)dr =
I 47T-r2 p(r)dr = Z
(4)
is the number of atoms in the first coordination shell surrounding an average atom, assuming no overlap with the second peak. The second peak defines the rms angular deviation. While the radial distribution function holds the greatest appeal, the closely related correlation function t(r) should be used for quantitative comparisons. It is related to g(r) by
t(r) = g(r)Ir.
(5)
The correlation function is more fundamental in that it is in t(r) and not g(r) that experimental broadening is symmetric and r independent [15].
Figure 9. A schematic illustration of the origin of structural features in the radial distribution function. Atoms are shown as lying on sharply defined rings for simplicity. Broadening is incorporated in g(r).
348 REFERENCES 1. 2. 3. 4. 5. 6.
F. Wooten, K. Winer and D. Weaire, Phys. Rev. Letters 54 (1985) 1392. F. Wooten and D. Weaire, Solid State Physics 40 (1987) 1. M.F. Thorpe, B. Djordjevic and F. Wooten (to be published) N. Mousseau and L.J. Lewis, Phys. Rev. B41 (1990) 3702. L.P. O'Mard, Modelling Simul. Mater. Sci. Eng. 1 (1993) 485. L.P. O'Mard, Studies of Disorder in Ice using Simulation Techniques, PhD thesis, Univ. of Kent at Canterbury (1991). 7. S.R. Elliot, Physics of Amorphous Materials (Longman, 1984) 8. J. Robertson, Prog. Solid State Chem. 21 (1991) 199. 9. B.J. Hickey and G.J. Morgan, J. Phys. C19 (1986) 6195. 10. S.K. Bose, K. Winer and O.K. Andersen, Phys. Rev. B37 (1988) 6262. 11. J.L. Feldman, M.D. Kluge, P.B. Allen and F. Wooten, Phys. Rev. B48 (1993) 12589. 12. D. Weaire, D. Hobbs, G.J. Morgan, J.M. Holender and F. Wooten, J. Non-Crystalline Solids 164-166 (1993) 877. 13. S. Kugler, P.R. Surjan and Naray-Szabo, Phys. Rev. B37 (1988) 9069. 14. D.A. Drabold, P.A. Fedders and P. Stumm, Phys. Rev. B49 (1994) 16415. 15. G. Etherington, A.C. Wright, J.T. Wenzel, J.C. Dore, J.H. Clarke and R.N. Sinclair, J. Non-Crystalline Solids 48 (1982) 265. 16. A.C. Wright, G. Etherington, J.A.E. Desa, R.N. Sinclair, G.A.N. Connell and J.C. Mikkelsen Jr., J. Non-Crystalline Solids 49 (1982) 63. 17. A.C. Wright, R.A. Hulme, D.I. Grimley, R.N. Sinclair, S.W. Martin, D.L. Price and F.L. Galeener, J. Non-Crystalline Solids 129 (1991) 213. 18. S. Lin and B.W. Kernighan, Operations Research 21 (1973) 498. 19. S. Kirkpatrick, C.D. Gelatt, Jr. and M.P. Vecchi, Science 220 (1983) 671. 20. F. Wooten and D. Weaire, Key Engineering Materials 13-15 (1987) 109. 21. F. Wooten and D. Weaire, J. Phys. C: Solid State Physics 19 (1986) L411. 22. D. Weaire and N. Rivier, Contemp. Phys. 25 (1984) 59. 23. F. Wooten and D. Weaire, J. Non-Crystalline Solids 64 (1984) 325. 24. F.H. Stillinger and T.A. Weber, Phys. Rev. B31 (1985) 5262. 25. K.C. Pandey, Phys. Rev. Letters 57 (1986) 2287. 26. F. Diederich, R.L. Whetten, C. Thilgen, R. Ettl, I. Chao and M.M. Alvarez, Science 254 (1991) 1768. 27. P.N. Keating, Phys. Rev. 145 (1966) 637. 28. R. Alben, J.E. Smith, Jr., M.H. Brodsky and D. Weaire, Phys. Rev. Letters 30 (1973) 1141. 29. R.M. Martin, Phys. Rev. B: Solid State 1 (1970) 4005. 30. W. Weber Phys. Rev. B: Solid State 15 (1977) 4789. 31. G.J. Dienes and D.O. Welch. Phys. Rev. Letters 59 (1987) 843. 32. P.E. BlOchl, E. Smargiassi, R. Car, D.B. Laks, W. Andreoni and S.T. Pantelides, Phys. Rev. Letters 70 (1993) 2435.
349
33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50.
51. 52. 53. 54. 55. 56. 57.
P. Steinhardt, R. Alben and D. Weaire, J. Non-Crystalline Solids 15 (1974) 199. F.Wooten, G.A. Fuller, K. Winer and D. Weaire, J. Non-Crystalline Solids 75 (1985) 45. L. Guttman, AIP Conf. Proc. 20 (1974) 224. R.H. Wentorf, Jr. and J.S. Kasper, Science 139 (1963) 338. J.S. Kasper and S.M. Richards, Acta Crystallogr. 17 (1964) 752. D. Weaire and P.C. Taylor, in Dynamical Properties of Solids, edited by G.K. Horton and A.A. Maradudin, North-Holland, Amsterdam, 1980, p 1. A.C. Wright, J. Non-Crystalline Solids 159 (1993) 264. F. Wooten and D. Weaire, J. Non-Crystalline Solids 114 (1989) 681. S.M. Lee, C.G. Hoover, J.S. Kallman, W.G. Hoover, A.J. DeGroot and F. Wooten, Mat. Res. Soc. Symp. Proc., 291 (1993) 613. J.S. Kallman, W.G. Hoover, C.G. Hoover, A.J. DeGroot, S.M. Lee and F. Wooten, Phys. Rev. B47 (1993) 7705. K. Minowa and K Sumino, Phys. Rev. Letters 69 (1992) 320. F. Spaepen, Acta Metallurgica 26 (1978) 1167. D.R. McKenzie, D.A. Muller and B.A. Pailthorpe, Phys. Rev. Letters 67 (1991) 773. P.A. Gaskell, A. Saeed, P. Chieux and D.R. McKenzie, Philos. Mag. B 66 (1992) 155. V.S. Veerasamy, G.A.J. Amaratunga, C.A. Davis, A.E. Timbs, W.I. Milne and D.R. McKenzie, J. Phys. Condens. Matter 5 (1993) L169. G. Jungnickel, M. Kuhn, F. Richter, U. Stephan, P. Blaudeck and Th. Frauenheim, Diamond and Related Materials 3 (1994) 1056. G. Jungnickel, M. Kuhn, F. Richter, U. Stephan, P. Blaudeck and Th. Frauenheim, J. Non-Crystalline Solids, in press. Th. Frauenheim, G. Jungnickel, U. Stephan, P. Blaudeck, S. Deutschmann, M. Weiler, S. Sattel, K. Jung and H. Erhardt, J. Non-Crystalline Solids, in press. J. Robertson, Adv. Phys. 35 (1986) 317. J. Robertson, in R.E. Clausing (ed.), Diamond and Diamond-like Films, NATO ASI 266B, Plenum, New York, 1991, p. 331. N. Rivier, D. Weaire and R. de Romer, J. Non-Crystalline Solids 105 (1988) 287. F. Wooten and D. Weaire, in R. Vichnevetsky and J.J.H. Miller (eds.), Proc. 13th World Cong. on Comp. and Appl. Math., 1991, p.1698. E.L. Altschuler, T.J. Williams, E.R. Ratner, F. Dowla and F. Wooten, Phys. Rev. Letters 72 (1994) 2676. N. Mousseau and L.J. Lewis, Phys. Rev. B43 (1991) 9810. S.V. King, Nature 213 (1967) 1112.
This Page Intentionally Left Blank
Adaption of Simulated Annealing to Chemical Optimization Problems, Ed. by J.H. Kalivas 1995 Elsevier Science B.V. All rights reserved.
351
Chapter 16
Conformational analysis of flexible molecules Stephen R. Wilson a, Weill Cuib and Frank Guarnieric aDepartment of Chemistry, New York University, New York, NY 10003 bSynaptic Pharmaceuticals, Paramus, NJ 07652 cMt Sinai Medical Center, New York, NY 10029 1. INTRODUCTION In his 1983 paper in Science, Kirkpatrick described simulated annealing and formulated a practical algorithm based on the concept [1]. Due to its appealing concept, simple implementation and broad generality, the method of simulated annealing has been widely applied. We were first to apply the process in the search for the global minimum in conformational analysis [2] and protein folding [3-5]. We will begin the discussion with a brief introduction of the multiple minimum problem as it exists for conformationally flexible molecules, and the problems one faces in the search for the global minimum. In the simplest formulation, the number of conformational possibilities in a molecule
# Conformers = 3
N
may be expressed as 3 N , where N is the number of degrees of freedom (usually rotatable bonds). Therefore the simplest hydrocarbon n-butane shown above (one bond) has 3 conformers. The most basic conformational search method, called grid search or tree search, is a straightforward solution to the multiple minimum problem. One samples geometries systematically along the torsional coordinate by a defined increment. For example, at a rotational resolution of 30 degrees, for the simple molecule n-butane (one degree of freedom), twelve starting conformations would be generated. When these geometries are minimized, four of them will lead to the global minimum. The others will lead to two other known local minima, since, from a given starting structure, energy minimization always move the molecule downhill toward the bottom of the nearest energy well. For very small molecules, this approach works very well and is still the method of choice. The number of conformers (and therefore the CPU cost for their geometry optimization) scales as the power
352 of the number of angles to be rotated. Therefore, when more than 10 torsion angles must be sampled, this method is prohibitively expensive. A somewhat better approach uses Monte Carlo methods. The energy surface is randomly sampled. The conformations generated are subsequently optimized with local minimization. If one is lucky enough to sample the right part of the energy surface, the global minimum can be found. Simulated annealing provides a more efficient application of Monte Carlo methods. To apply simulated annealing to conformation sampling, we implemented the following process [2]. A random walk in the conformation space of a molecule is implemented. pick bond pick angle
*
E
.00.00,...........,.
start
E new
Figure 1. Anneal-Conformer generates new molecular geometries randomly. First, the starting molecular energy Estn, is computed (Figure 1). The one of the rotatable bonds of the starting geometry is picked randomly and is rotated a random number of degrees. The new molecular energy Enew, is computed and the new geometry is accepted or rejected depending on an energetic criterion related to temperature. The probability of accepting this new conformation is calculated in two steps. First, a probability density function (pdf) is calculated as pdf= e-AEIRT a Boltzman factor. Then, this pdf value is compare with a random number RM (0 < RM < 1). If pdf > = RM, this new conformation is accepted otherwise, this new conformation is rejected. Downhill moves on the energy surface (E new < Estart) is always accepted. Only in the case of the uphill move (En, > ESA), are the moves accepted or rejected based on the pdf at that temperature. Thus we can characterize simulated annealing as energy and temperature directed Monte Carlo sampling. An extensive study of hydrocarbons established that simulated annealing is an excellent method for unfolding chains to produce the global minimum (Figure 2). In the course of optimization of the geometry of n-octane the energy change in the random walk is noted (Figure 3).
353
Anneal-Conformer
Starting Conformer E = 35.62 kJ/mol
Global Minimum E = 19.49 kJ/mol
Figure 2. Simulated annealing of n-octane showing starting and ending structure.
0
1000
2000
3000
steps Figure 3. Simulated annealing of n-octane showing the curve of energy vs steps of the random walk.
354 1.1. Simulated annealing of met-enkephalin and other peptides We extended the use of our program Anneal-Conformer and the Amber force field for conformation searching of peptides [5]. We were also able to efficiently locate several new families of active conformations of Met-enkephalin [6]. Similar studies of Met-enkephalin by Kawai [7] on a simplified ECEPP energy surface also demonstrated the efficiency of simulated annealing . The simulated annealing strategy for conformational searching was challenged in a study by Nayeem et al [8]. Nayeem concluded that, while simulated annealing converges to low energy conformations significantly faster than his own MCM technique, it does not converge to a unique minimum whereas MCM does. This conclusion was refuted in a recent publication by Kawai [9]. The efficiency of simulated annealing for conformational searching was supported by additional studies of Met-enkephalin by several other groups [10,11]. Simulated annealing has also been applied to the conformational study of dipeptide models of Gly, Ala, and Asp; pentaglycine and Leu-enkephalin [11]; an analog of vasopressin [10], (3S, 4S)-statin [12], analogs of thyrotropin releasing hormone [13] and the C-peptide of Ribonuclease A [13]. 1.2. New simulated annealing methodology Most simulated annealing studies have used the energy surfaces of molecular mechanics force fields for conformation searching. Some reports have also appeared that extend the use of simulated annealing to quantum mechanical potentials [15-17]. We have recently reported an algorithm for simulated annealing rings with using a program called Anneal-Ring [18]. We have located the global minimum for ring systems up to 17-membered [ 9]. Other new simulated annealing methodology has also been introduced. Lelj applied molecular dynamics simulated annealing, to the global conformational search of a molecule of biological interest, (3S,4S)-statine [12]. He proposed a new criterion for monitoring conformational transitions called fractional energy fluctuation, instead of the specific heat used in the original algorithm of Kirkpatrick [1]. A hybrid algorithm reported by Morely, combines aspects of molecular dynamics, Metropolis Monte Carlo sampling, and simulated annealing [20]. In this method, trial conformations are generated by short bursts of concerted molecular dynamics, with the kinetic energy concentrated into one randomly selected bond rotation. The dynamics trajectories are allowed to continue as long as the energies of the new structures satisfy the Metropolis test. By gradually decreasing the temperature, a simulated annealing protocol can be built upon this combined sampling technique. The effective use of temperature as a control factor in the Metropolis Monte Carlo process is the most important factor for the success of simulated annealing method. This has inspired an attempt to exploit temperature in a new way. Von Freyberg [21] has proposed a "Simulated Shocking" protocol for the efficient search for conformations of polypeptides. In this protocol, the temperature jumps between a very low (T = 5 K) and very high temperature
355 (T = 2000 K). Von Freyberg claims this variable temperature schedule works better than a continuously decreasing temperature. 2. THE SIMULATED ANNEALING PATHWAY Simulated annealing as a general optimization technique takes an arbitrary configuration as input, and outputs the global minimum. The method is usually employed with no interest in the intrinsic mechanism of the algorithm or the random pathway to the optimal configuration. The use of the algorithm as a "black box" is common because of the endpoint nature of simulated annealing. The user focuses only on the input and output configuration. In this formulation, the investigator gains confidence that the output is actually the best configuration by repeatedly converging to this same configuration starting from different arbitrarily chosen inputs. While the power of simulated annealing in finding the global energy minimum has allowed conformational analysis of highly flexible systems, location of the global energy minimum, is often only one aspect of a problem. The chemical and physical properties of flexible molecules are a reflection of the Boltzmann weighted average of all low energy conformers. In addition, important topics of great current interest, such as protein folding, involve a study of the pathway to the global minimum. Perhaps information about this folding process can be attained by monitoring the simulated annealing algorithm. 2.1. Monitoring structural changes To study the folding pathway of a simulated annealing run, our Anneal-Conformer program [22] was equipped with monitoring options for energy, theta, angle and distance. The first parameter is the total molecular mechanics energy of the molecule, the second the accepted change in torsion angle, the third the torsion angle, and the last the relative distance between two preselected atoms. After every accepted step, each parameter can be recorded. To see if monitoring these variables actually yields any useful information on the folding pathway, multiple simulated annealing runs with all parameter monitoring facilities implemented were carried out on a family of vitamin D molecules (Figure 4) [23]. The classical biological responses for 1,25(OH) 2-vitamin D3 are now well established. The vitamin D hormone is known to mediate intestinal calcium absorption and bone calcium metabolism via its receptor. More recently, "non-traditional activities" have been discovered including cell differentiation and proliferation. Several synthetic vitamin D analogs have been prepared and their activities investigated. Of particular importance for potential anti-cancer and immune supression drugs, are vitamin D compounds whose bone calcium activity is low and cell differentiation activity is high. Compound 1 and 2 are traditional vitamin D structures with high bone calcemic activity (calcium homeostasis) and only modest cell differentiation activity.
356
0
2 (vitamin D2 ketone)
1 (vitamin D3 ketone)
0
3 (vitamin D-yne ketone)
4 (vitamin D-oxy ketone)
Figure 4. The Vitamin D Family. Ketone models for side chain simulated annealing (cf. ref 32.) The natural hormones have 0 = 1,25(OH) 2-Ring A. The pathway monitoring data, gives some interesting qualitative flexibility data. For example, theta (the accepted change in torsion angle) vs. steps for the vitamin D3 ketone 1 and the vitamin D2-yne ketone 3 (Figures 5a and 5b), clearly show a striking difference in relative flexibility. Considerable large-amplitude angle fluctuations in the vitamin D2-yne ketone 3 occur down to much lower temperatures that in the former molecule. A discussion of some of the implications of differences in other variables during the simulated annealing run may be found in references 5 and 24.
357
Figure 5a. Theta vs steps plot for vitamin D3 ketone model. 5b.Theta vs steps plot for vitamin D-yne ketone model.
358 2.2. Monitoring bond rotation frequency While some of this pathway monitoring data is interesting and informative, it is qualitative. In an attempt to quantitate intramolecular flexibility, a program called Flex was devised. Flex obtains intramolecular flexibility data by monitoring the relative acceptance rates of each rotatable dihedral in a flexible molecule over the course of an entire simulated annealing run. Even though all flexible torsions are randomly selected for random rotation the same number of times, the acceptance rate of the new conformation is strongly dependent upon which dihedral has been selected. As a test, the program was run on the molecules in the vitamin D family shown in Figure 4. The results are collected in Table 1. Examining the data for the vitamin D2-ketone 2, for example, indicates that bond 4 is about 50% more flexible than bond 1, bond 3 is twice as flexible as bond 1. Similar observations are easily made on the intrinsic flexibility of the other analogs and have been described in reference 24. Table 1. Flex Acceptance Percentages for the Vitamin D Family (Figure 4) Vitamin D3 (1) Vitamin D2 (2) Vitamin D-yne (3) Vitamin D-Oxy (4) Bond (%) (%) (%) (%) 1 2 3 4 5
12.42 19.93 22.71 23.38 21.53
14.77 33.46 30.66 21.09 -
11.33 36.48 36.87 15.29 -
17.42 21.83 19.30 18.89 22.54
Two specific aspects of this data are especially noteworthy. The central bonds of vitamin D2-yne 3 are 2-3 times more flexible than its other rotational torsions. This flexibility can also be seen in Figure 5b--large amplitude rotations. The vitamin D2-yne 3 data shows how Flex quantitates a well known phenomena--the lack of steric interactions around a triple bond. The acceptance rate for the third bond of vitamin D3-oxy 4 is only 19%, relatively low compared to the others and suggestive of conformational restriction. 2.3. Bond statistics and flexibility monitoring Detailed chemical and physical insights on the nature of flexible molecules requires structural information. Since bond angles and bond lengths are approximately constant, structural information means knowledge of the dihedral space of the rotatable torsions. Obtaining this knowledge requires two things: a good model to generate accurate data, and a concise means of presenting the data. There are two broad classes of dihedral space models for studying and building flexible chain molecules. The first, known as Rotational Isomeric State Theory (RIS) [25], was developed to study industrial polymers. The second, known as Chain Buildup (CB) [26], was
359 developed to study biopolymers. Both of these models begin with the smallest possible molecularly distinct fragment of the system under study. In RIS, this fragment is generally one monomer unit of a polymer. In CB, this fragment is usually an amino acid (if a peptide is the system under study). Chemical and physical relevance is the reason for using a molecularly distinct fragment. Computational ease and necessity generally demand using the smallest possible system. Utilizing the smallest entity is especially required for these methods because both start by exhaustively examining the entire dihedral space to create every possible conformation of the fragment. Then, each conformer is energy minimized with duplicates and high energy structures discarded. In CB, these low energy fragments are stored as a database. If a peptide is to be studied, twenty such databases need to be created. Conformations of a peptide are created by combining all possible combinations of conformers of the appropriate amino acid sequence. Since RIS was developed to study industrial polymers, there is no need to form multiple databases because the chain is built by repetitively linking the same monomer unit. From the collection of minima of the monomer fragment, rotational state probabilities proportional to the energy of the state and conditional on the rotational state of the previous nearest neighbor are developed. Since only these conditional probabilities [27] are needed for chain building in the RIS model, only these quantities are retained. Rotational Isomeric State Theory and Chain Buildup are both local models. In constructing conformations, both methods only take into account local interactions. Long range interactions are totally neglected in the building process. While ignoring non-bonded interactions greatly simplifies computational complexities, it substantially removes the model from physical reality. The other commonality of these methods is that final results are displayed as individual structures. If, however, as is often the case, an large collection of conformations are present at the end of the process, how can this multitude be dealt with? 2.4. Dihedral distribution functions Our Flex program was improved to deal with both problems: incorporation of all interactions, and conversion of all relevant data into a convenient form. The total package consists of a modified simulated annealing algorithm called Anneal-Flex. The final data is visualized using the Macintosh program Deltagraph' [28]. Not only are all intramolecular interactions efficiently studied as a function of temperature, but all the data is compacted into one graph [29]. 2.5. Anneal-Flex As previously described, the method of simulated annealing only generates one (hopefully optimal) configuration. Monitoring the entire simulated folding process therefore requires modification of the basic algorithm. To follow the detailed course of the process at each temperature block, the simulated annealing algorithm has been modified to output at every accepted step the accepted step number, rotated bond, rotational increment, the energy of the new conformation, and the new dihedral value after rotation.
360 A typical Anneal-Flex run on a molecule such as the vitamin D3 ketone 1 consists of 20 runs of 1000 steps per temperature at 30 temperatures. Since the acceptance rate is usually around 30%, there are about 180,000 accepted steps or 9,000 lines of data for each 20-run file. In classical statistical mechanics, one Anneal-Flex run can be considered as one member of an ensemble [30]. The collection of twenty runs is the ensemble. In this type of formulation, the numerical value of the quantity of interest is obtained by calculating averages over this ensemble. While the quantities that we are interested in are too complicated to be represented by a single number, the same statistical mechanical principles can be used to create the distribution functions which accurately represent dihedral space. To create the dihedral distribution functions Anneal-Flex data and performs the averaging over thirty-six 10 degree intervals for each temperature. Details of the process may be found in reference 24. The final data is output in Deltagraph thi format and plotted. We initially called these plots Flex-Maps [29] but more recently have settled on the term "conformational memories" [31]. 2.6. Conformational memories and bioactivity Several uses and applications of conformational memories have been investigated: calculation of rotational states [29], study of stereochemistry [29], bioactivity prediction [32], and importance sampling [31]. Conformational memories for the vitamin D family in Figure 4 have been created and are shown in Figures 6-9. These diagrams contain information on the rotational states of all angles in the molecules. For example, bonds with typical 3 rotational states can be observed, i.e. compound 6 bond 5 (Figure 6). Moreover, these plots must contain all information about the "shapes" of the compounds and indeed are a representation of the conformation space of each analog. Thus, they may reveal subtle information about structure activity relationships. The vitamin D hormone has two major biochemical regulatory functions -- calcium homeostasis, and cell differentiation [23]. While both vitamin D3 1 and vitamin D2 2 show a high level of both activities, the vitamin D-23-oxy analog 4 has a high level of activity only for cell differentiation and not for calcium homeostasis. Comparing the conformational memories for vitamin D3 ketone 1 with vitamin D-23-oxy ketone 4 , there is a major difference in bond 2: vitamin D3 1 has a large population in the 60 degree range which does not exist for vitamin D-23-oxy 4. Thus it may be that the conformation which corresponds to this region is crucial for cell differentiation. Apparently, the specific conformation affected by the decreased flexibility caused by the replacement of a carbon by an oxygen in the side chain can be pinpointed using Anneal-Flex. If the conformational memories can be used to understand the differences in the conformations and bioactivity of known analogs, perhaps it could also be used to predict the activity of unknown analogs. To test this hypothesis, we ran Anneal-Flex on a new compound vitamin D-23-oxy ketone 5. The conformational memories of this compound are
DIHEDRAL
DIHEDRAL 4
DIHEDRAL 2
DIHEDRAL 5
Figure 6. Conformational memories for vitamin D3 ketone (1)
180 160 140 120 100 80
160 140 120 loo 5 so
0.6 60 40
40
g
140 120 100 iS BO E 60 40
DIHEDRAL I
DIHEDRAL 2
DIHEDRAL 3
• IMO- 1111
.t ..n• gralta.Twir NINIP.011? ft I T. 'fa / es.S.ON 1 / g
a -IA g
DIHEDRAL 4
Figure 7. Conformational memories for vitamin D2 ketone (2)
I tiDIHEDRAL1 1 * —
g
ig
DIHEDRAL 4
g t tr
0 3 ''' a t a
11,11-fEDRAL,2
i is DIHEDRAL
3
140
160 140 Z 120
120 100
1
80
8 6 4 111,41141,
.11•111, ID
• 11IP
Alt 111, 4
11114.111, 7
11.1.111
1[50.•
51110
=1 7.74
517.74
esa.5111
ecas_eses
122.01 g
At
C=1,
DIHEDRAL 1 180 140
g
122.01
DIHEDRA L 2
DIHEDRAL 3 140 120
,ss
100 OH
5 ?. 40 ADM
ItI.II UM- 7411 11,00.essr
—14
51 7. 7.• 105.0• 122.01
IN
•
DIHEDRAL. 4
w
a DIHEDRAL 5
Figure 9. Conformational memories for vitamin D-22-oxy ketone (4)
ig
160 140 120 p 100
250 200 50
80 Q. 60
4
ski
411.1 '2*
n
*
DIHEDRA Lt_
DINIIEDIRA L. I
E
lig re 2
*
DIHEDRAL 3
OH
f2i DIHEDRALL 4
—14
DIHEDRAL S
Figure 10. Conformational memories for vitamin D-23-oxy ketone (5)
rtt
366
Y 5 (vitamin D-23-oxy) shown in Figure 10. A careful examination of bond 2 shows significant population in the 60 degree region similar to bond 2 for vitamin D3-ketone 1 (Figure 6) and quite unlike bond 2 for the vitamin D-22-oxy ketone (Figure 9). This observation predicts that vitamin D-23-oxy 5 should possess conformation space similar to vitamin D3 1 , and thus high activity for both calcium homeostasis and cell differentiation. Vitamin D-23-oxy 5 was not a known compound at the time when these calculations were done [33]. Subsequent to our prediction [32], vitamin D-23-oxy 5 was synthesized and tested [34,35]. Bioactivity results were consistent with our prediction, i.e., this new analog possesses a high degree of activity for both calcium homeostasis and cell differentiation just like vitamin D3 1. 3. CONCLUSION Simulated annealing has be demonstrated to be an excellent method for examining conformations of molecules. Optimization of the global geometry of a compound using this method gives the global minimum. In addition our recent work has shown that the entire conformational information of a flexible molecule contained within a complex force field can be converted to a small set of rotational distribution graphs called "conformational memories". Studies on the rotational states of vitamin D illustrate the visual power of the method and suggest that the method could be used for struccture/activity studies of biologically active molecules. REFERENCES 1. 2. 3. 4.
S. Kirkpatrick, C.D. Gelatt Jr, M. P. Vecchi, Science, 220 (1983) 671. S.R. Wilson, W. Cui, J. Moskowitz, K. Schmidt, Tetrahedron Lett., (1988) 4343. S.R. Wilson, W. Cui, J. Moskowitz, K. Schmidt, Int. Jour. of Quant. Chem., 22 (1988) 611. S.R.Wilson, W. Cui, J. Moskowitz, K. Schmidt., J. Comput. Chem., 12 (1991) 342.
367 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27.
S.R.Wilson, W. Cui, Biopolymers, 29 (1990) 255. T. Metcamp, W. Cui, S. R. Wilson, J. Mol. Struc. (Theochem.), 308 (1994) 37. H. Kawai, T. Kikuchi, Y. Okamoto, Protein Eng., 3 (1989) 85. A. Nayeem, J. Villa, H. Scheraga, J. Comput. Chem., 12 (1991) 594. Y. Okamoto, T. Kikuchi, H. Kawai, Chem. Lett., 7 (1992) 1275. Q. Deng, Y. Han, L. Lai, X. Xu , Y. Tang, M. Hao, Chin. Chem. Lett., 2 (1991) 809. L.B. Morales, R. Garduno-Juarez D. , Romero, J. Biomol. Struct. Dyn., 8 (1991) 721. F. Lelj, P.L. Cristinziano, Biopolymers, 31 (1991) 663. R. Garduno-Juarez, F. Perez-Neri, J. Biomol. Struct. Dyn., 8 (1991) 737. Y. Okamoto, M. Fukugita, T. Nakataka, H. Kawai, Protein Eng., 4 (1991) 639. P. Dutta, D. Majumdar, S.P. Bhattacharyya, Chem. Phys. Lett., 181 (1991) 293. P. Dutta, S. P. Bhattacharyya, Phys. Lett. A, 148 (1990) 331. M.J. Field, Chem. Phys. Lett., 172 (1990) 83. F. Guarnieri, W. Cui, S. R. Wilson, J. Chem. Soc. Chem. Commun., (1991) 1542. F. Guarnieri, S. R. Wilson, Tetrahedron., 48 (1992) 4271. S.D. Morely, D. E. Jackson, M.R. Saunders, J.G. Vinter, J. Comput. Chem., 13 (1992) 693. B. Von Freyberg, W. Braun, J. Comp. Chem., 12 (1991) 1065. W. Cui, PhD Thesis, New York University, (1988). For a survey of current vitamin D research see: Proceedings of the Ninth Workshop on Vitamin D, A. Norman, De Gruyter, Amsterdam 1994. F. Guarnieri, PhD Thesis, New York University, (1991). P.J. Flory, Macromolecules, 7 (1974) 381. L.G. Dunfield, H.A. Scheraga, Macromolecules, 13 (1980) 1415. For a discussion of conditional probabilities see: Freund and Walpole, Mathematical Statistics, Prentice-Hall Inc. Englewood Cliffs, N.J. (1980)
28. DeltagraphmCopyright Deltapoint, Inc., 200 Heritage Harbor, Suite , Monterey, CA. 93940 29. S.R. Wilson, F. Guarnieri, Tetrahedron Lett., 32 (1991) 3601. 30. R. Tolman, The Principles of Statistical Mechanics, Dover Press, New York (1971). 31. F. Guarnieri, S.R. Wilson, J. Comp Chem., (1994) in press. 32. For a preliminary report on vitamin D side-chain flexibility see: S. R. Wilson, F. Guarnieri, Proceedings of the Eighth Workshop on Vitamin D, pg. 208, De Gruyter, Amsterdam 1994. 33. A. Yasmin, PhD Thesis, New York University, (1991). 34. G. Neef, A.Steinmeyer, Tetrahedron Lett., 5073 (1991). 35. G. Neef, A. Steinmeyer, G. Kirsch, K. Schwartz, M. Haberey, R. Thieroff-Ekerdt, P. Rach, Ger, Patent [Chem. Abs. 117. 192157 (1991).]
Adaption of Simulated Annealing to Chemical Optimization Problems, Ed. by J.H. Kalivas 1995 Elsevier Science B.V. All rights reserved.
369
Chapter 17
Simulated annealing-optimal histogram applications to the protein folding problem D. M. Fergusona and D. G. Garrettb aDepartment of Medicinal Chemistry and Minnesota Supercomputer Institute, University of Minnesota, Minneapolis, MN 55455 b Unisys Corporation, P.O. Box 64525, St. Paul, MN 55164
1. INTRODUCTION Of all the potential applications of simulated annealing methods in chemistry, the protein folding problem may be the most provocative. [1-4] From the development of new biopolymeric materials to the design of novel drugs, the a priori knowledge of protein structures is a prerequisite for success. Unfortunately, little is known about the precise rules that govern protein folding; and what is known is derived from a small set of solved protein structures. All is not lost, however, since a great deal of information is contained in the primary sequence of proteins (see figure 1). In fact, as Anfinsen initially demonstrated,[5] this half of the genetic code contains all the information necessary to fold the protein to the compact, native state. The basic idea that proteins could be reconstituted, or refolded under the appropriate conditions was further validated by Privalov and forms the basis to the thermodynamic hypothesis of protein folding.[6] Although exceptions to this generality exist, it is fairly well accepted that the majority of small globular proteins do reach the global minimum free energy conformation, or the corresponding structural domain on the energy surface, at equilibrium.[1] The ramifications to theoretical treatments of protein folding are, of course, great. First, given a primary sequence of a globular protein, it should be possible to predict the 3-dimensional structure from first principles. The problem essentially becomes one of global optimization. And second, the problem should be tractable thermodynamically. That is, by adjusting the conditions (specifically temperature) it should be possible to explore the folding process and driving forces that lead to the native state. The suitability of simulated annealing protocols to problems of this type is well established (as the other chapters in this book indicate). In theory, it should be possible to follow the folding process from start to finish using fairly straightforward adaptations of established annealing schemes and potential energy functions. In practice, however, our applications are severely limited by the sheer size of the problem. Popular atom-based force field calculations, which are commonly used in conformational studies, are still several orders of magnitude away from reaching time-scales relavent to fold proteins. Even small protein fragments, such as the myoglobin fragment shown in figure 1, are far beyond
370 the reach of conventional simulation methods and techniques at atomic resolution. The numerical explosion encountered as system sizes increase beyond tens to hundreds and thousands of atoms is simply overwhelming. To address this problem, simplified models have emerged.[1-3] For the most part, these models capture the physical properties of the amino acids on an "average" basis, allowing the protein to be modeled by some minimal reduced-atom representation. Although this approach has produced many interesting and insightful results over the years,[1-3] the general applicability of these models to simulating real proteins and folding processes has not been fully established.
H P
H
Figure 1. Schematic diagram of a short protein sequence taken from myoglobin folding from the random coil to the native conformation. The remainder of the protein is omitted for simplicity. (The sequence is given in standard one letter amino acid codes.)
To solve the greatest questions of protein folding, however, this gap must be bridged. Obviously, predicted folding pathways and mechanisms are only meaningful if the physical significance of the model can be established as well. This is also true for the continued development and/or refinement of simplified models for reliable structure predictions. In
371 this chapter, we present an adaptation of simulated annealing that allows such issues to be explored computationally. The main question we set out to answer here is precisely how do simplified protein models fold?. If successful, it should then be possible to make model comparisons and experimental evaluations to determine the scope and limitations of specific reduced-atom representations. Although simulated annealing has been applied in a number of protein folding studies, the majority have focused on structure predictions using fairly standard protocols.[3,713] Very few have attempted to characterize the folding process both structurally and thermodynamically from start to finish. This is due, in part, to difficulties that arise in determining the physical properties of molecular systems given the limited nature of Monte Carlo or molecular dynamics sampling.[14] Convergence in thermodynamic averages can be quite slow in simulations of this type greatly effecting our ability to reliably calculate key parameters, such as specific heat, for the folding process. This is true of many applications of simulated annealing in chemistry and physics. In the remaining sections of this chapter, we address this problem through a variation of the optimal histogram methodology originally reported by Ferrenberg and Swendsen.[15] The approach developed applies variance optimization techniques and energy histograms to determine a "best estimate" of the thermodynamic functions for the folding process. In addition, we outline the development of several order parameters that allow the structural properties of the protein model to be followed as well. The functions are based on those described by Parisi[16,17] and Edwards and Anderson[18] for use in spin glass studies. These methods are then applied to study the physical behavior of a protein model (similar in structure to that given in figure 1) using two reduced-atom force fields representations. In particular, the folding process is explored for potential phase transitions and structural changes that occur as the system freezes, as well as force field dependencies on the predicted pathways and low energy structures. Multiple annealing simulations will also be analyzed using order parameters to further investigate the thermodynamic reversibility of the process, allowing any glassy behavior in the system to be identified if present.[19] Finally, the chapter closes with a discussion of model predictions and comparisons to experimental protein folding results. 2. PROTEIN MODEL Although a wide variety of models have been described and applied to study protein folding, most, if not all rely on fairly drastic structural simplifications to bring the numerical size of the problem in reach.[1-3] Typically, explicit atoms and bonds are replaced by virtual representations as shown in figure 2. In the approach followed here, each amino acid residue of the primary sequence is reduced to a single bead, centered on the C, carbon.[7,20,21] These beads are then connected by virtual bonds that replace the rigid peptide linkage, producing a simple string of beads model of the protein structure. Explicit interactions are also replaced by average interactions in this scheme, further reducing the unique bead types in the sequence to a small collection of "averaged" amino acid types. At this limited resolution, three are essential: hydrophobic (PHB), hydrophilic (PHL), and neutral (NEU). These main classes of amino acids are, of course, found in all proteins, but not at random.[1-4] Structurally, hydrophobic groups are located in the core
372 of the protein while hydrophilic groups are located on the periphery. Compositionally, the distribution is even more exact. Common structural motifs are often built from sequences that contain periodic patterns of amino acids, pre-arranged to fold uniquely to specific secondary, super-secondary, and tertiary structures. In this work, an example sequence is implemented that mimics those found in a-helical hairpin structures.[22] Although no direct correspondence to a real protein sequence exists, the model most closely resembles a de novo designed protein reported by Regan and Degrado.[23] The protein model is constructed in figure 3. ca
H / N \
H N C II
H N
c C 0H / 0 0,TT H- N H 1-12 —I H— °
0
Figure 2. Transformation of the all atom representation of a protein fragment to the residue-based, or bead model employed to simulate protein folding.
Energetically, the system is modeled in a similar fashion to its all-atom counterparts. Unlike standard force field calculations, however, simplified versions must capture the bulk effect of solvation and other dominating physical forces through averaged interaction potentials and parameters to fold proteins; albeit on a limited basis.[1-3] The energy expression implemented here was first developed by Rey and Skolnick to examine the folding pathways of a-helical hairpin structures. [8] The function applied,
Etot = 3 8
4c [( ) rii
8 ( a2 6
rii
1+
ie helix
) r i,i+4
(
) 61 ri,i-F4
373
Figure 3. Stereoview of the 22-residue a-helical hairpin protein mode in an ideal conformation. Dashed lines represent hydrophobic residues (PHB), hydrophilic and neutral residues are solid lines.
+
(1
2
Wi,j,k,l)
27r
2
27
6
u
()
A=0
includes contributions from both local and non-local interaction potentials that have been appropriately parameterized to simulate the model described in figure 3. Short range forces (local) are captured as C c, bond stretching, angle bending, and torsional energies that estimate the average properties of the folded state through geometric preferences.[1] Long range forces (non-local), such as solvation, electrostatics, and dispersion, are accounted for by net attractive or repulsive Lennard-Jones interactions.[1] The model also includes a Lennard-Jones term to mimic helical cooperativity, or hydrogen bond formation, between all i to i+4 residues in the helix. Although the terms of the energy expression and parameters are described in great detail in ref. 8, several parameters corresponding to the torsional polynomial function are missing from that report. To complete the force field, we have fit these to the data provided in ref. 8 and report them, as well as all other parameters employed in table 1. Although previous calculations have determined the global minimum conformation of this model to be the a-helical hairpin structure, some anomalous behavior was noted.[7] For comparison, a second force field has been implemented here as well to study the same, primitive C c, bead model and example sequence given in figure 3420] The energy expression captures the residue-based interactions in a similar fashion to Eq. 1, but is less complex, sharing common origins with several popular all-atom force fields. The following energy function Et,
=
E
K
Vn A 2 [1 + cos ( nw i,j,k,/ ')()]
K
i„k
i,j,k,
2,1
B r •z,
(2)
has been parameterized to model the folded states of both 13-barrel structures[21] and
374
Table 1 Force field parameters derived from the work of Rey and Skolnick for use with Eq. 1.a
bonds
X - Xb
angles X-X-X nonbonds PHB - PHB PHL - X intrahelix (i to i+4)
K1/2 600.0
10
1.0
Kt9 /2 Oo 60.0 109.45 f 24.0 9.0 12.0
Q1 0.87 0.87 2.0
o-2 0.87 0.0 2.0
torsion Ucis Up U1 U2 U3 U5 U4 gauche- 163.8 -636.3 -12.4 215.1 128.5 -5640.3 -91.3 gauche + 203.4 -796.2 -13.2 119.1 195.3 -6670.1 -625.6 trans 172.9 -691.4 -14.2 301.6 180.6 -6068.5 -406.6
U6
6039.9 7817.7 6698.7
'See ref. 8 for a complete description of all force field parameters of Eq. 1 and energy units. b X stands for any bead type. cTorsional potential is sequence specific (see ref. 8).
a-helical-type structures.[20] This force field is of particular interest since the low energy regions of conformational space have already been explored for the 22-residue a-helical hairpin of interest here. [20] In previous calculations, we not only verified the global minimum to be in the desired conformation given in figure 3, but collected a great deal of information regarding the local minima that populate the energy surface. The force field parameters are listed in table 2 and are further discussed in greater detail below and in ref. 20. Although both models predict similar ground states, the choices of potential function and parameter development are quite different. The first function above uses a complicated polynomial for torsional energies and forces that is parameterized to bias angles towards the desired C c, secondary structure. The second applies a simple cosine function phase-shifted in the helical regions to produce similar effects. Although the phase shift employed corresponds to a minimum of 60 degrees (gauche + ), the actual C c, torsion is effectively modeled by this potential to be 57 degrees (when all force field interactions are included);[20] in-line with the values derived from a-helices in proteins. The phase shift, however, is not applied to model the C G, torsions of the bend region, allowing each state, gauche+ , gauche-, and trans to be populated with equal probability. The first torsional function includes a bias in this region as well, further weighting the conformational states towards the a-helical hairpin structure. [8] The slight compression noted in the torsions above is primarily due to the attractive hydrophobic forces that favor compaction. In both models, this force is balanced by the torsional potential that favors extended conformations. Significant differences exist,
375
Table 2 Force field parameters of Eq. 2.'
X -Xb
bonds
Ki/2
100.0
to
1.0
angles X - X - X NEU - NEU - X
Ke/2 20.0 10.0
00 91.0 104.0
nonbonds PHB - PHB PHB - PHL PHL - PHL PHB - NEU PHL - NEU NEU - NEU
A 4.0000 2.6666 2.6666 4.0000 4.0000 4.0000
4.0000 —2.6666 —2.6666 0.0000 0.0000 0.0000
torsion' X - PHB - PHB - X
Vn/2 1.0 1.2 1.0 1.2 1.0 1.2 0.2 0.2 0.2
7 0 240 0 240 0 240 0 0 0
X - PHL - PHL - X X - PHB - PHL - X X - NEU - PHB - X X - NEU - PHL - X X - NEU - NEU - X
B
a 3 1 3 1 3 1 3 3 3
'See ref. 20 for a complete description of the force field. Energy in kcal/mol. bX stands for any bead type. 'Corresponds to a a value of 1.0.
however, in the handling of this energetic partitioning that ultimately defines the behavior of the model. When these force fields are scaled relative to the torsional component, the hydrophobic potential of the first force field above is considerably stronger than that of the second. In addition, non-bonded interactions are further strengthened in the first by a helical cooperativity term that is applied to all residues in the i to i + 4 position. The balance is more equally weighted in the second force field. Although no special hydrogen-bond terms are added to further stabilize the final helical motifs, a 1,4 scale factor of 1/2 is applied to all non-bonds to promote turn formation. [20] In effect, the scale compensates for the enhanced torsional barriers that result from the addition of the 1,4 non-bonded term to the energy during bond rotation through the cis configuration. The scaling actually reduces the i to i + 3 interactions, but has no effect on the classic i to i + 4 interaction in a-helical structures.
376 Although these differences are significant, it is important to emphasize that both force fields produce very similar ground state structures (similar to that given in figure 3). What is not known, however, is whether these functions produce similar folding pathways and thermodynamic properties. Furthermore, the anomalous behavior noted in past studies may suggest that the folding process, at least for one of the protein models, may not be thermodynamically reversible. [7,21] This may also provide evidence for classic spin glasstype behavior for these systems,[15-19] which if true may have implications to the kinetic mechanisms of protein folding. The determination and exploration of these properties is the main focus of the remainder of this chapter. (To simplify terminology in subsequent sections, the force field derived from the work of Rey and Skolnick will be referred to as RS, while the second will be given a more generic acronym, SFF/m, for simplified force field/minimal due to the "minimal-model" approach employed in development. [20,21]) 3. COMPUTATIONAL ALGORITHMS The above potential functions have been implemented as part of a simulated annealing algorithm to study the thermodynamics and ground state structure of the a-helical hairpin protein model given in figure 3. The algorithm applies Metropolis sampling to generate a Boltzmann distribution of states at a number of gradually decreasing temperatures. [24] In this scheme, acceptances are generated with probability
P(DE) =
AE > 0 pE < 0
where AE is the change in configurational energy produced by the trial move and ,3 is the inverse temperature factor (1/k7:). To determine the thermodynamic behavior of the protein model, between five and ten random initial configurations were generated. Each random configuration was then run at infinite temperature for 5,000 acceptances and subsequently annealed from an initial temperature of 10 5 K down to a temperature of 5 K. Once equilibrium is reached at each temperature (see discussion below), energy histograms and configurational data were collected for a minimum of 5000 acceptances (roughly 6,000 to 30,000 trials) and stored for further processing. Convergence was checked by doubling the number of acceptances and recalculating average system properties. To maximize the efficiency of this method, a unique move scheme has been developed for this system that randomly chooses between eight move types. These include random changes in bond length, bond angle, torsion angle, arbitrary bending at a random bead, displacement of a random bead, rotations of a random section within the structure, complimentary bending at two residues (kinking), and compression or elongation of a random section within the structure. These move types effectively expand the size of the neighborhood about any given configuration that can be reached in one move. Substantial reductions in equilibration times after temperature changes, as well as reductions in relaxation times (i.e. energy autocorrelation times) were effected using this protocol. 3.1. Optimal histogram methodology One of the great advantages to using simulated annealing for this type of problem is that it is based on statistical mechanics and therefore can be used to extract a great
377 deal of chemically relevant information about the system in addition to its global minimum conformation. In previous work,[7] we briefly described an approach, based on the optimal histogram method of Ferrenberg and Swendsen,[15] to obtaining more accurate estimates of the thermodynamic properties for molecular processes, such as protein folding. This scheme estimates the density of states from a set of histograms generated at each temperature during an annealing run. Here a more detailed outline of the methodology developed to perform these calculations is presented, including the practical considerations of implementation. To begin, consider a single annealing run consisting of a series of sweeps at inverse temperatures Oi = 1 I kTi , i = 1, , S. For each sweep, a histogram Hi (E j ), j = 1, ,B is collected in a bin of width DE;. The probability of finding a system in a bin of width of /EE about energy Ei at inverse temperature Oi is then Hi(EE) = ZOO - 1 e-'3zEiS2(Ei)AEi
Pi3, (Ei ) =
(3)
where Ni is the total number of samples in the i-th temperature sweep and Hi (Ei ) is the expected value of the histogram recorded on that sweep. C2(E E ) is the density of states and Z(f3i ) is the partition function Z( 132 )
-13tE351(Ei)AEi
(4)
i=i with the normalization constraint
Esi(E,)AE3
(5)
Then for a discrete system with Ar states, the full density of states is given by fi(Ej ) = Arf2(Ei ). For a system of N identical particles confined to a volume V, li(E) = VAr C2(E Using Eq. (3), each histogram (from a sweep at ,C3i ) results in an estimate of the density of states Qi (Ei )AEi = Z(0i )e13E3
(6)
Since the bin hits in Hi are a Poisson process with autocorrelation time Ti, the variance in Hi (E j ) is 82 Hi (Ei ) = giHi(Ej)
(7)
where[25] gi 1
+
(8)
Then the variance cr? in the estimate C2 i (Ei )Agi can be computed using the fact that the expression Z(0i )e'3E3 = SI(Ej)Agi
(9)
is independent of i; resulting in the following variance estimate: 62 (52i(Ej)AEj) = giZ(0i) eoi E,
(10)
378 Now we can combine several measurements of S2(E E ) to form the variance-weighted leastsquares estimate S
S2(Ei )AEi =
e
N
-1 E3
gi Z (iii)
.
(11)
i=
Defining Wi f2(Ej)AEi
(12)
S
(13) i=1
we get the histogram equations B
E 1/Vj
(14)
= 1
j=1
Zi
=
-(3iE,
(15)
j=1
T.
" 3
) -1
(S
=
Nigi 1 e"--°iE3 Z71
.
(16)
i=1
Given a set of histograms, Hi (Ei ) from multiple temperature sweeps, Eqs. (14-16) can be solved for 1/17; self consistently. We initialize W; at da and subsequently iterate these equations sequentially until the total change in W is less than a predetermined limit (set at 10' in our calculations). Once solved, the static thermodynamic properties of the system can be determined from 1/17j . The partition function, internal energy, specific heat, and entropy can be estimated by Z(0) =
U(3) =
- 13E3 Wj 1 Z(0)
1 Cy (0) = z(0)
(17)
E- -0Ej Tvj
(18)
(Ei —U(13))2e-fiE311/i
(19)
S(3) = OU(0)+ ln Z(13) N in V ,
(20)
where the N ln V term in the entropy arise from the normalization condition for N particles confined to a volume V. To satisfactorily implement this methodology, several practical considerations must also be addressed. First, the bin widths, AE;, must be chosen to yield the desired resolution in Z(/3) while maintaining computational feasibility. In a Gaussian approximation, Po(Ei) has variance of C,(13)/i3 2 and mean value of U(,(3). Therefore, in order for a histogram to resolve the peak in Po(Ei ), its bins can not be collected at inverse temperature wider than ,AE.; ti \/CV (/9) in the region Ea ti U(13). To accomplish this, we initially sample the data at a very fine resolution, with bin widths of approximately 0.3 kcal. This
379
requires 20,000 bins in the histogram. To reduce the computational load in solving the histogram equations, we resample the histogram using non-uniform bin widths. We retain the lowest 1000 energy bins of the original histograms and then condense the remaining 1001 20, 000 bins as a function of j. This allows us to resolve the narrow peaks that occur at small Ei as /3 becomes large (at small T), while avoiding the computational cost of using very small bins in the region of the histogram that is only populated at small /3. Second, unbounded energy terms must be adequately handled at high temperatures. This is accomplished by imposing a maximum energy cutoff to each term in the potential energy expression. Although this distorts the density of states at high energies, calculations at moderate temperatures are not effected if the cutoff is chosen properly. Thus, the maximum energy is determined by the upper temperature limit of interest. In the models considered here, it is found that limiting each term to a maximum of 200 kcal effected only those configurations that are important in 12(E) for temperatures of approximately 2,000K and greater, well beyond the range of temperatures of interest here. Third, the equations above are only meaningful if the system is at equilibrium during the annealing run. To address this issue we developed a heuristic criterion to determine equilibration time. Prior to collecting the histogram data several preliminary annealing runs were made. At each temperature step the mean energy in a moving window was computed. We used a predetermined temperature schedule given by Ti = with ry = 0.8. This choice of cooling schedule results in smaller steps in T at low temperatures to compensate for the expected increase in equilibration times. It was observed that after 5,000 to 8,000 acceptances, the change in mean energy was approximately zero and much smaller than the variance in the energy. This was nearly independent of the choice of force field and temperature. Thus, we established the criteria of waiting a minimum of 12,000 acceptances. In principle, it can be argued that since the equilibration process is exponential in decay, the system "never" reaches equilibrium. This coupled with the fact that the decay time is not known makes the development of a equilibration criteria problematic. Note that by basing the equilibration time estimate on acceptances rather than the number of trials, it automatically tends to increase as T decreases. While this is a somewhat ad hoc method of estimating equilibration, it is reasonably conservative. And fourth, the autocorrelation time, Ti in Eq. 8, must be determined to solve the equations given above. For each inverse temperature, A, Ti is the characteristic decay time of the energy autocorrelation function
a(t)
(Eto Eto +t ) (Et0)2 = Ce -t/T' (to to)(Et0)2
(21)
for the system once equilibrium has been reached.[14] To understand the role of the autocorrelation time and its contribution to the variance of Hi ( Ej ), consider the simpler problem of estimating the variance in an estimate of the average energy, 8 2 E, at fixed temperature T from a sequence of n energy samples, E2 , ..., En ). Now we can compute an estimate of E from the sample as E = E.77 .=1 E. We assume that the sample is large enough so that E is a good estimate of the statistically expected value of E. The variance is then given by (5 2r = 82E-
380 where T is the autocorrelation time given by Eq. (21) and 6-2 is the sample variance 0,2
E3 j=1
Clearly, neglect of the statistical inefficiency in our equations would result in an underestimate of the actual sample variance, especially near phase transitions where temporal correlations become quite long. The inclusion of this contribution is therefore a necessity if the estimate of 8 2 Hi (Ej ) is to be truly meaningful in determining the density of states and resulting thermodynamic functions from Eqs. 14-16. The autocorrelation time in question here is not directly related to a physically observable quantity. Ti depends on not only the potential function Eta but also on the simulated annealing move scheme used. The move scheme defines a neighborhood in configuration space that is reachable in a single move. The "size" of the neighborhood and the shape of the energy surface over the neighborhood determine the statistical properties of the energy sequence observed on each annealing sweep. Consider a simple move scheme based on small random moves of a single bead at a time. Assuming Eta is reasonably smooth we would expect the energy of successive configurations to be correlated resulting in a large T. With a more complex move scheme, the neighborhood in configuration space reachable on a single step is increased. This will generally will result in less correlation in the energy sequence generated on the annealing sweep, and hence a smaller value of T. In our application, T is therefore an estimate of the statistical inefficiency in the measurements that has routine use in numerical analysis. Later it will be shown that T has a dramatic effect on the results found using Eqs. 14-16, and hence the choice of estimators is an important consideration. To estimate two different techniques have been employed. A crude estimator can be constructed from the acceptance ratio with Ta" -
number of trial configurations number of accepted configurations
which has roughly the correct scaling properties. This estimator was tested on some simple trial systems where C2(E) was known (not related to the protein folding system) and was found to produce the correct results for 11(E). Although r' does not correctly estimate the overall scale of the correlation time, inspection of Eqs. 14-16 reveals that 11(E) is insensitive to the scale of this parameter for Ti 1, suggesting that this simple heuristic may be applicable in our calculations. However, it should be noted that this ratio is at best an indirect estimate of the correlation time in the system energy over the annealing run. A much more robust method of estimating Ti is provided by Eq. 21. One can estimate the autocorrelation function a(t) from the observed sequence of E l 's and fit the results to an exponent using the derived expression 1 (22) ln Et(t) = — —t Tz
to obtain an estimate of ln ln t = b0 ln t ln
As a further check of self-consistency we have also computed
which assumes a(t) is a stretched exponential.[26]
(23)
381
3.2. Order parameters The methodology thus far outlined is general, but relies on the basic assumption that configuration space is sampled ergodically over the temperatures of interest. Although often accepted without proof in molecular dynamics and Monte Carlo simulations, this may not necessarily be the case in systems that contain significant frustration; especially when glassy behavior may be suspected below the phase transition.[19,21] To explore this possibility, a structural analysis of the configurations visited above and below the phase transition for several independent annealing runs must be performed. While measures like the RMS dihedral deviations, the number of phobic contacts, and radius of gyration have been used for this purpose,[8,21] order parameters or overlap functions derived from spin-glass theory provide a more powerful method of determining the system behavior in configuration space.[16-19] What follows is an adaptation of concepts (and terminology) taken from spin glass studies for use in the analysis of the annealing runs. The basic approach adopted is derived from lattice model studies of protein folding. For bead sequences confined to a lattice, the following order parameter can be defined[27,28] qa 13 —
N
j cy
r
(24)
o) ,
where 8(x) is the usual Kronecker delta, IV is the total number of beads, and rQ and denote the spatial position of the i t h bead in the chain for two annealing runs, a and 13. q a o is the number of beads in each chain that occupy the same lattice position normalized by the total number of beads. Therefore =1 if a, # have the same conformation have dif erent conformations { < 1 if a , The important feature of q a o is that when a chain possesses one thermodynamically stable fold (or global minimum conformation), q a o = 1 for all a and /3 while for a homopolymerlike collapsed globule, q a o 1/N (indicating no stable folds or conformers). A third possibility can exist as well which typifies the spin-glass phase. The chain has numerous stable folds (or many low energy conformers). In this case, there exists a set of thermodynamically definite folds with coordinates, ro and qw = 1 when = and qw ti 1/N when Eq. (24) is applicable only to lattice models where the beads have discrete positions. To extend this methodology to the models considered in this work, continuous space must be discretized. Fortunately, this can be done using the internal coordinates of the potential function as a reference. First, bit vectors e(a) and d(a) are introduced with components of 0 or 1. To each component of e(a) we assign a pair of non-adjacent phobic beads i and j. We then define the value of as 1 if Enb(7', { 0 otherwise
< pcontact
(25)
where Econlact = 12-. min {Enb }. Thus, the elements of bit vector d(a) are 1 for every phobic contact and 0 otherwise. Although the definition of Fcontact is so mewhat arbitrary here, our results are not sensitive to it.
382
di
Similarly for each torsion angle, co , a two bit vector, ci(a), is assigned with elements defined by 00 if co i 01 if wi 10 if w i 11 if co i
d3• =-
is cis, is gauche-, is trans, is gauche+.
(26)
The local maxima in the torsion potential are used to define the boundaries in torsion angles between gauche- and trans, and trans and gauche + . The boundary between cis and gauche± is chosen so that the corresponding energy is the same as the local maximum between gauche± and trans. Thus, the elements of d(a) represent a discretized torsion space. Since the configurations are effectively determined by the torsion states (cis, gauche± , trans) and the phobic contacts, the vectors e(a) and d(a) effectively determine the structures and discretize the continuous spatial coordinates. Given two conformations, a and 13, a distance function can then be defined using Eqs. (25) and (26): X(a,
0)
Iga) — at3 )11 +
— (40)11,
(27)
where 11-11 denotes the Hamming distance between the two bit vectors (i.e. the number of bits in arguments that are different).[29] The distance measure X(a, 0) has the property that X(a, a) = 0 and X(a, /3) > 1 for conformations, a and /3, that differ in either the number of phobic contacts or torsions. Generally, we would like to normalize X(a, 0) /3).) However, it is not at all obvious which structures and then define gacoontact = 1 _ xN(a, p maximize X (a, (3). Since X (a„3) as defined in Eq. (27) contains all of the information of the more usual order parameter,[13] it is more practical in this case to neglect the normalization and use the overlap function directly. To study the physical behavior of the system during an annealing run, the thermodynamic average of the overlap function X(a,13) is also introduced below. After reaching equilibrium at some temperature, T, we generate a set of states, A = {a} during an annealing run. Using Eq. (27) we define the average overlap and its variance as (XA,A)T =
62(x
1
x(a, 0),
A,A ) 7, = 1 v 2 E (X(a, 13) — (--KA,A)T)2 A cy,i3EA
(28)
(29)
where NA is the number of elements in the set A and the summation is taken over all possible pairs of configurations. (XA,A)T is analogous to the Edwards/Anderson order parameter.[18] A phase transition in the system would be indicated by (XA,A ) T 0, as T --> Tm , with both (XA,A)T and (52 (XA,A ) T small below Tm . On the other hand, if the system remains in a homopolymer-like globule then (XA,A ) T will be greater than zero for all finite T.
383 Similarly we can compute the overlap and its variance between separate annealing runs with
1
EX (cto3),
(XA,B)T = NANB aEA,i3EB
82 ( XA,B)T = , 1„ iv A iv B
aEA,0EB
(X(a7/3) — ( XA,B) ) 2
(30)
(31)
( XA,B)T is analogous to the Parisi overlap function.[16,17] This form of the overlap function reveals nature of the low energy frozen states. Assuming that there is a transition to a frozen state at Tm, two possibilities exist. If there is a pure state, (i.e. a single frozen Tm from above. This implies (XA,B)T —> 0, as T state), then (XA , B)T N (XA,A)T that the various annealing runs are indistinguishable, and the system is ergodic. The we have other possibility is that for some (possibly large) set of annealing runs, (XA,B1T
=0 A=B VA B E 7 >0 A B
This would indicate a number of frozen states, implying that the system is becoming trapped in local minima. The interconversion times between states would also be expected to be large and grow exponentially with system size.[30-32] In simulated annealing runs, we would expect (XA , B)T to become fixed for T < Tm , and large with respect to 62 (XA,B)T for this case. 4. RESULTS The optimal histogram methodology and overlap functions have been applied to examine the folding process for potential phase transitions and conformational changes that occur in the protein system during thermal cooling. The specific heat for 5 independent simulated annealing runs using the SFF/m force field is plotted in figure 4 as a function of T. A single phase transition in the range of 250K to 400K is apparent in each run. The Edwards/Anderson-like overlap function, (XA,A)T, and its variance are plotted for these runs in figure 5, with both reaching zero over this same temperature range as well. These results provide fairly strong evidence that the system is freezing from the random coil to one or more low energy conformations as the system cools below Tm . Previous calculations using the RS force field also support the existence of a phase transition during the folding process for this protein model.[7,8] Although not reported here, we did analyze several annealing runs in detail with the methodology as described above and noted nearly identical behavior in C, and (XA , A)T; except that the temperature scale is shifted due to differences in the force field energy scaling. The results also provide a good indication that the optimal histogram methodology is functioning properly. The general agreement in the independent estimates of Tm from the C, and (XA,A) )T T plots suggests that these independent measurements are capturing the same "event" in the folding process. (Disparities in the Tm values predicted from C, are discussed further below.) The average number of hydrophobic contacts present in the protein model during the annealing runs are plotted in figure 6 as a function of T and under go nearly a step
384
140
120 100 80 47-1 .P) 60 40
...
20
200
400
600
800
1000
1200
T (degrees K)
Figure 4. C, verses T for 5 independent simulated annealing runs using the SFF/m force field.
change at approximately 250K. The variances in the dihedral angle deviations are plotted in figure 6 as well and reach a minimum at the same temperature. These results are also consistent with the existence of a phase transition. The steepness of the change in the number of hydrophobic contacts at T, as compared to the dihedral deviations suggests the phase transition is the result of a freezing of the hydrophobic contacts, or hydrophobic collapse. This interpretation is not unexpected, since the bond, angle and torsional potentials are predominantly a short range effect and hence can not support a phase transition by themselves. Although a second folding transition at a temperature T f < T, could also expected in this type of system,[33] there is no evidence of this here. The absence may be due in part to the small size of the system under investigation and possibly to the small size of the effects at this transition. Again we find the behavior of average number of hydrophobic contacts and the dihedral angle deviations for the RS model nearly identical to the SFF/m model indicating the thermodynamic behavior of these models are essentially the same. Figure 4 also shows that the specific heat and by implication the density of states for each run are not the same. Differences in T, between runs is also suggested by the plots of the average structural properties (figures 5 and 6). To explore this further, we have plotted the autocorrelation times in the measurements for the same 5 runs. As can be seen in figure 7, the estimates of 7 all have a peaks at temperatures somewhat below the T, observed in the corresponding C„ plot. This correlation is riot surprising since the specific heat is derived from C/(E) that has a direct dependence on T in the optimal histogram methodology. It would appear that T is therefore determining much of the thermodynamics of the phase transition. To help elucidate the effect of the T estimator, the calculations were redone using the heuristic-based T a " . The correlation times determined in this manner are plotted in figure 8 and show a linear increase over T
385
20
20
15
15 A
A < 10
V
10 5
5 0 0 200 400 600 800 1000 1200 T (degrees K)
0 0 200 400 600 800 1000 1200 T (degrees K)
Figure 5. The overlap function ()C AA ) (left graph) and the variance (right graph) averaged over 5 independent runs.
as the system cools, reflecting the decreasing probability of accepting configurations as T is lowered. T acc is therefore overestimating the correlation times below the phase transition and furthermore, is apparently insensitive to the local structural dependencies of the correlation time at the phase transition. The use of T acc in the density of states calculation also leads to a substantial shift in the position of T, as shown for a single annealing run in figure 9. Noting that T acc is essentially the same for all runs, the disparities in Tin naturally disappear as well. By comparing figures 7 and 8 it is clear T acc fails to capture differences in the temporal correlations near the phase transition. Although T acc may provide a crude estimate of the correlation time, the physical significance of the derived properties (i.e. Cy , Trn , etc.) is questionable and are simply not comparable to those derived using T from the autocorrelation function. This analytical approach, while not trivial numerically,[34] is certainly more robust. This result clearly demonstrates the need to carefully estimate T when using the optimal histogram methodology. Although the disparities in T, are substantially diminished when the temporal correlations are ignored, this is not a satisfactory approach to addressing the problems associated with multiple transition temperatures in the calculations; nor does it provide an explanation of the behavior. To gain insight to this result, a more detailed analysis must be performed on the physical properties of the system between annealing runs. Figure 10 is a graph of the Parisi-like overlap functions, (XA ,B)T, and its variance for all possible pairs of five independent runs using the SFF/m force field. Figure 11 shows the same function for seven independent runs using the RS force field. The temperature scale for the latter was arrived at by first scaling the total energy by a constant, chosen so that the hydrophobic potential is at a minimum at -1.0 kcal, consistent with the SFF/m force field. The behavior of (XA ,B)T and 62 1K \ --A,B i\ T at high temperatures is consistent with a random coil state. However, at the transition temperature of approximately 250 K the individual (XA,B)T )T diverge from each other and remain non-zero while 6 2 (XA,B )T a13-
386
20 I
i
r
i
i
i
i
i
i
i
60
i
0,,o, 50 _ 15,
40 -
_ '
A
a
01,
A
10-
A
\ A :,
...
- ..
\ \
,I
-
cq <
et 30 -
v
'-
20 .
_
pc-..;;:C4
10 01
i
i
i
i
i
i
i
i
i
i
0 200 400 600 800 1000 1200 T (degrees K)
0
i i i i i i i i i „ I 0 200 400 600 800 1000 1200 T (degrees K)
Figure 6. Average number of hydrophobic contacts (left graph) and torsional variance (right graph) as a function of temperature during several annealing runs.
proaches zero. This is clear evidence for the existence of multiple frozen or folded states, indicating possible spin glass behavior below Tm.[16-19] This implies that the system is trapped in a local energy well in configuration space, known as "pure state" in spin-glass theory. The sampling of the local minima therefore is not ergodic, indicating that the system may not reach equilibrium over the time scales used to generate the histograms. The differences in the correlation times in figure 7 represent the characteristics of the transition from the random coil state at high temperature and a particular energy well, but do not represent an ensemble average. One normally expects the time behavior of the energy autocorrelation function to vary as e (-01-) . The multiple relaxation times imply that a(t) a e (-t/T)P .
(32)
This stretched exponential behavior known from spin-glass theory.[16] Each run produces an estimate of Ti that depends on the energy minimum visited as the systems freezes (or perhaps on two where the double peak in T is observed).[35] The "ensemble" average autocorrelation function is then a sum over many terms of the form e (-ilTi ) , which yields the stretched exponential behavior. The non-equilibrium behavior at and below the phase transition boundary violates the assumptions of Eq. (3). This means that the 11(E) and C, as computed from a single run do not represent the thermodynamics of the system in the limiting case of non-finite sample sizes, and as a result we observe a spread of transition temperatures in the C, plots. However, we may interpret each f/(E) and C, as the density of states and heat capacity for a specific annealing process, representative of a particular local minimum conformation. To further the study of the nature of the local minima populated below the phase transition the lowest energy structure from several annealing runs were evaluated using interactive graphics and conformational analysis. We should point out that although
387
1750
4.0
1500
3.0
2.0 •
1250 1000 E-n
750 500
-
.
1.0
250 0
1-1 "s"-* 0 200 400 600 800 1000 1200 T (degrees K)
0.0
0 200 400 600 800 1000 1200 T (degrees K)
Figure 7. The correlation times (T) derived Figure 8. The correlation times (Tact) defrom the autocorrelation function as a func- rived from the acceptance ratio approximation of temperature for 5 independent runs. tion.
these are not actual minima, they are reasonable approximations of optimal structures since the annealing process is conducted extremely slowly at low temperature. The lowest and highest energy conformers taken from 10 SA runs using both force fields are given in figure 12. SFF/m energies range from -8.2 kcal to -7.4 kcal, with the predicted global minimum consistent with previous results of a global optimization study.[20] An analysis of the dihedral angles of all structures located further indicates that all frozen states share the basic a-helical hairpin conformation with minor deviations mainly localized to the turn region and terminal residue positions. No, what we shall term, "mis-folded" states were identified in any annealing runs. In fact, all structures located using the SFF/m force field are easily interconverted during short molecular dynamics simulations at low temperature (200K).[36] The majority of structures located using the RS force field, on the other hand, deviate significantly from the ideal a-helical hairpin in both the turn and helical domains at slightly higher energies above the suspected global minimum. Energies ranged from -11.4 to -7.6 kcal using the energy scaling described above. Although previous results from a Brownian dynamics study predict a similar global minimum, the mis-folded states were not reported.[8] (However, a propensity to mis-fold was suggested in a previous study. [7]) Although we can not directly address this difference in the results, the most likely explanation stems from the fact that the structures were not completely quenched in that study. In addition, we can not rule out the possibility that the entropy of these states is small at the finite temperatures at which the Brownian trajectory was run. An interpretation of the results from that study in light of our work is difficult, however, since possible phase transitions were not rigorously defined or characterized during the dynamics simulations.
388
140 120 100 80 C.)
60 40 20 0 0
200
400
600 T (degrees K)
800
1000
1200
Figure 9. C, verses temperature computed for a single annealing run using relaxation time estimates from the autocorrelation time (solid) and approximate acceptance ratio (dotted).
5. CONCLUDING DISCUSSION This chapter has described an adaptation of simulated annealing for use in studying the the physical properties of a simplified protein model. To track the folding process thermodynamically, we have developed and applied an optimal histogram method that provides a "best estimate" of the density of states for the system. Although this provides a route to calculate all thermodynamic quantities, here we have focused on the specific heat due to the common utility such a parameter has in determining potential phase transitions in molecular systems. The results indicate, as others before, that the simple bead model shown in figure 3 undergoes a phase transition during thermal cooling. However, the calculated specific heats from independent runs showed peaks at multiple values. A closer examination of the system behavior at the phase transition revealed that differences in the autocorrelation times in the measurements produced much of the effect responsible for the disparities. This brings up two points. First, the result demonstrates the critical dependence the optimal histogram methodology has on the estimate of the correlation lengths or statistical inefficiencies.[14] Inaccuracies in the correlation time will undoubtedly lead to errors in the density of state calculations and therefore the thermodynamic functions as well. And second and possibly most intriguing, the multiple values reported suggest that the relaxation process is stretched exponential in nature. That is, that a number of processes, in this case folding pathways, contribute to the overall physical behavior of the system. This is typical behavior associated with spin glass-type systems and may imply that the simple protein model studied here is also glassy below the transition temperature. [19,31,37,38] To further investigate the folding process, we have also developed and applied two over-
T,
389
30
30
25
25
20
20
ci 15
15
A
10
10
. A. ti
5
5
0
0 200 400 600 800 1000 1200 T (degrees K)
0 0 200 400 600 800 1000 1200 T (degrees K)
Figure 10. Average overlap function (XA,B ) (left) and variance (right) verses temperature for all pairs of five independent runs using the SFF/m force field.
lap functions (or order parameters) adapted from spin-glass studies to track the structural properties of the system. The first (Edwards/Anderson-like) analyzes single run characteristics, while the second (Parisi-like), multiple runs. Both functions described essentially discretize the continuous space of the force field through the internal degrees of freedom defining the protein model. As the system cools, the value and variance of the overlap function provides a measure of the uniqueness of the conformer, as well as evidence to support the existence of potential phase transitions. Interestingly, the results not only supported the basic behavior of the system as predicted by the specific heat, but also provided further insight to the multiple transition temperatures noted from the calculations. The values of the Parisi-like overlap function plotted in figures 10 and 11 indicate that the system is freezing at the phase transition into multiple conformers. It would logically follow that the system must also possess multiple folding pathways that may or may not lead to the same structure, indicating that the process is not in fact thermodynamically reversible. This is certainly consistent with the existence of multiple correlation times and transition temperatures. Although it is tempting to conclude that the protein system is therefore spin glass-like, such an interpretation may be premature. It is important to point out that, unlike typical spin glass systems, the frozen states or conformers located here are easily interconverted below the transition temperature using dynamics calculations. The kinetic barriers separating minima are apparently small. For a system of this size and structural complexity (compositional and topographical), however, this is not surprising. The real "acid test" of spin glass theory, of course, is the scaling behavior of the system.[19] In other words, how do the physical properties change upon increases to N, the number of beads. Insight to this effect can be found in the work of Honeycutt and Thirumalai in which a similar force field and protein model was applied to a larger 0-barrel structure. [21] The results from that study indicate that a manifold of folded and mis-folded states are populated
390
30 25 20 A
4 <21
15 10 5
200 400 600 800 1000 1200 T (degrees K)
200 400 600 800 1000 1200 T (degrees K)
Figure 11. Average overlap function (right) and variance (left) verses temperature for all pairs of five independent runs using the RS force field.
below the phase transition temperature that are postulated to be meta-stable intermediates in the folding pathway. These states share similar characteristics to the conformers described here, except the structural excursions in conformational space are more diverse. It is speculated that interconversion times are in fact, quite long between some states. Interestingly, this implies that the simple bead model may in fact demonstrate scaling behavior consistent with that found in spin glasses, but a more detailed analysis of the protein model is required before a definitive conclusion can be reached. The results also indicate that the conformers located below the transition temperature vary significantly between force fields. While not apparent from the overlap functions or thermodynamics, the folding pathways for the RS and SFF/m models are not equivalent. The latter freezes into a set of conformationally similar structures (all cr-helical hairpin-like) while the former produces a multitude of mis-folded structures that are clearly non-physical (see figure 12). Given that both force fields share similar thermodynamics and global minimum structures, as well as commonalities in parameterization, the result is particularly interesting. First, it suggests that the glassy behavior implied by the Parisi-like overlap function may after all be an overinterpretation of the result for at least one of the force field models. Chemically, the structures may in fact be indistinguishable. Although the function is not designed to measure similarities in conformations, this, nevertheless, implies that some caution must be used in applying order parameters (adopted from spin glass studies) to chemical problems of this type. The bulk of lattice model studies of molecular conformation may also be effected by these findings. And second, it demonstrates the critical dependence folding pathways and possible mechanisms have on the details of the parameterization. Although the results indicate that the phase transition is dominated by a hydrophobic collapses of the beads, the graph of the torsional variance given in figure 6 shows that substantial local interactions have been first established. In these models, the formation of the secondary structural elements are therefore
391
Figure 12. Stereoview of the lowest (dotted lines) and highest energy (dashed lines) frozen states located using the RS (upper) and SFF/m (lower) force fields. The structures were least squares fit and showed an R MS difference in cartesian space of 1.6 in the former, verses 0.5 for the latter.
key in predisposing the beads to the correct fold or general conformation during the cooling process. It follows that differences in the scaling of the forces favoring compaction (phobic to phobic) to those favoring elongation (torsional) compaction) are most likely responsible for the variation noted in the structural results. A direct examination of the scaling in these terms indicates that the latter are in fact stronger in the RS force field which not only supports this conclusion, but also explains the more compact nature of the anomalous structures (see figure 12). Unfortunately, it is not possible to determine which force field contains a more realistic scaling in these terms since the energetics of this effect are difficult to quantitate; both experimentally and theoretically. Obviously, the development of more accurate models of protein structure greatly depends on our ability to correctly estimate the balance in these local and non-local interactions. This challenge remains. 5.1. Comparison to experiment Finally, the physical properties calculated in this work can also be related and contrasted to experimental properties derived from real protein systems. Although quantitative comparisons are not possible due to the overwhelming simplifications made to the protein structure, some general aspects of the model predictions can be examined. First, the spin glass-type behavior implied by this work is not strongly supported by the bulk of experimental studies of protein folding. The majority of small, globular proteins do not show a manifold of compact states trapped below the phase transition temperature under standard conditions (i.e. T 3004[6,39] The folding process is thermodynamically reversible. Although many conformational substates do exist, the kinetic barriers are accessible in finite time scales. However, as the system is cooled, these barriers become unsurmountable and the system freezes out into a collection of substates. [39] The results
392 presented here are remarkably similar but are only representative of a system with limited frustration or complexity. The question of what happens to the interconversion times between substates as the size and resolution of the a-helical hairpin model approaches that of real proteins remains open. Although some hints are offered by the work of Honeycutt and Thirumalai, further studies are required to fully understand the experimental significance of the results. Second, hydrophobic collapse is a strongly supported mechanism of protein folding.[34] The theoretical results have demonstrated that the freezing of the system is dominated by this non-local interaction, consistent with previous Brownian dynamics simulations of this model.[8] However, questions still remain regarding the role secondary structure plays in predisposing the system to reliably fold to the native conformation. The gradual decrease in the torsional variance noted in figure 6 suggests that substantial secondary structure is formed before collapse. Unfortunately, this prediction can neither be strongly supported or refuted by available experimetnal data. Answers to this provocative question may come from simulations similar to those described here using more accurately parameterized energy terms and protein models that account for the proper scaling in the long and short range interactions (as alluded to above). And third, in terms of structure prediction; the force fields are clearly not equivalent. Although little structural data is available for small peptide fragments we can find no strong support for the existence of the non-physical states produced by the RS force field for systems of this size. Intermediates in the folding pathway of real proteins are also thought to be nativelike as well,[40] so it is doubtful that the system is capturing structural properties that have experimental significance. These arguments support the conclusion that the balance in the long and and short range forces in the RS force field may in fact be skewed too far in favor of the latter. Given the resolution of the models, however, the structural results using both force fields are certainly reasonable and offer satisfactory starting points for further refinements. REFERENCES 1. H. S. Chan and K. A. Dill, The Protein Folding Problem, Physics Today, 46 (1993) 24. 2. J. M. Troyer and F. E. Cohen, Simplified Models for Understanding and Predicting Protein Structure, in Reviews in Computational Chemistry, Vol. 2, ed. by K. B. Lipkowitz and D. B. Boyd, VCH Publishers, NY, 1991. 3. J. Skolnick and A. Kolinski, Computer Simulations of Globular Protein Folding and Tertiary Structure, Annu. Rev. Phys. Chem., 40 (1989) 207. 4. F. M. Richards, The Protein Folding Problem, Scientific American, 264 (1991) 54. 5. C. B. Anfinsen, Principles that Govern the Folding of Protein Chains, Science, 181 (1973) 223. 6. P. L. Privalov, Thermodynamic Bases of the Stability of Protein Structure, Thermochimica ACTA, 163 (1990) 33; P. L. Privalov, Stability of Proteins, Adv. Protein Chem., 33 (1979) 167. 7. D.G. Garrett, K. Kastella, and D.M. Ferguson , New Results on Protein Folding from Simulated Annealing, J. Chem. Soc., 114 (1992) 6555. 8. A. Rey and J. Skolnick, Comparison of lattice Monte Carlo Dynamics and Brownian
393
9. 10.
11.
12. 13. 14. 15. 16. 17. 18. 19. 20. 21.
22. 23. 24. 25. 26. 27. 28.
29. 30.
Dynamics Folding Pathways of a-Helical Hairpins, Chem. Phys, 158 (1991) 199. C. A. Laughton, A Study of Simulated Annealing Protocols for Use with Molecular Dynamics in Protein Structure Prediction, Prot. Eng., 7 (1994) 235. T. Karasawa, K. Tabuchi, M. Fumoto, and T. Yasukawa, Development of Simulation Models for Protein folding in a Thermal Annealing Process, Comp. App. Biosciences, 9 (1993) 243. M. E. Snow, Powerful Simulated Annealing Algorithm Locates Global Minimum of Protein Potentials from Multiple Starting Conformations, J. Comput. Chem., 13 (1992) 579. K. C. Chou and L. Carlacci, Simulated Annealing Applications to the Study of Protein Structures, Prot. Eng., 4 (1991) 661. C. Wilson and S. Doniach, A Computer Model to Dynamically Simulate Protein Folding: Studies with Crambin, Proteins: Struct. Func. Gen., 6 (1989) 193. M. P. Allen and D. J. Tildesley, Computer Simulation of Liquids, Clarendon Press, Oxford (1987). A.M. Ferrenberg and R.H. Swendsen, Optimized Monte Carlo Data Analysis, Comp. Phys., Sept/Oct. (1989) 101. M. Mezard, G. Parisi, M.A. Virasoro, Spin Glass Theory and Beyond, World Scientific Press, NY (1984). M. Mezard and G Parisi, Replicas and Optimization, J. Physique Lett., 46 (1985) L771 S.F. Edwards and P.W. Anderson, Theory of Spin Glasses, J. Phys.: Metal Phys., 5 (1975) 965. Spin Glasses and Biology, ed. by D. L. Stein, World Scientific Press, NY (1992). D.M. Ferguson, A. Marsh, T. Metzger, D.G. Garrett and K. Kastella, Conformational Searches for the Global Minimum of Protein Models, J. Glob. Opt., 4 (1993) 209. J.D. Honeycutt and D. Thirumalai, The Nature of Folded States of Globular Proteins, Biopolymers, 32 (1992) 695; J.D. Honeycutt and D. Thirumalai, Metastability of the Folded States of Globular Proteins, Proc. Natl. Acad. Sci. USA , 87 (1990) 3526. A. M. Lesk, Protein Architecture, Oxford University Press, N. Y. (1991). L. Regan and W. F. DeGrado, Characterization of a Helical Protein Designed from First Principles, Science, 240 (1988) 976. S. Kirkpatrick, C. D. Gelatt Jr., M. P. Vecchi, Science, 220 (1983) 671. A.M. Ferrenberg, D.P. Landau, and K. Binder, Statistical and Systematic Error in Monte Carlo Sampling, J. Stat. Phys., 63 (1991) 867. J. Klafter and M. F. Shlesinger, On the Relationship Among Three Theories of Relaxation in Disordered Systems, Proc. Natl. Acad. Sci., USA, 83 (1986) 848. E.I. Shakhnovich and A.M. Gutin, Enumeration of All Compact Conformations of Copolymers with Random Sequence Links, J. Chem. Phys., 93 (1990) 5967. E.I. Shakhnovich and A.M. Gutin, Formation of Unique Structure in Polypeptide Chains Theoretical Investigation with the Aid of a Replica Approach, Biophysical Chemistry, 34 (1989) 187. R. Rammal and G Toulouse, Ultrametricity for Physicists, Rev. Mod. Phys., 58 (1986) 765. R. H. Austin and C. M. Chen, The Spin Glass Analogy in Protein Dynamics, in Spin
394 Glasses and Biology, ed. by D. L. Stein, World Scientific Press, NY (1992). 31. H. Frauenfelder, K. Chu, and R. Philipp, Physics from Proteins, Biologically Inspired Physics, ed. by L. Peliti, Plenum Press, NY (1991). 32. B. Derrida, Random-energy Model: An Exactly Solvable Model of Disordered Systems, Phys. Rev. B, 24 (1981) 2613. 33. K. A. Dill, Dominant Forces in Protein Folding, Biochemistry, 29, (1990) 7133; K. A. Dill, Theory of the Folding and Stability of Globular Proteins, Biochemistry, 24 (1985) 1501. 34. G.E.P. Box and G.M. Jenkins, Time Series Analysis Forecasting and Control, HoldedDay (1970). 35. A detailed analysis of the conformers and correlation times for the runs that produced a double peak in C, has revealed that the system is effectively "tunneling" from one low energy well to another. To ensure that this was not an effect of the annealing time, several more runs where made with longer annealing times. No further transitions were observed for annealing times up to 40,000 acceptances (8 times our normal time). 36. The frozen states populated were simulated at low temperature (200K) with molecular dynamics using the force field given by Eq. 2, and table 2. The trajectories were then analyzed for conformational interconversions. 37. E.I. Shakhnovich, G. Farztdinov, A.M. Gutin, and M. Karplus, Protein Folding Bottlenecks: A Lattice Monte Carlo Simulation, Phys. Rev. Lett., 67 (1991) 1665. 38. J.D. Bryngelson and P.G. Wolynes, Spin Glasses and the Statistical Mechanics of Protein Folding, Proc. Natl. Acad. Sci. USA, 84 (1987) 7524. 39. FL Frauenfelder, F. Parak, and R. D. Young, Conformational Substates in Proteins, Ann. Rev. Biophys. Biophys. Chem., 17 (1988) 451. 40. J. S. Weissman and P. S. Kim, Reexamination of the Folding of BPTI: Predominance of Native Intermediates, Science, 253 (1990) 1386. Acknowledgements: The authors would like to thank John Troyer and Keith Kastella for helpful discussions regarding the development of this chapter. We are also grateful to Joel Neisen and the Minnesota Supercomputer Center for help preparing figure 1.
Adaption of Simulated Annealing to Chemical Optimization Problems, Ed. by J.H. Kalivas © 1995 Elsevier Science B.V. All rights reserved.
395
Chapter 18
Optimization of linear and non-linear parameters in a trial wavef unction by the method of simulated annealing *
P. Dutta and S.P. Bhattacharyya Department of Physical Chemistry , Indian Association for the Cultivation of Science, Jadavpur, Calcutta-32, INDIA
1. INTRODUCTION The variational method has been the cornerstone on which the very edifice of electronic structure theory has been largely built. The bounding properties of variational solutions and the scope of systematic improvement of an approximate wavefunction by progressively enlarging the dimension of the trial space are features that are particularly appealing and useful [1-31. It is not surprising therefore, that a vast body of literature dealing with different aspects of the variational approximation and its applications to atomic and molecular electronic structure theory exists. The present paper is focused around the variational approximation and a stochastic method of handling the mathematical problem of optimization that it very naturally leads to. The problem of minimizing functions is ubiquitous in many different branches of science. It arises very naturally and rather directly in the electronic structure theory when the strategy adopted is variational; for, the basic task in the variational approximation boils down to finding out values of a set of parameters present in the trial wavefunction (assuming expansion in terms of finite dimensional analytic
*
Present Address : S.N.Bose National Centre for Basic Sciences DB-17, SaltLake, Calcutta-64, India
396 basis sets) for which the appropriately constrained energy function has stationary points. Recast in the language of optimization theory, the problem is to find out the extremum of a function on a multidimensional surface. For the ground state of a Hamiltonian that is bounded from below, the extremal point is the absolute or the global minimum of the function. For the excited states, the problem is less straight forward and constraints become very important for preventing variational collapse onto the ground state or the lowest state of a particular symmetry. But constraints inevitably bring in Lagrangian multipliers that are to be handled carefully. The conventional approach has been to adopt either the gradient driven techniques [4-5] or the non-gradient ones like the pattern search method, to solve the resulting optimization problem. But to do so successfully one must be able to surmount the problem of (i) distinguishing saddle points, maxima and minima, (ii) choosing the direction of search and appropriate step lengths and (iii) discriminating between a local and a global minimum. These problems are not insurmountable, but are certainly vexing. Alternative techniques that are not only gradient free but also free from the burden of establishing a search direction, have naturally been explored and a global stochastic minimizer has emerged as a very general and powerful technique for handling the problem. The method has been known as the method of simulated annealing [6-7]. Its adoption in solving the basic problems of electronic structure theory is of rather recent vintage. Two distinctly different kinds of approaches have been advocated; one of these works by making a fictitious Lagrangian stationary (61 L dt = 0) and generating classical equations of motion for the parameters in iii which reaches their optimal values as t–> cx and T –>0 [8-10]. The fictitious Lagrangian has kinetic energy terms for the nuclear as well as electronic degrees of freedom. The latter are the variational parameters in the trial wavef unction. The potential energy term in the Lagrangian is the expectation value of H (=<171(a)1H II7J(a)>, where a represents a collection of variational parameters in Vi. The other version relies on the cost or penalty function recipe that recasts the variational problem so that it transcribes into a problem of finding the global minimum of a constrained energy function through a stochastic search at different parameter temperatures such that the sequence a(T i ), a(T2 ),... a(Tn ) –> ao as T –>0 where a is the set of values of the parameters a for which the o n
397 function has a global minimum [11-16]. The construction of the appropriate function as opposed to the Lagrangian is therefore the first step in this strategy. We will therefore turn our attention first into this aspect. 2. THE CONSTRUCTION OF COST FUNCTION
For the ground state, the variational problem is to make stationary with respect to all possible variations in
ri,
<rpIHIrp >
subject to the
condition that =1. The constrained variational problem can be succinctly posed as 8J(iii) =0, for arbitrary 8iii
(1)
where
(2)
J(ITJ)=
<,,,W,IHIV) > + A [1 - < iii I ill >]
A being a Lagrangian multiplier. We can cast the same problem in the mould of a direct minimization problem by constructing the cost function
Y(rp)
and looking for its absolute
minimum where
KJ) = c<rpiHirft
(3)
>—Ed 2 + A [1 - 12
EL is the lower bound to the constrained energy and u is a penalty weight factor of appropriate magnitude and dimension. Our problem is then to find the absolute minimum of non-linear, present in
rp
Y(trii)
by adjusting the parameters, linear or
[17-18]. For the first excited state which has a
different symmetry from the ground state, the same procedure can be adopted. But for an excited state having the same symmetry as some of the lower states, the construction of an appropriate cost function is less simple. Here we must discriminate between two categories of variations viz. linear and non-linear. If the variation is restricted to be linear, then the constraints to be imposed on ti for an excited state of the same symmetry as some of the lower states are those of normalization and orthogonality to the relevant lower states k=1,2, ... n) where
Ok s
(<1,110
k >=
<0
k
117.1
(=1)
>= 0, for
are the approximate lower eigenstates of H, that
are of the same symmetry as iii [2-3]. The appropriate cost function is then given by
n
E1)2 J 1 (k1)
=
(
1 11 1 17/ > –
L
+
2 [1 - ‹tii I t >]2
1
+ Ok k=1
I
1 2
(4)
398
where (3s and p are the penalty weight factors. The variational k
problem now boils down to finding the absolute minimum of J (0T with 1 respect to all the linear parameters in 171 . If we wish to allow much more variational freedom to
ili
and carry out a
perfectly general variational
calculation on the excited state, the orthogonality constraints together with the norm conservation constraints are not enough to prevent variational collapse. We must enforce on 17; the decoupling constraints, as well [1,19-20]. That is, we must demand that = 1,and
= <4) k ICh >= 0,
=
<0 1,
1H I (3 >=
(5)
0, for all k
The set of functions {0)
k k=1,n
representing approximate lower
eigenstates of H must be an orthonormal set and mutually decoupled. The appropriate cost function for this case is then given by
..1 (VI) = (< 3H113 > –
2
E
" L
) 2 +A [1 - ] 2 + n g k k=1
l<1310 k >12
n
(6)
+ E 7 k 1q11110k>12 k=1
The variational optimization of
VI
for the (n+l)th state of H can
therefore be reduced to the problem of global minimization of J 2(17;) with respect to all the parameters in rii, linear or non-linear, without encountering the problem of variational collapse. Of course, we could have also set 8J (i/J)=0 and proceeded to solve the resulting equations. However, 2
the equations would be quite difficult to handle even for the simplest choice of ri, and 0s [21]. Moreover, the resulting equations would only k
characterize a stationary point, i.e. an extremum and there is no
guarantee
that the solution obtained is associated with a minimum, even a local one. A direct search for the absolute minimum of the cost function like J (iii) or 1 J crii) could therefore be a worthwhile proposition providing that a reliable 2
and efficient global optimizer is available. Before we take up this question in the section to follow, we would digress a little here. So far we have considered the problem very generally. It would be prudent perhaps to make a contact with the more specific problem of electronic structure calculation of atoms and molecules. In quantum theoretical or quantum chemical calculation of atomic and
399
molecular electronic structure, the ground and lowest electronic state of each symmetry are commonly represented by approximate wavefunctions determined variationally by applying the method of self-consistent-field (SCF). Such single determinant representable approximate wavefunctions (SCF wavefunctions) provide what may be called the parent configuration. Further refinement comes by the way of supplementing the parent configuration by invoking the method of configuration interaction (CI). This essentially amounts to applying the linear variation theorem in the manifold of electronic states generated by the SCF orbitals. The convergence of the CI wavefunctions may, however, be tediously slow, forcing one to explore other avenues. The same two-step hierarchy i.e. SCF followed by CI, can in principle be extended to the problem of representing an excited state as well. But such a simple minded extension may not be feasible in many cases. This is because the parent excited state configuration may not be amenable to SCF calculation due to variational collapse on to a lower state of the same symmetry. On the other hand, expansion around a non-parent configuration is quite expectedly rather inefficient. A way out of this impasse is to invoke the method of multi-configuration SCF in which the base (SCF) orbitals used for generating configurations for the linear variational calculation are not held fixed but are allowed to vary simultaneously when linear expansion coefficients are optimized (quadratic MC-SCF) [22-24]. A simpler version of the method adopts a kind of linearization approximation and optimizes the linear parameters and the MC-SCF orbitals in tandem [25]. The point we wish to emphasize here is that the non-linear orbital optimization step can, however, be often beset with convergence difficulties, originating from variational collapse, neglect of coupling between orbital and configuration (CI) spaces, inadequate handling of orthogonality constraints, etc. Adoption of the so called state averaged MC-SCF strategy can be useful for handling problems arising from variational collapse [26]. It is not difficult to see that the problem of simultaneous optimization of linear and non-linear parameters addressed by us earlier in this section has a direct bearing on the MC-SCF method. The only difference lies in the fact that in the conventional MC-SCF scheme, one expands the MC-SCF orbitals in terms of a finite basis set and optimizes the orbital expansion coefficients and not the exponents, to get at the optimal orbital
400 forms while the optimization of the CI coefficients takes place just as in our scheme of obtaining the best linear parameters in a trial wavefunction by minimizing a suitably constrained function or cost function (see later). Naturally, the strategy to be described in the following sections can be invoked with equal facility for handling the MC-SCF problem as well [16]. 3.
MINIMIZATION OF THE COST FUNCTION
Once the construction of an appropriately constrained cost function is complete, its minimization in the absolute sense becomes the target. Gradient search methods are patently unsuitable for the minimization of cost functions (equations 2,3) of the form proposed by us as the corresponding Hessian may be singular at the true constrained minimum where I H I >=EL , ideally. On the other hand, the method of simulated annealing provides an ideal gradient-free pathway leading to the global minimum of the constrained function and completely bypasses the Hessian singularity problem. Since the exact form of the cost function will depend on the particular problem, the nature and number of constraints imposed on ti, we shall represent it generally by F, a function of the parameters (p , p
2
,pn ) in with respect to which F must be minimized. Let the parameter temperature at the mth step be T m. Any one of the parameters (p 1 say) is chosen randomly and updated in our scheme as follows : p
m
-> p
m+1
=p
m
m m + (2r -1.0) s
(7)
1
where rmis a random number between 0 and 1 generated in the mth step and s is an adjustable step length parameter. The new function value m+1 m F(p m , p ,p ) is calculated. At this step, however, we introduce a m+1 1
1
n
departure from the standard simulated annealing recipe in that we begin to recognize three different types of reconfiguring moves as opposed to the only two types of such moves considered in the standard version [ 27]. We would categorize them as feasible moves of the first and second kinds (FM1 and FM2) and unfeasible moves (UFM). In FM1, the result of randomly reconfiguring the system reduces the value of F
m+1
(i.e. F
m+1
< F ) while
in the FM2 processes F > F , but AF ={ I F – F I } 112 is within the m+1 m m m+1 scale of thermal fluctuations at the prevailing parameter temperature (r <
401 AF ). That leaves us with moves of the unfeasible type in which AF >>kT m m (i.e. AF >r , r being a random number between 0 and 1). For a move m m m belonging to FM1, the temperature Tm is reduced immediately without waiting for the system to equilibrate at that temperature (i.e. no further sampling is carried out at that temperature), the amount of reduction being proportional to the reduction of cost at that step (T
=C AF , C being the m+1 m constant of proportionality). Simultaneously, the step length S m+1 is
i
reduced similarly. However, for an FM2 type of reconfiguring move, the temperature is held constant until either the maximum number of moves (preset) allowed at a particular temperature is exceeded or the limiting number of successful moves (also preset) has been already achieved or a move of FM1 type has been encountered. That is, the simulated thermodynamic system is allowed to equilibrate as long as energy fluctuations are of the order of kT or higher. The third kind of reconfiguring moves are rejected as in the standard SAM. Along with this minor change in the sampling device, another simple modification is suggested to make the annealing more efficacious. To appreciate this change in strategy, let us take up the cost function of equation (2), viz. Y in (p 1 , p 2 ,..p n ) = (07i I H I iii > – EL m m L
+ p. [1 - ] 2 m m
(8)
Now, the current estimate of lower bound Emmay either be held fixed L for all the temperature steps (EmL = el.. ) or dynamically updated. We prefer to modify the value of Emas the system evolves from the temperature T L m –>T m+1 by using the Morrison Function Technique [18] by using the relation Em+1 =Em+ (3 R , 0:5 (3 --s1 1 L L m
(9)
where, R = (AF)l"2 m
(10)
This updating of Er is recommended only if there is no reduction in the
value of
the function
Y
as the annealing evolves from the mth
temperature step and enters the (m+l)th temperature step. That is, if F er-, m+1 Fm even after the complete optimization at a particular temperature (T ), m the lower bound estimate is updated using (9). We shall examine the effects of these changes in strategy on the performance characteristics of our SAM based technique of optimizing linear add non-linear parameters in a trial wavef unction later in this article. Before focusing on applications of the
402 proposed method we like to stress here that no segregation of the parameters into linear and non-linear types is advisable as it lowers the overall efficacy of the process. That essentially means that the use of cyclic optimization route is not advisable when SAM is invoked. It is possible to discard the energy based approach and invoke the SAM for minimizing the variance V= - 2 of energy for a trial wavefunction and determine the optimal set of parameters [28]. However, calculation of V involves computation of complicated integrals involving H2 which is primarily responsible for the relative unpopularity of the least squares method. Some integration free techniques are available for circumventing the problem [29-30]. Recently Dutta et al [28] have shown that the use of a linear space admirably simplifies the problem of variance minimization and offers a facile route to the ground state energy and wavefunction. By choosing a linear trial space we have n
E C. 0. (11)
VI =
1
1.1
1
It is now easy to show [28] that the condition f k =awack =0 leads to the following set of coupled non-linear equations ,
n
(12)
f = E C [H – 2 H + ( 2 – V )S ] = 0 k
1=1
I
lk
Ik
Ik
for k=1,2, ..., n 2 In (12)
ik
i j
= H' and
n n
n n
pi E E C 1 C i E H ii H u / E E C I C i S i i 1
1 j
(13)
S = <0 . 10 > ij 1 j An effective optimization of the linear parameters in ri) can now be
performed by globally minimizing a cost function n
Y = Ef k=1
2
Y
where ? is either (14)
k
But instead of solving the n-coupled non-linear equations given in (12) by minimizing as it merely represents satisfaction of the variance stationarity condition as opposed to the absolute minimization of V, direct minimization of
IV I
by the SAM appears to be a perfectly feasible
proposition [28]. Having thus made the advocated methodologies transparent, we now turn to some simple minded applications of the techniques to the
403 problem of optimization of both linear and non-linear parameters in a trial wavef unction, be it for the ground or the excited states of an atomic or molecular system. 4. RESULTS AND DISCUSSION The methodologies described in the previous section can be invoked to handle the problem of optimization of linear and non-linear parameters in trial i71 irrespective of the details of the trial function, if the Hamiltonian is bounded from bellow. However, we will focus on rather simple systems, simple trial functions, to illustrate subtle features of the proposed methods. It is designed with more weightage on educative principle and less on providing evidence in support of its practicability in large scale structure calculations. The applications that we will describe in the following pages broadly fall into two categories : (a) electronic structure of two electron atoms or ions, (b) electronic structure of two electron molecules or ions. The atomic problems have been studied with essentially two different types of basis sets, viz. Davis' and Slater type basis sets. While Davis' basis set [31] has been used for calculating only the ground state wavefunctions and energies of some members of He-sequence, Slater type basis functions have been used for both the ground and excited state calculations involving simultaneous optimization of linear and non-linear parameters. In some cases energy minimization based calculation has been supplemented by the variance minimization based approach. With this preamble we now turn ourattention to the detail of the applications. 4.1. Ground states of two electron atoms and ions 4.1.1. Use of Davis' basis set : Optimization of linear parameters only The normalized spatial wavefunction for the ground state of the two-electron atomic species is taken in the form 1
,r ) = E F (r ,r ) P (cose) 1
2
1
n
1
(15)
n max max
where F (r ,r ) = E 1 1 2
2
E C(n,n i 0) 0 (n,n' ,i) and
n=1+1 n =n
(16)
404
-1/2 41)(n,n i ,i) = [2 (1+8' )] [R(no,r ) R(n',I,r ) + nn 1 2 R(ni,i,r ) R(n,i,r )1 1 2
(17)
Since terms with 1=0 are expected to make by far the largest contribution to the total energy of the 150 states of these species, and our basic purpose in this article has been to demonstrate the workability of the SAM based strategy, we have restricted the expansion in (16) to 1=0 terms only. The radial basis functions have been taken in the following form R(n,r) = (271) 3/2 [(n+1)!]-3/2 [(n-1)!]112 Exp(-rr) L 2+1 (2T)r) 2 The L (271r) are the associated Laguerre polynomials and n+1
(18)
i
can be
regarded as a parameter representing effectively the screened nuclear charge of the atom or ion that an electron in an ns (i=0) orbital is likely to feel. We have used n
=7, thus generating a total of 28 configurations
max [31]. The optimization therefore involves 28 linear parameters (C) and one
non-linear parameter (ii). However, we have used 7-1 reported by Davis and optimized the linear parameters only to get the ground state energy. We have adopted both the energy and variance minimization based approaches. The optimal energy values obtained for a few selective ions are summarized in Table 1. For comparison, best available estimate of S-limit energies [31,32] have also been reported. Table 1 The ground state energies (in a.u.) of a few members of He-isoelectronic sequence obtained by SAM-based variance and energy minimization approaches using Davis' basis set (28 configurations)
Z
-E (by SAM)
-E (s-limit) [Ref.31]
2
2.878 997 483
2.879 028 0
3
7.252 406 011
7.252 431 2
4
13.626 826 241
13.626 859 0
5
22.001 479 593
22.001 513 0
6
32.376 260 774
32.376 295 0
7
44.751 109 986
44.751 145 0
8
59.125 999 729
59.126 036 0
405
Figure 1(a) shows the cost profile of the energy based minimization carried out on the ground state of 0 +6 ion with Davis' basis set. Figure 1(b), on the other hand, shows the behaviour of the energy variance during an annealing run on the same system. Although both the approaches lead to an identical point on the energy hypersurf ace, the average rate of descent appears to be different in the two cases.
2.0
0.8 --
30 Figure 1, Cost profiles during ground state energy optimization of 0+6 based on (a) energy and (b) variance minimization approach.
At low parameter temperatures the fluctuations in the cost profile have been drastically reduced and extrapolation to T=0 is practically linear. the overall efficiency of the procedures does not depend strongly on the starting point of the minimization. Whatever be the starting point, a unit vector or a good approximation to in the linear space concerned, the performance index of the method remains virtually the same.
4.1.2. Use of STOs : Optimization (cyclic or simultaneous) of many linear and a single non-linear parameters The calculations reported in this subsection are designed to establish that simultaneous optimization of linear and non-linear parameters is more advantageous when done by the SAM. Again, we choose two-electron atoms and write down the ground state wavefunction iii(r e r2 ) as Or e r 2 ) = E C nn , Det 14) ns (1Ins1 ' n,n' where Ø
= N e n s n
(19)
n-1 n
, s = Nn , e
)r n, -
(20)
406 n and n' have been allowed to take values from 1 to 10 so that a total of 55-configurations (spin-singlet) are generated. g is the common scale factor chosen and has been optimized either simultaneously or cyclically with the linear variational parameters (Cnn , ). Figure 2(a) shows the conventional minimization of energy as a function of where is changed following a pattern search procedure. For each g, the H-matrix is constructed and diagonalized, and the lowest eigenvalue is tracked until it passes through a well-defined minimum as a function of
0 --59.107 ----›. -59.109 W
4 -- 59.112 -- 59.114 - -
-
771
7.67
7TT
--7 1
1
TTT-1-71-7-7
7.79 7.91 EXPONENT
8.03
Figure 2(a). Profile of ground state energy [E(l)l of 0 obtained by Cl pathway as a function of the exponent (E) present. in 55-configuration ip Figures 2(b) and (c) display the SAM based profile of
as a function
of annealing steps for the cyclic and simultaneous optimization of both
7.85 c
/
4 7.69 ^;,1 7.61 44 7.53 7.45 0
10 20 30 40 STEPS
in SAM based (b)simultaneous (P)t:. yelic optimization Figure 2. Profiles of of both linear and non-linear () parameters for the system as in Fig.°(a)
407 linear and non-linear parameters. In each case, the same optimal value of g is obtained although the paths leading to the minimum are different. Our experience so far is positively in favour of the simultaneous optimization of both linear and non-linear parameters. 4.1.3. The case of many non-linear parameters The calculations reported in this sub-section use the trial function of the type described in sec.4.1.2 with the additional flexibility that the exponents of the STOs are each allowed to have an optimum value. Concomitantly, the length of the CI expansion was cut down to 10 configurations only. The computed energy values along with the s-limit energies for each Z are reproduced in Table 2. To asses the kind of effect that optimization brings in, we have also reported in the same table the energy values computed with the same 10 configuration 17i with a single optimized value of , for each Z. Appreciable improvement is seen to have been achieved through simultaneous optimization of all the linear and non-linear parameters. Table 2 The ground state energies (in a.u.) of a few members of He-isoelectronic sequence obtained by SAM-based optimization of all the linear parameters and (a)a single exponent, (b) all the exponents, in 10-term tii using STOs as basis functions Z
– Energy a 1
0.513 095 3
b
s-limit
0.513 866 4
2
2.878 597 3
2.878 978 4
2.879 028 0
3
7.252 066 9
7.252 348 0
7.252 431 2
4
13.626 424 3
13.626 805 4
13.626 859 0
5
22.001 067 1
22.001 448 2
22.001 513 0
6
32.375 838 3
32.376 219 3
32.376 295 0
7
44.750 678 3
44.751 059 4
44.751 145 0
8
59.125 559 7
59.125 940 8
59.126 036 0
4.2. Excited states of two electron atoms and ions Once the ground state is determined, we can construct the appropriate objective or cost function incorporating the orthonormality and decoupling
408 constraints on the trial wavefunctions for the s (ls2s) state of He or He-like atom and carry out annealing. The evolution of the cost as the annealing approaches termination (T–>0) is displayed in Figure 3(a) while Figure 3(b) shows the corresponding energy profile [E(T=0)=-2.129 361 a.u]. The energy compares well with the best available value of s-limit energy of the is (1s2s) state of He.
1.00
2.15
0.80
b
0.60
— —1.99 cri 0 1--1.91 (r44 — —1.83
O c) 0.40 —
0.20 0.00 — 0
— —2.07
—1.75
, 10
20 STEPS
30
Figure 3. Profiles of (a) cost. (b) energy duringSAM based orthogonality and decoupling constrained optimization of l s , (1s2s) state of He 4.3. Applications to two-electron diatomics 4.3.1. Simultaneous optimization of R and a The diatomic molecular systems provide an additional degree of freedom in that the internuclear distance (R) is also a parameter (in addition to explicit parameters in 2 ) with respect to which the molecular energy must be minimized to arrive at the equilibrium molecular electronic structure when working in the adiabatic Born-Oppenheimer approximation. In the usual i.e. adiabatic approximation, one fixes the internuclear distance R at a specific value (R i , say) and optimizes energy with respect to the explicit variational parameters, linear or non-linear to get an energy E(R . ) that is optimal for the given internuclear distance. Once a large number of E(R1) values have been computed hopefully bracketing the global minimum on the potential energy surface, it is straightforward to find out the minimum energy (E0 ) and the equilibrium internuclear distance R o associated with it. But it is possible, at least in principle, to do away with this two-step procedure and go for the optimization of R and the explicit
409 variational parameters in 1/i(c,a,R) (c= linear parameters, aanon-linear parameters) simultaneously. If the coupling between R and C,oc is strong, simultaneous optimization of R,c,a would certainly be preferable. We propose to illustrate this point by taking the Heitler-London ground state wavefunction [32-33] for H
HL
(oc,R)-
[K
4 2( 1+ s2)
2 HL
a (1)K b (21 ± K a
) as an example 1
(2)K (1)] (1)] 72 [a(1)(3(2) - (3(1)a(2)]
(21)
Figure 4(a) displays the profile of SAM based a optimization at a fixed R (R=1.4A°) while 4(b) shows the corresponding energy profile.
—2.28 — --2.26 a — —2.24
— --2.22 0.97
11111111
10
o
20 STEPS
30
—2.20
40
Figure 4.Profiles of (Oa (b)energy obtained by fixed-R (=1,4) optimization of ground state energy of fl 2 molecule using Heftier-London wavefunction The aopt -R profile constructed in this manner is displayed in Figure 4(c) while the E opt —R curve is exhibited in 4(d) in which the equilibrium value of R0 is also indicated.
1.30
2.28 1
1.20
—
1.10 ._
1.0
--2.26 rzn
—9.24
E-
1.2
1.5
1.7
2.0
R Ms a function of r? obtained Figure. 4. Prelfilos of (c) a opt On opt . 40-b) if! Pigu rf,s for dirreTent fixed-R f:tortenlinz a runs s
410 Figures 5(a-c) show the evolution of a, R and E, respectively when both a and R are minimized simultaneously by the SAM. The same values of a R
opt
and E
opt
opt
are obtained as found in the previous case. But the latter
seems to be a much easier and more efficient route. 1.45
—2.30 c—
— a
1.30 --I -
Cr
---2.14
0
CrA
1.15--I
,L1.98
1.00 0
10
30
20
STEPS Figure 5. Variations of (a)a, (b)R and (4- . ) onergy during simultaneous optimization of a and R by SAM for the same system as in figures 4 4.3.2.Simultaneous optimization of R and a with a modified choice of 1i
trial
We have also studied the problem of simultaneous optimization of a and R for H molecule by making the following choice for 2
trial
(a and b are
the two H-atoms) [34]
=
1
12(1+s) K(1) = a
(22)
WI/2) {01) 0(2) — 0(2) 0(1) }
tria 1 E 11 0
a
=e
-r
(23)
[K (i) + K (i) a ai a , K (i) =
-r
= e bi
a
(24)
Figure 6(a) shows the profile along which a is optimized during a typical annealing run designed to carry out simultaneous optimization of a and R. Figure 6(b) shows the evolution of R along the path that leads to stochastic minimization of the parameters of simulated thermodynamic system defined in the parameter-space of the H molecular wavef unction. Figure 2
6(c) depicts the minimization of electronic energy of H2 molecule as a and R are both allowed to vary simultaneously during a typical annealing run commencing at a high parameter temperature (T=10) and subsequently annealed to
TN 0.
411
—2.320
1.45 c
1.30 1.15 -
1 .00
7 1
r---1-1-1
0
1 0
T I 7-1---E T-7-1—r
30- -1.96
20 STEPS
Figure 6. Variations of (a)(x, (b) R and (c) energy of II molecule obtained 2 ev
by SAM using a modified v (see eqns. 22-24)
4.3.3. Simultaneous optimization of orbital exponent (a), internuclear distance (R) and the CI-coefficient (C ) o To demonstrate the simultaneous optimization of three different types of parameters contained in
1,1)
trial
(both implicit and explicit). We have
chosen Weinbaum type wavefunction for the cu
o
+C
molecule (t/iwB ) [35]:
where
(25)
= (1/32) {0+ (1) 47. 4. (2) - 0 , (2) ii) , (1) } and
(26)
= (1/v'2) {0_(1) 16_(2) - 0_(2)
(27)
WB
= C
H2
1/1 2 2
0 0
Furthermore, we have used ,
42(1+s 1
0_(i) =
42(1-s) where
K a
and e
K (i) = a
-r
al
[K (i) + a
[K
b (i)
(28)
(i)
(29)
(i) - K a
defined just as in section 4.3.2. i.e. , K j) =
e
-r
bj
The normalization condition oni/i demands 2
2
Co0
2
C
=
1 , i.e. C =± 2
WB 2 0
(The phase is to be fixed by the sign of <001H
>)
10 2 The energy function foris therefore a function of a, C and R, WB o
412 all of which can be varied simultaneously to reach the global minimum of the energy function. Figures 7 (a-c ) present the E-R, a-R and C o-R profiles respectively. The figures have been drawn from the {a l , C 0 1 , R } data sets generated during a simulated annealing run in which all the parameters were allowed to vary simultaneously. The optimal values tally nicely with the known values. The R-values generated by the stochastic minimizer have been transformed into an ordered data set and the corresponding a, C o and E values have been noted for constructing the relevant figures which are shown below.
1.28
2.30
rzn —
--2.25 ofQ
1.08 - T-T° I 1.0 1.3
i
1.5
—2.20
1.8
2.0
R Figure 7. (a) a-R and (b) E-R profiles during annealing of ground state energy of H 2 molecule using Weinbaum type wavefunetion
—0.10.1 0 —0.16 —
—0.22 — —0.28 1.0
1.3
1.5 R
1.8
2.0
Figure 7(c). Profile of the linear coefficient (C o ) as a function of R during the same annealing run as in figure 7(a-b)
413
5. FURTHER GENERALIZATION OF THE SCHEME
Almost as a corollary to the discussion of the strategy explicated in the previous section, we may note that essentially the same procedure can be invoked for calculating the SCF or MC-SCF wavefunctions. Let the closed-shell ground state of a 2n-electron system be represented by a single determinant wavefunction (0 0 ) constructed from 2n-spin orbitals. 0 =
) 1 -(1) 1 ....0 ?) }and 01 =ET n n
p
K
(30)
p1 p
In matrix notation, + (I) = K T , T S T = 1 The variational problem that the SCF method epitomizes can then be reduced to a global minimization of g (00 ), where [16] ( 0 0 ) = (6scF
g0
) + 0 Tr {(T+S T - 1) + (T+S T -1)1
(31)
The minimization variables are the elements of the linear expansion coefficient matrix (T). If the situation demands the use of MC-SCF strategy i.e if T o is a multideterminant wavefunction 0
=
c i
C21 ,=)
the construction of the cost function
Y( Ili ) o
is cumbersome but straight
forward. Thus we may now define (To ) as follows : g (T o ) = [Tr{(hTP T+ ) + (1/2)Z T+) - g °] 2 +
[(1 -
+
f3 Tr{(T+S T - 1) 4. (T+S T -1)1
Ec2,
where, MO AO
z = E E T
jkl qrs
s
12
I qr> T T P kq
lr
2 kl,lj
P1 and P are the one and two electron density matrices defined by T 2 o in the MO-basis. One notes that the linear and the non-linear parts of the variations have not been decoupled in this approach. One interesting consequence of it has been the absence of convergence problem that often haunts first order MC-SCF theory, arising from the neglect of coupling between the orbital and configuration spaces. This has been demonstrated elsewhere [16]. Indeed, it would be worthwhile to explore the idea further as the alternative route involving quadratic MC-SCF is much more complicated and difficult to organize.
414 6. POSTSCRIPT Before concluding the paper, we would like to turn back and recall that the Morrison function technique involves a dynamic updating of the lower bound to the constrained energy. When the global minimum is known, or a very good lower bound estimate is available, E L may be held fixed without much difficulty. It is more common, however, to encounter problems for which either the global minimum or a very good estimate of E L would not be known. In that case, dynamic updating of EL becomes imperative. Even otherwise, a carefully designed updating schedule of E Lmay improve the performance of the method as the following example shows : Figure 8(a) shows the Energy-step profile for the minimization of the ground state energy of He atom using 10-term wavefunction (STOs) when a fixed EL = -2.90is used. Figure 8(b) displays the corresponding energy profile when ELis updated by using E i+1 = E' + a. R 1l2 L
L
(32)
with a = 0.001 , E 1 = -2.90 a. u.
-2.89 ---T-7771 1-177-7- C1-1 8 16 24 32 40 0 STEPS Figure Profiles of ground state energy of He obtained by SAM using 10term wavefunction for (a)fixed F and (b)updated F annealing runs
A quicker descent to the global minimum can be noticed in the second case, 1+1 if Ei+1 >C following updating at any particular step, one must reset its value by subtracting a 'shifter' of appropriate magnitude.
415 Acknowledgements P.D wishes to thank the C.S.I.R, Govt. of INDIA, New Delhi, for the award of a Research Associateship. We express our sincere appreciation for the encouragement and constructive criticism of the work by our colleagues, Prof. M. Chowdhury, Prof. D. Mukher jee and Dr. D.S. Ray of I.A.C.S, Calcutta. We wish to thank Prof. C.K. Mazumdar, S.N. Bose National Centre for Basic Sciences, Calcutta for his kind interest. REFERENCES 1.
S.T. Epstein, variational methods in Quantum Chemistry, Academic Press, New York, 1974
2.
E.A. Hylleraas and B. Undheim, Z. Phys., 65 (1930) 759.
3.
J.K.L. Macdonald, Phys. Rev., 43 (1933) 830.
4.
H.B. Schlegel, Adv.Chem. phys., Ed. K.P. Lawley, Vol.LXVII, 1986, page-249-286.
5.
D. Garton and B.T. Sutcliffe, Specialist Periodical Reports, Chem. Soc. London, Theoretical Chemistry, Vol.1-Quantum Chemmistry.
6.
S. Kirkpatrick, C.D. Gelatt and M.P. Vecchi, Science, 220 (1983) 671.
7.
S. Kirkpatrick, J. Stat. Phys., 34 (1984) 975.
8.
R. Car and M. Parrinello, Phys. Rev. Lett., 55 (1985) 2471; Chem. Phys. Lett., 139 (1987) 540; Phys. Rev. Lett., 60 (1988) 204.
9.
D.A. Estrin, C. Tsoo and S.J. Singer, J. Chem. Phys., 93 (1990) 7201.
10.
H. Chacham and J.R. Mohallem, Mol. Phys., 70 (1990) 391.
11.
P. Dutta and S.P. Bhattacharyya, Chem. Phys. lett., 162 (1989) 67.
12.
P. Dutta and S.P. Bhattacharyya, Chem. Phys. lett., 167 (1990) 309.
13.
P. Dutta and S.P. Bhattacharyya, Phys. Lett. A, 148 (1990) 331.
14.
P. Dutta, D. Mazumdar and S.P. Bhattacharyya, Chem. Phys. Lett., 181 (1991) 67.
15.
P. Dutta and S.P. Bhattacharyya, Chem. Phys. lett., 184 (1991) 330.
16.
P. Dutta and S.P. Bhattacharyya, Chem. Phys. lett., 199 (1992) 169.
17.
A.V. Fiacco and G.P. McCormik, Non-Linear Programming : Sequential Unconstrained Optimization Technique, Wiley, New York.
18.
D.D. Morrison, SIAM J. Numer. Anal., 5 (1968) 83.
19.
J. Hendikovic, Chem. Phys. Lett., 90 (1982) 198.
20.
S.H. Gould, Variational Methodd for Eigevalue Problems, Universiy of Toronto Press, Toronto, 1966.
416 21. 22.
H.G. Miller and R.M. Ddreizler, Nucl. Phys.A., 316 (1979) 32. Hans-Joachim Werner., Adv. Chem. Phys., 6 (1987) 1-62; R. Shepard, ibid, pp.63-200.
23.
R. McWeeny and B.T. Sutcliffe in Methods of Molecular Quantum Mechanics, Academic Press, New York, 1969.
24.
H.J. Werner and P.J. Knowles, J. Chem. Phys., 82 (1985) 5053.
25.
D.L. Yeager and P.J. Jorgensen, J. Chem. Phys., 74 (1981) 7549.
26.
P.J. Knowles and H.J. Werner and, Chem. Phys. Lett., 115 (1985) 259.
27.
P. Dutta and S.P. Bhattacharyya, J. Comput. Phys. (submitted, 1994).
28.
P. Dutta and S.P. Bhattacharyya, Chem. Phys. Lett., 226 (1994) 78.
29.
A.A. Frost, J Chem. Phys., 10 (1942) 240.
30.
F.H. Read, Chem. Phys. Lett., 12 (1972) 549; J. Phys.B, 5 (1972) 1359.
31.
H.L. Davis, J. Chem. Phys., 39 (1963) 1827.
32.
W. Heitler and F. London, Z. Physik, 44 (1927) 455.
33.
S.C. Wang, Phys. Rev., 31 (1928) 579.
34.
S. Weinbaum, J. Chem. Phys., 1 (1933) 593.
35.
J.C. Slater, J. Chem. Phys., 19 (1951) 220.
Adaption of Simulated Annealing to Chemical Optimization Problems, Ed. by J.H. Kalivas © 1995 Elsevier Science B.V. All rights reserved.
417
Chapter 19
Annealing to a moving target: first-principles molecular dynamics S. J. Singer* Department of Chemistry, Ohio State University, 120 W. 18th Ave., Columbus, Ohio 43210, USA Overwhelmingly, most of the chemical transformations performed in the laboratory take place in a condensed phase environment - in solvent, in bulk crystalline or amorphous solids, in porous media, or on surfaces. The same is not true of chemical transformations modelled on computers, which are almost always studied apart from the condensed phase environment. There are obvious intellectual and practical reasons for this disparity. Intellectually, it is attractive to examine parts of the problem before assembling the whole, and this applies to studying chemical reactants as isolated molecules without solvent, or alternatively studying the collective properties of the condensed phase without the reactants. Practically, both electronic structure calculations and condensed phase simulations are computationally demanding. The cost of electronic structure calculations scales quite rapidly with the number of electrons, so one is forced to accept less accurate methods with increasing system size. Condensed phase simulations scale as the square of the number of particles, generally much kinder than electronic structure calculations, but this is achieved by using highly simplified interaction potentials. This review covers the emerging area where electronic structure and condensed phase simulation techniques are combined with the help of simulated annealing. This is essential in situations where the roles of reactant electronic structure and the solvent medium in determining the course of a chemical reaction cannot be disentangled, or where energy of a condensed phase arrangements cannot be broken into two-, three-, or other few-body contributions. In these cases, the potential energy surface governing condensed phase behavior cannot be pre-calculated as input for a statistical simulation. Instead the BornOppenheimer energy must be calculated as the particles explore a multitude of thermally accessible configurations. A true marriage of electronic structure and many-body simulation is needed before laboratory experiments can be replaced with computer experiments. The best present-day efforts, which do include significant scientific accomplishments, fall short of this lofty goal, but do indicate what continued progress can achieve. The progress reviewed in this article is largely concerned with classical motion on Born-Oppenheimer *S.J.S. gratefully acknowledges the support of NSF grant CHE-9115615, and the contributions of Chiachin Tsoo, Dario Estrin and Li Liu to my understanding of this subject. The assistance of Kathryn J. Degray with bibliographic matters is greatly appreciated.
418 surfaces, thereby ignoring quantum aspects of nuclear motion as well as electronically non-adiabatic effects, but still captures a wealth of important chemical phenomena. Even with important simplifications the cost of computing the Born-Oppenheimer electronic energy, and its derivative, in concert with molecular dynamics simulation is staggering. Typically rico, fig ''' 104-107 configurations are generated in a condensed phase simulation. The cost of an ab initio statistical simulation would be the cost of an electronic structure calculation at a fixed nuclear configuration multiplied by n con fig if no other savings could be realized. Indeed dynamical simulations in which the Born-Oppenheimer potential energy surface was generated at each step by standard means have been performed for semi-empirical [1-5] and ab initio [6-9] electronic structure models. A fundamental insight of Car and Parrinello [10] was that subsequent configurations generated in a molecular dynamics simulation are quite similar, differing only by an amount that vanishes linearly with the time step St. The electronic wave functions should therefore also be similar, suggesting an iterative method to solve the electronic structure problem at each time step with the wave function at the previous step as an initial guess. Iterative methods generally reduce computational cost by a factor that scales with the number of parameters in the problem. For example, full matrix diagonalization scales of n 3 , where n is the order of the matrix. Iterative diagonalization techniques scale as n 2 . The savings upon using an iterative technique hardly compensate for the cost of incorporating an ab initio component to the statistical simulation, ncon fig x (cost of electronic structure calculations at fixed nuclear configuration), but iterative methods do make the difference between "highly demanding" and "computationally intractable" . The particular iterative technique chosen by Car and Parrinello to iteratively solve the electronic structure problem in concert with nuclear motion was simulated annealing [11]. Specifically, variational parameters for the electronic wave function, in addition to nuclear positions, were treated like dynamical variables in a molecular dynamics simulation. When electronic parameters are kept near absolute zero in temperature, they describe the BornOppenheimer electronic wave function. One advantage of the Car-Parrinello procedure is rather subtle. Taking the parameters as dynamical variables leads to robust prediction of values at a new time step from previous values, and cancellation in errors in the value of the nuclear forces. Another advantage is that the procedure, as is generally true of simulated annealing techniques, is equally suited to both linear and non-linear optimization. If desired, both linear coefficients of basis functions and non-linear functional parameters can be optimized, and arbitrary electronic models employed, so long as derivatives with respect to electronic wave function parameters can be calculated. There is also a drawback to treating electronic parameters as dynamical variables. Energy flow between the physically meaningful dynamical variables, the nuclear positions, and auxiliary dynamical variables introduced for computational reasons, the electronic wave function parameters, must be kept to a minimum. The arbitrary masses assigned to the wave function parameters as additional dynamical variables are adjusted so that the characteristic frequency of their motion is sufficiently high in comparison to nuclear
419 motion so that an adiabatic separation between physical and auxiliary variables is effected, and energy transfer between the two sets of dynamical variables is minimized. Unfortunately, the time step required for accurate integration of equations of motion in a molecular dynamics simulation is inversely related to the highest frequency characteristic of the system. Hence the Car-Parrinello methods requires time steps which are small in comparison to the time step needed for nuclear motion alone, forcing a partial retreat from the savings introduced by iterative optimization by simulated annealing. Alternatives to simulated annealing as an iterative method have been proposed and will be discussed below.
1. THE BASIC SCHEME The use of simulated annealing in conjunction with dynamical simulations relies on the variational approach for electronic structure. In SchrOdinger equation-based methods [12], parameters of the electronic wave function 1kIf) are obtained by minimization of e[w] . (W17:10) (I) •
(1)
The total electronic wave function is almost always represented in terms of one or many Slater determinants of 1-electron orbitals, O i , 02 , . . ., although multi-electron orbitals have been considered from time to time in the ab initio community [13-20] and have even been calculated by simulated annealing techniques [21]. Using an orbital description, the function to be minimized takes the form E[W] = e [01,02, ...] .
(2)
Density functional techniques are, in principle, based on minimization of energy as a functional of the electron density p(r). In practice the density is represented in terms of Kohn-Sham orbitals, and therefore the implementation takes the same broad form [Equation (2)] as SchrOdinger based methods. We will forgo extensive discussion of specific forms of E[T] until Section 2 below, and consider simple examples to illustrate the use simulated annealing here. A practical representation of the full wave function, or the constituent orbitals, involves basis functions. The electronic wave function (or electron density) is parameterized by linear basis function coefficients or nonlinear parameters, such as the position or widths of Gaussian basis function. The array c will denote the collection of all wave function parameters, both linear and nonlinear unless otherwise specified. The ground electronic Born-Oppenheimer surface, is given by Eo (R.) ,c,-, mine (c) ,
(3)
The Born-Oppenheimer energy depends on nuclear coordinates R because the electronic Hamiltonian 7-1 or density functional depends on R through Coulombic electron-nuclear
420 interactions. An approximate equality "',:.," is used in Equation (3) as a reminder that truncation of the basis set expansion only yields approximate electronic energies. Excited electronic energy levels can also be calculated variationally as long as orthogonality to the ground state is enforced. In the simulated annealing method, minimization of E(c) is mapped onto a statistical mechanical problem by considering E(c) to be the potential energy governing a fictitious thermal system. If this fictitious thermal system can be brought to thermal equilibrium at sufficiently low temperature, then the variables c will be brought to the minimum of E(c), i.e. to the Born-Oppenheimer surface.
T1
E ) T2 < Ti
T3 < T2
c
Figure 1. Statistical sampling of the space of wave function parameters c when the energy function E(c) is treated like the potential energy governing a statistical system. A sequence of decreasing temperatures T3 < T2 < T1 is shown.
The Born-Oppenheimer parameters which minimize E(c) can be found using any statistical mechanical simulation method by which the system governed by E(c) can be brought to equilibrium as a function of steadily decreasing temperature. For example, either Monte Carlo [22-26] or molecular dynamics [10] have been used. Sufficiently slow cooling schedules will avoid trapping the fictitious system in a local minimum of the potential E(c). While simulated annealing may hardly be the method of choice for solving the electronic structure problem at a single electronic structure problem, we will see that it does
421 becomes a viable method for determining the Born-Oppenheimer surface in a ab initio molecular dynamics simulation. Classical nuclear dynamics is governed by the Lagrangian,
1N
E
miR? — min E (c) . £=— c 2 i =1 Derivatives of min e(c) with respect to nuclear position are required at a sequence of incrementally changing nuclear configurations in a molecular dynamics simulation. The energy function E (c) depicted in in Figure 2 varies smoothly with time t as the nuclear position evolve.
E(c)
Born-Oppenheimer trajectory
Figure 2. The electronic energy E is shown schematically as a function of the set of wave function parameters c. E (c) changes with time t as the nuclear change position.
The global minimum of E (c) as a function of time is indicated as the Born-Oppenheimer trajectory in Figure 2.
422 Car and Parrinello proposed simultaneous molecular dynamics simulation of the physical variables, the nuclear coordinates R, and simulated annealing of the electronic parameters c, also by molecular dynamics simulation. The system is now governed by an extended Lagrangian,
£=
1
E m2 112z + —2 EM in k eZ — 5(c) . N
.
1
— 2 i=1
(4)
k=1
Later, we shall find that this form may be usefully modified in several ways. The electronic parameters can be expected to follow a trajectory qualitatively like the one that oscillates around the global minimum in Figure 2. It is essential to assign a mass to the electronic parameters such that their oscillation around the Born-Oppenheimer trajectory is rapid in comparison to changes in 5(c) brought about by evolution of the nuclear positions. The rapid motion of the electronic parameters is then adiabatically decoupled from nuclear motion, rendering energy flow from the physical (nuclear) coordinates to the electronic parameters negligible. Relaxation to the Born-Oppenheimer surface is instantaneous on the time scale for physical motion. We have optimistically shown that the simulated annealing trajectory has avoided a local minimum in Figures 1 and 2, and at low temperatures settled near the global minimum. Also shown rather optimistically in Figure 2 is a stable ordering of the minima of 5(c) with time. In special situations, a local minimum can drop below the previous global minimum, resulting in a discontinuous change in the location of the global minimum with time [21]. In practice these are not serious difficulties.
1.1. An example simulated annealing calculation To illustrate the above considerations, let us consider an artificial example. We will pretend that the electron in a hydrogen atom feels a harmonic force field. The electronic Hamiltonian (atomic units) is 7..1 =
___
1 v2 2
1 1 2 + — 11. 1 , Ir — RI 2
(5)
where upper case R indicates nuclear coordinates and lower case r electron coordinates. If forced to give a physical interpretation, one may picture the harmonic interaction as a pseudopotential generated by another particle at the origin. A basis of the ls, 2s and 2p hydrogen atom eigenfunctions is chosen to solve the problem. We will consider classical motion of the hydrogen atom on the ground surface for purposes of illustration, even though quantum mechanics is, of course, more appropriate for a particle with a small mass. Just to keep things simple, the trajectory is begun with the hydrogen atom displaced from the origin in the z-direction at Z = —5. The atom will oscillate around the origin along the z-axis. Only the 2pz state, and not the 2px and 2p, levels, will couple to the is and 2s states. Therefore, we can retain only three states in our calculation, the ls, 2s, and 2pz levels. The Born-Oppenheimer surfaces generated by this truncated basis
423
set expansion are shown in Figure 3 for displacements of the hydrogen atom away from the origin in the z-direction. The ground and excited states are distinctly anharmonic in the range of Z-motion considered below.
10
-10
-5
5
10
Figure 3. Born Oppenheimer surfaces generated by the model electronic Hamiltonian in Equation (5) as the hydrogen is displaced from the origin in the z-direction. The inset at the right schematically shows the model which electron is harmonically bound to a point at the origin of coordinates while the electron and proton interact via a Coulomb potential. The wave function is expanded as a linear combination of three basis functions, hydrogen ls, 2s and 2pz eigenstates.
First let us use the electronic energy as written in Equation (1), (inl 'P> in the La(010 ' grangian of Equation (4). The masses assigned to the electronic parameters, the m k in (4), are all chosen to be equal. Solutions to the equations of motion with electronic parameter mass equal to .1 or .01 times the proton mass are shown in Figure 4. Clearly, convergence with respect to decreasing parameter mass, and hence increasing adiabatic separation of nuclear and wave function parameter motion, is rather slow. In both cases the proton trajectory diverges from the Born Oppenheimer trajectory after a single oscillation as energy rapidly flows from the nuclear degree of freedom to the electronic parameters. It can easily be shown [27] that f — t 2 7 c? = 2 Ek k > 0 in this simple example. Hence, rapidly inonce the coefficients start to acquire any appreciable velocity, the norm E k creases leading to still larger parameter magnitudes and velocities, draining energy from the physical system.
e
424
m=0.01 mproton
Born-Oppenheimer trajectory
m=0.1 mproton 4
Z(t) 2
A A /A
Figure 4. Nuclear trajectories generated by the simulated annealing molecular dynamics defined by the Lagrangian in Equation(4) using masses for all parameters equal to either 0.1 or 0.01 times the proton mass. For comparison, the exact nuclear trajectory on the Born-Oppenheimer surface is shown.
1.2. Holonomic constraints Our first attempt to generate Born-Oppenheimer dynamics by simulated annealing, shown in Figure 4, can be improved in several ways. Car and Parrinello enforced orthonormality of electronic orbitals as a holonomic constraint on the parameter dynamics. For N orbitals this gives N(N 1) holonomic constraints. For our single electron model problem (5) there is only one constraint, 3 — ,2 c2 1 — C21 s '2s 2pz
= — E e2k —o — •
(6)
k= 1
The classical trajectory is now determined by minimization of the action subject to the constraint, which is practically implemented using the technique of Langrange undetermined multipliers. After the constraint times a multiplier, A, is added to the Lagrangian, 3
= + A(1 — k= 1
CID
425
• 1 N = — E m i re 2 i=1
1M
3
+ - E mk ei2, — e(c) + A(1 — E ci) , 2 k=1
k=1
(7)
the Euler equation for minimizing the action yields the following equations of motion for the parameters.
nA as ., n + ZlICk = 0 Mk Ck + ---UCk
(8)
When all the parameter masses are equal (mk = m, k = 1,2, ...) in this simple problem. (and generally when linear wave function parameters are optimized), the Langrange multiplier A can be obtained explicitly. For linear wave function coefficients c k , it is easily seen that
001 9-1 10) = 0.
a Ck-aC--;
(OM l
(9)
Therefore, after multiplying the equation of motion (8) by c k and summing over k (recalling that all the m k are now assumed equal), we obtain 3 n as .. E Ck M,Ck + - + ZliCk Ck k=1 A
(10)
3
= ,,,,, ,,0 E CIA. + 2A k=1 3
= -m E 61,2 + 2A = 0 .
(12)
k=1
To obtain the last equality, we used 3
3
E Ckalc ± E ek2 = 0, k=1 k=1
(13)
which follows from differentiating the constraint (6) twice with respect to time. Equation (12) is trivially solved for A. Solution of the constrained equations of motion generated by (7) yields greatly improved results compared to Figure 4, as shown in Figure 5. Of course, the denominator (01) can now be dropped from 5(c) under the holonomic constraint (6).
426 Born-Oppenheimer trajectory m=0.5 mproton
4 - Z(t)
2 t 201
00
801',1000
Figure 5. Simulated annealing nuclear dynamics under the holonomic constraint (6). The parameter masses are set to half the proton mass, chosen, for artistic reasons, so that some small deviation from exact Born-Oppenheimer dynamics would be visible to the reader. Further reduction of the parameter mass would lead to nuclear dynamics indistinguishable from the exact Born-Oppenheimer trajectory, also shown in the figure.
Figure 6. Trajectories of the wave function parameters corresponding to the simulated annealing nuclear trajectory shown in Figure 5.
427 Figure 6 displays the trajectories of the wave function parameters corresponding to the nuclear trajectory shown in Figure 5. The high frequency component of the parameter trajectories is quite apparent in Figure 6. Even though the parameter trajectories in Figure 6 depart noticeably from the exact Born-Oppenheimer values, the corresponding nuclear trajectory in Figure 5 differs from the exact one to a much smaller extent. The rapidly oscillating error in the forces effectively cancels [28], and the effective force felt by the nuclei is an average force which is much closer to the exact Born-Oppenheimer forces [29]. The high frequency motion, essential for adiabatic separation of nuclear and parameter dynamics, extracts a computational price. Comparing Figures 5 and 6, it is clear that the time step for numerical solution of the equations of motion by finite difference methods must be substantially smaller than if the Born-Oppenheimer surface could be pre-computed by some other means. Tuckerman and Parrinello has recently proposed a mutiple time-scale method to cope with the high frequency components of simulated annealing trajectories [30, 31]. 1.3. Damped motion Since the deviation from the Born-Oppenheimer trajectory exhibited by the unmodified dynamics in Figure 4 is caused by rapid increase in the kinetic energy of the wave function parameters, another means of improving on those results is to damp the parameter motion. Damped dynamics can be incorporated within the Lagragian framework, leading to equations of motion for parameters of the form,
OE + 7k C k = 0 . rnkck + ,— uck
(14)
Figure 7 demonstrates how damping the parameter dynamics improves the accuracy of the nuclear trajectory. The parameter trajectories under damped dynamics, shown in Figure 8, are quite smooth in comparison to the constrained dynamics (Figure 6). Fewer time steps are required to discretize the damped parameter trajectories, although small parameter masses cause numerical instability even though the trajectories are smooth. The characteristics of the damped equations of motion are revealed by a simple, coupled oscillator system. The variable z moves in the model potential,
1 9
1
9
1
min (—z- + acz + —c-) = — (1 — a 2 \)z2 . c 2 2 2(
(15)
Hence z is considered a "nuclear" or "physical" coordinate while c represents an "electronic" coordinate.
428 m=0.01 m proton,
Z(t) 4
Born-Oppenheimer A A A trajectory y=/0
400 600
000
'V/ Figure 7. Nuclear dynamics of the simplified model defined in Equation (5) in which the wave function parameters are determined by solution of the damped equation of motion (14). The parameters are chosen so that, for artistic reasons, the simulated annealing and exact Born-Oppenheimer trajectories are distinguishable to the reader. Further diminution of the parameter mass m and damping coefficient 7 would bring the simulated annealing trajectory arbitrarily close to the exact result.
Figure 8. Electronic parameter trajectories generated by the damped equations of motion (14), corresponding to the nuclear trajectory shown in Figure 7.
429 Motion along the z coordinate will be generated by coupled classical equations for both z and c, with damping of the motion of c. (16) i + z ± ac = 0 mO + c + az =
- -ye
(17)
Freedom to rescale the coordinates and time variable has been used to set the force constants and one of the masses equal to unity (with no loss of generality). These equations can be converted to purely algebraic equations by Laplace transformation. Inverse Laplace transformation yield trajectories in the form, 4
z(t)
=
E Ziesit
(18)
i=1 4
c(t)
=
E Ciesit
(19)
i=1
where the si are solutions to the quartic equation 1 _ a 2 + rys + (1 + m)s2 ± rys3 + ms4 .
(20)
The constants Zi and Ci in Equations (18-19) are residues of rational expressions in s at poles given by the quartic equation (20). Detailed expressions for Zi and Ci , which depend on the initial conditions, will not be given here, although they can be derived without difficulty. The behavior of the roots of (20) as a function of the parameters a, m, and 7 is quite rich. While a, the coupling between the "nuclear" coordinate z and "electronic" coordinate c, may be large, the region of interest is for small m and 'y. The leading behavior of the four roots of (20) with small m and 7 is given by [32]
i
iw — —
2
„2 wa2 ' - +...,
2
—7 ± i 2m Viii
9 + -ya2 __ T ic\Frii ± ... , 2
(21)
where w --,-_-: -V1 — a2 is the exact frequency expected for the z motion when the parameter c follows the minimum specified in (15). The first two roots are the "physical" ones, exhibiting oscillatory behavior whose frequency differs from the exact frequency w by quantities which vanish linearly with m and with damping that vanishes linearly with 'y. The other two roots are high-frequency transients with large decay constants. The coefficients Zi associated with the transient roots in (18) also grow small as m and 'y decrease. Energy loss from the "physical" coordinate z can be made arbitrarily small by appropriate choice of 7 and m, although some finite error is inevitably associated with the damped motion (just as some small, but finite energy exchange between nuclear coordinates and electronic parameters must inevitably be part of the original Car-Parrinello
430 scheme). Arbitrarily small m or 'y will lead to numerical instabilities, but, as shown in Figure 7, workable combinations of parameters can be found. A practical finite-difference scheme for propagating damped equations of motion in realistic situations has been given by Tsoo et al [33].
1.4. Further refinements of simulated annealing dynamics Successful first-principles molecular dynamics simulations in the Car-Parrinello framework requires low temperature for the annealed electronic parameters while maintaining approximate energy conservation of the nuclear motion, all without resorting to unduly small time steps. The most desirable situation is a finite gap between the frequency spectrum of the nuclear coordinates, as measured, say, by the velocity-velocity autocorrelation function, enudear(W)
=
f
N.
.
dt cos(wt) E(R i (oRi (o)) ,
(22)
i=1
and the frequency spectrum of the electronic parameters, measured by the corresponding autocorrelation function, Ceiec (co) =
.
I
o
M dt cos(Got)Dak (t)ak (0)) .
(23)
k=1
Overlap of Cnuciear( W) and Ceiec(w), as shown in Figure 9a, leads to rapid energy exchange between nuclear coordinates and electronic parameters in the Car-Parrinello scheme. If Celec(W) extends to very high frequencies, as shown in Figure 9b, the required time step becomes prohibitively small. The ideal situation is shown in Figure 9c. The situation depicted in Figure 9a arise in simulations of metallic systems, and indeed energy conservation is poor without special precautions. In the first attempts to simulate metallic systems [34, 35] the electronic coordinates, to which energy flows as the simulation progresses, were periodically quenched to the Born-Oppenheimer surface. The nuclear coordinates were maintained at constant temperature using the method of Nose [36,37]. By introducing an auxiliary variable which represents system coupling to a heat bath, the Nose molecular dynamics algorithm yields time averages representative of a canonical ensemble. Here the Nose thermostat supplies energy which is constantly drained to the electronic coordinates. Later, BlOchl and Parrinello [38] introduced a more satisfactory procedure for metallic systems by simultaneously thermostatting the nuclear coordinates to a physical temperature, while maintaining electronic parameters at low temperature with a separate thermostat. In this, they were following the lead of Sprik [39], who earlier had separately thermostatted nuclear degress of freedom and coordinates which represented fluctuating electronic polarization degreed of freedom. Previously, Nose had performed classical simulations with two separate thermostats [40]. Fois et al. have considered the use of thermostats in spin density functional theory calculations [41].
431
a) CP
b)
c(w)
(0 c)
Cnuc/ear (0) 0
4........Ce/ec (W)
1
C(w
(.0
Figure 9. Schematic frequency spectra for nuclear and electronic parameter motion.
As long as a frequency gap between nuclear and electronic motion exists, decreasing the electronic parameter masses can increase the gap so as to decouple nuclear and electronic motion. However, this may well cause the frequency spectrum for electronic parameters to spread to unreasonably high values, as in Figure 9b, necessitating very small times steps. Parameters associated with certain types of basis functions, like exponential coefficients in Gaussian basis functions, for example, are especially prone to high frequency behavior. The electronic parameter masses mk can be individually adjusted to decrease the width of Cdec (co ) . Assuming that the Hessian h kk, = ackto E(c) is diagonally dominant, Pederson et al. adjusted parameter masses to approximately equalize the time scale for motion of all parameters [42]. They demonstrated an improvement in the efficiency with which a floating Gaussian basis set could be annealed. To anneal a Hartee-Fock wave function at fixed atomic positions, Chacham and Mohallem chose parameter masses based on diagonal Hessian elements in a similar fashion [43].
432 Sometimes the effect of off-diagonal elements of the Hessian is significant. This occurs, for example, when pairs of floating spherical Gaussians are used to represent p-orbitals [33]. In this case, in-phase and out-of-phase motion of parameters associated with each lobe of the p—orbital have very different frequencies. When the effect of the full Hessian matrix must be incorporated to decrease the width of the electronic parameter frequency spectrum, the parameter kinetic energy can be generalized to include a mass matrix [33].
1
M
-2 E
k=1
•2
mkek
1M
Ck mk,k' ale — 2 k,ki=1
(24)
To the extent that motion of the parameter is dominated by harmonic oscillations, choosing mk , k' proportional to the hessian hk, k' brings all electronic parameter frequencies into alignment. Some of the eigenvalues of hk,k' may be close to zero, indicating coordinates to which the electronic energy is insensitive. To avoid instabilities associated with these otherwise unimportant modes, Tsoo et al. used a modified Hessian as the mass matrix formed by imposing a lower bound on its eigenvalues [33]. The mass matrix procedure can be alternatively viewed as assigning ordinary "diagonal" masses to approximate normal modes for electronic parameter motion.
2. MODELS OF ELECTRONIC STRUCTURE AND APPLICATIONS There is currently no satisfactory general method for electronic structure calculations. Except for the most trivial situations, every electronic structure method entails significant error. Based on accumulated experience with different levels of theory, one can usually place "theoretical error bars" on the results of a calculation, and hopefully avoid over-interpreting those results. Each class of calculation discussed below entails different characteristic difficulties. The review of models and applications in this section proceeds from few- to manyelectron systems. This is not historically accurate, since the first first-principles molecular dynamics simulations were performed on larger systems. However, applications to solids, amorphous materials and liquids have been stressed in other reviews [44-47], so we will take the opportunity to place more emphasis on systems of chemical interest.
2.1. Solvated electrons and other few-electron systems First-principles simulations of one- or few-electron systems involve quantum systems of the lowest dimensionality we will consider in this section. These might entail the smallest error intrinsic in the calculation of the wave function. However, the blessing of small dimensionality is compensated by the tendency of systems in this class usually involve electrons in highly irregular potentials. Furthermore, the potential is usually a pseudopotential which describes the effect of atomic cores and/or solvent molecules on the quantum system of interest. Unfortunately, pseudopotentials introduce errors which are difficult to calibrate.
433 Selloni et al. [48] were the first to simulate adiabatic ground state quantum dynamics of a solvated electron. The system consisted of the electron, 32 K + ions, and 31 Cl - ions, with electron-ion interactions given by a pseudopotential. These simulations were unusual in that what has become the standard simulating annealing molecular dynamics scheme, described in the previous section, was not used. Rather, the wave function of the solvated electron was propagated forward in time with the time-dependent SchrOdinger equation, . OW -, 1 W. = 7lkif
(25)
relying on the intrinsic quantum adiabaticity of the electronic motion to keep the electron in its ground state. The diffusion of the electron solvated in the liquid metal halide solution was estimated. Subsequent work in this vein [49-51] showed that singlet electron pairs at low concentration tended to merge into a localized complex known as a bipolaron, while electron pairs in a triplet state were unbound. A spin density functional theory (SDFT) Hamiltonian [52-54] in the local spin density approximation (LSDA) governed the dynamics for more than one solvated electron. Further increasing the electron concentration induced a transition to metallic behavior. (Since explicit orthonormalization of orbitals via holonomic constraints is not required under unitary time propagation according to Equation (25), time-dependent density functional theory has been suggested as an alternative to the Car-Parrinello method for many-electron systems [55].) Deng et al. treated solvated electrons and bipolarons in ammonia with SDFT-LSDA [56, 57]. They used the Car-Parrinello procedure with holonomic constraints described in Section 1. Deng et al. found that the electrons of the bipolaron complex in ammonia were spatially separated by a small distance, making that complex peanut-shaped in contrast to the more spherical bipolaron observed in molten KC1 [49-51]. In accordance with the KC1 studies, increasing the concentration of solvated electrons induced a transition to metallic behavior. Sprik and Klein studied diffusion and energy fluctuations characteristic of adiabatic electronic motion in ammonia by annealing the positions and widths of floating Gaussian basis functions [58-60]. They kept the linear coefficients of all the floating Gaussian basis functions to be fixed and equal to prevent numerical instabilities which would occur if the coefficient of one of the Gaussians became very small, thereby making the energy insensitive to the other parameters of that Gaussian. Only 16 floating Gaussians [59, 60] were needed to converge the ground state properties of the solvated electron wave function. Tsoo et al. [33, 61] simulated NaArn clusters using floating Gaussian basis functions for the sodium valence electron and incorporating the effects of the Na + core and rare gas solvent through pseudopotentials [62]. The adiabatic ground electronic state and the first three excited states, i.e. sodium 3S and 3P levels distorted by the cluster environment, were generated over lengthy simulations to gauge the effect of cluster isomerization and phase transitions on the sodium absorption spectrum. Excited states were prevented from variationally collapsing to the lower-lying levels without holonomic constraints. Instead
434
an extra term added to the electronic energy of Born-Oppenheimer level 0 of the form
1 -
l(0«10,3)12 kba)(0#100) (0« 2 .<0 a
E ko)
(26)
converted a saddle point in Hilbert space for excited state 13 to a global minimum [33]. In the above equation, 2140) must be chosen greater than the splitting between BornOppenheimer levels 0 and a. Martyna et al. calculated adiabatic electronic states for metal-ammonia [63] and alkali-rare gas [64] clusters by solving the SchrOdinger equation on a grid for configurations selected from thermal path integral simulations. They found that the metal valence electron orbital in M(NH 3 ) n , M = Li, Cs, Sr+ clusters expands from its characteristic atomic size to a Rydberg-like orbital as the first solvent shell around the metal is completed. Martyna et al. obtained the absorption spectra for LiXe n and CsXen clusters by similar techniques [64]. 2.2 Ab initio Schriidinger equation based methods Enormous progress has been made in the ab initio calculation of electronic properties of small molecules, thanks to algorithmic developments, exponential improvement of hardware performance with time, and an accretion of shared experience with widely used techniques. Electronic energy changes are often a dominant factor determining reaction dynamics, and therefore a substantial fraction of all high performance computational resources goes to the calculation of Born-Oppenheimer energies at selection molecular configurations. For systems on the order of 10° atoms the situation is favorable for ground electronic states and even excited states. It is practical to use SchrOdinger equation based methods [65] - Hartree-Fock (HF), nth-order Moller-Plesset perturbation theory (MPn), configuration interaction (CI), coupled cluster (CC) theory [12], and others - to obtain accurate Born-Oppenheimer potential energy surfaces for small molecules containing first row atoms. The situation deteriorates quickly as atoms beyond the first row of the periodic table are included. There one must turn to effective core potential techniques [66]. For systems on the order of 10 1 atoms, the options become limited to HF or low order perturbation theory. SchrOdinger-based calculations are rare for systems of 10 2 atoms. The assessment given in the previous paragraph applies to ab initio calculations for a single or small number of nuclear configurations. Even taking advantage of iterative techniques, the level of treatment and system size have to be scaled back for practical reasons in a molecular dynamics simulation. To date, all SchrOdinger-based ab initio molecular dynamics simulations for many-electron systems have been very limited in the number of electrons explicitly treated and the degree to which nuclear motions is allowed, if at all. The first ab initio calculations using simulated annealing to solve for the electronic wave function appeared in 1990. Chacham and Mohallem compared molecular dynamics simulated annealing and steepest descent quenches with traditional iterative diagonalization for the HF helium atom ground state [43]. A later contribution from this group considered simultaneous HF electronic structure calculation and geometry optimization via simulated
435
annealing dynamics for H2, LiH and Lie [67]. Employing molecular dynamics simulated annealing, Field simultaneously optimized the electronic wave function and geometry for small peptides using an AM1 semiempirical Hamiltonian [68], and the water and formamide molecules using the HF theory [69]. Both of these studies involved optimization of linear basis function coefficients of Gaussian basis functions. The work of Dutta and Bhattacharyya employed Metropolis Monte Carlo to anneal electronic parameters for ground and excited states of 2-electron atoms. They calculated non-linear exponential parameters within a HF framework [23], and for a two-electron CI wave function both linear [24] and non-linear [25] parameters. Simultaneous geometry optimization [70] and Multiconfigurational Self-Consistent Field (MC-SCF) wave functions [26] were treated by Monte Carlo simulated annealing with semi-empirical electronic structure Hamiltonians. Estrin et al. obtained CI wave functions for 2-electrons atoms and H2 by molecular dynamics simulated annealing [21]. They optimized both linear and non-linear parameters of explicitly correlated multi-dimensional Gaussian basis functions. Estrin et al. used a Hartree-Fock treatment for the ground and excited states of the 3 valence electrons of an aluminum atom in the AlAr 12 cluster [71]. The aluminum core and the rare gas atoms were modelled with non-local pseudopotentials [62]. Simulated annealing Hartree-Fock dynamics revealed that the A1Ar 12 cluster with a central aluminum atom broke icosahedral symmetry, as expected for this Jahn-Teller active system, preferring configurations with little or no symmetry about ,,, 600 cm -1 in energy below the icosahedral configuration. Treating the 4 valence electrons of Na4 explicitly, Hartke and Carter used Hellmann-Feynman forces derived from a Hartree-Fock wave function [72], and later full forces derived from a Generalized Valence Bond (GVB) wave function [73], to propagate test trajectories for Na 4 on singlet and triplet Born-Oppenheimer surfaces. Hartke and Carter also used a GVB wave function for the 5 valence electrons of Nis in a similar fashion [74]. Hammes-Schiffer and Andersen have developed a Generalized Hartree-Fock (GHF) method in which each orbital is an arbitrary linear combination of up- and down-spin components [75]. They implemented Car-Parrinello molecular dynamics for Lin , n < 5 clusters using a frozen core approximation, re-casing of the GHF energy expression to exploit the spatial localization of the cores, and fits to the electron repulsion integrals. We also note that ab initio techniques implemented in a standard fashion, that is, without simulated annealing for iterative determination of the electronic wave function, have been combined with geometry optimization or molecular dynamics simulation. Applications include molecular dynamics of dense helium [7] and Si n , n = 4,6,8,10 [8] using gradients calculated at the HF level, and simulated annealing geometry optimization of Li5 H at the HF and MP2 level [9]
2.3. Density functional techniques Many-particle simulations typically involve on the order of 10 2 --103 particles. At present, density function theory (DFT) [52-54] is the only practical electronic structure
436 method to treat the electrons associated with this many particles in a molecular dynamics simulation. The wealth of first-principles simulation results using DFT has confirmed the utility of Car and Parrinello's melding of molecular dynamics and simulated annealing. Long confined to solid state total energy calculations, application of DFT to molecular electronic structure has recently attracted great attention [76-80]. In principle DFT yields the Born-Oppenheimer energy by minimization of a functional of p(r), the electron density. In practice, one is forced to work with more complex quantities, a series of one-electron orbitals flPi 1 [81]. Even though orbitals are the practical basis of both DFT and SchrOdinger equation-based techniques, two-electron, 4-index electron repulsion integrals are not required in DFT, and consequently DFT calculations are generally much less taxing than SchrOdinger equation-based techniques. Several groups are developing orbital-free versions of density functional theory [82-84]. This would be a major reduction in the computational cost of DFT, but requires a reliable kinetic energy functional of the electron density. The original Car-Parrinello implementation of simulated annealing molecular dynamics [10] employed the familiar local density approximation (LDA) for the exchange and correlation part of the electronic density functional. Typically LDA-DFT is implemented with plane wave basis functions, allowing transformation between position- and Fourierspace representation of orbitals by Fast Fourier Transform (FFT). Matrix elements can be computed in the most convenient representation. Recent developments particularly relevant for molecular systems are the introduction of gradient corrections to LDA-DFT and methods to cope with the short-range nature of first-row atomic pseudopotentials. Gradient-corrected density functional theory [85, 86] has been shown to have accuracy surpassing Hartree-Fock and rivaling MP2 for molecular calculations on first and second row atoms. Core pseudopotentials for first row atoms can be rather short-ranged, necessitating large numbers of plane waves for convergence of the wave function. The introduction of ultra-soft pseudopotentials [87-89] alleviates this problem, and has been exploited in simulated annealing molecular dynamics of systems containing first row atoms [90-94]. Very recently, other methods for coping with first-row atom pseudopotentials have been developed. Gygi has generalized the plane wave method to arbitrary curvilinear coordinates [95, 96]. In Gygi's scheme the transformation between Cartesian and curvilinear coordinates can be optimized, placing more grid points near atomic centers and opening the possibility of dynamically adapting a grid during a simulation. Devenyi et al. have explored the feasibility dropping pseudopotentials altogether and performing all-electron density function calculations using Gygi's technique [97]. BlOchl has recently proposed an all-electron density functional technique, the Projector Augmented Wave (PAW) method, in which a plane wave basis is augmented by a partial wave expansion in regions strictly localized near atomic cores [98]. Most of the current implementation of DFT within simulated annealing molecular dynamics calculations follows the basic outlines given in Section 1.2. Ample details particular to DFT have been given in other reviews of Car-Parrinello techniques [44-47, 99]
437
and will not be repeated here. Many applications have been reported for crystalline solids [100-103], amorphous materials and liquids [35, 45,104-121], crystal defects and impurities [122-131], phase transitions [106, 132], plasmas [133] surfaces [134-147] and mineral structure [127, 148-153]. In addition to the mainly solid state applications listed here, there have been many applications of Car-Parrinello molecular dynamics methods to chemical problems. A wide variety of atomic clusters have been studied, including clusters composed of selenium [154], silicon [155-157], sulfur [158], alkali metals [159], aluminum [160-164], phosphorus [165, 166], magnesium [167], sodium [168, 169], gallium [162, 163, 170], and atomic mixtures [171-174]. There have been many calculations on carbon clusters and fullerenes [175-184]. The water dimer [91] and small water clusters [93], phase transformations in ice [90, 94] and liquid water [92], have been simulated by simulated annealing molecular dynamics techniques. Recently polymerization on a semiconductor surface [185, 186] and hydrolysis on the surface of Mg0 [152] have been studied. Using BlOchl's PAW method, vibrational modes for the ferrocene molecule were determined from an all-electron DFT molecular dynamics simulation [98].
2.4. Other electronic structure approaches The induced dipole moment of species in a polar or ionic material depends on the instantaneous configuration of charges and dipoles surrounding the polarizable center. The induced electronic polarization on N polarizable centers can be calculated by solving 3N coupled linear equations (if the polarization model is the standard linear response model), iteratively solving N equations of degree 3, or by minimizing a function of particle positions and instantaneous dipole moments of the polarizable centers. Cast in the latter form, Sprik [39] used simulated annealing to obtain the self-consistent polarization of a polarizable water model [187]. This work was novel in that it was the original implementation of separate Nose thermostats [36, 37, 40] for electronic parameters and nuclear coordinates. Simulated annealing molecular dynamics has been used in conjunction with tightbinding Hamiltonians to model the properties of silicon [188-191], potassium [190] and carbon [190] clusters. Car-Parrinello techniques have been used to describe classical variables whose behavior, like quantum electrons in the Born-Oppenheimer approximation, is nearly adiabatic with respect to other variables. In simulations of a colloidal system consisting of macroions of charge Ze, each associated with Z counterions of charge —e, Lowen et al. [192] eliminated explicit treatment of the many counterions using classical density functional theory. Assuming that the counterions relax instantaneously on the time-scale of macroion motion, simulations of the macroion were performed by optimizing the counterion density at each time step by simulated annealing.
438 3. NEW DIRECTIONS In less than a decade, first-principles molecular dynamics through simulated annealing has produced an impressive body of results with density functional theory, and some important exploratory studies with SchrOdinger equation-based ab initio methods. There has been recent interest in more efficient algorithms [30, 31,193-195], implementation on parallel architectures [196, 197] or alternatives to simulated annealing molecular dynamics. Small time steps are inherent to Car-Parrinello dynamics, so it is natural to ask if it is more efficient to calculate electronic energies and forces by a method that does not rely on adiabatic decoupling, perhaps at greater computational expense, but at time intervals characteristic of an all-classical simulation. Alternative iterative methods with this goal have been reviewed by Payne et al. [198] Even with current algorithms and computational capabilities, there is no shortage of important applications awaiting study. Of course, it is difficult to imagine a situation in which ever more challenging applications did not await our ability to simulate larger systems with more accuracy.
REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.
13. 14. 15. 16. 17. 18. 19.
I. S. Y. Wang and M. Karplus, J. Amer. Chem. Soc., 95 (1973), 8160. P. A. Bash, M. J. Field, R. Davenport, and M. Karplus, J. Amer. Chem. Soc., 109 (1987), 8092. M. J. Field, P. A. Bash, and M. Karplus, J. Comput. Chem., 11 (1990), 700. F. Zerbetto, Chem. Phys., 150 (1991), 39. Y. T. Wong, B. Schubert, and R. Hoffmann, J. Am. Chem. Soc., 114 (1992), 2367. U. C. Singh and P. A. Kollman, J. Comput. Chem., 7 (1986), 718. S. M. Younger, A. K. Harrison, and G. Sugiyama, Phys. Rev., A40 (1989), 5256. S. A. Maluendes and M. Dupuis, Int. J. Quantum Chem., 42 (1992), 1327. V. Keshari and Y. Ishikawa, Chem. Phys. Lett., 218 (1994), 406. R. Car and M. Parrinello, Phys. Rev. Lett., 55 (1985), 2471. S. Kirkpatrick, J. C. D. Gelatt, and M. P. Vecchi, Science, 220 (1983), 671. Certain SchrOdinger equation based methods, such as coupled cluster theory, are not based on a variational principle. They fall outside schemes that use the energy expectation value as a optimization function for simulated annealing, although these methods could be implemented within a simulated annealing molecular dynamics scheme with alternative optimization function. S. F. Boys, Proc. Roy. Soc. (London), A258 (1960), 402. K. Singer, Proc. Roy. Soc. (London), A258 (1960), 412. J. V. L. Longstaff and K. Singer, Proc. Roy. Soc. (London), A258 (1960), 421. J. V. L. Longstaff and K. Singer, Theor. Chim. Acta, 2 (1964), 265. J. V. L. Longstaff and K. Singer, J. Chem. Phys., 42 (1965), 801. J. William A. Lester and M. Krauss, J. Chem. Phys., 41 (1964), 1407. N. C. Handy, Mol. Phys., 23 (1972), 1.
439 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32.
33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44.
45. 46. 47. 48. 49. 50. 51. 52.
L. Salmon and R. D. Poshusta, J. Chem. Phys., 59 (1973), 3497. D. A. Estrin, C. Tsoo, and S. J. Singer, J. Chem. Phys., 93 (1990), 7201. P. Dutta and S. P. Bhattacharyya, Chem. Phys. Lett., 162 (1989), 67. P. Dutta and S. P. Bhattacharyya, Chem. Phys. Lett., 167 (1990), 309. P. Dutta and S. P. Bhattacharyya, Phys. Lett., A148 (1990), 331. P. Dutta and S. P. Bhattacharyya, Chem. Phys. Lett., 184 (1991), 330. P. Dutta and S. P. Bhattacharyya, Chem. Phys. Lett., 199 (1992), 169. cf. Equations (9-13) below. R. Car, M. Parrinello, and M. Payne, J. Phys.:Condens. Matter, 3 (1991), 9539. G. Pastore, E. Smargiassi, and F. Buda, Phys. Rev., A44 (1991), 6334. M. E. Tuckerman and M. Parrinello, J. Chem. Phys., 101 (1994), 1302. M. E. Tuckerman and M. Parrinello, J. Chem. Phys., 101 (1994), 1316. The four roots of Equation (20) behave in discontinuous fashion with respect to the mass m and damping coefficient 7. In particular, the transient roots become purely real for certain combinations of parameters. Therefore, the leading behavior for small 7 and m given in Equations (21) has a limited range of validity. The separation of the roots of Equation (20) into two pairs, one a transient and the other pair describing the physically desirable solution, remains valid over a much wide range of parameters. C. Tsoo, D. A. Estrin, and S. J. Singer, J. Chem. Phys., 96 (1992), 7977. G. Galli, R. M. Martin, R. Car, and M. Parrinello, Phys. Rev. Lett., 63 (1989), 988. I. Stich, R. Car, and M. Parrinello, Phys. Rev. Lett., 63 (1989), 2240. S. Nose, Mol. Phys., 52 (1984), 255. S. Nose, J. Chem. Phys., 81 (1984), 511. P. E. BlOchl and M. Parrinello, Phys. Rev., B45 (1992), 9413. M. Sprik, J. Phys. Chem., 95 (1991), 2283. S. Nose, Mol. Phys., 57 (1986), 187. E. S. Fois, J. I. Penman, and P. A. Madden, J. Chem. Phys., 98 (1993), 6361. M. R. Pederson, B. M. Klein, and J. Q. Broughton, Phys. Rev., B38 (1988), 3825. H. Chacham and J. R. Mohallem, Mol. Phys., 70 (1990), 391. R. Car and M. Parrinello, in Simple Molecular Systems Very High Density, edited by A. Polian, P. Loubeyre, and N. Boccara, volume 186, page 455, Plenum, NATO ASI Ser. B, New York, 1989. G. Galli and M. Parrinello, in Computer Simulation in Materials Science, edited by M. Meyer and V. Pontikis, page 283, Kluwer, Boston, 1991. G. Galli and A. Pasquarello, in Computer Simulation in Chemical Physics, edited by M. P. Allen and D. J. Tildesley, page 261, Kluwer, Boston, 1993. D. K. Remler and P. A. Madden, Mol. Phys., 70 (1990), 921. A. Selloni, P. Carnevali, R. Car, and M. Parrinello, Phys. Rev. Lett., 59 (1987), 823. A. Selloni, R. Car, M. Parrinello, and P. Carnevali, J. Phys. Chem., 91 (1987), 4947. E. Fois, A. Selloni, M. Parrinello, and R. Car, J. Phys. Chem., 92 (1988), 3268. E. Fois, A. Selloni, and M. Parrinello, Phys. Rev., B39 (1989), 4812. edited by S. Lundqvist and N. March, Theory of the inhomogeneous electron gas,
440 Plenum, New York, 1983. 53. R. G. Parr and W. Yang, Density-functional theory of atoms and molecules, Oxford, New York, 1989. 54. E. Gross and R. M. Dreizler, Density functional theory: an approach to the quantum many-body problem, Springer, New York, 1990. 55. J. Theilhaber, Phys. Rev., B46 (1992), 12990. 56. Z. Deng, G. J. Martyna, and M. L. Klein, Phys. Rev. Lett., 68 (1992), 2496. 57. G. J. Martyna, Z. Deng, and M. L. Klein, J. Chem. Phys., 98 (1992), 555. 58. M. Sprik and M. L. Klein, J. Chem. Phys., 87 (1988), 5987. 59. M. Sprik and M. L. Klein, J. Chem. Phys., 89 (1988), 1592, [Erratum: ibid., 90, 7614 (1989)]. 60. M. Sprik and M. L. Klein, J. Chem. Phys., 91 (1989), 5665. 61. C. Tsoo, D. A. Estrin, and S. J. Singer, J. Chem. Phys., 93 (1990), 7187. 62. D. A. Estrin, C. Tsoo, and S. J. Singer, Chem. Phys. Lett., 184 (1991), 571. 63. G. J. Martyna and M. L. Klein, J. Phys. Chem., 95 (1991), 515. 64. G. Martyna, C. Cheng, and M. L. Klein, J. Chem. Phys., 95 (1991), 1318. 65. A. Szabo and N. S. Ostlund, Modern Quantum Chemistry, McGraw-Hill, New York, 1989. 66. P. A. Christiansen, Y. S. Lee and K. S. Pitzer, J. Chem. Phys., 71, 4445 (1979); L. F. Pacios and P. A. Christiansen, J. Chem. Phys., 82, 2664 (1985); M. Krauss and W. J. Stevens, Ann. Rev. Phys. Chem., 35, 357 (1984). 67. R. 0. Vianna, H. Chacham, and J. R. Mohallem, J. Chem. Phys., 98 (1993), 6395. 68. M. J. Field, Chem. Phys. Lett., 172 (1990), 83. 69. M. J. Field, J. Phys. Chem., 95 (1991), 5104. 70. P. Dutta and S. P. Bhattacharyya, Chem. Phys. Lett., 181 (1991), 293. 71. D. A. Estrin, L. Liu, and S. J. Singer, J. Phys. Chem., 96 (1992), 5325. 72. B. Hartke and E. A. Carter, Chem. Phys. Lett., 189 (1992), 358. 73. B. Hartke and E. A. Carter, J. Chem. Phys., 97 (1992), 6569. 74. B. Hartke and E. A. Carter, Chem. Phys. Lett., 216 (1993), 324. 75. S. Hammes-Schiffer and H. C. Andersen, J. Chem. Phys., 99 (1993), 523. 76. T. Ziegler, Chem. Rev., 91 (1991), 651. 77. edited by J. K. Labanowski and J. W. Andzelm, Density functional methods in chemistry, Springer, New York, 1991. 78. A. M. Rappe, J. D. Joannopoulos, and P. A. Bash, J. Amer. Chem. Soc., 114 (1992), 6466. 79. R. 0. Jones, Theochem, 114 (1994), 219. 80. A. Ghosh, J. Alm1Of, and L. Que, Jr., J. Phys. Chem., 98 (1994), 5576. 81. W. Kohn and L. Sham, Phys. Rev., A140 (1965), 1133. 82. L.-W. Wang and M. P. Teter, Phys. Rev., B45 (1992), 13197. 83. M. Pearson, E. Smargiassi, and P. A. Madden, J. Phys.: Condens. Matter, 5 (1993), 3221. 84. E. Smargiassi and P. A. Madden, Phys. Rev., B49 (1994), 5220.
441 A. D. Becke, Phys. Rev., A38 (1988), 3098. C. Lee, W. Yang, and R. G. Parr, Phys. Rev., B37 (1988), 785. D. Vanderbilt, Phys. Rev., B41 (1990), 7892. K. Laasonen, R. Car, C. Lee, and D. Vanderbilt, Phys. Rev., B43 (1991), 6796. K. Laasonen, A. Pasquarello, R. Car, C. Lee, and D. Vanderbilt, Phys. Rev., B47 (1993), 10142. 90. C. Lee, D. Vanderbilt, K. Laasonen, R. Car, and M. Parrinello, Phys. Rev. Lett., 69 (1992), 462. 91. K. Laasonen, F. Csajka, and M. Parrinello, Chem. Phys. Lett., 194 (1992), 172. 92. K. Laasonen, M. Sprik, and M. Parrinello, J. Chem. Phys., 99 (1993), 9080. 93. K. Laasonen, M. Parrinello, R. Car, C. Lee, and D. Vanderbilt, Chem. Phys. Lett., 207 (1993), 208. 94. C. Lee, D. Vanderbilt, K. Laasonen, R. Car, and M. Parrinello, Phys. Rev., B47 (1993), 4863. 95. F. Gygi, Europhys. Lett., 19 (1992), 617. 96. F. Gygi, Phys. Rev., B48 (1993), 11692. 97. A. Devenyi, K. Cho, T. A. Arias, and J. D. Joannopoulos, Phys. Rev., B49 (1994), 13373. 98. P. Margl, K. Schwarz, and P. E. BlOchl, J. Chem. Phys., 100 (1994), 8194. 99. M. J. Gillan, in Computer Simulation in Materials Science, edited by M. Meyer and V. Pontikis, page 257, Kluwer, Boston, 1991. 100.F. Buda, G. L. Chiarotti, R. Car, and M. Parrinello, Phys. Rev. Lett., 63 (1989), 294. 101.F. Buda, R. Car, and M. Parrinello, Phys. Rev., B41 (1990), 1680. 102.X. G. Gong, G. L. Chiarotti, M. Parrinello, and E. Tosatti, Phys. Rev., B43 (1991), 14277. 103.F. Ancilotto, A. Selloni, and R. Car, Phys. Rev. Lett., 71 (1993), 3685. 104.R. Car and M. Parrinello, Phys. Rev. Lett., 60 (1988), 204. 105.G. Galli, R. M. Martin, R. Car, and M. Parrinello, Phys. Rev., B42 (1990), 7470. 106.G. Galli, R. M. Martin, R. Car, and M. Parrinello, Science, 250 (1990), 1547. 107.L. F. Xu, A. Selloni, and M. Parrinello, J. Non-Cryst. Solids, 117-118 (1990), 926. 108.Q. M. Zhang, G. Chiarotti, A. Selloni, R. Car, and M. Parrinello, Phys. Rev., B42 (1990), 5071. 109.G. Galli and M. Parrinello, J. Phys.: Condens. Matter, 2 (1990), SA227. 110.1. Stich, R. Car, and M. Parrinello, Phys. Rev., B44 (1991), 4262. 111.F. Buda, G. L. Chiarotti, R. Car, and M. Parrinello, Physica, B170 (1991), 98. 112.F. Buda, G. L. Chiarotti, R. Car, and M. Parrinello, Phys. Rev., B44 (1991), 5908. 113.D. Hohl and R. 0. Jones, Phys. Rev., B43 (1991), 3856. 114.F. Buda, J. Kohanoff, and M. Parrinello, Phys. Rev. Lett., 69 (1992), 1272. 115.F. Finocchi, G. Galli, M. Parrinello, and C. M. Bertoni, Phys. Rev. Lett., 68 (1992), 3044. 116.A. Pasquarello, K. Laasonen, R. Car, C. Lee, and D. Vanderbilt, Phys. Rev. Lett.,
85. 86. 87. 88. 89.
442 69 (1992), 1982. 117.E. Fois, A. Selloni, G. Pastore, Q. M. Zhang, and R. Car, Phys. Rev., B45 (1992), 13378. 118.G. Seifert, G. Pastore, and R. Car, J. Phys.: Condens. Matter, 4 (1992), L179. 119.X. G. G. G. L. Chiarotti, M. Parrinello, and E. Tosatti, Europhys. Lett., 21 (1993), 469. 120.F. Finocchi, G. Galli, M. Parrinello, and C. M. Bertoni, Physica, B185 (1993), 379. 121.P. E. B1Ochl, E. Smargiassi, R. Car, D. B. Laks, W. Andreoni, and S. T. Pantelides, Phys. Rev. Lett., 70 (1993), 2435. 122.M. C. Payne, P. D. Bristowe, and J. D. Joannopoulos, Phys. Rev. Lett., 58 (1987), 1348. 123.E. Tarnow, P. D. Bristowe, J. D. Joannopoulos, and M. C. Payne, J. Phys.: Condens. Matter, 1 (1989), 327. 124.S. A. Kajihara, A. Antonelli, J. Bernholc, and R. Car, Phys. Rev. Lett., 66 (1990), 2010. 125.T. Oguchi and T. Sasaki, in Molecular Dynamics Simulations, edited by F. Yonezawa, page 157, Springer, New York, 1990. 126.T. A. Arias and J. D. Joannopoulos, Phys. Rev. Lett., 69 (1992), 3330. 127.A. DeVita, M. J. Gillan, J. S. Lin, M. C. Payne, I. Stich, and L. C. Clarke, Phys. Rev., B46 (1992), 12964. 128.V. Milman, M. C. Payne, V. Heine, R. J. Needs, J. S. Lin, and M. H. Lee, Phys. Rev. Lett., 70 (1993), 2928. 129.J. M. Jin, L. J. Lewis, V. Milman, I. Stich, and M. C. Payne, Phys. Rev., B48 (1993), 11465. 130.L. Gilgien, G. Galli, F. Gygi, and R. Car, Phys. Rev. Lett., 72 (1994), 3214. 131.T. A. Arias and J. D. Joannopoulos, Phys. Rev., B49 (1994), 4525. 132.P. Focher, G. L. Chiarotti, M. Bernasconi, E. Tossatti, and M. Parrinello, Europhys. Lett., 26 (1994), 345. 133.J. Clerouin, G. Zerah, D. Benisti, and J. P. Hansen, Europhys. Lett., 13 (1990), 685. 134.M. Needels, M. C. Payne, and J. D. Joannopoulos, Phys. Rev. Lett., 58 (1987), 1765. 135. C. Z. Wang, M. Parrinello, E. Tosatti, and A. Fasolino, Europhys. Lett., 6 (1988), 43. 136.F. Ancilotto, W. Andreoni, A. Selloni, R. Car, and M. Parrinello, Phys. Rev. Lett., 65 (1990), 3148. 137. F. Ancilotto, A. Selloni, W. Andreoni, S. Baroni, R. Car, and M. Parrinello, Phys. Rev., B43 (1991), 8930. 138.1. Moullet, W. Andreoni, and M. Parrinello, Surf. Sci., 269-270B (1992), 1000. 139.1. Moullet, W. Andreoni, and M. Parrinello, Phys. Rev., 46 (1992), 1842. 140.G. Brocks, P. J. Kelly, and R. Car, Surf. Sci., 269-270-B (1992), 860. 141.F. Ancilotto and A. Selloni, Phys. Rev. Lett., 68 (1992), 2640. 142.1. Simonetta, G. Galli, F. Gygi, M. Parrinello, and E. Tosatti, Phys. Rev. Lett., 69 (1992), 2947.
443 143.A. Vittadini, A. Selloni, R. Car, and M. Casarin, Phys. Rev., B46 (1992), 4348. 144.1. Simonetta, G. Galli, F. Gygi, M. Parrinello, and E. Tosatti, Physica, B185 (1993), 539. 145.J. Wang, T. A. Arias, and J. D. Joannopoulos, Phys. Rev., B47 (1993), 10497. 146.H. Gai and G. A. Voth, J. Chem. Phys., 101 (1994), 1734. 147.C. Lee, G. T. Barkema, M. Breeman, A. Pasquarello, and R. Car, Surf. Sci., 306 (1994), L575. 148.D. C. Allan and M. P. Teter, Phys. Rev. Lett., 59 (1987), 1136. 149.H. M. Lu and J. R. Hardy, Phys. Rev. Lett., 64 (1990), 661. 150.J. D. Kubicki and A. C. Lasaga, Am. J. Sci., 292 (1992), 153. 151.R. M. Wentzcovitch, J. L. Martins, and G. D. Price, Phys. Rev. Lett., 70 (1993), 3947. 152.W. Langel and M. Parrinello, Phys. Rev. Lett., 73 (1994), 504. 153.B. Winkler, V. Milman, and M. C.Payne, Am. Mineral., 79 (1994), 200. 154.D. Hohl, R. 0. Jones, R. Car, and M. Parrinello, Chem. Phys. Lett., 139 (1987), 540. 155.P. Ballone, W. Andreoni, R. Car, and M. Parrinello, Phys. Rev. Lett., 60 (1988), 271. 156.D. Hohl, R. 0. Jones, R. Car, and M. Parrinello, J. Amer. Chem. Soc., 111 (1989), 825. 157.U. Rahlisberger, W. Andreoni, and M. Parrinello, Phys. Rev. Lett., 72 (1994), 665. 158.D. Hohl, R. 0. Jones, R. Car, and M. Parrinello, J. Chem. Phys., 89 (1988), 6823. 159.P. Ballone, W. Andreoni, R. Car, and M. Parrinello, Europhys. Lett., 8 (1989), 73. 160.J. Y. Yi, D. J. Oh, J. Bernholc, and R. Car, Chem. Phys. Lett., 174 (1990), 461. 161.R. 0. Jones, Phys. Rev. Lett., 67 (1991), 224. 162.R. 0. Jones, Z. Phys. D: At. Mol. Clusters, 26 (1993), 23. 163.R. 0. Jones, J. Chem. Phys., 99 (1993), 1194. 164.R. 0. Jones, Z. Phys. D: At. Mol. Clusters, 26 (1993), 349. 165.R. 0. Jones and D. Hohl, J. Chem. Phys., 92 (1990), 6710. 166.R. 0. Jones and G. Seifert, J. Chem. Phys., 96 (1992), 7564. 167.V. Kumar and R. Car, Phys. Rev., B44 (1991), 8243. 168.U. ROethlisberger, J. Chem. Phys., 94 (1991), 8129. 169.E. S. Fois, J. I. Penman, and P. A. Madden, J. Chem. Phys., 98 (1993), 6352. 170.X. G. Gong and E. Tosatti, Phys. Lett., A166 (1992), 369. 171.R. 0. Jones and D. Hohl, Int. J. Quantum Chem., 24 (1990), 141. 172.R. 0. Jones and G. Seifert, J. Chem. Phys., 96 (1992), 2942. 173.H. P. Cheng, R. N. Barnett, and U. Landman, Phys. Rev., B48 (1993), 1820. 174.R. 0. Jones, Inorg. Chem., 33 (1994), 1340. 175.B. P. Feuston, W. Andreoni, M. Parrinello, and E. Clementi, Phys. Rev., B44 (1991), 4056. 176.W. Andreoni, F. Gygi, and M. Parrinello, Chem. Phys. Lett., 189 (1992), 241. 177.W. Andreoni, F. Gygi, and M. Parrinello, Phys. Rev. Lett., 68 (1992), 823. 178.W. Andreoni, F. Gygi, and M. Parrinello, Chem. Phys. Lett., 190 (1992), 159.
444
179.K. Laasonen, W. Andreoni, and M. Parrinello, Science, 258 (1992), 1916. 180.J. Kohanoff, W. Andreoni, and M. Parrinello, Phys. Rev., 46 (1992), 4371. 181.J. Kohanoff, W. Andreoni, and M. Parrinello, Chem. Phys. Lett., 198 (1992), 472. 182. R. Jones, C. D. Latham, M. I. Heggie, V. J. B. Torres, S. Oeberg, and S. K. Estreicher, Philos. Mag. Lett., 65 (1992), 291. 183.W. Andreoni, P. Giannozzi, and M. Parrinello, Phys. Rev. Lett., 72 (1994), 848. 184.G. Onida, W. Andreoni, J. Kohanoff, and M. Parrinello, Chem. Phys. Lett., 219 (1994), 1. 185.G. Brocks, P. J. Kelly, and R. Car, Synth. Met., 57 (1993), 4243. 186.G. Brocks, P. J. Kelly, and R. Car, Phys. Rev. Lett., 70 (1993), 2786. 187.M. Sprik and M. L. Klein, J. Chem. Phys., 89 (1988), 7556. 188.F. S. Khan and J. Q. Broughton, Phys. Rev., B39 (1989), 3688. 189.J. Broughton and F. Khan, Phys. Rev., B40 (1989), 12098. 190.C. Satoko, in Molecular Dynamics Simulations, edited by F. Yonezawa, page 186, Springer, New York, 1990. 191.F. S. Khan and J. Q. Broughton, Phys. Rev., B43 (1991), 11754. 192.H. Lowen, P. A. Madden, and J. P. Hansen, Phys. Rev. Lett., 68 (1992), 1081. 193.W. Yang, Phys. Rev. Lett., 66 (1991), 1438. 194.S. Baroni and P. Giannozzi, Europhys. Lett., 17 (1992), 547. 195.G. Galli and M. Parrinello, Phys. Rev. Lett., 69 (1992), 3547. 196.L. J. Clarke, I. Stich, and M. C. Payne, Comput. Phys. Commun., 72 (1992), 14. 197.M. C. Payne, I. Stich, R. D. King-Smith, J. S. Lin, A. D. Vita, M. J. Gillian, and L. J. Clarke, in Computer Aided Innovation in New Materials, edited by M. Doyama, volume 2, page 101, North-Holland, Amsterdam, 1993. 198.M. C. Payne, M. P. Teter, D. C. Allan, T. A. Arias, and J. D. Joannopoulos, Rev. Mod. Phys., 64 (1992), 1045.
Adaption of Simulated Annealing to Chemical Optimization Problems, Ed. by J.H. Kalivas © 1995 Elsevier Science B.V. All rights reserved.
445
Chapter 20
A MATLAB algorithm for optimization of an arbitrary multivariate function
Michael A. Curtis
Pharmacokinetics and Drug Metabolism Unit, Alcon Laboratories, Inc., 6201 South Freeway, Fort Worth, TX 76134-2099, USA
1. INTRODUCTION Simulated annealing (SA), in both its original and generalized (GSA) forms, is an appealing optimization technique due to its robust nature and the inherent simplicity of the basic algorithm. The concepts involved are especially familiar to chemists due to their origins in the field of statistical mechanics. While creating a simple SA or GSA program in any of the commonly used higher level languages should be relatively straightforward, care is needed to develop an implementation of the algorithm which is efficient, user-friendly, well documented, and sufficiently flexible to be easily modified for specific applications. MATLAB, an interactive matrix based numeric computation/graphics programming environment developed by The Math Works, has become highly popular with chemometricians in recent years. While powerful and computationally efficient, the MATLAB language is surprisingly easy to use. Individual MATLAB routines, called mfiles, may be invoked within other m-files, allowing great flexibility and extensibility of applications. A general overview of MATLAB is given in the User Guide [1]. In this chapter, a MATLAB implementation of the GSA algorithm, gsaopt.m, listed in Section 6.1, is presented and its operation explained. Special features, including an option to allow decreases in step size during the search to more precisely locate the global optimum, are explained. Three worked examples are included to demonstrate its use and to give readers a starting point to experiment with GSA on practical problems.
2. PROGRAM REQUIREMENTS AND OPERATION The various user parameters required when invoking gsaopt.m, along with variables and counters generated internally in the routine, are explained in detail in the comments, designated by a percentage sign at the beginning of a line, at the beginning of the program listing. The operation of the program is outlined in the flowchart shown in Figure 1. In addition to the m-file defining the function to be optimized, the function call statement must include various vector and scaler quantities. All vectors used as input to gsaopt must be row
446 vectors or an error message will result. (
1 Check input for acceptability, nitialize initialize counters, set step scaling range and evaluate cost function at starting coordinates
n_
rGenerate random ste , scalp appropriately and evaluate cos function at new coordinates, increment iteration counter
( Shrink step size, reset fail . counter and increment shrink counter
Increment fail N counter and update monitor, N
4
Accept current step, save coordinates and cost value in output file, reset fail counter
ximu number of iterations reached"
Figure 1. Flowchart for gsaopt.m.
The first user supplied parameter, optfunc, is the name of the m-file for the function to be optimized. This function can have multiple input variables which allows response surfaces of high dimensionality to be searched. However, the output of optfunc must be single valued. If multiple attributes need to be optimized for a particular application, a weighted sum or other composite quantity may be generated within optfunc to provide a single valued response at each point in the search. The second user supplied parameter, minmxflg, determines whether minimization or maximization of the function designated in optfunc is performed. Maximization simply involves minimization of -f(x) where x is the vector of coordinates for a given point in the search space. The next three input parameters: xstart, xmax, and xmin are vectors of equal
447 length designating the starting coordinates and upper and lower limits of the search domain, respectively. The delta parameter designates the initial step length. The next two parameters, b and g, correspond to the terms Q and g in the equation: P(acceptance I detrimental step) = exp((-004 ) ( (or4:ecurrentl )g)
(1)
as described by Bohachevsky et. al. [2] for conditional acceptance probability of a detrimental step. The value for g in the above equation is a negative number, -1 or -2 often being used. When invoking gsaopt, the absolute value of the desired g factor is entered. As Bohachevsky has noted, a g value of zero will simplify equation 1 to give standard simulated annealing. The estopt parameter in the gsaopt call statement corresponds to the estimated global optimum 43,. The user defined maxiter value determines the maximum number of attempted steps before run termination to limit runtime should a given search prove futile in approaching the desired optimum. One limitation of conventional simulated annealing routines is the fixed step size (sr) which limits the ultimate precision with which the global optimum can be located. Kalivas [3] reported an alternative approach using discrete step size changes with increases or decreases throughout the search based upon the acceptance rate of the previous 20 attempted steps. The approach used in gsaopt to overcome fixed step size limitations is as follows: following a specified number of unaccepted step attempts (quitreps), the step size is reduced via multiplication by a fixed decimal factor (shrinkfac), and the number of reduction cycles is limited by the maxshrinks parameter, provided the overall iteration limit is not reached first. Note that the step size reduction feature can be disabled by setting maxshrinks equal to zero but a value for shrinkfac must still be entered to give proper assignment of arguments in the call function. An alternative way to disable step size shrinkage is to set shrinkfac equal to one. The remaining user supplied parameter, monitor, provides a useful diagnostic tool to fine tune the value of /3 during successive searches of a given response surface and its use is explained later in the text. The simple approach to variable step size described above can be highly effective, allowing large areas of the search region to be explored in the early phase if the response is far from the estimated optimum and providing high resolution searching when the region containing the optimum is found. However, the user must take care to avoid setting the value (actually the initial value, vide infra) excessively high which restricts acceptance of detrimental steps early in the search. In such cases, the stepwise reduction can restrict the search to a local optimum until the iteration maximum is exhausted. Referring to the flowchart in Figure 1, the operation of gsaopt is described below. After the input parameters are checked to ensure they are within acceptable ranges, the first random step is generated, scaled to length i1r, and its response compared to that of the starting coordinates. Throughout the search, favorable steps are accepted unconditionally and unfavorable steps evaluated by the acceptance probability criteria of Equation 1. At each step, up to the limit defined by maxiter, the appropriate counters are incremented depending on the acceptance or rejection of the step in question. These include failcount, the number of successive rejected step attempts since the last accepted step which is used to trigger the next stepwise reduction in Or or search termination should no additional step shrinkages be allowed, and detaccept, the number of accepted detrimental steps within a user defined window (monitor) of both accepted and rejected detrimental step attempts. When the sum
448
of detaccept and detfail, a counter similar to failcount, but reinitialized as described below, equals the user selected monitor value (m), the acceptance percentage of the previous m detrimental step attempts is calculated and displayed. If the failcount value reaches the limit defined by quitreps, Ar is reduced by shrinkfac, provided the maximum number of step shrinkages has not occurred. As Kalivas [3] has noted, 0 should increase with decreasing Ar . Otherwise, the smaller changes in mean response variation between adjacent steps with reduced Ar will cause excessive wandering of the algorithm. In gsaopt, a new 0 value is calculated after each step shrinkage as follows:
o new =
I3old Arvid
(2) Arne,
This assumes that the magnitude of response change at adjacent points in the surface is, on average, proportional to Or, which is reasonable in many cases. If the estimated optimal response, 4) 0, is met or exceeded at any point in the search, execution is paused and the user is given the option of continuing with a new estopt value or terminating the search by entering -99 at the prompt. An additional program feature worthy of note is the range vector (see Source Code Listings in Section 6). This performs a preliminary scaling of each random step to conform to the anisotropy in the search volume defined by xmax and xmin. For example, in the case of a two dimensional response surface f(x 1 ,x2) where the allowed ranges of x 1 and x2 are 0 to 100 and 0 to 1, respectively, the range normalization ensures that all random step attempts generated will be within the user defined search limits.
3. APPLICATIONS 3.1.Multioptimum function in two variables The first example of the use of gsaopt involves optimization of the following function in two variables: f(x,y) = x2 + 2y2 - 0.3 cos(3irx) - 0.4 cos(47y) + 0.7
(3)
This function has a number of local minima over the range -1 < = x,y < = 1, with the global minimum at x =y =O. Bohachevsky et. al. [2] used equation 3 to demonstrate the utility of GSA. The two variable function is defined in the m-file cosmaze. m, listed in Section 6.1. Minimization of cosmaze was performed using the following parameters:
minmxflg = -1 xmax = [1 1] xmin = [-1 -1]
449
xstart = [0.85 0.85] delta (or) = 0.15 b (0) = 3.5 g = 1 (equivalent to -1 in equation 1) quitreps = 50 maxshrinks = 5 shrinkfac = 0.5 maxtrial = 1000 estopt = 0 monitor = 25 the values for (3, Ar, and g being the same as used by Bohachevsky et. al. [2]. The MATLAB command to execute the search was: [phisave,xsave]=gsaopt('cosmaze',-1,xstart,xmax,xmin,0.15,3.5,1,50,5,0.5,1000,0,25); Figure 2 shows a contour plot of the cosmaze function over the region -1 to 1 for both x and y with the search path over this same region shown in Figure 3. The latter shows the 99 accepted (out of 561 attempted) steps. Convergence to the global optimum was extremely close with final coordinates: x = -2.73 x 10-3 , y = -1.48 x 10-3 . While some time was spent exploring two regions containing local optima, the algorithm successfully escaped these. Rapid convergence occurred in the region surrounding the global optimum. Figure 4 shows a plot of the response (phisave) over the 99 accepted steps in the above search. Note the sharp drop in the cost function response between 85 and 90 accepted steps, as the region containing the optimum was located and decreases in Ar implemented. The response at search termination was 1.80 x 10-4. The monitor option was useful for following the progress of the search. The initial detrimental step acceptance rate (over 25 detrimental step attempts) was 48 percent. This rate decreased to zero as the search progressed. If the detrimental step acceptance rate remains high throughout a search, the initial 13 value may be too low or the estimated response at the global optimum, (Po, may be too low in the case of minimization or too high in the case of maximization. The above search was conducted with the seed value of the MATLAB random number generator set to zero, its default value when MATLAB is invoked. To repeat the search with a different random step sequence, enter the command: rand(' seed' ,z) where z is the desired seed value. Although the length of the search will vary somewhat with different random number sequences, convergence to a region very close to the global optimum should occur provided the initial parameters, particularly 13 and 430, are appropriate for the problem at hand.
450
1 0.8
0.8f-
0.6
0.6'-
0.4
0.4 -
0.2
0.2 -
0
0-
-0.2
-0.2L
-0.4
-0.4
-0.6
-0.6-
-0.8
-0.8 -0.5
0
0.5
Figure 2. Contour plot of cosmaze function over the area searched.
-0.5
Figure 3. GSA search path for response surface in Figure 2.
3.5
0.5 0 0
20
40
60
80
100
Accepted Step Number
Figure 4. Cost function values for accepted steps in cosmaze search.
451 3 .2.Non-linear regression The second application of gsaopt demonstrates non-linear regression of a set of pharmacokinetic data representing plasma level drug concentrations (y), in p, g / mL , as a function of time (x), in minutes following administration of an intravenous dose. Such data frequently exhibit a biphasic exponential decay response, the initial rapid phase primarily due to distribution of the injected drug to various tissues and the second, slower decay reflecting elimination of the drug by metabolism and excretion. This curve fit uses weighted regression, the weights being Wy. The data is shown in Table 1. Table 1 Pharmacokinetic profile data for GSA non-linear regression demonstration
x (minutes)
5 10 20 40 60 90 120 150 180 240
y (p,g/mL)
weight
0.134 0.0826 0.0390 0.0161 0.0118 0.0040 0.0047 0.0031 0.0021 0.0017
2.7318 3.4794 5.0637 7.8811 9.2057 15.8114 14.5865 17.9605 21.8218 24.2536
The concentration and time data were fitted to the following four parameter model: y = A exp(-ax) + B exp(-fx)
(4)
The regression involved minimization of the weighted sum of squared deviations between the predicted and actual y values as a function of the four model parameters and was consequently an exploration of a four dimensional response surface. The source code listing for the sum of squares calculation routine, wexpred.m, is given in Section 6.3. The search range for the four parameter fit was 0 to 1 for the A and B parameters and 0 to 0.1 for the a and fl parameters. The search was initiated at the center of the search range for each parameter (xinit = [0.5 0.5 0.05 0.05]). These were considered reasonable starting values based on previous work with similar sets of pharmacokinetic data. Since GSA requires an estimate of the magnitude of cost function response at the global optimum, an estimate of the residual sum of squares for the best fit was needed. Based on work with similar data and some trial and error, a value of le-4 was found to give good results. A typical trial run used the following command (see Section 6.1) with the random number generator initialized at its default seed value of zero:
452 [phisave,xsave] = gsaopt('wexpred', -1, xinit, xmax, xmin, 0.1, 10, 1, 50, 10, 0.5, 3500, le-4, 25); One problem with gsaopt as applied to this non-linear regression problem was that after the general region of the response surface containing the global optimum was found, further progress in reducing cost function response by small adjustments in the fit parameters was extremely slow. In the above example, the GSA search terminated at the designated maximum of 3500 attempted steps. The termination point did not represent the minimum response encountered in the search. One possible solution to this problem would be to repeat the search with gradually increasing values of estopt which will force convergence as this value approaches the true response of the global optimum. Another approach is to modify the GSA program to terminate the search when the magnitude of the difference in cost function response between the current and most recent previous accepted steps falls below a very small selected tolerance. When this condition is reached, further improvement in locating the global optimum is unlikely. The GSA non-linear regression was repeated using this modification with the response tolerance limit of 5 x 10-11 , a reasonable cutoff given responses seen in the previous searches, and all other search parameters, including the random number generator seed value, identical to those used for the unmodified gsaopt code. The parameter estimates of both GSA approaches to the first five digits are shown in Table 2 along with those obtained using the Nelder-Mead sequential simplex algorithm [4] (MATLAB routine fmins), the same starting coordinates being used for all three searches. The final cost function estimates (weighted residual sum of squares) were 1.6784 x 10' and 1.6276 x 10' for the unmodified GSA and simplex regressions, respectively. The corresponding value for the modified GSA search was 1.6282 x 10 1 . The results from the unmodified GSA for the A and B parameters showed reasonable agreement with the corresponding simplex estimates. However, agreement was somewhat poor for the a and particularly poor for the 0 parameter. In contrast, the modified GSA gave good agreement with simplex for all four fit parameters. A plot of the pharmacokinetic data and the fitted curve obtained with the modified GSA described above are shown in Figure 5. Table 2 Results of non-linear regression of pharmacokinetic data to biexponential model
Parameter
A a B 0
GSA (unmodified)
0.19521 0.11874 0.028108 0.016374
Simplex
0.19519 0.11431 0.024981 0.014922
GSA (modified*)
0.19528 0.11478 0.025306 0.015086
* gsaopt.m code in Section 6.1 modified to terminate search using response function change tolerance limit as described in text.
453
7
0.14,
Ib
0.1210.1 1 0.081c?, 0.06 0.0410.02 o------
o
o
100
'50
150
200
250
Time (minutes)
Figure 5. Plot of pharmacokinetic data from Table 1 showing fitted curve for biexponential model using the parameters obtained with the modified GSA algorithm.
3.3. Binary classification using linear discriminant analysis The third application of GSA shows a simple classification problem between two groups of objects. The two variable data set is shown in Figure 6. The "o" and "x" symbols represent classes 0 and 1, respectively. The two variables, xl and x2, could represent actual measurements of the objects to be classified. In real life classification problems, they would more likely be composite variables such as the first two principal components of a multivariate data set containing several measurements (e.g. pH, concentrations of various trace elements, near infra-red reflectance signals at multiple wavelengths, etc.) on each object in the set. For this demonstration, linear discriminant analysis [5] was used. The objective was to find a linear combination of xl and x2 generating a new axis such that the projections of the objects on this composite axis would yield all negative values for class 0 and all positive values for class 1. To ensure that the zero point on the discriminant axis would fall between the two classes regardless of the position of the data with respect to the origin of xl and x2, a bias term was added to the transformation expressed as equation 5:
z = wl*xl + w2*x2 + b
(5)
where z is the projection of a given object on the new vextor defined by w I , w2 and b. Finding a suitable classification vector by GSA involved generating random values within specified limits for w 1, w2, and b and, at each search point, calculating z values for all objects in the set and assigning classification of each object as zero or one for negative and positive z values, respectively. The latter were then compared with the known class values. The total number of misclassified objects for a given discriminant vector defined the cost function for that vector. The above process was repeated using the GSA algorithm to drive
454 the cost function to zero, i.e., perfect classification. The model created with the objects of known class (training set) could then be used to classify new objects of unknown class given measurements of their xl and x2 values. 3-
0
o 0`1°.
xxx
1
2
3
xl
Figure 6. Data for linear discriminant classification. The "o" and "x" symbols represent classes 0 and 1, respectively. The MATLAB code to evaluate classification ability of a given wl, w2 and b combination is contained in the m-file clsdemo.m (see Section 6.4). This code includes the 40 training set vectors shown in Figure 6 along with their assigned classification (0 or 1) in the 40 by 3 matrix clsdat contained within the m-file. Note that while the lowest possible cost function value for this problem is zero, inspection of Figure 6 shows that there is no unique solution for this classification problem, i.e., a large number of fairly similar vectors will give complete separation of the two classes. The search ranges for the three variables in this optimization were -2 to 2, -2 to 2 and -1 to 1 for w1, w2 and b, respectively. The search of the three dimensional response space was started at xinit = [-0.8 -0.8 0.5] with the command: [phisave,xsave] = gsaopt('clsdemo' ,1,xinit,xmax, xmin,0.5,3, 1, 25, 0, 1,100,-le-7,25); Note that although the known cost function response at the global optimum is zero, an estopt value slightly below theory (-1e-7) was used. This tends to prevent the algorithm from halting on the first encounter of perfect classification of the training set, which may represent an unstable (i.e. barely acceptable) solution to the problem and encourages exploration of a region where multiple solutions abound. Such an approach is a useful strategy for applications where the cost function is not continuous, in this case the cost function being constrained to integer values due to the nature of the problem. The search was completed in 78 attempted steps to final values of w 1 = 0.73610, w2 = -1.25685 and b = 0.26472. Figure 7 shows the search progress over the 20 accepted steps with the misclassified object cost function decreasing from 19 to 0. Figure 8 shows the lines corresponding to the points whose projection onto the initial and final classification vectors equals zero (class boundary lines). The actual discriminant vectors defined by the
455 initial and final w 1, w2 and b values are orthoganol to their corresponding class boundary lines shown in the figure. It is interesting to note that the above search was repeated using sequential simplex optimization from the same starting coordinates used for GSA. The simplex algorithm converged to a local optimum of w 1 = -0.800, w2 = -0.800, b = 0.550, which was very close to the starting point. This local optimum gave extremely poor classification with 18 misclassified objects. The fact that the magnitude of the desired optimum (zero) is exactly known makes the GSA approach very robust for this application in spite of the presence of local optima. 20 18 16 0
0
C
1412 t
.2 t
10 -
ff.'
8
C
6
U
6F41
-2 .
200
5
10
15
Accepted Step Number
Figure 7. Cost function for accepted steps in GSA discriminant search.
_3 20 4 -3
-2
-1
0
. 1
2
xl
Figure 8. Data from Figure 6 showing initial and final class boundaries for GSA optimization.
4. MODIFICATIONS FOR MATLAB 4.0 The most recent revision of MATLAB: Version 4.0 for Windows, contains modifications in the random number generator options. In MATLAB 4.0, the rand command generates numbers from a uniform distribution between 0 and 1. The command randn is used for random generation of values within the unit normal distribution (N(0,1)). Previous versions used the rand command for both distributions with the switches rand('normal') and rand('uniform') to designate the distribution to use on subsequent rand commands. While the
456 gsaopt source code listed in Section 6.1 will execute in MATLAB 4.0 in the same manner as previous versions in spite of the old convention, a warning message discouraging use of the latter will be generated at each random step iteration, significantly slowing runtime. Future versions of MATLAB will not allow use of the normal and uniform switches. Comments are included in the source code of gsaopt to instruct the user on trivial modifications to overcome this problem. However, as the rand and randn commands in MATLAB invoke separate random number generators, each with its own seed value, the exact search path for a given problem with the same search parameters will differ between the code demonstrated here and the modified version. Also, in order to replicate a search with the modified code, both random number generators need to be reinitialized with the corresponding seed values from the initial search.
5. CONCLUSIONS A practical, well documented implementation of the GSA algorithm in the MATLAB programming environment has been developed and its capabilities for solving a diverse range of optimization problems demonstrated. Fundamental limitations of the technique include the importance of proper choice of the initial 0 and Ar values and the need for an accurate estimate of cost function response at the desired optimum for the problem at hand. These limitations must be kept in mind when considering whether or not GSA is the technique of choice for a given application. The power and computational speed of MATLAB and the flexibility in invoking multiple m-files within a given application make this programming language particularly appropriate for a technique such as GSA. To use GSA to its fullest potential, the user must experiment on various problems with varying search parameters to gain an understanding of how to best fine tune the search for a particular application. Finally, imagination is often required in expressing a given optimization problem as a single valued cost function appropriate to the algorithm.
6. SOURCE CODE LISTINGS 6.1.Program gsaopt.m function [phisave, xsave] = gsaopt(optfunc, minmxflg, xstart, xmax,... xmin, delta, b, g, quitreps, maxshrinks, shrinkfac, maxiter, estopt,... monitor) % % function [phisave, xsave] = gsaopt(optfunc, minmxflg, xstart, xmax,... % xmin, delta, b, g, quitreps, maxshrinks, shrinkfac, maxiter, estopt,... % monitor) % % GSAOPT is a global response surface optimization program % using the Generalized Simulated Annealing (GSA) algorithm. It can % seek either the global maximum or minimum based on flag set during call. % The n-dimensional response surface is searched over a user specified % region. * indicates variable passed in function call
457
% % % • % % % % % %
* optfunc: m-file, function for optimization - must have a single dependent variable * minmxflg: scaler: set at 1 for maximization, -1 for minimization n: scaler, # independent variables in optfunc * xstart: row vector, starting coordinates for search xcurr:
row vector, current coordinates in search region
%
* xmax/xmin: row vectors, define search region limits
• % • % % % % % % %
* delta: scaler, scaling factor for random step size
• % • % % % % • % % % % % % % % %
* maxshrinks: scaler, number of step shrinkages permitted in search
%
* b,g: scalers, beta factor and denominator exponent: g, respectively, for detrimental step acceptance probability calc. (See Reference) If delta is shrunk to find exact location of optimum, b is increased proportionately to compensate for smaller average changes in cost function between current and next attempted steps. * quitreps: scaler, number of consecutive unsuccessful step attempts before step shrink or search termination
* shrinkfac: scaler, factor by which delta is decreased at each shrink (must be < = 1 & > 0, a value must be entered even if maxshrinks = 0) * maxiter: scaler, maximum number of accepted steps in search * estopt: scaler, estimated response at optimum, for GSA acceptance probability determination (See Reference) * monitor: scaler, number of unaccepted steps evaluated before percent detrimental steps accepted is updated on screen e.g. if monitor = 50 the percentage of accepted detrimental steps is displayed for the previous 50 detrimental steps evaluated. This is useful in selecting the beta factor (initial temperature) for the optimization. iter: scaler, counter for attempted search steps phicurr: scaler, value of optfunc at
xcurr
%
% %
failcount: scaler, number of consecutive step attempts since last accepted step
458
% % % % % % %
detfail/detaccept: scalers, counters used to calculate detperacc detperacc: scaler, percentage of accepted detrimental steps for previous interval (See monitor) shrinkreps: scaler, counter for step shrinkages * phisave: column vector, optfunc values for all accepted search steps
% % % % %
* xsave: matrix, coordinates of all accepted search steps ksave: scaler, counter for accepted steps U: row vector, random vect. (N(0,1) dist.) sets direction of next step
% range: row vector xmax-xmin, used for preliminary scaling of random steps so as to minimize generation of random steps outside acceptable % % search space if latter is highly anisotropic, e.g. xmax= [100 2], % xmin= [0 0] % % Reference: Bohachevsky et al, Technometrics 1986, 28 (3), 209-217 % % Version 28 February, 1994 by M.A. Curtis % if abs(minmxflg) — = 1 error(' minmxflg must = 1 or -1') end n = length(xstart); if length(xmax) — = n I length(xmin) — = n error('dimensions of starting point and search limits do not agree') end for i=1:n if xstart(i) > xmax(i) I xstart(i) < xmin(i) error('Starting point is beyond defined range') end end if b <= 0 error('Beta must be greater than zero') end if g < 0 error('g must be > = 0') end if quitreps — = round(quitreps) I quitreps < = 0 error(' quitreps must be an integer greater than zero') end if maxshrinks — = round(maxshrinks) I maxshrinks < 0
459
error('maxshrinks must be a non-negative integer') end if shrinkfac > 1 I shrinkfac < = 0 error(' Factor must be > 0 and < = 1') end if maxiter — = round(maxiter) I maxiter < = 0 error(' ina)dter must be an integer greater than zero') end % Begin search xcurr = xstart; phicurr = feval(optfimc,xcurr); failcount = 0; shrinkreps = 0; detaccept = 0; iter = 0; ksave = 1; range = xmax-xmin; range = range/max(range); detfail = 0; while iter < = maxiter accept = 0; gdenom = ((minmxflg * (estopt-phicurr))"g); if (minmxflg * (estopt-phicurr)) < = 0 lisp ('Estimated optimum met or exceeded.') estopt = input('Enter new estimate or -99 to quit: ') gdenom = ((minmxflg * (estopt-phicurr))"g); if estopt = = -99 break end end while accept = = 0 & failcount < quitreps % FOR USE WITH MATLAB 4.0 OR SUBSEQUENT VERSIONS, % DELETE THE NEXT LINE (rand('normal')) AND REPLACE % THE LINE BELOW IT WITH: U=randn(1,n) rand('normal') U = rand(1,n); U=(diag(range'*U))'; root = (U*U')"0.5; step =(U * delta)/root; iter = iter + 1; if min (xmax-(xcurr+ step)) > = 0 & max (xmin-(xcurr+ step)) < = 0 xtest = xcurr + step; phitest = feval(optfunc,xtest); delphi = minmxflg * (phitest-phicurr); % FOR USE WITH MATLAB 4.0 OR SUBSEQUENT
460 % VERSIONS, DELETE THE NEXT LINE (rand('uniform') AND % REPLACE WITH: unirand=rand; rand('uniform'); if delphi > = 0 accept = 1; elseif exp((b*delphi)/(gdenom)) > rand; % accept detrimental step % FOR MATLAB 4.0 OR SUBSEQUENT VERSIONS, REPLACE THE % ELSEIF STATEMENT IMMEDIATELY ABOVE WITH THE % FOLLOWING LINE: % elseif exp((b*delphi)/(gdenom)) > unirand % accept detrimental step detaccept = detaccept + 1; accept = 1; else failcount = failcount + 1; detfail = detfail + 1; end end % of "if min (xmax-(xcurr+ step))" if (detfail+detaccept) = = monitor detperacc = (detaccept/(detaccept + detfail))*100; disp('Percent detrimental steps accepted in monitor window = ') disp(detperacc) detaccept = 0; detfail = 0; end end % of "while accept = = 0" if failcount = = quitreps & shrinkreps = = maxshrinks disp('Search completed') disp('If increased confidence in result desired, repeat search') disp('using new starting coordinates or final coordinates of this search') disp(") disp('Total number of attempted steps = ') disp(iter) break elseif failcount = = quitreps & shrinkreps < maxshrinks olddelta = delta; delta = delta * shrinkfac; shrinkreps = shrinkreps + 1; b = b * (olddelta/delta); failcount = 0; end if accept = = 1 % save tested point, adjust counters and attempt next step xsave(ksave,:) = xcurr; phisave(ksave, 1) = phicurr; xcurr = xtest; phicurr = phitest; ksave = ksave + 1; failcount = 0;
461 end end % of "while iter < = maxiter" if iter > = maxiter disp('Iteration maximum reached') disp('Last values in output may not be in vicinity of optimum ') disp('with the largest magnitude for the area searched') end 6.2.Program cosmaze.m (Application 1) function z=cosmaze(invec) % function z=cosmaze(invec) % % COSMAZE is a 2 variable multioptimum function used by Bohachevsky et al. % for Generalized Simulated Annealing (GSA) demonstration. % % Reference: I. 0. Bohachevsky, M. E. Johnson, M. L. Stein, Technometrics, % 1986, 28, 209-217. % x=invec(1); y =invec(2); z=(xA2)+(2*(y^2))-0.3*cos(3*pi*x)-0.4*cos(4*pi*y)+0.7; 6.3. Program wexpred.m (Application 2) function wressq = wexpred(estpar) % function wressq=wexpred(estpar) % % WEXPRED calculates weighted (1/square root y) sum of squared deviations for fitting % pharmacokinetic data (biexp) to a four parameter, biexponential decay model. This % allows demonstration of non-linear regression by simplex, simulated annealing, or other % optimization techniques. % A=estpar(1); alpha=estpar(2); B--estpar(3); beta=estpar(4); biexp=[5 0.134 10 0.0826 20 0.0390 40 0.0161 60 0.0118 90 0.0040 120 0.0047 150 0.0031 180 0.0021 240 0.0017]; expwts=[2.73179 3.47944
462 5.06370 7.88110 9.20575 15.81139 14.58650 17.96053 21.82179 24.25356]; for i =1:10 bexpfit(i,1)=A*exp(-alpha*biexp(i,1))+B*exp(-beta*biexp(i,1)); end wresvec=bexpfit-biexp(:,2); wressq=(wresvec'*diag(expwts)*wresvec); 6.4.Program clsdemo.m (Application 3) function misclass=c1sdemo(vec3in) % function misclass=c1sdemo(vec3in) % CLSDEMO evaluates cost function for demonstration of binary % classification in two variables. Object patterns for training set % are defined by columns 1 and 2 of clsdat. Column 3 of clsdat % defines assigned classification of training set objects. Output is the % total number of misclassified objects for a given discriminant vector. % wl =vec3in(1); w2 =vec3in(2); b =vec3in(3); clsdat=[-0.6965 1.6961 0 1.8079 1.0282 0 0.0591 1.7971 0 1.0982 1.1226 0 1.2704 0.9845 0 0.0020 1.6065 0 0.7652 0.8617 0 0.5077 0.8853 0 1.8141 0.0350 0 0.3750 1.1252 0 -1.0639 0.3516 0 1.0215 0.3177 0 0.2641 0.8717 0 0.2091 0.5621 0 0.3367 0.4152 0 0.3967 0.7562 0 0.0562 0.5135 0 0.9235 -0.0705 0 0.0751 0.3516 0 1.4462 -0.7012 1
463
1.1650 0.6268 1 1.5161 0.7494 1 2.0185 0.9241 1 0.8476 0.2681 1 0.2738 -0.3229 1 0.4450 -0.6129 1 0.7031 -0.0524 1 1.1330 0.1500 1 0.5817 -0.2714 1 0.2481 -0.7262 1 0.1479 -0.5571 1 0.5774 -0.3600 1 0.3180 -0.5112 1 -0.0449 -0.7989 1 0.4142 -0.9778 1 1.2460 -0.6390 1 0.1356 -1.3493 1 0.4005 -1.3413 1 0.7286 -2.3774 1 1.5578 -2.4443 1]; clscurr=[wl w2]*clsdat(:,1:2)'+ones(1,40)*b; clscurr=c1scurr'; missclass =0; for i=1:length(clscurr) if clscurr(i) < = 0 classign(i,1)=0; else classign(i,1)=1; end % of "if clscurr(i) < = 0" if classign(i) - = clsdat(i,3) missclass = missclass+1; end % of "if classign(i) - = clstrue(i) end % of "for i=1:length(clscurr)"
REFERENCES 1. PC-MATLAB for MS-DOS Computer User Guide, The MAThWorks, Natick, MA, October 15, 1990. 2. I.O. Bohachevsky, M.E. Johnson and M.L. Stein, Technometrics, 28 (1986) 209. 3. J.M. Sutter and J.H. Kalivas, Anal. Chem. 63 (1991) 2383. 4. J.A. Nelder and R.Mead, Comput. J. 7 (1965) 308 . 5. D.L.Massart, B.G.M.Vandeginste, S.N.Demming, Y.Michotte and L. Kaufman, Chemometrics: A Textbook, Chapter 20, Elsevier, Amsterdam, 1988 .
This Page Intentionally Left Blank
465
Epilogue John H. Kalivas Department of Chemistry, Idaho State University, Pocatello, Idaho, 83209 USA
1. INTRODUCTION From the chapters presented in this book, it is apparent that simulated annealing (SA) type algorithms are adaptable to a wide range of chemical problems. The SA approach has been the object of detailed studies in other disciplines as well. The fact that investigators in diverse fields have contributed to the study of SA speaks well of SA type algorithms to solve complex optimization problems. Listed in Table 1 are results from a literature search using the CARL UnCover data base. The keywords simulated annealing were used for the search producing 437 citings. Table 1 arranges these references by year from 1988 to 1994. As can be seen, the number of publications involving SA has steadily grown in both chemistry and other areas. There appears to be a decrease in 1994, but this is more than likely due to the search being performed on December 22, 1994 and many of the 1994 publications have not yet been entered into the data base. Table 1 Number of publications cited from 1988 to 1994 using the CARL UnCover system
on December 22, 1994 Year Chemistry related
1988 1989 1990 1991 1992 1993 1994
4 9 15 21 25 27 18
Other discipline
8 32 45 47 55 69 62
Total
12 41 60 68 80 96 80
Listed in section 2. are some selected references from the 437 citings. Most of these references emphasize solutions to chemical problems. However, some of the listed references describe applications of SA to nonchemical problems that parallel actual chemical problems. A list of 292 additional references prior to 1988 is
466
available in N.E. Collins, R.W. Eglese, B.L. Golden, Simulated annealing - an annotated bibliography, American Journal of Mathematical and Management Sciences, 8 (1988) 209.
2. BIBLIOGRAPHY A. Gronenborn, The solution conformation of the antibacterial peptide cecropin A: a nuclear magnetic resonance and dynamical simulted annealing study, Biochemistry, 27 (1988) 7620. D. Hohl, R.O. Jones, R. Car, Structure of sulfur clusters using simulated annealing: S2 to S 13, Journal of Chemical Physics, 89 (1988) 6823. R. Navarro, F.J. Fuentes, M. Nieto-Vesperinas, Simulated annealing image reconstruction in photolimited stellar speckle interferometry, Astronomy and Astrophysics, 208 (1989) 374. S. Webb, SPECT reconstruction by simulated annealing, Physics in Medicine and Biology, 34 (1989) 259. P.J.M. Folkers, G. M. Clore, P.C. Driscoll, Solution structure of recombinant hirudin and the Lys-47 glu mutant: a nuclear magnetic resonance and hybrid distance geometry-dynamical simulated annealing study, Biochemistry, 28 (1989) 2601. W.B. Dolan, P.T. Cummings, M.D. LeVan, Process optimizing via simulated annealing: Application to network design, A I Ch E Journal, 35 (1989) 725. D. Young, E. Corey, Optimization of physical data tables by simulated annealing, Computers in Pyhsics, 3 (1989) 33. R.E. Hoffman, G.C. Levy, Spectral deconvolution by simulated annealing, Journal of Magnetic Resonance, 83 (1989) 411 J. Hafner, M.C. Payne, A dynamical simulated annealing approach to the electronic structure of liquid metals, Journal of Physics, 2 (1990) 221. S.R. Wilson, Applications of simulated annealing to peptides, Biopolymers, 29 (1990) 225. M.M. Doria, J.E. Gubernatis, D. Rainer, Solving the Ginzburg-Landau equations by simulated annealing, Physical Review B: Condensed Matter, 41 (1990) 6335. J. Kelly, B. Golden, A. Assad, Using simulated annealing to solve controlled rounding problems, ORSA Journal on Computing, 2 (1990) 174.
467
R.O. Jones, D. Hohl, Structure of phosphorus clusters using simulated annealing: P2 to P8, Journal of Chemical Physics, 92 (1990) 6710. M.H. Browdy, Simulated annealing: an improved computer model for political redistricting, Yale Law & Policy Review, 8 (1990) 163. C.P. Chang, Y.H. Lee, S.Y. Wu, Optimization of a thin-film multilayer design by use of the generalized simulated-annealing method, Optics Letters, 15 (1990) 595. R.W. Eglese, Simulated annealing: a tool for operational research, European Journal of Operational Research, 46 (1990) 271. J. Pannetier, J. Bassas-jAlsina, J. Rodrigurz-Carajal, Prediction of crystal structures from crystal chemistry rules by simulated annealing, Nature, 346 (1990) 434. J. Habazettl, C. Cieslar, H. Oschkinat, 1H NMR assignments of sidechain conformations in proteins using a high-dimensional potential in the simulated annealing calculations, Febs Letters, 268 (1990) 141. M.J. Field, Simulated annealing, classical molecular dynamics and the Hartree-Fock method: the NDDO approximation, Chemical Physics Letters, 172 (1990) 83. D. E. Goldberg, A note on Boltzmann tournament selection for genetic algorithms and population-oriented simulated annealing, Complex Systems, 4 (1990) 445. T.P.L. Roberts, T.A. Carpenter, L.D. Hall, Design and application of prefocuses pulses by simulated annealing, Journal of Magnetic Resonance, 89 (1990) 595. C. Tsoo, D. A. Estrin, S. J. Singer, Electronic energy shifts of a sodium atom in argon clusters by simulated annealing, Journal of Chemical Physics, 93 (1990) 7187. H. Raittinen, K. Kaski, Image deconvolution with simulated annealing method. Physica Scripta, 33 (1990) 126. S.M. Morrill, R.G. Lane, I.I. Rosen, Constrained simulated annealing for optimized radiation therapy treatment planning, Computer Methods and Programs in Biomedicine, 33 (1990) 135. H. Das, P.T. Cummings, M.D. Le Van, Scheduling of serial multiproduct batch processes via simulated annealing, Computers & Chemical Engineering, 14 (1990) 1351. P.E. Correa, The building of protein structures from a-carbon coordinates, PROTEINS: Structure, Function, and Genetics, 7 (1990) 366. I.M. Navon, F.B. Brown, D.H. Robertson, A combined simulated annealing and quasi-Newton-like conjugate-gradient method for determing the structure of mixed
468 argon-xenon clusters, Computers Chem., 14 (1990) 305. H. Ku, An evaluation of simulated annealing for batch process scheduling, Industrial & Engineering Chemistry Research, 30 (1991) 163. D. Abramson, Constructing school timetables using simulated annealing: sequential and parallel algorithms, Management Science, 37 (1991) 98. H. Kawai, Y. Ikamoto, M. Fukgita, Prediction of alpha-heliz folding of isolated Cpeptide of ribonuclese A by Monte Carlo simulated annealing, Chemistry Letters, (1991) 213. T. Satoh, K. Nara, Maintenance scheduling by using simulated annealing method, IEEE Transactions on Power Systems, 6 (1991) 850. P. Ballone, P. Milani, Simulated annealing and colision properties of carbon clusters, Zeitschrift fur Physik. D. 19 (1991) 439. S.R. Wilson, F. Guarnieri, Calculation of rotational states of flexible molecules using simulated annealing, Tetrahedron Letters, 32 (1991) 3601. M.A. This, N.K. Reddy, Calculation of rotational states of flexible molecules using simulated annealing, Tetrahedron Letters, 32 (1991) 3605. S.H. Nilar, Applications of the simulated annealing method to intermolecular interactions, Journal of Computational Chemistry, 12 (1991) 1008. M.J. Field, Constrained optimization of ab initio and semiemperical Hartee-Fock wave functions using direct minimization or simulated annealing, Journal of Physical Chemistry, 95 (1991) 5104. D.J. Chartrand, J.C. Shelley, R.J. LeRoy, Pulling, packing, and stacking: structural proclivities of SF6 -(rare gas)n van der Waals clusters, Journal of Physical Chemistry, 95 (1991) 8310. S.Z. Selim, K. Alsultan, A simulated annealing algorithm for the clustering problem, Pattern Recognition, 24 (1991) 1003. J. Higo, V. Collura, J. Gamier, Development of an extended simulated annealing method: application to the modeling of complementary determining regions of immunoglobulins, Biopolymers, 32 (1992) 33. S.G. Shi, W. Taam, Non-linear canonical correlation analysis with a simulated annealing solution, Journal of Applied Statistics, 19 (1992) 155. V. Smith, J. Kurhanewicz, T.L. James, Solvent-suppression, pulses. III. design using simulated-annealing optimization with in vitro and in vivo testing, Journal of Magnetic Resonance, 96 (1992) 345.
469 R.L. Asher, D.A. Micha, Brucat, Equilibrium properties of transition-metal ionargon clusters via simulated annealing, Journal of Chemical Physics, 96 (1992) 7683. J.M. Hjorthoj, L.A. Philips, Data analysis for rotationally resolved spectra: A simulated annealing approach, Journal of Molecular Spectroscopy, 154 (1992) 288.A S. Webb, Optimization by simulated annealing of three-dimensional, conformal treatment planning for radiation fields defined by a multileaf collimator, Physics in Medicine & Biology, 37 (1992) 1689. R.S. Sloboda, Optimization of brachytherapy dose distributions by simulated annealing, Medical Physics, 19 (1992) 955. J.A. Cuticchia, J. Arnold, W.E. Timberlake, The use of simulated annealing in chromosome reconstruction experiments based on binary scoring, Genetics, 132 (1992) 591. F.S. DiGennaro, D. Cowburn, Parametric Estimation of time-domain NMR signals using simulated annealing, Journal of Magnetic Resonance, 96 (1992) 582. K. Wakamatsu, D. Kohda, H. Hatanaka, Structure-activity relationships of muconotoxin GIIIA: structure determination of active and inactive sodium channel blocker peptides by NMR and simulated annealing calculations, Biochemistry, 31 (1992) 12577. M.W. Deem, J.M. Newsam, Framework crystal structure solution by simulated annealing: test application to known zeolite structures, Journal of American Chemical Society, 114 (1992) 7189. M.E. Snow, S.B. Crary, The use of simulated annealing in the I-optimal design of experiments, Michigan Academician, 24 (1992) 343. S. Crozier, D. Doddrell, Gradient-coil design by simulated annealing, Journal of
Magnetic Resonance, 103A (1993) 354. E. Peyrol, P. Floquet, L. Pibouleau, Scheduling and simulated annealing: application to a simiconductor circuit fabrication plant, Computers & Chemical Engineering, 17 (1993) S39. D. Zhao, Sequential simulated annealing: an efficient procedure for structural refinement based on NMR constraints, Journal of Physical Chemistry, 97 (1993) 3007. A.M. Weiner, S. Oudin, D.E. Leaird, Shaping of femtosecond pulses using phaseonly filters designed by simulated annealing, Journal of the Optical Society of America, 10 (1993) 1112.
470 J. Pospichal, V. Kvasnióka, Fast evaluation of chemical distance by simulatedannealing algorithm, Journal of Chem. Inf. Comput. Sci., 33 (1993) 879. B. Tidor, Simulated annealin on free energy surfaces by a combined molecular dynamics and monte carlo approach, Journal of Physical Chemistry, 97 (1993) 1069. F.H. Epstein, J. P. Mugler III, J.R. Brookeman, Optimization of parameter values for complex pulse sequences by simulated annealing: application to 3D MP-RAGE imaging of the brain, Magnetic Resonance in Medicine, 31 (1994) 164. C. Koulamas, K.R. Davis, F. Turner III, A survey of simulated annealing applications to operations research problems, Omega, 22 (1994) 41. Y. Okamoto, Dependence on the dielectric model and pH in a synthetic helical peptide studies by Monte Carlo simulated annealing, Biopolymers, 34 (1994) 529. H.L. Huber, Structural optimization of vapor pressure correlations using simulated annealing and threshold accepting: application to R134a, Computers & Chemical Engineering, 18 (1994) 929.
471
Index
A acid rain, 225 ADAPT -, automated data analysis and pattern recognition toolkit, 112 algorithm -, back-propagation, 118 -, branch-and-bound, 187 -, combinatorial, 3 -, genetic, 24 -, heuristic, 181, 184 -, K-means, 139, 160, 178 -, Metropolis 1, 228, 265 -, optimal histogram, 376 -, partial clustering, 136 -, Powell, 73, 77, 79 -, probabilistic, 3 -, quasi-Newton BFGS, 118 -, sequence building, 187 -, Ward's method, 140, 151 -, WWW, 331 ammonium aerosols, 224 B back-propagation, 118 batch processing, 181 Beer's law, 28 biased random walk, 7 Boltzmann distribution, 5, 30, 156 branch-and-bound, 187 break-down point, 59 C charged particle surface area, 113 cluster analysis, 133, 155, 453
combinatorial, 34, 156, 181 constrained background bilinearization, 57, 73 continuous process, 181 convergence, 31, 315 cooling schedule, 4, 314 covariance matrix, 61 criterion -, external clustering, 135 -, internal clustering, 134 -, selectivity, 36 -, SEP, 36 -, Shannon's, 293 -, PRESS, 36 -, weighted tardiness, 187 cross validation, 36, 116 crystallographic refinement, 260, 283 crystallography, X-ray, 259, 281, 303 D descriptors -, geometric, 113 -, electronic, 113 -, topological, 113 distribution -, Boltzmann, 5, 30, 156 -, uniform, 64 E eigenvalues, 61 eigenvectors, 61 electron density map, 282 electron spectroscopy for chemical analysis, 86, 103 electronic wavefunction, 419 emission control, 223
472 error function, 91, 97 environmental, -, deterioration, 224 -, impact, 225 -, protection, 225 Euclidian distance, 166 F fluorescence, -, decay, 243, 247 -, lifetime, 239, 242 -, spectrum, 73 G genetic algorithm, 24 goodness of fit, 250 H hierarchical clustering algorithm, 136, 137, 160 HPLC, 98
Metropolis algorithm, 1, 228, 265 minimum energy, 19 molecular conformation, 19 molecular dynamics, 266, 417 multi product plant, 182 multi purpose plant, 182 multicomponent analysis, 86 multiple linear regression, 114 N neighborhood, 4, 37 neural network, 115 -, annealed, 119 neutronics model, 207 nitrate aerosols, 224 NMR, 303 nodal expansion model, 207 noncontinuous processing, 181 nonlinear mapping, 175 nonlinear regression, 451 NP-complete, 183 nuclear fuel management, 205 0
I internal clustering criterion, 134 ISODATA, 139, 149 K Kalman filter, 59, 85, 87, 325 K-means, 139, 149, 160, 178 L Leaps-and-bounds regression, 113, 114 M M-estimator, 59 Manhattan distance, 37 Marquardt fitting, 248 MATLAB, 445
optimization -, combinatorial, 3, 4 -, molecular dynamics, 417 -, nuclear fuel management, 205 -, probabilistic, 3 -, solvated electrons, 431 -, stochastic, 3, 63 -, wavefunction, 395 -, wavelengths, 27 orthogonal space, 62 P partial clustering algorithm, 136 partial least squares, 44, 57 performance index, 200 perturbation theory, generalized, 207 pollutants, 208, 223 Powell algorithm, 73, 77, 79 prediction set, 113
473
principal component -, analysis, 57, 59, 160 -, regression, 58 production sequence, 193 production path, 191 projection pursuit, 57, 59, 60, 173 protein folding, 304, 369, 371 Q quality, 120 quantitative structure activity relationships, QSAR, 111 quasi-Newton BFGS algorithm, 118 R random walk, 4, 7 rank, 74 resolution, 85, 86, 93 robust regression, 59 Rosenbrock's function, 6 S scheduling, 181, 212, 269 search space, 213 simplex, 94, 105, 250, 452 spectral density function, 306 steepest descent, 94, 105 stepwise discrimination analysis, 176 storage tanks, 195 supervised pattern recognition, 155 T temperature control -, velocity scaling, 268 Langevine, 268, 314 training set, 112 training, 116 trajectory model, 227
U uniform distribution, 64
unit group, 193
Ward's method 140, 151 wavelength selection, 27
This Page Intentionally Left Blank