COMPUTER AIDED MOLECULAR DESIGN: THEORY AND PRACTICE
COMPUTER-AIDED CHEMICAL ENGINEERING Advisory Editor: R. Gani Volume 1: Volume 2: Volume 3: Volume 4:
Distillation Design in Practice (L.M. Rose) The Art of Chemical Process Design (G.L. Wells and L.M. Rose) Computer Programming Examples for Chemical Engineers (G. Ross) Analysis and Synthesis of Chemical Process Systems (K. Hartmann and K. Kaplick) Studies in Computer-Aided Modelling. Design and Operation Volume 5: Part A: Unite Operations (1. Pallai and Z. Fony6, Editors) Part B: Systems (1. Pallai and G.E. Veress, Editors) Neural Networks for Chemical Engineers (A.B. Bulsari, Editor) Volume 6: Material and Energy Balancing in the Process Industries - From Microscopic Volume 7: Balances to Large Plants (V.V.Veverka and F. Madron) European Symposium on Computer Aided Process Engineering-10 Volume 8: (S. Pierucci, Editor) European Symposium on Computer Aided Process Engineering- 11 Volume 9: (R. Gani and S.B. Jorgensen, Editors) Volume 10: European Symposium on Computer Aided Process Engineering- 12 (J. Grievink and J. van Schijndel, Editors) Volume 11: Software Architectures and Tools for Computer Aided Process Engineering (B. Braunschweig and R. Gani, Editors) Volume 12: Computer Aided Molecular Design: Theory and Practice (L.E.K. Achenie, R. Gani and V. Venkatasubramanian, Editors)
COMPUTER-AIDED CHEMICAL ENGINEERING, 12
COMPUTER AIDED MOLECULAR DESIGN: THEORY AND PRACTICE Editedby
Luke E.K. Achenie
Computer Aided Process and Product Design Lab Department of Chemical Engineering University of Connecticut 191 Auditorium Road Storrs, CT06269, USA
Rafiqul Gani
CAPEC, Technical University of Denmark Department of Chemical Engineering Building 229, DK-2800 Lyngby, Denmark
Venkat Venkatasubramanian
Laboratory of Intelligent Process Systems School of Chemical Engineering Purdue University West Lafayette, IN 4 790 7-1283, USA
2003 ELSEVIER Amsterdam
- Boston
- London
- New
San Diego - San Francisco - Singapore
York - Oxford - Sydney
- Paris
-Tokyo
E L S E V I E R S C I E N C E B.V. Sara B urgerhartstraat 25 P.O. B o x 211, 1000 A E A m s t e r d a m , The N e t h e r l a n d s 9 2003 E l s e v i e r S c i e n c e B.V. All rights reserved. This w o r k is p r o t e c t e d u n d e r copyright by E l s e v i e r Science, and the f o l l o w i n g terms and conditions apply to its use: Photocopying Single photocopies of single chapters may be made for personal use as allowed by national copyright laws. Permission of the Publisher and payment of a fee is required for all other photocopying, including multiple or systematic copying, copying for advertising or promotional purposes, resale, and all forms of document delivery. Special rates are available for educational institutions that wish to make photocopies for non-profit educational classroom use. Permissions may be sought directly from Elsevier Science via their homepage (http://www.elsevier.com) by selecting 'Customer support' and then 'Permissions'. Alternatively you can send an e-mail to: permissions @elsevier.corn, or fax to: (+44) 1865 853333. In the USA, users may clear permissions and make payments through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA; phone: (+1) (978) 7508400, fax: (+1) (978) 7504744, and in the UK through the Copyright Licensing Agency Rapid Clearance Service (CLARCS), 90 Tottenham Court Road, London W1P 0LP, UK; phone: (+44) 207 631 5555; fax: (+44) 207 631 5500. Other countries may have a local reprographic rights agency for payments. Derivative Works Tables of contents may be reproduced for internal circulation, but permission of Elsevier Science is required for external resale or distribution of such material. Permission of the Publisher is required for all other derivative works, including compilations and translations. Electronic Storage or Usage Permission of the Publisher is required to store or use electronically any material contained in this work, including any chapter or part of a chapter. Except as outlined above, no part of this work may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior written permission of the Publisher. Address permissions requests to: Elsevier Science Global Rights Department, at the fax and e-mail addresses noted above. Notice No responsibility is assumed by the Publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. Because of rapid advances in the medical sciences, in particular, independent verification of diagnoses and drug dosages should be made.
First edition 2003 Library of Congress Cataloging in Publication Data A catalog record from the Library of Congress has been applied for.
British Library Cataloguing in Publication Data A catalogue record from the British Library has been applied for.
ISBN: 0-444-51283-7 ISSN: 1570-7946 (Series) ( ~ The paper used in this publication meets the requirements of ANSI/NISO Z39.48-1992 (Permanence of Paper). Printed in The Netherlands.
Preface CAMD or Computer Aided Molecular Design refers to the design of molecules with desirable properties. That is, through CAMD, one determines molecules that match a specified set of (target) properties. CAMD as a technique has a very large potential as in principle, all kinds of chemical, bio-chemical, and material products can be designed through this technique. It has become a mature technique and attracting more and more researchers and finding increasing industrial applications. The limitation, at this moment, is the ability to estimate the target properties of the desired molecule. The book mainly deals with macroscopic properties and therefore, does not cover molecular design of large, complex chemicals such as drugs. The methodology presented, however, would be applicable for such problems provided the higher level molecular structural representation is integrated with appropriate molecular structure-property relationships. While books have been written on computer aided molecular design related to drugs and large complex chemicals, a book on systematic formulation of CAMD problems and solutions with emphasis on theory and practice which would help one to learn, understand and apply the technique is currently unavailable. With this book, we have tried to put together the theoretical aspects related to CAMD, the different techniques that have been developed and the different applications that have been reported. We have highlighted the applications through case studies. We have grouped the chapters of this book into 3 parts - Part I: Theory, Methods & Tools; Part II: Applications & Practice of CAMD; and Part III: New Frontiers. Problem formulation and solution techniques are covered in Part I by chapters 1-7. Applications and practice of CAMD in different types of problems are highlighted in chapters 8-15 of Part II together with descriptions of case study problems and their solution. Each case study highlights the application of specific CAMD techniques. Part III contains one single chapter (16) where we highlight the new frontiers (in our view) and the future of CAMD. We have targeted a mixed audience in this book. Specifically, we have designed the book for scientists and engineers from industry who would like to apply CAMD to solve their specific problems of interest. It is also designed for educators from academia who would like to use it for teaching as part of process/product design courses (including such courses as separation processes). The book would be of interest to scientists and engineers who would like to learn more about CAMD in addition to
vi CAMD problem solutions. Finally, this book is intended for those who would like to use it as the starting point to further develop and extend the state of the art in CAMD. We would like to thank all the contributing authors for their manuscripts and for agreeing to make the necessary changes to accommodate the content, format and style of this book. The contributing authors to the various chapters of this book come from academia as well as industry. They are among the leading researchers, developers and users of CAMD. We hope the book will serve to promote further development of CAMD and further interest from the industry to apply CAMD. We thank the reviewers for their valuable comments and suggestions. We thank Elsevier for their interest in this subject and for publishing this book. We acknowledge the support, help and contribution of Prasanjeet Ghosh, Santhoji Katare, Mette Dinsen and all our previous students and coworkers who have contributed to the development of CAMD in general and preparation of this book in particular. We also thank all the companies who have shown interest in CAMD and supported our research in this area. We hope the readers of this book will find it an invaluable resource in their research, development and educational activities. We also hope that the book will generate enough interest and valuable feedback for future editions.
Luke E. K. Achenie, Rafiqul Gani & Venkat Venkatasubramanian
List of contributors Author L. E. K. Ache nie
C. S. Adjiman
A. Apostolakou
E. A. Brignole
A. Buxton
J. M. Caruthers M. Cismondi J. L. Cordiner
R. Gani P. M. Harper M. Hostrup A. Hugo
Address University of Connecticut, Department of Chemical Engineering, 191 Auditorium Road, Storrs, CT 06269, USA Department of Chemical Engineering and Chemical Technology, Centre for Process Systems Engineering, Imperial College of Science, Technology and Medicine, Prince Consort Road, London SW7 2BY, UK. Department of Chemical Engineering and Chemical Technology, Centre for Process Systems Engineering, Imperial College of Science, Technology and Medicine, Prince Consort Road, London SW7 2BY, UK. Planta Piloto de Ingenieria Quimica-PLAPIQUI (UNS'CONICET), Camino La Carrindanga Km 7, 8000, Bahia Blanca Argentina. Department of Chemical Engineering and Chemical Technology, Centre for Process Systems Engineering, Imperial College of Science, Technology and Medicine, Prince Consort Road, London SW7 2BY, UK. Laboratory for Intelligent Process Systems School of Chemical Engineering, Purdue University, West Lafayette, IN-47907, USA. Planta Piloto de Ingenieria Quimica-PLAPIQUI (UNS-CONICET), Camino La Carrindanga Km 7, 8000, Bahia Blanca Argentina. Syngenta, Global Specialist Technology, Grangemouth Manufacturing Centre, Earls Road, Grangemouth, Stirlingshire, FK3 8XG, United Kingdom CAPEC, Technical University of Denmark, Department of Chemical Engineering, Building 229, DK'2800 Lyngby, Denmark. Integrated Process Solutions ApS, Solvgade 14B, 1307 Copenhagen K, Denmark Integrated Process Solutions ApS, Solvgade 14B, 1307 Copenhagen K, Denmark Department of Chemical Engineering and Chemical Technology, Centre for Process Systems Engineering, Imperial College of
viii
A. G. Livingston
G. M. Ostrovski
P. Patkar E. N. Pistikopoulos
M Sinha A. Sundaram Vo
Venkatasubramanian J. M. Vinson
Science, Technology and Medicine, Prince Consort Road, London SW7 2BY, UK. Department of Chemical Engineering and Chemical Technology, Centre for Process Systems Engineering, Imperial College of Science, Technology and Medicine, Prince Consort Road, London SW7 2BY, UK. University of Connecticut, Department of Chemical Engineering, 191 Auditorium Road, Storrs, CT 06269, USA. Laboratory for Intelligent Process Systems School of Chemical Engineering, Purdue University, West Lafayette, IN-47907, USA. Department of Chemical Engineering and Chemical Technology, Centre for Process Systems Engineering, Imperial College of Science, Technology and Medicine, Prince Consort Road, London SW7 2BY, UK. Global Alternative Propulsion Center, General Motors, Honeoye Falls, NY 14472, USA. ExxonMobil Process Research, Pauslboro, NJ 08066, U.S.A. Laboratory for Intelligent Process Systems School of Chemical Engineering, Purdue University, West Lafayette, IN-47907, USA. Pharmacia Corporation, 5200 Old Orchard Rd., Skokie, IL 60077, USA.
Contents
Page
Preface List of contributors
PART I: Theory, Methods & Tools 1. Introduction to CAMD R. Gani, L. E. K. Achenie and V. Venkatasubramanian 2. Molecular D e s i g n - Generation & Test Methods E. A. Brignole and M. Cismondi 3. Optimization Methods in CAMD - I M. Sinha, L. E. K. Achenie and G. M. Ostrovski 4. Optimization Methods in CAMD - II A. Apostolakou and C. S. Adjiman 5. Genetic Algorithms Based CAMD P. R. Patkar and V. Venkatasubramanian 6. A Hybrid CAMD Method P. M. Harper, M. Hostrup and R. Gani 7. Identification of Multistep Reaction Stoiehiometries" CAMD Problem Formulation A. Buxton, A. Hugo, A. G. Livingston and E. N. Pistikopoulos Part II: Applications of CAMD 8. CAMD for Solvent Selection in I n d u s t r y - I J. M. Vinson 9. CAMD for Solvent Selection in I n d u s t r y - II J. L. Cordiner 10. Case Study in Optimal Solvent Design M. Sinha, L. E. K. Aehenie and G. M. O~trovskl 11. CAMD in Solvent Mixture Design M. Sinha and L. E. K. Aehenie 12. Refrigerant Design Case Study A. Apostolakou and C. S. Adjiman 13. Polymer Design Case Study P. R. Pa tkar and V. Venka tasubramanian 14. Case Study in Identification of Multistep Reaction Stoiehiometries A. Buxton, A. Hugo, A. G. Livinggton and E. N. Pi~tikopoulos 15. Molecular Design of Fuel Additives A. Sundaram, V. Venkatasubramanian and J. M. Caruthors
vii
23 43 63 95 129 167
211 213 229 247 261 289 303 319
329
PART III: Computer Aided Product Design 16. Challenges and Opportunities for CAMD R. Gani, L. E. K. Achenie and K Venkatasubramanian
355 357
Glossary of Terms
379
Subject Index
387
Author Index
393
P a r t I: T h e o r y , M e t h o d s & Tools This part of the book covers problem formulation and solution techniques. The first chapter introduces the computer aided molecular design (CAMD) problem and discusses its important issues. Then chapters 2 to 7 deal with some of the common techniques used to tackle various types of CAMD problems. Specifically, the second chapter discusses methods based on a generate-and-test approach, followed by two chapters on optimization methods involving mathematical programming. Evolutionary techniques based on genetic algorithms are presented next in chapter 5 while chapter 6 describes a hybrid CAMD method. Finally, the first part of the book concludes with chapter 7 where CAMD in identification of multistep reaction stoichiometries is presented.
This Page Intentionally Left Blank
Computer Aided MolecularDesign: Theoryand Practice L.E.K. Achenie, R. Gani and V. Venkatasubramanian(Editors) 9 2003 Elsevier ScienceB.V. All fights reserved.
C h a p t e r 1" I n t r o d u c t i o n to CAMD R. Gani, L.E.K. Achenie & V. V e n k a t a s u b r a m a n i a n
In (chemical) product design, we try to find a (chemical) product t h a t exhibits certain desirable or specified behaviour. In another type of (chemical) product design, we try to find an additive t h a t when added to another chemical or non-chemical product, enhances its (desirable) functional properties. This type of a product is commonly known as a formulation. That is, in (chemical) product design, we do not know the identity of the final product but we have some idea of how we want it to behave and the problem is to find the most appropriate chemical(s) t h a t will exhibit and/or cause the desired behaviour. Once we have identified the product, and have tested it, we need to determine if it can also be manufactured. That is, we need to design a (chemical) process through which we can manufacture the desired product with profit, increased operational efficiency and positive environmental, health and safety impact. Before we can do this, however, we also need to determine the likely raw materials (which could also be other chemical products) t h a t can be processed in order to manufacture the desired product. That is, we extend the problem boundary of process design at the start by determining the product t h a t we would like to manufacture and at the end in order to analyse the effect of the product and its manufacture on the environment.
1.1 W H A T IS CAMD? The design process for a chemical product involves a n u m b e r of steps t h r o u g h which scientific principles may be applied for the solution of the specified design problem. Cussler and Moggridge (2001) suggest four principal steps in their design process: 1. 2. 3. 4.
Define needs; Generate ideas to meet needs; Select among ideas; Manufacture product.
As illustrated in Fig. 1, the 2 nd and 3 rd steps considered together, represent two types of design problems namely, Molecular Design and Mixture/Blend Design. The I st step may be considered as a pre-design or problem formulation step while the last step may be considered as part of a process design problem. The molecular and mixture/blend design
problems can be solved independent of the process design problem or as an integrated product-process design problem.
Process-Product Design
Pro duct Design
CAMD "generate, & select alternatives" Pre-Design "define needs & goals"
I
' I~ ]
Process Design
~
"malmfacture & test product"
CAMbD "generate & ~lect alternatives"
Figure 1: Steps of the design process related to product design. For the solution of the molecular and mixture/blend design problems, various approaches, ranging from empirical trial and error approaches to mathematical programming to hybrid methods can be applied as the solution technique. The applicability of a particular solution technique depends, to a large extent, on the approach used to determine the target behaviour (properties) of the desired products. If appropriate property models do not exist, although not the most efficient, an empirical trial and error approach based on experimentation is usually the only option. If property models are available, computer aided methodologies become viable alternatives. That is, the molecular design problem is transformed into a computer aided molecular design (CAMD) problem while mixture/blend design problem becomes transformed into a computer aided mixture design (CAMDD) problem through the use of property models as part of a computer aided methodology. CAMD and CAMDD together may be called computer aided product design (CAPD). Unless specifically mentioned, in this book, the term CAMD will be used for molecular design as well as mixture/blend design. Likewise, the term product will be used to include single molecules as well as mixtures. 1.1.1 P r o b l e m D e f i n i t i o n Computer aided molecular design problems are defined as
Given a set of building blocks and a specified set of target properties, determine the molecule or molecular structure that matches these properties.
In this respect, it is the reverse problem of property prediction where given the identity of the molecule and]or the molecular structure, a set of target properties are calculated. CAMD maybe performed at various levels of size and complexity of molecular structure representation. For example, design of solvents, refrigerants, etc., are usually based on properties estimated from macroscopic structural information. In the design of structured products such as polymers, drugs, pesticides, food additives, etc., the structural differences are observed by employing meso- and/or microscopic representation of the molecular structure. Therefore, the property models and the molecular structural representation differ according to the type of molecules being designed. Computer aided mixture/blend design problems can be defined as, Given a set of chemicals and a specified set of property constraints, determine the optimal mixture and~or blend.
Here, we do not know which chemicals to use in the product and in what amount they should be present but we know the molecular structures of the candidate chemicals. The design of formulated products and blends are typical examples of mixture design. Here, a formulation (representing a mixture or blend) is added to a product in order to enhance one or more specified properties of the original product. For example, a specified property (for example, viscosity of a product) needs to increase by an order of magnitude when the formulation (also known as ingredient or additive) is added. In other cases, a mixture or blend having a specified set of target properties is the desired p r o d u c t - as in polymer blends, petroleum blends, solvent blends, edible oil blends and many more. The fundamental objective of CAMD, therefore, is to identify a compound or a collection of compounds having specific (desired) properties. The structures of the compounds (molecules) are represented using appropriate descriptors together with an algorithm that identify these descriptors. This means that the property evaluation methods should be based on these descriptors as well. The most common approach in CAMD is to generate chemically feasible molecular structures from a set of descriptors (represented by fragments or building blocks) and to test them by estimating their desired (specified) properties. The properties are estimated by using some kinds of fragmentbased methodology, where the contributions for a specific property of each fragment present in the compound molecular are added to determine the compound property value. The set of feasible compounds are identified as those that match the property specifications, given as a series of property
constraints. The optimal compound is identified from the set of feasible compounds through a problem specific selection criteria or objective function. The principal differences between the various CAMD methodologies are how the various steps are performed, the type of descriptors used and how the necessary property values are obtained.
1.1.2 F o r m u l a t i o n of P r o p e r t y C o n s t r a i n t s The formulation of the property constraints is a prerequisite for solving any CAMD problem. A set of properties is selected as constraints with some combination of specified goal values, lower and upper bounds. These represent explicit property constraints because their values can be determined directly through a model or measured experimentally. There are, however, desired properties involving products such as food, fragrances, health & safety, etc., that may need to be formulated implicitly. That is, they cannot be measured or predicted by a model directly but may be inferred through databases, past knowledge, other measured or predicted properties and so on. For example, taste of a food product, the aroma of fragrances, the health hazards of chemicals, etc., fall under implicit property constraints. Environmental considerations can be formulated implicitly or explicitly. Explicit considerations relate physical properties to environmental considerations (e.g. ozone depletion potential) while implicit considerations are realized in the selection of the types of compounds considered in the search/design phase (e.g. the exclusion of aromatic compounds). The following questions help to define the c o n s t r a i n t s - note t h a t these are not the only questions that will help to define the problem completely.
What function is the desired product supposed to perform? These functions could be related only to the use of the product on a standalone basis or, they could be included as part of some greater functionality t h a t the product may be asked to provide in conjunction with other materials. Examples of the former are a solvent, a refrigerant, and a polymer while examples of the latter are a solvent blend added to a paint, an ingredient added to a food product to make it fat-free, and an ingredient added to a drug to inhibit a specific biological function.
Is the product a replacement for another product? If yes the designed product should do some combination of the following (a) match a set of properties, (b) match or surpass a set of properties of the original product and (c) avoid a third set of properties. This can be the replacement of one synthesized chemical product with another as well as replacement of a natural product with a synthesized one (for example, synthetic rubber).
A r e there any operational limits (temperature, p r e s s u r e a n d p h a s e ) for the desired p r o d u c t ? I f yes, w h a t are these?
The operational limits help define the upper and lower limit of the constraints on the phase and the phase transition related properties. W h a t criteria s h o u l d be used to evaluate the p e r f o r m a n c e of the desired product?
The performance criteria are related to the function of the desired product in the process operation for which it is designed, which helps to define the objective function for optimization based CAMD. For example, as a solvent in solvent based separations, these criteria often degenerate into bound constraints; usually lower bounds on selectivity, lower bounds on distribution coefficient, upper bounds on solvent loss and many more. In the case of formulations, the ingredient needs to be tested for the enhanced performance of the original product, such as controlled release, improved inhibition, etc., of drugs. Models for evaluation of performance, however, may not be easy and is most likely to be very complex. A r e there any d o w n s t r e a m processing considerations?
The role of the designed product in downstream processing, such as solvent recovery, wastewater treatment and disposal, needs to be considered. They may be included as direct property constraints, if feasible. However, since they depend on the process, alternatively, the product and process design problems may be integrated to handle these constraints together with other process design issues. The following provides a generic representation of most CAMD problems.
mathematical
programming
(1)
FOBJ = m a x {C T y + f (x)}
s.t. hl (x) = 0 h2 (x) = 0 h3 (x) = 0 ll ~_gl (x) ~_ul 12 ~_g2 (x) ~_u2 13 ~_B y + C x ~_u3
....process design specs ....process model equations .... C A M D specifications ....process design constraints .... C A M D constraints .... logical constraints
(2) (3) (4) (5) (6) (7)
In the above equations, x represents the vector of continuous variables (such as flowrates, mixture compositions, condition of operation, design variables, etc.), y represents the vector of binary integer variables (such as unit operation identity, descriptor identity, compound identity, etc.), hi (x) represents the set of equality constraints related to process design specifications (such as, reflux ratio, operation pressure, heat addition,
etc.), h2 (x) represents the set of equality constraints related to the process model equations (i.e., mass and energy balance equations), h3(x) represents the set of equality constraints related to CAMD (such as, chemical feasibility rules, mixing rules for properties, etc.), gl(x) represents the set of inequality constraints (process design specifications) and g2(x) represents the set of inequality constraints with respect to environmental constraints and property constraints related to CAMD design. The binary variables typically appear linearly as they are included in the objective function term and in the constraints (Eq. 7) to enforce logical conditions. The term f(x) represents a vector of objective functions t h a t may be linear or non-linear depending on the definition of the optimization problem. For process optimisation, f(x) is usually a non-linear function while for integrated approaches, f(x) usually consists of more t h a n one non-linear function. Many variations of the above mathematical formulation may be derived to represent different CAMD problems and methodologies. Some examples are given below.
ii)
iii) iv)
v)
Satisfy only constraint 6. This represents a CAMD problem for which a database search is adequate as a solution methodology. Ignore the objective function and the constraints represented by Eqs. 2, 3, 5 and 7 and only satisfy constraints 4 and 6. This is a CAMD problem that generates a feasible set of candidates. Solve a mathematical programming problem that includes Eqs. 1, 4 and 6. This is optimal design of the molecule and/or mixture. Only satisfy the constraints 2-7. This generates a feasible set of candidates (products and their corresponding process). Solve all the equations. This represents an integrated processproduct design problem.
Note that for all problem formulations, properties either need to supplied (measured or database retrieval) and]or predicted through models. Problems that include Eq. 3, also have property models included as a set of constitutive models that relates the properties to the intensive variables (pressure, temperature and composition). All problem formulations may use property models and therefore, the application range of a CAMD methodology depends on the application range of the property models used. Note t h a t in problem formulations i-ii, an optimal design may be obtained by ordering all the feasible candidates according to the objective function (Eq. 1) value. Global optimality, however, can only be guaranteed if and only if all possible compounds were considered in the generation of the feasible set of candidates. On the other hand, problem formulations iii-v, may become too complex to solve if the property model is highly non-linear and discontinuous. Also, the solution approach may not be able to accommodate multiple property models for the same property. In this way,
while these problem formulations can determine the optimal design, their application range is usually quite small. Having formulated the property constraints and a version of the generic problem formulation, the next step is to select the property models and/or means to provide the necessary property values.
1.1.3 P r e d i c t i o n of P r o p e r t i e s Successes of CAMD methodologies depend to a large extent, on the ability to predict and/or obtain the necessary pure component and mixture properties, or more generally, performance characteristics, included in the property constraints and in the process model. Even if the CAMD problem involves the design of a single molecule, mixture properties may need to be calculated. For example, in solvent design, the property constraints may include pure component properties such as boiling point, heat of vaporization and mixture properties such as solubility of solute and solvent loss. In CAMbD problems, the property constraints are all mixture property based, however, the models for these mixture properties may require pure component properties. Consequently, the pure component properties may be used to screen out some of the candidate molecules to be considered in the mixture design problem. A wide range of property models can be found (Poling et al. 2000). The main question is which model has the largest reliable application range for the descriptors used to represent the molecular structures? For instance, if the descriptors employed for molecular structural representation are able to identify differences in isomer structures, then the property model must also be able to predict the property differences (if any) of these isomers. Otherwise, all isomers would be selected as feasible. Gani and Constantinou (1995) proposed a classification of properties as primary (pure component properties that can be determined only from the molecular structural variables - examples are critical properties, normal boiling point, normal melting point, heat of vaporization at 298 K, heat of fusion at 298 K, etc.), secondary (pure component properties that are dependent on other p r o p e r t i e s - examples are surface tension, viscosity, solubility parameter, vapor pressure at a given temperature, density at a given temperature, etc.) and functional (pure component properties dependent on temperature and/or p r e s s u r e - examples are density, vapor pressure, enthalpy, heat of vaporization, etc., as a function of temperature; and mixture properties that are dependent on composition and/or temperature & p r e s s u r e - examples are liquid phase activities, vapor phase fugacities, phase density, mixture viscosity, mixture saturation temperature, etc.). For several material design applications of interest, the desired properties are even more complex, high-level performance characteristics that are to be satisfied by the material during its active service life. These performance measures are usually very difficult to predict using standard property-prediction models. Sophisticated models,
10 usually hybrids of different approaches, need to be constructed. Examples of such systems or properties include reaction systems (i.e. where the final desired performance may come into play only at the end of chemical or biological reactions), long-term mechanical properties, biological functionalities, etc. Further several of these performance measures are dynamic i.e., time-evolving. In such cases, not only is the value of a particular high-level property at the start of active service life of the material important, but also, and usually more critical, its evolution profile throughout the period of service. Gani and Constantinou (1995) also propose a classification of property models that may be employed for each class of properties. Figure 2 highlights this classification.
Classification of Estimation M e t h o d s
/
Reference
Mechanical models
Semi-empilical models
EmphJcal models
Quantum Mechanics
Corresponding States Theory
Chemometrics
Molecular Mechanics
Topology / Geomet~'y
Pattern matching
Molecular Simulation
Group / Atom / Bond additivity
Facto," analysis QSAR
Figure 2: Classification of property estimation methods Estimation of primary pure component properties
While there are numerous property estimation methods for primary pure component properties, not all of them are applicable in CAMD. Most property estimation methods used in CAMD methodologies are based on the Group Contribution Approach, GCA, (Franklin, 1949) where the properties of a compound are expressed in terms of functions of the number of occurrences of predefined fragments (groups) in the molecule. The GCA-based methods belong to a class known as additive methods. F (p) = w~Z Ni C~ + w, s M~ D~ + w~X Oh Eh +.
(S)
11 In the above equation, Ci is the contribution of atom, bond or first-order group i; Ni is the number of occurrences of atom, bond or first-order group i; Dj is the contribution of atom, bond or second-order group j ; / ~ is the n u m b e r of occurrences of atom, bond or second-order group j; Ek is the contribution of atom, bond or third-order group k; Oh is the n u m b e r of occurrences of atom, bond or third-order group k. wi, w2, w3 are weights t h a t may be imposed on each of the additive terms. With this method, if the fragments (atoms, bonds, groups, etc.) representing each molecule are identified and their contributions to a needed property are available, then the corresponding property of the molecule can be estimated by simply summing all the contributions. Since the same fragments can be used to represent different molecules, these property estimation methods, although semi-empirical in nature, are also truly predictive. Note t h a t the atoms and bonds only consider the number of occurrences and not their placement in this type of methods. The limitations of these methods are accuracy and ability to handle complex molecular structures. However, in principle, these methods can be made to be highly accurate with large application range by simply adding more additive terms of higher order. From a practical point of view, this is not feasible and the highest order of this type of methods is three (Marerro and Gani, 2001). Second- and thirdorder additive methods are able to distinguish some isomeric molecular structures. Methods based on topological or geometric information provide a higher level of molecular representation. The methods based on topological information related to the molecular structure commonly employ the wellknown connectivity index (Kier and Hall, 1986; Bicerano, 1993) while methods based on geometric information employ conjugates (Constantinou et al. 1994). Connectivity indices specify the spatial a r r a n g e m e n t of the atoms in the molecule, while, conjugation (with respect to molecular structures) refers to an idealized arrangement of atoms connected by bonds (Constantinou et al. 1994). Any property p is estimated through Eq. 9 (connectivity index) or Eq. 10 (conjugation).
F (p) = a X ' + b X 1 + c X 2 + d X 3 + .....
F (p) = E N~ B~ + E Mj Ej
(9)
(10)
In Eq. 9, X n is the connectivity index of order n; and a, b, c & d are the adjustable parameters. In Eq. 10, Bi is the contribution of bond i; Ni is the number of occurrences of bond i; Ej is the contribution of bond j ; / ~ is the number of occurrences of bond j. The main computational effort is spent on generating the connectivity indices or conjugates representing a molecular structure. Once these are known, the properties estimation phase is simple and computationally inexpensive. As in the additive methods, these methods are also predictive. Another advantage of these methods is t h a t the indices and/or conjugates may be used to generate the fragments for
12 the additive methods. In this way, they use additional structural information t h a n the additive methods and therefore, are able to distinguish more isomeric structures. The main difficulty is to know how m a n y indices should be used and how to estimate their property contributions. The topological information based methods are also classified under QSPR (Quantitative Structure Property Relationship) or QSAR (Quantitative Structure Activity Relationship) methods. Many QSPR and QSAR methods base the prediction of properties on the structure of the molecule using complex descriptors obtained from molecular modeling. CAMD methodologies dealing with meso- and microscopic representation of the molecular structures employ such descriptors to identify the differences in the molecular structures as well as to estimate the needed properties. While these property models are able to employ complex descriptors and to distinguish between isomeric structures, their application range outside the training set of molecules may be questionable. Therefore, they are more suitable for use in CAMD problem formulations of types i & ii but are able to handle large, complex molecules. More details on QSPR and QSAR methods can be found in Kier and Hall (1986) and Livingstone
(2001). E s t i m a t i o n of secondary pure c o m p o n e n t properties The best source of methods for this type of properties is the book by Poling et al. (2000), which gives a comprehensive overview of the properties and the corresponding property models that may be used. Therefore, in this book, we are not covering these methods. It should be noted, however, t h a t many of the secondary properties that are calculated from primary properties might also be converted to primary properties. For example, the Hansen's solubility parameters are estimated from known values of molar volumes and heats of vaporizations at 298 K. The solubility p a r a m e t e r data can therefore be also correlated through a set of groups or topological indices to generate a primary property model. In a similar way, properties such as Octanol-Water partition coefficients and water solubilities may also be converted to primary pure component properties. Since the p r i m a r y pure component properties are only functions of the molecular structural variables, they are very useful in CAMD problem solution.
E s t i m a t i o n of mixture properties The simplest and easiest, but usually the least accurate way, is to assume mixture ideality and employ a simple linear mixing rule. F (O) - V~x~ p~
(11)
13 In the above equation, F (0) is a property function for mixture property 0; x i is the composition of component i and pi is the corresponding pure component property of 0 for component i. If the assumption of mixture ideality is valid, this method is fast, easy and very convenient for use in CAMD problem formulations of types iii-v. Most practical problems, however, do not behave ideally and therefore, more rigorous models are needed. Since CAMD methodologies generate molecular structures and therefore, work with molecular structural parameters, models that do not employ such parameters are therefore not suitable. Examples of these models are NRTL (Renon and Prausnitz, 1968) and Wilson (Wilson, 1964), which need compound specific, and predetermined molecular interaction parameters for estimation of liquid phase activity coefficients. The most widely used mixture property in many CAMD applications are the liquid phase activity coefficients because they may be used for estimating solubility (solid, liquid or gas), phase equilibrium (considering the other phase in equilibrium with the liquid to be ideal), for liquid surface tension, liquid viscosity, bulk properties such as saturation temperatures and pressures and many more. GCA-based methods are the only practical choices in this case since the topological information based methods have not been developed for general purpose use and molecular modeling based methods are too complex for use in CAMD problem formulations of types ii-v. The GCA-based method for prediction of liquid phase activity coefficients that is most widely used in CAMD methodologies is the UNIFAC method (Fredenslund et al., 1977) in its original form or in its various modifications. A major limitation of the UNIFAC method with its original set of first-order groups is that it cannot handle complex mixture nonideality (such as proximity effects) and it cannot distinguish between isomers. Some of these limitations have been addressed recently through the introduction of second-order groups (Kang et al. 2002). Another important limitation of UNIFAC and all other GCA-based mixture property models is that the necessary group interaction parameters may not be available for the generated feasible candidate molecules. Molecular modeling in this respect can help to predict the necessary group interactions (Jonsdottir et al. 1994). For CAMD involving large, complex molecules and mixture properties, problem formulations of type i-ii are feasible options as they allow the use of sequential generation of feasible candidate molecules and testing of candidates. In this case, any number of property models may be used. While this is not a computationally efficient procedure, it is able to provide a means to identify promising candidates, at least, as a first step of the search.
14
Estimation of environmental, implicit and high-level properties Environmental and other implicit properties need special attention since they do not usually belong to the standard databases for properties of chemical compounds. For the estimation of environmental properties, such as toxicity, biodegradability, ozone depletion potential, biological oxygen demand, global warming potential, soil adsorption potential, very few general methods covering a wide range of compounds have been developed, although, new methods are continuously being developed (Martin and Young 2001). However, a number of methods valid for specific molecular types such as alcohols, acids, benzene derivatives are available (Lyman et al., 1990). These methods are capable of predicting many of the environmental properties listed above. Often, methods for environmental properties rely on the Octanol/Water partition coefficient (log P) as a known property value. Databases such as CHRIS (Silver Platter Information Inc., 1998a), HSDB (Silver Platter Information Inc., 1998b) and RTECS (Silver Platter Information Inc., 1998c) store environmental data and properties for a large number of substances. The more difficult properties are high-level performance characteristics desired of the material. Examples of these include properties related to taste of food products, aroma of fragrances, long-term mechanical properties of polymers and polymer blends and many more. What often makes the modeling process even more challenging is that several of these properties of interest are dynamic and the design objectives are specified in terms of the time-evolution profile of the property in question throughout the service time of the material. Some of these maybe estimated through a combination of higher-level modeling and theory, such as molecular modeling combined with kinetic phenomena (in the case of polymer blends with desired properties) while others may be implied through QSAR types of investigations. Typically, highly sophisticated hybrid approaches that make use of a variety of modeling techniques need to be employed to model the high-level properties to desired levels of prediction accuracy (Ghosh et al., 2000). Having the necessary property models available brings us to the next topic - t h e actual CAMD algorithm.
1.1.4 CAMD algorithms The CAMD algorithm basically solves the CAMD problem formulations of type i-v and other variations of the generic problem defined by Eqs. 1-7. The main solution step involves finding the molecules of the desired type having the desired properties. Here, a difference is made between those problems that involve only selection (type i and some variation of type ii) and those that involve selection plus design (types ii-v). If the problem is of the selection type (i.e. finding candidates from a database of known compounds) the solution step involves one or more database lookup
operations in order to identify the subset (if any) satisfying the property and molecule type constraints. For pure component properties based selection, the search engine is commonly known as pattern matching (Nielsen et al., 1991), that is, find the specified pattern in a database. If mixture properties are also considered, the search is more difficult. Cabezas (2000) have developed tools for efficiently solving these problems. If the CAMD problem formulation is of type ii-v, an algorithm is needed to identify (design) the molecules of the specified types and having the desired properties as specified through the property constraints. Even though different algorithms have been proposed for design of molecules, nearly all algorithms rely on, to some degree, the creation of chemically feasible molecules from fragments. The most widely used feasibility criteria is the valency rule proposed by Macchietto et al. (1990) where the goal is to guarantee the fulfillment of the octet rule. Different approaches have been proposed for solving CAMD problems and these approaches can be grouped into three categories: 1. Mathematical programming (a mathematical representation of the problem is solved with a numerical optimization m e t h o d ) problem type iii-v. Chapters 3, 4 and 11 describe these types solution approaches. 2. Stochastic optimization (a mathematical representation of the problem is solved by numerical stochastic methods) - problem type ii-iii. Chapter 5 describes a genetic algorithm based solution approach of this type. 3. Enumeration techniques (a combined mathematical and qualitative representation of the problem is solved by hybrid solution approaches) - problem type ii-v, but using a decomposed problem formulation (also called hybrid methods). Chapters 2, 6 and 7 describe solution approaches of this type. Common to all the solution approaches is that the objective is to find a compound or compounds fulfilling the requirements set forth in the constraints and goals. 1.1.5 M o l e c u l a r S t r u c t u r e R e p r e s e n t a t i o n
All CAMD methodologies need to employ some form of representation of the molecular structure information for use in property estimation. In general, the estimation methods used for predicting properties of the designed molecule(s) decide the level of detail needed for the molecular structural information and the representation method to use. Other considerations are compatibility with external programs and databases. The simplest form of a compound is an atomic representation based on chemical formula. Here, a compound is simply represented by the types of
15 atoms it contains and the number of occurrences of each atom type (Fig. 3a). A single representation can describe a large number of compounds of very different types. No direct information regarding the bonds in the compound can be extracted from the representation. Although, if assumptions of the valency of the different atom types are made, it is possible to calculate bond configurations. A related representation form is the representation of a compound as a collection (or vector) of groups. A group is a molecular fragment or substructure defined by the number and types of atoms in the fragment, how the atoms are connected, how many free connections the group has and where (on which atom) they are located. Figure 3b shows an example of a fragment and Fig. 3c an example of a group vector. A group vector contains some information about the connectivity of the structure of the molecule but does not define it completely. As a result, a group vector can represent more t h a n one possible molecule (isomers) - Figure 3d illustrates the different compounds t h a t are possible to construct using the group vector in Fig. 3c. The compounds depicted in Fig. 3d have the connectivity defined. One of the most versatile and manageable methods is the adjacency matrix. An adjacency matrix is a square symmetrical matrix with rows and columns representing the atom (or fragments) in the molecule and containing zeroes and non-zeroes indicating bonds or absence of bonds. An adjacency matrix can be on fragment level or on atomic level. Conversion from a fragment-based matrix to an atomic based matrix is achieved by substituting the entry for each fragment with that of the atomic adjacency matrix representing the fragment. Figures 3e and 3f are the fragment based and atom based adjacency matrices, respectively, for the first compound in Fig. 3d. While the adjacency matrix defines the 2-dimensional relations between atoms in a compound, it does not contain the steric information needed in order to distinguish R/S, L/D and Cis/Trans isomers. In order to distinguish between such isomers it is necessary to have 3-dimensional information about the placement of the atoms. For 3-dimensional representation two methods are widely used. The first is the combination of an adjacency matrix with a list of x, y, z Cartesian coordinates for the atoms. The second is the so-called internal coordinate system where an atom's position is defined by a length, a bond angle and a torsion angle (Maranas and Floudas, 1994). Choice of the type of representation depends on the computations that are to be performed with the 3dimensional representation. Chapter 2 describes methods for generating molecular structures using group information only. Chapters 3 and 4 give examples of how the generation of molecular structures can be incorporated into mathematical programming formulations through the feasibility rules. Chapter 4 also gives a detailed description of generation of molecular structures from higher-level groups (Marerro and Gani, 2001). Chapter 5 describes how
17
e m p l o y i n g g r o u p s a n d topological indices c a n g e n e r a t e m o l e c u l a r s t r u c t u r e s t h r o u g h g e n e t i c a l g o r i t h m s . Finally, c h a p t e r 7 d e s c r i b e s g r o u p s b a s e d c o m b i n a t i o n r u l e s to g e n e r a t e m o l e c u l e s t h a t also s a t i s f y r e a c t i o n stoichiometry.
,o H2C~C ~ C5H1002
/
~
/ (a)
\
2 C~O
H
H
(c)
\
2
O
H2C~C
\ H2C~CH
CH3 CH3 CH3 CH2 CH2COO
H
1 CH2COO
H3C~CH
o/~
H 0
I CH2
o
(b)
H3C~CH
H H H H H H H H H H C C C C C 0
2 CH3
3
O~CH
(d)
CH3
CH2
0 0 0
0 0 1
0 1 0
1
0 (e)
1
H
H
H
// \
H
H
H
0 0
C 1 1 1
0
C
CH2COO 1 0 1
0
C
0 0 0
1 1
0
1
0 1
1
C
1 1
0
1
C
1 1 1
0
1
3
1
1 0
1
1
1 1
1
1 0
1
1 0
,0 (0 Figure 3: Different levels of molecular structure representation (Harper, 2000)
O
O
18
1.2 KEY I S S U E S & T H E I R R E L A T I O N S H I P S
Some of the key issues and their relationships associated with the generation of molecular structures and the predictions of the properties of the generated compounds are highlighted here (from H a r p e r 2000). 9 Computational L o a d - This is related to the a m o u n t of calculations required to solve any CAMD problem. 9 Generation L e v e l - This is related to the steps employed to generate molecular structures (compounds). With increasing levels of molecular structural information, the degree of detail and information also increases. 9 Property Range - The Property Range is the total n u m b e r of properties to be calculated for a generated molecule in order to evaluate if it matches the specified requirements. Each of the properties in the Property Range may have an associated constraint value indicating a lower and/or upper bound t h a t m u s t be fulfilled if the generated molecule is to be retained for further screening. 9 Property L e v e l - This is related to the level of "complexity" involved in the estimation of a needed property. This is a theoretical m e a s u r e of the a m o u n t of information needed in order to calculate the property based on: o
The type of molecular information needed in order to use the selected property estimation method. o W h e t h e r or not the property requires other properties in order to be calculated (that is, if they are secondary properties). o The complexity of the calculation, t h a t is, is the calculation iterative, does it involves solution of a system of equations or is it otherwise calculation intensive? o If a property p depends on other properties, the level (with respect to calculation order) of property p m u s t be higher t h a n the levels of the other properties. Therefore, if the level of property p is determined on the basis of the levels of other properties, it is not a fixed value for all calculations involving using property p - but is a variable. o W h e t h e r the property p is a dynamic i.e. time-evolving property. Certain high-level, complex performance m e a s u r e s m a y involve not only the value p(O) of the property at the s t a r t of the material's active service life, but also the profile p(t) of its evolution with time over the service period.
Property T r u s t - The level of "confidence" one can assign to a property. This depends on: o
Estimation accuracy.
19 o o
The dependence of other calculated properties, for example, error propagation. Applicability of the method(s) to the compound(s) in question.
For any CAMD problems it is necessary to identify the Generation Levels needed for a given CAMD problem. It is necessary to cover the entire property range (of the t a r g e t properties) within the generation levels. The n u m b e r of levels needed is determined by the available property e s t i m a t i o n methods. As a consequence of this, the property range and the available property estimation methods control the m i n i m u m generation level.
1.3
T A R G E T S F O R A CAMD F R A M E W O R K
From the above discussion, it is clear t h a t any CAMD methodology requires a n u m b e r of methods and tools t h a t need to work in an i n t e g r a t e d m a n n e r . An architecture t h a t glues the various methods and tools together into a CAMD framework could therefore be very useful for further development of CAMD methodologies in a systematic m a n n e r as well as increasing the solution range of any CAMD methodology. The targets for the development of a CAMD framework could be (Harper 2000): 9 The correct formulation of the Property Range is critical to the success of a CAMD method. Failure to identify the i m p o r t a n t properties will lead to the generation of the wrong products. It is therefore necessary to include a methodology for the formulation of the t a r g e t property constraints within a CAMD framework. 9 The ability to predict a wide range of properties using different methods would broaden the application range of CAMD. Therefore, a CAMD framework m u s t be able to use other prediction methods in addition to the traditionally used GCA methods. This requires the generation and integration of detailed molecular models. 9 While the design of highly detailed molecular structures improves the ability to predict properties accurately there can be a significant associated computational cost. If highly detailed molecules (in t e r m s of s t r u c t u r a l information) are to be generated, it is necessary t h a t the computational efficiency of the CAMD algorithm be t a k e n into account in the development of the CAMD framework. 9 The minimization of u n c e r t a i n t y is i m p o r t a n t when performing complex calculations. Consequentially the use of correlations should be minimized and the use of experimental data and accurate prediction methods (using all available information) should be maximized. With the background presented in this chapter, we now move on to some of the tools and methods used to tackle the CAMD problem.
20 Acknowledgement
The PhD-thesis of Peter M. Harper (2000) has provided material in the form of text and figures for parts of this chapter.
1.4
1.
2. 3. 4. 5. 6.
7. 8. 9.
10. 11.
12.
13.
14.
15.
REFERENCES
J. Bicerano, "Prediction of Polymer Properties", Marcel Dekker Inc. (1993). Cabezas, H., "Designing green solvents", Chemical Engineering, 107 (3), March (2000) I07-109. Chem-Bank, Chemical Hazards Response Information System (CHRIS) Database, Silver Platter Information Inc, MA, USA, November (1998a). Chem-Bank, The Hazardous Substances Data Bank (HSDB), Silver Platter Information Inc, MA, USA, November, (1998b). Chem-Bank, The Registry of Toxic Effects of Chemical Substances (RTECS), Silver Platter Information Inc, MA, USA, November (1998c). L. Constantinou, S.E. Prickett and M.L. Mavrovouniotis, "Estimation of thermodynamic and physical properties of acyclic hydrocarbons using the ABC approach and conjugation operators", Ind. Eng. Chem. Res., 32 (1993), 1734. L. Constantinou and R. Gani, "New group contribution method for estimating properties of pure compounds", AIChE J., 40 (1994) 1697. Cussler, E. L., Moggridge, G. D., "Chemical Product Design", Cambridge University Press, USA (2001). Aa. Fredenslund, J. Gmehling, P. Rasmussen, "Vapor liquid equilibria using UNIFAC", Elsevier Scientific, Amsterdam, The Netherlands (1977). Franklin, J. L., "Prediction of Heat and Free Energies of Organic Compounds", Industrial Engineering & Chemistry, 41(1949) 1070 R. Gani, B. Nielsen and A. Fredenslund, "A group contribution approach to computer-aided molecular design", AIChE J., 37 (1991) 1318. R. Gani, & L. Constantinou, "Molecular Structure Based Estimation of Properties for Process Design", Fluid Phase Equilibria, 116 (1996) 75-86. Ghosh, P., A. Sundaram, V. Venkatasubramanian and J. Caruthers, "Integrated Product Engineering: A Hybrid Evolutionary Framework", Computers and Chemical Engineering, 24 (2000) 685691. P. M. Harper, "A Multi-Phase, Multi-Level Framework for Computer Aided Molecular Design", PhD-thesis, Technical University of Denmark, Lyngby, Denmark (2000). S. O. Jonsdottir, Kj. Rasmussen, Aa. Fredenslund, Fluid Phase Equilibria, 100 (1994) 121-138.
21 16. J. W. Kang, J. Abildskov, R. Gani, J. Cobas, "Estimation of Mixture Properties from First- and Second-Order Group Contributions with the UNIFAC Model", I&EC Research, 41 (2002) 3260-3273. 17. L. Kier, L. H. Hall, "Molecular Connectivity in Structural-Activity Analysis", Wiley, New York, USA (1986). 18. D. Livingstone, "Data analysis for chemists,: Application to QSAR and chemical product design", Oxford University Press, Oxford, UK (1995). 19. L. J. Lyman, W. F. Reehl, D. H. Rosenblatt, "Handbook of Chemical Property Estimation Methods, Environmental Behavior of Organic Compounds", American Chemical Society, Washington DC., USA (1990). 20. C. D. Maranas, C. A. Floudas, "A Deterministic Global Optimization Approach for Molecular Structure Determination", J. Chem. Phys., 100 (1994) 1247-1261. 21. J. Marrero and R. Gani, "Group-contribution based estimation of pure component properties", Fluid Phase Equilibria, 183-184 (2001) 183. 22. S. Macchietto, O. Odele and O. Omatsone, "Design of optimal solvents for liquid-liquid extraction and gas absorption processes", Chem. Eng. Res. Des., 68 (1990) 429. 23. J. M. Nielsen, R. Gani, J. P. O'Connell, "TMS: A Knowledge Based Expert System for Thermodynamic Model Selection and Application", in "Computer-Oriented Process Engineering" ed. L. Puigjaner and A Espuna, Elsevier, 10 (1991) 29-34. 24. B.E. Poling, J.M. Prausnitz, J.P. O'Connell, The properties of gases and liquids, 5th edition, McGraw-Hill, New York, USA (2000). 25. H. Renon, J. M. Prausnitz, AIChE J., 14 (1968) 135. 26. G. M. Wilson, J. Am. Chem. Soc., 86 (1964) 127. 27. T. D. Martin, D. M. Young, "Prediction of the Acute Toxicity (96-h LC50) of Organic Compounds to the Fathead Minnow Using a Group Contribution Method", Chem Res Toxicol, 14 (2001) 1378-1385.
This Page Intentionally Left Blank
ComputerAided MolecularDesign: Theoryand Practice L.E.K. Achenie, R Gani and V. Venkatasubramanian(Editors) 9 2003 Elsevier ScienceB.V. All fightsreserved.
23
Chapter 2: Molecular D e s i g n - G e n e r a t i o n & Test Methods E.A.Brignole & M.Cismondi
2.1 I N T R O D U C T I O N Traditionally the search for solvents or products for specific applications has been carried out by examining several compounds and families of compounds and selecting those with the desired properties. A more systematic approach to the solution of these problems is based on CAMD of solvents or products. In both cases an experimental validation of the component properties is recommended. The CAMD approach was introduced in the early eighties for the selection of solvents for separation process [1,2]. At that time the problem was formulated as follows: "Given a mixture and certain separation goals, synthesize, from the set of UNIFAC groups, molecular structures with the desired solvent properties. The groups are the building blocks for the synthesis process and the UNIFAC thermodynamic model is used for the evaluation of the primary solvent properties". UNIFAC is a group contribution based model [3] used for predicting the liquid phase activity coefficients of the compounds present in the mixture and the UNIFAC groups are the functional groups needed to represent the molecular structures of the compounds. These two stages: synthesis and evaluation are still the main components of the various types of CAMD techniques that have been developed. The extensive development of group contribution methods for the prediction of pure component and mixture properties has been a fertile ground for the generalized use of product molecular design techniques. The original CAMD approach can be defined as the backward product design problem: "giving a set of property constraints and certain performance indexes, generate chemical structures with the desired physico-chemical and/or environmental properties". Applications have been reported for the design of polymers [4], refrigerants [5,6], product substitution [7], solvents [8,9,10] and many more. The first solvent design studies were based on solution properties derived from the UNIFAC group contribution method for computing activity coefficients [3]. Several revisions and extensions to electrolytes, polymers and equations of state, of the original UNIFAC predictive package have been presented [11]; a group contribution equation of state (GC-EOS) based on similar but more detailed group definitions, has been extended to
24 new groups and gases [12-14]. For the prediction of pure component properties, such as heat capacities, solubility parameters, formation energies, critical properties, etc, different group definitions have been proposed [15]. However, correlation of pure component properties has also been proposed in terms of the original UNIFAC groups [16,17], which are also called first-order groups [17, 18]. In this chapter the original UNIFAC group definitions will be used throughout. This chapter presents the class of CAMD methods that is characterized as generate & test methods. At the macroscopic properties level, these type of methods were first developed for solvent selection and design. For the design of large complex molecules involving a higher level of molecular structural representation than functional groups, most of the procedures also employ generate and test type of CAMD methods. In this chapter, however, only the method based on groups as building blocks is discussed in detail.
2.2 T H E E V O L U T I O N OF CAMD
The elements of a CAMD technique can be divided into algorithmic stages dealing with generation of molecules and testing of generated molecules, that is, i) the "generate" or molecular synthesis stage and ii) the "test" or molecular evaluation stage. The main features of the molecular synthesis stage are: group selection, group characterization and molecular feasibility rules. The result of the molecular synthesis stage is a number of feasible molecular structures. The main features of the molecular evaluation stage are: group contribution methods for property estimation, calculated properties, property constraints and evaluation (performance indexes). The final result is a ranked set of product candidates. 2.2.1 M o l e c u l a r S y n t h e s i s
Molecules are synthesized by joining groups with free-attachments until no free-attachments remain in the generated structure. This means that the search (or design) for suitable molecules is not limited to a given set of molecules. Although this is an attractive feature of CAMD, it also has its drawback - the number of structures that may be generated can be very large. Another important feature with respect to properties prediction (forward problem) and CAMD (reverse problem) is that while in the forward problem the groups representing a molecule are given, in the reverse problem, the group's free-attachment properties are also important [1,2] and need to be analysed. The free-attachments of a group are the number of chemical bonds available to neighbouring groups for attachment (or combination). The characterisation of the group's combination properties is needed mainly to satisfy two criteria:
25
i) ii)
To obtain chemically feasible structures. To avoid proximity effects t h a t could lead to unreliable UNIFAC predictions.
Therefore, the generation of feasible molecular structures from the groups is subject to several restrictions and is based on the f r e e - a t t a c h m e n t s of the groups. Some of the restrictions are the result of the way the groups in the UNIFAC table are defined, while other restrictions are made to prevent the formation of unstable compounds or the generation of new functional groups such as acetals (for which the property predictions will be uncertain). In an earlier publication on molecular design using UNIFAC groups [1], a set of combination rules were formulated: a) Groups with two a t t a c h m e n t s cannot be combined to obtain a double bond. b) Aromatic groups with two a t t a c h m e n t s (such as "ACCH2" see Table 3) m u s t always have one a t t a c h m e n t to the aromatic ring. c) All non-hydrocarbon groups can only combine with a carbon attachment. d) Only one bond of the carbon atom can be used for a t t a c h m e n t s with bonds other t h a n those of carbon or hydrogen atoms. In later works [2,8] a more detailed group characterisation was introduced allowing a more general formulation of feasibility rules for aliphatic and aromatic compounds. The main chemical property used for the generation of combination rules was the electro negativity of the group bonds [2,8,9]. O t h e r authors have proposed feasibility rules t h a t satisfy the molecule n e u t r a l i t y conditions. However, the chemical stability of the components is, in m a n y cases, not g u a r a n t e e d [5,6] with such feasibility rules. This is p a r t l y due to the way groups are defined in different group contribution methods and/or the lack of proper combination rules for the groups.
Classification of Groups The UNIFAC groups with free-attachments (or bonds) have one or more a t t a c h m e n t s for combination among themselves. Groups with only one free a t t a c h m e n t are defined as "terminal " groups. All other groups with more t h a n one free a t t a c h m e n t are defined as "intermediate" groups. There are three types of i n t e r m e d i a t e groups (i.e., groups with multiple attachments): radial, linear and mixed. In the groups of the UNIFAC p a r a m e t e r tables, there are no more t h a n two atoms with "free" a t t a c h m e n t s . The "free" a t t a c h m e n t s of a group m a y be characterised by two properties: i) a t t a c h m e n t status, which takes into account the combination properties and ii) valence, the n u m b e r of a t t a c h m e n t s . Four
25 types of a t t a c h m e n t s , for paraffinic groups have been defined on the basis of t h e i r electro negativity: 9 K: severely restricted a t t a c h m e n t , e.g., 9 L: partially restricted a t t a c h m e n t , e.g., 9 M : unrestricted carbon a t t a c h m e n t in valence groups 9 J : unrestricted carbon a t t a c h m e n t s in "-CH2-","-CH<"
"-OH", "CH30-" "-CH2CI" single valence or linear dual radial paraffinic groups, e.g.
Three basic types of group valences have been identified in aliphatic compounds: M, J t h a t are classified as neutral and L & K, with increasing degrees of electro negativity. The methyl group (CH3), even though it is a J type group, is identified as type M because it plays a different role in the feasibility criteria analysis with respect to the other J groups. The synthesis of aromatic additional a t t a c h m e n t s :
compounds
requires
the
introduction
of
I: aromatic carbon ring a t t a c h m e n t such as ACH 9 H: s u b s t i t u t e d aromatic carbon ring a t t a c h m e n t such as ACCL
9
Types M and J a t t a c h m e n t s are extended to aromatic groups as follows: unrestricted a t t a c h m e n t in a carbon linked to an aromatic carbon such as ACCH29 J : unrestricted a t t a c h m e n t s in a "radial" carbon linked to an aromatic carbon, such as ACCH<
9
M:
The valence of an aromatic carbon a t t a c h m e n t has been defined to be one. For example, the characterization of the group (ACH) is (I,1) where the first letter indicates the a t t a c h m e n t type and the second n u m b e r indicates the valence. Modifying its a t t a c h m e n t type can change the combination properties of a given group. For instance, to avoid proximity effects between polar groups a type L a t t a c h m e n t may be changed to a type K. For instance, the keto group (CH2CO) and the amino group (CH2NH) have both the same combination properties (L,1)(K,1), therefore the combination (NHCH2)-(CH2CO) is feasible as well as a keto-amino compound like (CH3)-(NHCH2)-(CH2CO)-(CH3). However, if proximity effects between the amino and the keto group are to be avoided, the keto group characterization can be modified to (K,2), in this case both a t t a c h m e n t s of this group are highly restricted, and thus an and additional (CH2) is needed to link both functional groups: (CH3)-(NHCH2)(CHD-(CH2CO)-(CH3).
27
C o m b i n a t i o n & F e a s i b i l i t y Rules The a t t a c h m e n t combination properties for the synthesis of paraffinic, aromatic and cyclic solvents are: R l: Type K attachments can only be combined with unrestricted carbon attachments. R2: Type L attachments can be combined with L, M or J attachments R3: The combination of a J attachment (radial paraffinic group) with a type K a t t a c h m e n t changes the status, of the remaining free attachments of the group, to L. R4: Aromatic rings are built only with type I and H attachments. Very simple criteria were formulated to establish the feasibility criteria, when the above combination rules were applied to the synthesis of linear paraffinic or linear mixed paraffinic-aromatic compounds, using only dual valence intermediate groups and single valence terminal groups,. For example, the set of groups that makes up a molecular structure should have a n u m b e r of unrestricted carbon attachments equal to or greater than the n u m b e r of K (severely restricted) attachments. If this condition is satisfied, no further restrictions are imposed in the combination of the remaining attachments for the case of linear molecular structures. The feasibility criteria for intermediate molecular structures (IMSs) and the final molecular structures (FMSs) are given in Table la. Examples of application of the feasibility criteria are given in Table lb.
Table la. Feasibility Criteria for I M S s a n d IMSs T y p e of Compound Aliphatic K~_ M + J / 2 +2 Aromatic a I+H =6 Aliphatic-aromatic K~_ M + J / 2 I+H =6 Cyclic aSingle ring aromatic structures
FMSs
FMSs K~_M+J/ 2 I+H=6 K~_M+J/2 I+H=6 K~_M+J/ 2
G e n e r a t i o n of F e a s i b l e M o l e c u l a r S t r u c t u r e s The basic technique for the molecular synthesis stage in the Generate & Test methods follows a combinatorial approach. That is, enumerating the possible combinations (in this case FMSs) from the building blocks (in this case groups) and test each FMS for its structural and property constraints. Brignole et al. [2] proposed a combinatorial-partition strategy where a selected set of molecular groups are combined, considering all possible chemical structures and are then screened by checking the feasibility conditions. In principle the size of the combinatorial problem considering all the UNIFAC groups is of insurmountable magnitude. However, a
28 realistic implementation of many product or solvent design problems can be handled efficiently by a combinatorial molecular generation approach and will be discussed later. Pretel et al. [8] proposed molecular synthesis techniques based on intermediate and terminal groups for the generation of linear (not branched) molecules. Table lb. Examples of Feasibility Criteria for I M S s a n d F M S s
Type of Compound/ Group Characterisation Paraffinic/ (CH2): (s 2) (CHCl): (L,2) (CH=CH) : (K,2) (OH): (K, 1) , (CH3): (M,1)
IMSs
FMSs
(CHCl)(CH=CH) (CH2) M=O; J=2,. L=2 K=2 M+J/2 +2 = 3 (feasible IMS)
Aromatic a~ (ACCH,) : (H, 1) A CH. (I,1) (A COH): (H,1) A liphatic-aro matic/ (ACCH2): (H,1)(M,1)
(A C).. (H,1) (M,1)
(CH3)2 (CH=CH)(CHCI)(CH2) M=2;J=2;L =2;K=2 M+J/2=3 : Feasible FMS (CH3)(CH=CH) (CHCl)(CH2)(OH) M=I;J=2;L=2;K=3 M+J/2=2 : Unfeasible FMS (ACH)3(ACCH3)2(ACOH) I=3,'H=3 I+H=6: Feasible Molecule
(AC)(ACH)/ A CCHz)z M=3; J=O;I=3; H=3; K=O M+J/2=3 I+H =6 (feasible IMS)
Cyclic/ CH2CO)." ( (L, 1) (K, 1)
(OH)2(AC)(ACH)/ A CCH2)2(CH~O) M=3;J=0;I=3;H=3;K=3 M+J/2 =3 I+H =6 (feasible FMS) (CH2)3 (CH2CO) M=O; J =6 ;L =1; K=I M+J/2=3 (feasible FMS)
aSingle ring aromatic structures
A direct extension of Pretel et a/.(8) feasibility criteria to branched structures is the following: i Ki 5_M + J
(final structure)
where J - J2 + J3 + J4
(1)
where Ki or Ji are the number of K or J groups with "i" attachments in the structure and M is the number of methyl groups. However, this criteria leads, in many cases, to structures not described by UNIFAC groups. For instance when the above feasibility rule is applied to the final structure FMSa: (HCOO)(CH)(CHa)(OH): K groups "(OH): (K, 1) ; (HCO0): (K, 1)
29 i K i - 1"2 - 2 M group: (CH3): (M,1) M=I J3 group (CH); (J, 3) J=l By application of equation (1): M+J=2, (1) FMSa is feasible. However in this structure, the tertiary carbon group -CH< is attached to two oxygen bonds, generating a combination of atoms (functional group) not available in the UNIFAC table. A feasible structure can be obtained by the addition of a (CH2) group leading to FMSb: (HCOO)(CH)(CH2)(CH3)(OH). New Group Combination
Property Characterization
The failure of equation (1) to deal with branched structures can be explained as follows: after the combination of a J group, with valence greater t h a n two, there are residual free attachments whose combination properties may be modified when linked to K groups. Therefore, the formulation of robust feasibility criteria for the synthesis of the branched structures requires not only the characterisation of the group free attachments but also of the group internal bonds. This is particularly the case of groups having L attachments. Therefore, a more detailed characterisation of group combination properties for aliphatic compounds was introduced by Cismondi and Brignole [18]. Considering the internal and free bonds, only two bond status: K (electronegative) and J (neutral) are required to characterise the combination properties. For example groups with L attachments are formed by a combination of two "pure" K and J subgroups (see Table 2). A revised set of combination properties of UNIFAC groups is presented in Table 3. The methyl group (CH3) is still characterised as a neutral M group and it is not counted as a J group. With the new group characterisation more general feasibility criteria can be implemented. Table 2: Redefinition of group combination properties in terms of J and K bond status UNIFAC Group Previous Decomposed New Group Valence Characterin Sub-groups combination example isation p_roperties (CH2C1) 1 (L,1) J2 + K1 (K,1) (J,2) (CHC1) 2 (L,2) J3 + K1 (K,1) (J,3) (CCL) 3 (L,3) J4 + K1 (K,1) (J,4) (CH2CO) 2 (K,1) (L,1) J2 + K2 (K,2) (J,2) (CHNH) 3 (K,1) (L,2) J3 + K2 (K,2) (J,3) _ (CH2N) ................ 3 (K,2) (L,1) ........ J2 + K3 (K,3) (J,2) ..........
-:
.......
-:-
-
=:--=.---=--==---
.......
--= .........
==--===
. . . . . .
~=_:
..................................
----~=
~_:=_..
........
=:
. . . . . . . .
=:=:_
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
========================
30
Feasibility Structures
Criteria
for the S y n t h e s i s
of Linear
or B r a n c h e d
Considering the molecular structure as a combination of pure "K" and "J" groups or subgroups, the new synthesis concept is: each pure J group cannot be attached to more than one K group. In other words, the building of feasible molecules requires the existence of a J - J type bond for each K group incorporated into the molecule (after the first one, for not cyclic structures). For example consider the following sequence of feasible final structures, where only the J-J bonds introduced for feasibility reasons are shown: (C H~ C O) (CH3) --> ( CHz C O) (CHe)-(CH2) (C O CHs) ---> (CHs C O) (CH2)-(CH)-(CHJ (C O CHs)
"(OH) The last is a branched structure with a t e r t i a r y carbon linked to an (OH) group. This example shows how the addition each K group requires the introduction of J-J bonds in the final structures. This synthesis concept can be formulated as follows: K S NJJ K- I S NJJ
(cyclic) (noncyclic)
(2) (3)
where N J J is the number of J - J bonds These conditions are valid for both i n t e r m e d i a t e and final structures. Therefore the new feasibility criteria consist on determining the N J J by counting the n u m b e r of type J a t t a c h m e n t s available. A "J a t t a c h m e n t s balance" could be obtained as follows: Xi i Ji - 2 N J J + N J F
when K < N J F m
(4)
or
i Ji - 2 N J J + N J F + 2 (K-NJF)
w hen K > N J F
(5)
where the n u m b e r of J free a t t a c h m e n t s is given by: N J F - J8 + 2 J4 + 2 (non cyclic and J >_1)
(6)
or
N J F - Jz + 2 J4
(7)
(cyclic)
In the final structure (non cyclic) of the previous example: (CH3C O) (CH2)-(CH)-(CH2)(C O CH3)
(OH)
31 J2=2; J3 =1; NJF=3, N J J = 2 ; K=3; Zi i J i - 7; J=3 Therefore the structure verifies the feasibility criteria given by equation (3). However if this criterion is applied to FMSa: (HCOO)(CH)(CH3)(OH) discussed in the previous section:
J3 =1; NJF=3, NJJ=O ; K=2; Xi i J i - 3; J=l The s t r u c t u r e is unfeasible because it does not satisfy equation 3. W h e n K > NJF, a (K-NJF) n u m b e r of K should be inserted in the i n t e r m e d i a t e s t r u c t u r e requiring twice as m a n y additional J bonds (equation 5) to obtain a feasible structure. For example the following final structure is unfeasible: (CH3 C O) (CH2C O) (CH2)-(CH)-(CH2) (C O CHs)
"(OH) J3 =1; NJF=3, N J J = 2 ; K=4; Zi i J i - 9; J=4 On the basis of the previous definitions (equation 1) and equations 2 to 7, the general feasibility criteria derived for linear or branched structures are shown in Table 3, where J is the number of subgroups J given by e q u a t i o n / . F r o m Table 3 it can be seen t h a t for the case where K > N J F an additional (CH2) is required in the previous example in order to obtain a feasible molecule. When N J F = 0 then J=0, in this case for K=I the final molecule is obtained only by combining the K group with a M group (CH3). This is the case, for example, of methanol (CH3)(OH) where M=I; J=0; K=I. In the application of the feasibility criteria of Table 3, K and J are the total n u m b e r of groups or subgroups of each kind t h a t participate in the molecule irrespective of their valence. The criteria for the aromatic parts of the structures are those indicated in Table 1 and should be combined with the ones of Table 3 in the synthesis of mixed (aromatic - paraffinic) structures. Considering t h a t the new group characterisation gives more detailed properties of the functional group, the feasibility criteria of Table 3 can be extended to different group definitions.
Table 3: Feas!bility criteria for linear and cyclic branchedstructures K
NJF Non cyclic structures Cyclic structures J-0
............... :::::::::::::::::::::::::::::::::: ..................................................
::::::: ........ :............................
KSJ KSJ -. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 K <_ J + N J F 2 K <_ J + N J F K
::::::::::::: ................. :::::.............. :....... ::::::::::::::::::::::::: ........ :.................... :::::::::::................... ~
....................... :................... ::::................. :............ :.....
32
..............._T._a_b_le__4._R_ev_isi~n_oft h e C o m b i n a t i o n Properties of U N I F A C Groups Combination Properties Groups with the same Combination Properties ......~ e t h a
groups) J ,2)
I ___/;_. ... /_(___j..,13) ~i_ '........................
(CH2)
. . . . . . . . . . . . . . . . . .
(C H )
.......
....
!.......'--..--.. (J--;4) ...............
(C)
1............. _(_t..,~)_~_K,_l.) ..................... i.~n~C~)
!. " ( J ,3) (K,!) ( J ;4)(K;!)
~ ) _
.....
ii
|
................................................................................................................................................................................................................... .............. il.........~C_n~_~_~
(CHC!). (cci)
...................._(_/,_2)....K(_~,2)_
]
.
_ (C~~)
................ _(Cn~_C__N)..............................(_C__n_~S_n) .........
[HCON(cHi) ;) -7 ......... i . . . . . . . . . . .
................ ( C H ~ C _ O O ~ Q ) _
........ .__i .... l
i
................ _ _ ( ~
....................................................................................................... .(C._.Q~_C_H2).................. (_C.._.QNC.H.3.CH2)......... (FCH2Q) ............................(CH2S_)..............] ....................................................................................................................... (C2H402) i! .........." (J;3)---(K;2i ................................. (-cH--6-)......................................... (C~i ........................................ (e6-N-ieH2)--;) ............................. -(cH-S) ....... [ ~_(_..J.___,__2..)___tK_,3) .......................................... (._C__H2N) ................
!
(K,I) (CH2=CH) (O~ ......................................................................................................... i-CH----;C--~66)-........................... --(HCOO) . . . . . . ..................... (CsI-I4N) . (COOH) (CHzNO2) (I)
(CH3C0) ................ (CH.O).__ ..........l -ieHi-6)- ..................... ( e H ; ~ i l ((2-He-12-)-17.....-i......._.i---]]~ii) ....... l (Br) (CH=C)
................................................................................... C_!-(C=C)__............................................. (S_!H~) ......................................................... (CC!~F) ................................... ( H C C I D .........
CC~_)_ .............. {:CaHsO:)
(CH3S)
.................(_c 9__NH...2)........... _(_ C O ~ _C_H.>) ................Lc__0..E ( c_~
.......... (K,2)
.... (CH--CH)_ (CC12) (SiH2)
(CH2=C) (CHNO~_) (SiH20)
. . . . . . . . . (,CH3N) (C=C) (C4H2S)
..........................................................
(CsH3N) ......... (COO)
[ ........................(K,4) ................. (C-C) .... (Si) ..................,.........ii{-sio)ii---ii--iiiiiii-ili/iii --ii_i_i._i/__.i./__] [(I,1) . . . . . . (ACH) ............. (ACF i ................................................................................ ] ._(E._!..)_
L
(ACOH) tACCH3) .(A_C__N02)...................................
(K, 1) (H, 1)
(.A_C_~2)
k4CCl)
(AC)
R e d u c i n g the C o m b i n a t o r i a l Size of the P r o b l e m The group characterisation given in Table 4 indicates that there are only 19 d i f f e r e n t c o m b i n a t i o n p r o p e r t i e s of t h e U N I F A C g r o u p s . T h e r e f o r e , u s i n g t h e f e a s i b i l i t y c r i t e r i a of T a b l e 3 a n e f f i c i e n t c o m b i n a t o r i a l s y n t h e s i s of b r a n c h e d m o l e c u l e s is i m p l e m e n t e d o n t h e b a s i s of m e t h a g r o u p s , i.e. groups with the same combination properties as indicated in the first c o l u m n of T a b l e 4. I n t h e s y n t h e s i s of l i n e a r m o l e c u l e s t h e i n t e r m e d i a t e structures have two free attachments. However, the number of f r e e attachments in branched intermediate s t r u c t u r e s is a l w a y s l a r g e r t h a n two:
N F A - 2 + N V 8 + 2NVr
(non cyclic)
(10)
(cyclic)
(11)
or
N F A - N V 3 + 2NV4
33 where NV3 and NV4 a r e the number of groups of valence three and four. Computer programs based on the above combination rules and group classification can easily be developed [18] and consist on the following steps: 1. Definition by the user of the desired product or solvent property constraints and performance index. 2. Selection of the intermediate and terminal groups in an interactive way. 3. Generation of metha- Intermediate Molecular Structures with NFAs from 2 to 8, using the available metha-groups (intermediate) and satisfying the feasibility criteria. A maximum number of 12 groups in the Final Molecular Structures (FMS) is allowed. Then, each metha-IMS is replaced by all different possible combinations of the selected groups to form "real" IMSs. 4. In a similar way, pre-FMSs are obtained by adding (NFA-2) terminal groups to each IMS. 5. Screening of the pre-FMSs according to the physical property constraints. 6. Termination of Solvent Molecular Structures (SMSs) by adding to each accepted IMS different combinations of two terminal groups that conserve the molecule feasibility. 7. Screening of the synthesized SMSs according to the physical property constraints. 8. Ranking the selected products in accordance with molecular complexity and specific performance index, indicating their predicted physico-chemical or environmental properties. The size of the combinatorial synthesis problem increases when considering branched structures because of the large number of free attachments of the intermediate structures (equations 10 and 11) and the larger number of groups available (see Table 4). Usually, the UNIFAC or other group contribution methods for computation of activity coefficients or component fugacities are used in the case of solvent design. The application of these methods requires the availability of binary parameters between the groups participating in the molecule synthesis stage. Therefore, between the steps 3-4, 4-5 and 6-7 the molecular synthesis method eliminates all intermediate and final structures that contain pairs of groups (one or more) with unknown binary interaction parameters, limiting in this way the size of the combinatorial problem and reducing the computing time. The results of the synthesis procedure are illustrated with an example of solvent design for the separation of benzene from hexane by liquid extraction. For this example the following groups were chosen:
34 (C), (CH-O), (CHNH), (CH), (CH 3N), (CHNO 2), (C O O), (CH 2 CO), (C H2 NH), (DM F-2), (CH2), (O H), (CH 3 COO), (H COO), (CH2 NH2), (CH 3). The only physical property constraint for an intermediate structure is to have a maximum solvent loss of 10%. For the final solvents the main physical constrains are: selectivity greater than five and molecular weight less than 240. In this example, 16 groups (intermediate:10 and terminal groups:6) are selected for the synthesis of solvents with a minimum of two groups and a maximum of 12 in the final structure. In this case 10 meta groups can be identified within the selected set of groups. An example of the number of structures that are generated in the different steps of the molecular synthesis process is given in Table 5. The direct combination of these groups to form structures from 2 to 12 groups results in the generation of 646635 structures. The results of Table 5 show that the use of feasibility rules, physical constraints and the lack of binary interaction coefficients between groups, leads to a significant reduction in the size of the synthesis problem. However, when pure component properties are dominant in the product design, the size of the combinatorial problem is not limited by the availability of binary parameters. A sound strategy to handle this problem is to make a preliminary search of product candidates using only single and dual valence groups. Thereafter, it is convenient to select the main group families that lead to the most promising branched structures. Note that in this case a database search method may also be employed, provided that a large database is available. 2.2.2 T e s t or M o l e c u l e E v a l u a t i o n S t a g e
The test stages of generate & test methods, is closely related to the type of product design problem being solved. In this chapter, only solvent-based separation problems are considered. A separation operation requires specific values or ranges of solvent properties for each particular application. These properties determine the space of physical properties constraints that limit the search space of solvent structures. The solvent property constraints may have lower or upper bounds or both. Even though it is difficult to define the conditions for an optimum solvent, the solvents synthesised by molecular design can be ranked according to a performance index and molecular complexity. The development of molecular design applications to different separation problems therefore requires the identification of these physical constraints and the formulation of predictions based on group contributions methods.
35 Table 5: Solvent design for separation of benzene from hexane by liquid extraction N u m b e r of Groups selected: N u m b e r of m e t h a - intermediate structures generated N u m b e r of m e t h a - pre final solventes N u m b e r of pre-final solvents - P r e - f i n a l solvents rejected by MW restriction - P r e - f i n a l solvents rejected by lack of binary parameters - P r e - f i n a l solvents rejected by solvent loss constraints N u m b e r of final solvents generated Number of final solvents that satisfy all physical constraints
Potential Solvents
Selectivity
(CH3)(CH2)2(CH2COO)2(HCO0) (CH3)3(CH2)2(C)(HCO0) 3
8.8 7.5
16 2344 10552 101934 81303 14475 4120 8823 277
Distribution Coefficient 0.85 0.76
Liquid Extraction When selecting of a solvent for liquid extraction, it is important to consider all the separating operations involved in the liquid extraction process:
i) ii) iii) iv)
solvent extractor, raffinate removal from extract solute purification solvent recovery column.
The scheme shown in Fig.1 is typical for the extraction of a dilute component. If the solute is recovered by extraction from a dilute solution, the solute/solvent relative volatility should be much greater t h a n one, and the solvent solubilities in the raffinate should be very low. Otherwise, economic considerations screen out liquid extraction as infeasible for the separation under consideration. Cockrem et a/.[20] indicated t h a t the solute distribution coefficient and the solvent solubility in the raffinate (solvent loss) are usually the dominant properties for solvent selection in liquid extraction. Low solvent loss in the raffinate also determines raffinate-extract immiscibility. High solvent selectivity is also required to reduce the cost of solute recovery and purification from the extract. Solute - solvent azeotrope formation and high relative volatility for the solute solvent pair can be assured if a minimum boiling point difference is required. In general, the evaluation of potential solvents for liquid
36 extraction is based on primary solvent properties and pure component properties (boiling points, heats of vaporisation, densities and molecular weights). The primary solvent properties: selectivity, distribution coefficient, solvent loss and solvent power can be obtained from UNIFAC group contribution predictions of infinite dilution activity coefficients. The pure component properties of the solvent structures generated with UNIFAC groups can be estimated by group contribution methods (Pretel et al.[11], Gani and Constatinou [17]. The primary solvent properties can be estimated through the expressions given in Table 6: Pretel et al. [8] evaluated the performance of the UNIFAC method with respect to its liquid-liquid and liquid-vapour group interaction parameter tables. Their conclusion is that the vapour-liquid parameter table renders more reliable predictions at infinite dilution conditions than the liquid liquid parameter table. In addition, there are a greater number of groups and parameters available in the liquid-vapour parameter table and its revisions (Gmehling et al., 1982; Macedo et al., 1983; Tiegs et al., 1987; and Hansen et al., 1991), than in the liquid-liquid parameter table (Magnussen et al. 1981).
Table 6. UNIFAC Evaluation of Primary Solvent Properties for Liquid Extraction
Property
Estimate
(mass basis)
/3 =
Solvent Selectivity
MWA
r;,sMW
Solvent Power
MW A Sp -
Solute Distribution Coefficient
tr/--
r;,s 1 MWs
Solvent Loss
37
v
Extract A+B+S
Feed A+B
B
Raffinate removal column Solute A
Extractor Solvent and solute separation column Solvent
-
1'
J
Figure 1. Typical cycle for the extraction of a dilute solute E x t r a c t i v e Distillation. The s t a n d a r d extractive distillation process works in two steps: the extractive distillation column and the solvent recovery column. The primary solvent properties are the degree in which the solvent increases the relative volatility between the mixture components, the normal boiling point difference between the solute and the solvent, and the amount of solvent required to break the azeotrope in the case of an azeotropic feed mixture. Another important constraint is t h a t the solvent should be miscible in the mixture at the desired concentration range. This constraint is assessed with the phase stability criterion proposed by Michelsen [21]. Furthermore, the feed concentration should be considered in selecting the component that should removed from the top of the extractive distillation column. This choice determines the n a t u r e of the solvents to be generated with the purpose of increasing or decreasing the r of the feed mixture. The CAMD procedure estimates the solvent properties on the basis of activity coefficients and pure component properties, on the basis of group contribution methods based on UNIFAC groups. The computation of the desired properties on the basis of these estimates is given in Table 7.
38 Table 7. M O L D E S Property Estimates for Solvent Evaluation for Extractive Distillation Property Estimate
P: Relative Volatility 1 MWA
Solvent Power (mass basis) Sp
~ m
Y~,s MWs
Minimum amount of solvent to break the azeotrope (molar fraction) Phase Stability Criterion Performance Index
Xms,[O~,B,A]xms -1. ~S,azeotropeXms ~-~1.0 O~B,A 1 MWS x m
2.3 A P P L I C A T I O N EXAMPLES Application of a CAMD method based on the generate & test approach is highlighted through two examples involving solvent-based separations. 2.3.1 S o l v e n t for e t h a n o l r e c o v e r y
The ethanol recovery from aqueous solutions is a problem of great industrial interest. Ethanol recovery and dehydration by distillation and azeotropic distillation is very energy intensive. The potential of liquid extraction for this application can be readily explored by CAMD. The search of a potential solvent for this application illustrates the effect of physical property constraints, on solvent selection. The solvent properties desired for this application are: fl > 7.0 ( w t . / w t . )
T b s - TbA > 50 K
m> 1.0 ( w t J w t . ) S1 < 0.1 wt. %
Molecular design results for several homologue families of organic solvents are shown in Table 8. The low selectivities of alkyl amines and diols exclude all the components of these families as potential solvents. Even though all families satisfy the boiling point difference, the requirement of distribution coefficients greater than one rejects all solvents with MW
39 greater than 100. However the solvent loss restriction precisely requires higher molecular weights (>140, more CH2 groups) for the alcohols and carboxylic acid families; therefore no solvents that meet all the specifications can be found. We can say that molecular design excludes liquid extraction as a feasible operation for this particular problem.
Table 8. Effect of solvent property constraints on ethanol extraction from aqueous solution solvent molecular design
Solvent Family
fl> 7.0 (wtJwt.)
T b S - TbA >
m > 1.0
50 K
(wt./wt.)
Phenyl acids
(+)
(+)
(-)
Alcohols
(+)
(+)
Carboxylic acids Diols
(+)
(+)
(+) if MW<100 (-) if MW>100 (+) if MW<100 (-) if MW>100
(-)
Alkyl amines
(-)
S1 < 0.1 w t . %
(-) if MW <140 (-) if MW < 140
In the synthesis of linear paraffinic solvents for the extraction of ethanol from water using the following 13 groups: (C5H3N) (CH2CO) (CH2COO) (CH20) (CH2NH) (CH2) (OH) (CH3COO) (CH30)(C5H4N) (COOH) (CH2NH2) (CH3); the molecular design program selects 1050 intermediate structures and 213 final structures and generates 99 final solvents for which the information on binary coefficients is available. However, as mentioned before there was no liquid solvent that met all the primary properties required. The design of solvents for the recovery of other oxyehemieals from aqueous solutions, like furfural, butanol, propanoie and acrylic acids is successfully accomplished by molecular design and the results agree with experimental results for these systems [8].
2.3.2 Solvent for separation of n-propyl acetate from n-propyl alcohol by extractive distillation The separation of n-propyl acetate from n-propyl alcohol serves to illustrate the application of MOLDES for the synthesis of potential solvents. The solvent should exhibit the following properties:
40 as, A _>3.0 Sp >_30.0, wt%
(7)
Tbs - TbA > 50K
The best solvents found by MOLDES are shown in Table 9, together with experimental relative volatility values obtained by Cepeda and Resa (1984). Table 9. CAMD solvent selection for the extractive distillation of n-Propyl Acetate rrom n-Prop~,l Alcohol at atmospheric pressure
Solvent
a B ,A
O~B,A, exp
Ethylbenzene Nonene n-Decane Chlorobenzene Decalin Chloroctoane Xylene Dichlorobenzene Mesitylene
5.4 4.64 5.26 3.71 4.64 4.95 3.95 3.1 2.32
4.23 4.63 4.7 4.63 4.37 4.79 4.24
Sp
Xms
PI%
80.6 46.45 35.7 86.0 42.6 47.11 67.1 60.93 35.7
35.9 35.7 37.6 33.7 34.1 34.8 40.9 30.6 47.8
14.2 10.31 9.84 9.78 9.71 9.57 9.1 6.89 4.05
For this separation problem ethylbenzene, nonene, n-decane and xylenes are the most attractive solvents. From their experimental study Cepeda and Resa recommended the use of xylenes and saturated hydrocarbons with more t h a n 9 atoms. If the reverse problem is studied, that is, if propyl alcohol is the solute of the extractive distillation column and it is removed from the bottom, together with the solvent, the selection changes drastically and now the best solvents are Ethylene Glycol or Propylene Glycol (Pretel et al.[8])).
2.4 R E F E R E N C E S
1. R.Gani and E.A.Brignole, Fluid Phase Equilibria 13 (1983) 331 2. E.A.Brignole, S.Bottini, R.Gani, Fluid Phase Equilibria 29 (1986) 125 3. Aa.Fredenslund, J.Gmehling and P.Rasmussen, "Vapor liquid equilibria using UNIFAC", Elsevier Scientific, Amsterdan, 1977. 4. V.Venkatasubramanian, K.Chan, J.M.Carutheres, Computer Chem.Eng 18 (1994) 833.
41 5. K.G. Joback and G.Stephanopoulos "Designing molecules possessing desired physical property values" Proceedings FOCAPD'89, Snowmass, CO, 1989. 6. N.Churi, L.E.K.Achenie, Ind. Eng. Chem.Res. 35 (1996) 3788 7. P.M.Harper, R.Gani, P.Kolar, T.Ishikawa, Fluid Phase Equilibria, 158160 (1999) 337 8. E.J.Pretel, P.Araya LSpez, S.B.Bottini, E.A.Brignole, AIChE Journal 40 (1994) 1349 9. R.Gani, B. Nielsen, Aa. Fredenslund, AIChE J. 37 (1991) 1318 10. O.Odele, S.Machietto, Fluid Phase Equilibria 82 (1993) 47 11.Aa.Fredenslund, J.Sorensen, Ch.4, "Group Contribution Methods" in "Models for Thermodynamic and Phase Equilibria Calculations", editor S.I.Sandler, Marcel Dekker, Inc., New York, 1994. 12.S. Skjold-Jorgensen, Ind.Eng.Chem.Res. 27 (1988) 110 13.H.P.Gros, S.Bottini, E.A.Brignole, Fluid Phase Equilibria 116 (1996) 537. 14. S.Espinosa, G.Foco, A.Bermfidez, T.Fornari, Fluid Phase Equilibria 172 (2000) 129 15.R.C.Reid, J.M.Prausnitz, B.E.PSling ,"The properties of gases and liquids", 4th Ed. Graw Hill Inc., New York, 1987. 16. E.Pretel, P.Lopez, A.Mengarelli, E.Brignole, Latin American Applied Res. 22 (1992) 187 17. L.Constantinou, R. Gani, AIChE J 40 (1994)1697 18.M.Cismondi, E.A.Brignole, Proceedings of the 11th European Symposium on Cumper Aided Process Engineering, Denmark, May 2001, Edited by R.Gani and S.Bay Jorgensen, Elsevier, ISBN:0-44450709-4, pp.375-380. 19. Cockrem, M., J.Flatt and E. Lightfoot, Sep.Sci. and Technol., 24 (1989)769 20. E.Cepeda, J.M.Resa, An.Quire. 80 (1984)755
This Page Intentionally Left Blank
Computer Aided MolecularDesign: Theoryand Practice L.E.K. Achenie, R. Gani and V. Venkatasubramanian(Editors) 9 2003 Elsevier ScienceB.V. All fightsreserved.
43
C h a p t e r 3: O p t i m i z a t i o n M e t h o d s in C A M D - I
M. Sinha, L. E. K. Achenie & G. M. Ostrovsky
3.1 I N T R O D U C T I O N Chemical product design addresses the design of single component chemical compounds and/or mixtures (blends) of compounds with prespecified thermo-physical properties. In recent years, the traditional wet chemistry based chemical product design is being supplemented with computer-aided approaches. The latter is formally designated as computer-aided product design. To be consistent with this book, we will employ the more conventional name, namely computer-aided molecular design (CAMD) in this chapter. The CAMD problem can often be posed as a mathematical program in which a number of binary and continuous variables define the search space (Duvedi and Achenie, 1996; Churi and Achenie, 1996; Maranas, 1997; Odele and Machietto, 1993; Pistikopoulos and Stefanis, 1998). A binary variable is an integer variable that can have one of two possible values, for example 0 and 1. This chapter discusses a branch and bound approach to solving the resulting mathematical program.
3.2 PROBLEM DEFINITION A typical molecular design problem may be minimization or maximization subject to constraints. Thus a CAMD problem for design in which thermo-physical property modeled as
modeled as a single objective structural and performance single component molecular matching is sought may be
min f (x,v,O)
(1)
x,v,O
(pj(X,V,O)~O, h i (x, v,O) = 0,
j=l
.... ,m 1
i = 1,..., m2
(2) (3)
w h e r e , is a vector of binary variables that define the molecular structure, x is a vector of continuous variable such as process variables (pressure, temperature, etc.) and 0 is a vector of group contribution parameters. Note t h a t additional binary variables may be included in , to indicate additional constraints on the kind of molecular structures that can be
44 generated, f(x,v,O) is the performance objective function (for example some undesirable property such as a compound's ozone depletion potential). The group contribution model is a structure-property correlation that has found wide use in the chemical process industry. The constraints involve (a) structural feasibility, (b) physical property targets, and (c) process constraints. The constraints associated with structural feasibility are usually linear. Physical property targets often have the form p~
J
0j2 are elements in 0 and nj is the number of 0j1 or 0j2 present in the molecule. Transformation of such constraints into a linear form is straightforward. The function Pk (x,,,0) can also have the form Pk - flL (j~ YljOja ) / f 2 NL ( j~ njOjb / w h e r e f~L and f~L are nonlinear functions, and Oja and Ojb are parameters. Property constraints, which employ the given form, include, solubility parameter based models often used in solvent design. It is not always possible to reformulate these constraints into linear or convex forms. The nonlinear mathematical programming model for the CAMD problem (PMD) has the following features: (a) it is a nonconvex mixed integer nonlinear problem (MINLP) problem involving a large number of binary variables, (b) the number of linear constrains is larger than the number of nonlinear constraints, and (c) most of the components of the design vector (u) participate in the nonlinear terms. Previous attempts using global optimization are either geared to small size problems or use soft computing approaches (such as simulated annealing and genetic algorithms). The approach discussed here is based on the branch and bound (BB) algorithm. The basic BB algorithm may encounter a large number of branching variables for product design problems. To address this, the branch-and-bound global optimization algorithm presented here exploits the problem structure and allows significant reduction in branching expressions. A discussion of the algorithm is based on the papers (Sinha, Achenie and Ostrovksy, 1999) and (Ostrovksy, Achenie and Sinha, 2000). In group contribution based computer aided single component product design, solvents are formed from certain combinations of a set of structural groups. The pre-specified set of m structural groups is called the basis set. The size and composition of the basis set depends on the intended application, the availability of accurate property prediction models and the computational resources available. First, we define a set of variables based on an initial set of structural groups as
45 if the i - th group in the molecule is the k - th structural u~
=
group in the basis - set
Churi - Achenie
model
otherwise if the i - th structural group in the basis - set is in the molecule Odele-
/'/i
Machietto
model
(4)
otherwise Odele a n d M a c h i e t t o (1993) proposed a f o r m u l a t i o n t h a t e n s u r e d t h a t t h e valence of each s t r u c t u r a l group was satisfied. This f o r m u l a t i o n only accounts for the presence and absence of s t r u c t u r a l groups in the molecule. However, it does not consider the i n f o r m a t i o n t h a t d e t e r m i n e s how the groups are connected to each other in the molecule. To overcome this limitation, C h u r i a n d Achenie (1996) proposed a model t h a t gives complete i n f o r m a t i o n with r e g a r d to how the groups are connected to each other. P r e s e n t l y t h e r e is no k n o w n group contribution m e t h o d t h a t t a k e s a d v a n t a g e of the connectivity information of the Churi-Achenie model. In the l a t t e r model, the following variables were introduced
Z UP --
10 if the i - th group's j - th site is attached to the pth group otherwise
_ J1 if the i- th group in a molecule does not have a group attached wi
[0
otherwise
(5) For single component solvents s t r u c t u r a l c o n s t r a i n t s are imposed for (a) l i m i t i n g t h e n u m b e r of s t r u c t u r a l groups in a molecule; (b) e n s u r i n g t h a t the n u m b e r of bonds a t t a c h e d to a group equals the valence of the group; a n d (c) e n s u r i n g t h a t each group in a molecule is a t t a c h e d to at least one other group. The f o r m u l a t i o n is effective in specifying w h e t h e r the molecule is acyclic or cyclic. Moreover the m a x i m u m n u m b e r of cycles can also be controlled. This r e p r e s e n t a t i o n is also effective in d i s t i n g u i s h i n g b e t w e e n isomers. If the chemical process is not accounted for, t h e n the p u r e c o m p o n e n t molecular design problem involves only b i n a r y variables. The m a x i m u m n u m b e r of groups in a molecule is nmax; the n u m b e r of groups in the basis set is m with the m a x i m u m valence of Smax. In this case the s e a r c h d i m e n s i o n is t h e n given by n m a x X m + n m a x X S m a x X n m a x + n m a x . Here the n u m b e r of b i n a r y variables is equal to the s u m of the d i m e n s i o n s of u, z a n d w, respectively ( a s s u m i n g the Churi-Achenie model is used). The n u m b e r of l i n e a r s t r u c t u r a l c o n s t r a i n t s employed are n m a x 2 + n m a x X m + 3nmax + Sm~x + 1. For example, a CAMD problem with n m a x - 5, m = 10,
46 and Sm~x = 2 results in 93 linear constraints. The number of nonlinear constraints is generally small compared to the number of linear constraints. Let all the binary variables in the problem be assembled in the vector v (q-dimensional). If the Odele-Machietto model is employed then v - u ; on the other hand if the Churi-Achenie model is employed then v - [ u , z , w ] . Then the solvent design problem (see (1), (2), (3)) can be expressed compactly as a mixed integer nonlinear program in the general form P:
f=
min f(x,v) x, v e D
(6)
such that D = {x,v "c< x < d,q)i(x,v)
i=l,...m, h(x,v)=O, x e X c_gU, v e {0,1}q}
3.3. D E S C R I P T I O N OF THE P R O P O S E D M E T H O D OF S O L U T I O N 3.3.1 B r a n c h - a n d - B o u n d A l g o r i t h m P r e l i m i n a r i e s
The branch and bound (BB) method (Horst and Tuy, 1990) has been used for solving several problems in chemical engineering (Ostrovsky et. al., 1990, Friedler et. al., 1998, Quesada and Grossmann, 1995, Ryoo and Sahinidis, 1996, Maranas and Floudas 1997, Adjiman et. al., 1998). The generic BB method looks for a minimum of the objective function f(x,v) by partitioning the region D into subregions Di with respect to the search variables. At each iteration, a subregion Di is further partitioned into Dip and Diq (Di - Dip ~ Diq). The generic BB method consists of the following: (i)
An algorithm for estimating a lower bound (LB) #i on the objective function f ( x , , ) in any subregion Di e D such that
(ii)
#i <-f ( x , v ) Vx, ve D i An algorithm for estimating an upper bound (UB) r/j on f ( x , v ) i n
any D i e D such that T/j > f ( x , v ) (iii)
Vx, ve Dj
An algorithm for partitioning Di
Designate the set of subregions at the k-th iteration of the BB method as L(~) - ( D i , i = l .... ,Nk). Let I(k) be the index set of the subregions belonging to L(h). Then the algorithm for the BB method is as follows
Step 1. Set k=l. Give an initial set L(~ of the subregions Di (i=l,..,.No, usually No-l).
47 Step 2. Calculate an LB for each Die L(h) Step 3. D e t e r m i n e the subregion with the least LB. Let it be the lmt h region t h e n
Plm
= min Pl
(8)
lei(k)
Split Dt~ into two subregions Dp and Dq (Dim - Dpu Dq)
Step 4 such t h a t
Vp --{X" X~- Vlm,Xs ~ Cs}~Vq -'{X" XE Dlm,Xs ~ Cs} The variable, xs, is the branching variable and Cs is the branching
point. Step 5. D e t e r m i n e LB and UB for subregions p a n d q. Step 6. D e t e r m i n e the least upper bound 7/(k) at the k-th iteration.
~(k)- min O?k~, 77p, 77q) For the first iteration 77(~ = oo Step 7. If ~(k) _ Plm < e t h e n STOP. Step 8. If pj > 7/k
(9)
is m e t for j - p or j - q t h e n the corresponding subregion is e l i m i n a t e d from consideration. Step 9. F o r m a new set L&)of the r e m a i n i n g subregions as follows L(k)=
(D1,..., Dt~-l
,, +1,.",DNk
,L)
where Dp,Dq
L = Dq, Dp,
if pj <71(k) j - p,q if pq <7/(k) < pp if
Step 11. Set k=k+l, and go to Step 3
pq < 71(k) < pp
48 Each BB method needs to develop algorithms for partitioning and for estimating lower and upper bounds. Thus we describe algorithms we have developed for estimating lower and upper bounds for the mixed integer nonlinear program arising from our formulation of the computer aided molecular design problem. Let us consider the partitioning algorithm. At each iteration in a standard BB method, the "optimal" subregion Dtm is partitioned into two subregions Dp and Dq using the constraints x i < x} 9
9
and x i > x~ or vi __v i as follows Dp : { x ' x ~
Dtm,X~ <_cs},Dq : { x ' x E DIm,Xs >_Cs}
The variable, Xs, is the branching variable and Cs is the branching point. Different BB methods have different ways of selecting these. Thus in this case n+q branching variables are used. In a realistic product design problem, the number of branching variables can be several hundred. It is known t h a t the number of branching nodes grows exponentially. To alleviate this problem, we will use the following new partition algorithm. Instead of branching on the variables (x,v), we will use appropriate functions ,gj(x,v), j = l ....p of the search variables for branching. Subsequently, Di will be determined by the set of inequalities aj < iltj(x,v) _bid = 1,...p, where the lower and upper bounds aj i and bj i are the dimensions of the multidimensional box (subregion) Di are determined by the branch-andbound strategy. Thus Di has the form Oi--{x,v'x, la~.O,'a ~. <_l,gj(x,r) <_b~,j -1 .... p, }
Problem P for subregion Diis written as PiL:
f/ :
min f(x,v)
(10)
x,v~ D~ A direct solution of the above problem is very difficult. Instead, the approach to be described finds the solution indirectly by successively estimating lower and upper bounds for the performance objective function f/. In the limit, these bounds should collapse into one to give a solution to the above problem. Thus it is appropriate to discuss how these bounds are obtained. 3.3.2 L o w e r B o u n d A l g o r i t h m A lower bound f L for f~ on Di is obtained by solving the following problem
49
f/L=
PiL:
min
L[f(x,v);[)i]
X,V~_ D i where
Di={x~v.'L[tPk;Di]<_O, k = 1,.. ., m; L[gtj;D~] <<_bj;L[-gtj i i Di] <_-aj;ve {0,1}q } and L [ g ( x , v ) ; D i ] i s a convex u n d e r e s t i m a t o r for the generic function g ( x , v ) . Then it is easy to verify t h a t Di c D i . Some alternatives for e s t i m a t i n g lower bounds are: (a) The use of linear or convex nonlinear underestimators; (b) Enforcing the integrality of all the b i n a r y variables v at each iteration (Pantelides, 1996); (c) The variables v are considered as continuous variables such t h a t 0 < v < 1. In the latter, the variables become binary only at t e r m i n a t i o n of the algorithm. We will construct linear u n d e r e s t i m a t o r s and we will enforce integrality of v at each iteration as in (b). The resulting problem (Pi L) is a mixed integer linear program (MILP).
3.3.3 Upper Bound Algorithm The upper bound f [ for f/ on Di can be found by computing fi U - f ( ~ , v), _
where [~,v] is a feasible point for problem (10). The l a t t e r can be obtained by solving
y*- min y X,v,) t
r (x,v)< y, j = 1...(m + 2p) (11) where
[
q~j,j = 1...m
m
i
(4)j -:'{--IVj_m +
aj_m,
j =(m + 1)...(m + p)
[-bj_(m+p ) +lprj_(m+p)j
=
( m + 1 + p ) .... ( m + 2p)
This is a nonconvex problem and therefore computationally intensive to _
solve at each iteration. To circumvent this, we obtain an upper e s t i m a t e y of the value 7" by solving the problem _
pu:
y= min y x,w
U['~] (x,v);Di] _<~t,j = 1...(m + 2p)
50 where U[~j (x,v); D~] is a linear overestimator of ~j (x,v) on D i such t h a t U[~j(x,v)',Di]>_
~j(x,v)',V(x,v)ED
i.
Let
D--]={x,u'U[~)(x,v);D~]
then
_
D~D~ and y>y*. Problem p v is an MILP. It should be noted t h a t we could terminate the solution to p v whenever y <0. During evaluation of the lower and upper bounds for subregion Di, the following situations may arise at the k-th step of the branch-and-bound _
algorithm" (i)/3 i;~O, ?' <0, (ii)/)i r ~, y >0, and (iii)/)/- O.
In (i), we can
calculate both the lower and upper bounds, while in (ii) we can only calculate the lower bound since we cannot ensure that the point obtained by solving problem p v will be feasible for the problem (3.3). Finally in (iii), D i does not contain solution points and consequently it can be excluded from consideration. The branching point V* = V(x*,u*)is determined at the solution point of the lower bound problem pL.
3.3.4 L i n e a r E s t i m a t o r s and B r a n c h i n g F u n c t i o n s The main challenge in a BB based method is the construction of underestimators and overestimators. McCormick (1976) suggested the factorable programming technique for constructing convex underestimator for a function represented in factorable form. Sherali and Alameddine (1992) suggested a general approach for constructing underestimators for arbitrary polynomial functions. A method for construction of underestimators for more general functions is proposed in the a-BB global optimization method (Adjiman et al, 1998). The dimension of the lower bound problem, in all the above approaches, can be much larger t h a n the dimension of t h e original problem. Here we present an alternative approach in which the lower bound problem has dimension not greater t h a n the dimension of the original problem. Let us consider a class of functions qgi that can be represented as a tree graph (Fig. 1). Denote the root node of the graph as A1N. The set of nodes Aj Nk, which are k branches apart from the root node, are at the g - k th level. Let the k-th level of the tree graph has pk nodes. Each node Aj gk has qjN-h descendants. Assign a differentiable function ~pj
q/(N-k)continuously differentiable functions fji
.
(y) of one variable y to
each node Aj Nk of the (N-k)- th level. (k=1,..,~-1). The original function ~Pi corresponds to the root node. Thus the following relations hold -o-i ( N - k )
--
~L C j i
( N - k ) Yji r (N-k) (
(N-k-l) ,
.j ~ ~ _ k
(12)
51
F i g u r e 1: A multilevel r e p r e s e n t a t i o n of a t r e e
w h e r e Qj(N-k) is t h e set of d e s c e n d a n t nodes of Aj(N-k). T h e v a r i a b l e xi c o r r e s p o n d s to a leaf node. W i t h o u t loss of generality, we will a s s u m e t h a t t h e leaf n o d e s are a s s o c i a t e d with t h e first level. O t h e r w i s e we e m p l o y i d e n t i c a l t r a n s f o r m a t i o n s to r e l a t e t h e v a r i a b l e xi to t h e first level. S u p p o s e for e x a m p l e t h e v a r i a b l e xi is a s s o c i a t e d w i t h t h e second level. T h e n we c a n i n t r o d u c e t h e t r a n s f o r m a t i o n q912)= x i . In so doing we h a v e r e l a t e d x i to t h e first level as well.
52 A function f(x) is defined as a special tree function (STF) if each node of the computational graph corresponding to it is characterized by relation (see Eq. (10)) . Thus the STF is a superposition of univariate concave or convex functions connected by simple arithmetic operations, namely addition, subtraction, multiplication on some constant coefficient and operations corresponding to univariate functions fi (y) in intermediate (N-k)
..
nodes. There exists different ways for transformation of a tree function into an STF. The simplest way consists in the use of the following t r a n s f o r m a t i o n for removing the multiplication operation.
f (x)g(x) = 88
+ g(x))Z--~14 ( f (x ) - g(x)) z
We propose a strategy for constructing a linear underestimator for the function r N)corresponding to the root node A1N . Note t h a t r a complex multilevel function of the variables Xl,...,x, at the first level. We will assume t h a t all the coefficients Csi(N-k) are positive. If a coefficient
csi (N-k) is negative we can introduce new notations ~si(N-k)=--Csi(N-k) and ~i (N-k)= - f i (N-k), and replace %i Ys~ by Cj~(Nk)fj (U k) Here -~s~(N-k) >0. _
(N-k)
_r ( N - k )
-
.
Let
q)j(~-k) e Sj(yhl) where =
cpj
-~(N-k-1)
--< rjO(N-k-1) --< ~oj
}
(13)
If we know the bounds for the variables xi at the first level, estimate bounds for all functions r (at all levels) by using arithmetic (Moore, 1966). A linear underestimator of the function the region Si (N-k) with respect to the functions ~j(N-k-1)j~QNi-k designated as L[~(Nk), 9Si(Nh)].
we can interval q~(N-k) in will be
One can find a linear relation between L [ q ) i(N-k)'s(N-k)l , i J and the linear u n d e r e s t i m a t o r s of q)i (N'k'l) at the descendant nodes Aj(N'k'O as
=
c (N-k)L[fJN-k) j e_Q : - k
(N-k- ')), "S (N-k-l) ]<-
c(N-k)fj(N-k) ....
(N-k-O )
j e Q~i - k
(14) Now we will construct a linear underestimator for the function f ji (N-k) (q)j(N-k-~) . ) at the (N-k)-th level with respect to r (N-k-l) . Let the latter satisfy the Eq. (13). To simplify the notation for subsequent developments, let Y=ePj(N-k-I) and consider the function f(y) in the region
53 m
Sy = {y:fi < y < ~}. If f(y) is concave then in Sy the linear u n d e r e s t i m a t o r has the form L[f(y);Sy ] = f ( y ) +
[f (Y)- f(Y)] = _ (y-y) y-y
(15)
If instead f(y) is convex then a linear u n d e r e s t i m a t o r is given by the t a n g e n t to f(y) at the point Ym = (Y + ~). In this case the u n d e r e s t i m a t o r is 2 given by the following formule (16)
L[f (y); Sy ] - f'(Ym)(Y--Ym )+ f(Ym) Here f'(Y,O is the derivative of the function f(y) at the point ym 9
Substituting the expressions for linear underestimators of fie(N-k) in ~(u-k) we obtain
L[q)i(N-k);s~(N-k)]= ~_. dq~j(N-k-~)
(17)
j EQ N-k
Again
we
will
assume
that
dj>0;
otherwise
we can
employ the
transformation discussed earlier. Hence we finally obtain the following expression for the linear underestimator of ~0i
(N-k)
as,
L[q)i(N-k); Si(N-k)] -
~d
L[goj(N-k-1);S j ( N - k - l ) ]
(18)
j eQiu-*
At the (N-k-1)-th
level, we need to know the sign of dj which is
determined at the upper level, (N-k). Therefore, starting from the N-th level and moving down to the 2-nd level, we obtain all relations as A linear u n d e r e s t i m a t o r for the expressed in Eq. (18) for k=0,1..~-1. function q#N) can be represented as a linear function of the variables xi (associated with the first level) as follows N
L[~pl(N);$1(N)]
-
~cjxj
(19)
j=l
From the above consideration the following algorithm for construction of a linear u n d e r e s t i m a t o r for a tree function follows.
Summarizing, construction of linear underestimator involves: 1. A bottom to top sweep to obtain all bounds
54 --k
~k
[tp,, ~Pi ] V k = 1,...,N and/= 1,...,pk ) 2. A top to bottom sweep to obtain the relations (Eq. (18)) for all levels 3. A bottom to top sweep to obtain L[tpi(N-k);S(N-k)] as linear functions of x, and u. We will refer to this method as the sweep method. A similar procedure can be used for construction of linear overestimators. It is i m p o r t a n t t h a t the u n d e r e s t i m a t o r is a linear function of the variables x andv. We note the following. The dimension of the lower bound problem pL is the same as dimension of the original problem P.
3.3.5. S e l e c t i o n of B r a n c h i n g F u n c t i o n In a conventional BB method, the branching variables are the search variables xi. However, the larger dimensionality of xi (i = 1,...,n) can result in a rapid growth in the n u m b e r of branches in the BB tree. To address this problem, we consider an alternative selection of the branching 9 /~. ( N - k ) expressions: we e m p l o y the a r g u m e n t s (pj(N-k-~)of all the f u n c t i o n s Jji as branching
variables.
Branching on ~0jCU-k-1)will decrease the intervals
described by (13). Therefore, a tighter linear underestimators of fj ON-k) will f ( N - k ) -L[q)i (N- k) ;Si(N-k)]) will tend to zero as the size be obtained since max (.jj~ x,v
of cj~(u-k)strives to zero. Only independent functions q~ can be used as b r a n c h i n g functions. The suggested approach to selection of branching expressions will be advantageous if the n u m b e r of independent functions ._ ( N - k - l ) from the functions q)j is less t h a n the n u m b e r of variables x i (i = 1,...,n). In our formulation of the molecular design problem, this is indeed the case.
3.4 S T E P BY S T E P A L G O R I T H M F O R S O L U T I O N T E C H N I Q U E Step 1: Decide on the set of groups to be used to form compounds. Identify the design variables. The first set of design variables is v. The second set of design variables is x. Step 2: Develop the performance objective f (such t h a t it has the structure of Eq. (1)). Ensure t h a t the performance objective can be calculated directly or indirectly using a group contribution property model. Also make sure one or more of the design variables affects the performance objective directly or indirectly. Step 3: Develop the property constraints (such t h a t it has the structure of Eq. (2)). Ensure t h a t these constraints can be
55 calculated directly or indirectly using a group contribution model. Also make sure one or more of the design variables affects each constraint directly or indirectly. Step 4: Develop structural feasibility constraints (i.e. Octet Rule model such that it has the structure of Eq. (2)). Examples are OdeleMachietto Octet Rule model (1993) and Churi-Achenie Octet Rule model (1996). Step 5: Using information from previous steps, assemble the mathematical program, i.e. the performance objective, constraints, design variables and the Octet Rule Model. Step 6: Construct linear estimators of the performance objective and the constraints by the Sweep method (see Section 3.3.4 and the illustration in Section 3.6) Step 7: Find optimal structure of the molecule by using the BB method from Section 3.3.1.
3.5 M E T H O D S AND TOOLS To use the solution technique, you will need the following: (a) Group contribution based property estimation methods to calculate he needed physical properties. More details on these methods are given in chapter 1 of this book. (b) An MILP (mixed integer linear program) solver to be used in Step 7. An MILP solver can be found from commercial software such as CPLEX (www.cplex.com) and OSL (www.research.ibm.com/osl). The public domain code lp_solve by Hartmut Schwab (ftp.es.ele.tue.nl/pubflp_solve) can also be used. (c) An implementation of the branch and bound procedure from Section 3.3.1.
3.6 A P P L I C A T I O N E X A M P L E
To illustrate the CAMD design procedure using the branch and bound (BB) method, we use a simple example. Note that this is not a CAMD problem and yet it has the structure of a CAMD problem. A CAMD example is found in Chapter 10. S t e p 1 o f S e c t i o n 3.4
Here assume that the structural groups in the basis set have been given and that they have been labeled 1 through 4 - the number of structural
56
groups. A s s u m e t h a t the Odele-Machietto model is employed t h e n v - u . Therefore the first set of design variables is u = [ul, ue, u3, u4]. For e x a m p l e if u1=1, t h e n s t r u c t u r a l group 1 is p r e s e n t in the molecule; o t h e r w i s e u1=0 and s t r u c t u r a l group 1 is absent. The second set of design v a r i a b l e s is x = [Xl, xe] r e p r e s e n t i n g for example two properties.
S t e p 2 o f S e c t i o n 3.4 Suppose t h a t the performance objective function is given by f ( u , x ) = alu ~ + a 2 u 2 + a 3 u 3 + a 4 u 4 + a s x ~ + a6x 2
(20)
S t e p s 3 a n d 4 of S e c t i o n 3.4 Suppose also t h a t the molecular s t r u c t u r a l constraints and the p r o p e r t y c o n s t r a i n t s are given by (21)
q~o(U) ~ Ul " a t ' . . . " ] - U 4 - a < O (/91 (U,X) ~= a l l U 1 + . . . + a l 4 u 4 + alsulU 2 + a l 6 x I + c I _ 0
(22)
(P2 ( u , x ) ~ a21ul + ... + a24u4 + a25u3/,/4 + a26x 1 + c 2 _~ 0
(23)
Here ui are b i n a r y variables (for a real CAMD problem this would r e p r e s e n t the presence or absence of s t r u c t u r a l group n u m b e r i from the basis set - Step 1of Section 3.4). In addition, xi (for a real CAMD problem this would r e p r e s e n t a property of interest) are continuous variables and a i i are k n o w n constants t h a t a p p e a r in the model.
S t e p 5 of S e c t i o n 3.4 The r e s u l t i n g CAMD model is
min f (u,x)
(24)
x,u
~0i (/,/,X) ~_~O, u ~ {0;1},
i = 0,1,2 xL~ x ~ xU
S t e p 6 o f S e c t i o n 3.4 In the BB method we need to construct linear u n d e r e s t i m a t o r s for n o n l i n e a r constraints and find b r a n c h i n g functions. For this we m u s t c o n s t r u c t the special tree functions (STF) for n o n l i n e a r constraints (22) a n d (23) (since t h e y contain bilinear terms). Using the t r a n s f o r m a t i o n in (12) we obtain the STF for the constraints as follows (pl(l.t,x)=~allUl + . . . + a l 4 u
4 + 0.25a~5(u 1 + u 2 ) 2 - O . 2 5 a l s ( u l - - U 2 )
2 + a l 6 x 1 --I-c 1 ~ 0
(25)
57
q)2 (/,/,X)--= a21//1 + . . . +
a 2 4 u 4 + 0 . 2 5 a 2 5 ( u 3 .-t-//4) 2 - 0 . 2 5 a 2 5 ( u
3 - / / 4 ) 2 + a 2 6 x 1 + c 2 _<0
(26) From Section 3.3.5, the a r g u m e n t s in the nonlinear functions in the STF are the branching functions. Consequently, the four functions r = uj + u 2 , r = u~ - u2, r = u3 + u4, and r = u3 - u4 are the branching functions. The partitioning of D into subregions will be accomplished with the help of the b r a n c h i n g functions. Thus, the i-th subregion Dg will have the form D i =
{u"
i < a 3_
a 1i < _
//1 + U2 ~-~ b 1i
//3 "~ U4 ~
i i < b3,a 4_
, a 2i _<
//3
-
Ul _ / / 2
//2 ~
_< b 2i
b i4 , 0 < _ u j < l }
(27)
where the bounds a ji, b j i, 0=1 , ..4) are determined by the BB procedure. If there is no initial partition of D then 1
1
1
1
a~ = a 31 _ O, b] - b~ = 2 , a 2 - a 4 = - l , b 2 = b 4 - 1
(28)
In the problem constraints (22), (23) are nonlinear and nonconvex since they contain bilinear terms (i.e. linear with respect to each variable) ulu 2 and u3u4, respectively. In order to find the globally optimal solution of problem (24) the BB method solves an approximate (i.e. relaxed) version of is allowed to take any value (including problem (24) in which a ui fractional values) between 0 and 1. Step 7 of Section
3.4
Thus, BB solves (29)
min f(u, x) x,u
~o, (u,x) <_O,
D = {u :0<_u <_1}
i = 0,1,2 x L <_x<_x v
Thus for the i-th subregion, (29) becomes f i = min f (u,x)
(30)
X,UEDi
q~ (u,x) < O,
i = 0,1,2
X L <~X<~X U
The BB obtains a solution indirectly by generating a series of lower and upper bounds. In order to obtain a lower bound o f f i we m u s t solve the following problem (designated a s P i L in Section 3.3.2)
58 (31)
min f ( u , x )
x,u~Oi
% (u, x) < O, Z[(pi(u,x);Di]
x L <x<x
i=1,2
<_ O,
v
Each L[q9 i (u, x); D i ] is of the form mi L [ ( p 1 ; D i ] = a l l / g I + ... + a14z/4 +
L[cpz'D~]=-a21u 1 ,
0.25alsL[(Ul + u2 )2., S[,2 ]Av 0 . 2 5 a l 5 L [ _ ( U l _ u2 )2., $1,2 } + a16 x I + c 1
4- ... 4- a24u 4 +O.25azsL[(u 3 +u4)Z'S~,4]+O.Z5a2sL[-(u3,
-- U4)2 i",S 3,4 } 4- a26x 1 4- 6'2
where 9
--i
9
S[, z - {a; <_ u 1 + u 2 <_ b(} ,$1,~ = {a'2 <_ u I - u 2 <_ b~},
S~,4 = {a~
= { a ' 4 --< U 3 --/'/4 ~-~ b~}
Using (15) and (16) we obtain L[(u, + u ~ ) 2 ",S ,~, 2 ] = ( a ~ + b 1)[u ' l + u 2 -0.5(a~ +b])]+ 0.25(a~ +b[) ~--i
")2
"
L[_(ul _ u2 ) 2 ;$1.2 ] = - ( a ' 2
Z[(/g3 -I-/g4)2
; S ~ , 4 ] - - (a~ + b~)[u 1 + u 2
L[_(u31 -
/ / 4 ) 2 . s, i 3,4] = - ( a ~ )
i
i
- (a'2 + b 2 )(u I - u 2 - a 2)
2
- 0.5(a~ + b[)]+ 0.25(a~ + b[) 2 - ( a ~ +b~)(u 3 - - $ / 4 - a 4 ) i
In order to obtain an upper bound of f i in the i-th subregion we m u s t solve problem p v (see Section 3.3.3)) which in this case is of the form min f ( u , x ) (32) x,ueDi
% (u, x) <_ O, U[q)i(u,x);O~] < 0,
i = 1,2
x L <_x<x v
Each U[q)~(u,x);D~], i=1,2 is of the form L[q)l" Di ] = allul + ... + a14u4 4- 0.25alsU[(Ul + u2)2"~$1i2 ] + 0.25alsU[_(u 1 -- Uz) 2., ~i1,2 } 4- O16X 1 4- C 1 L[~P2 ; D , ] =
a21u 1 + ... + a24u4 4- 0 . 2 5 a 2 5 U [ ( u 3 4-/g 4 ) 2 , S~,4 ] 4- 0 . 2 5 a 2 5 U [ _ ( u
3 _ u4 )2 ;Si3,4 } 4- a26 x 1 4- c 2
where U[(u, + u z ) 2 ; S [ , 2 ] = ( a ; ) 9 2 + ( a ; " + b , )i ( u , + u 2 - a , ) i U [ - ( u I -u2)2;Si,1] = (a~ + b~)[u I - u 2 - 0.5(a~ + b~)]- 0.25(a~ + b~) z V [ ( u 3 + u 4 ) Z ; s ~ , 4 ] = ( a ~ ) 2 +(a~ + b ~ ) ( u 3 - u 4 - a ~ ) U [ - ( u 3 -u4)2;si3,4]- (a~ + b~)[u 3 - u 4 - 0.5(a~ + b4)]- 0.25(a 4 + b4) z
The lower and upper bounds (calculated as described above) will be used in the BB method (Section 3.3.1) for solving the CAMD model in (24). For
59 this simple example, consider the first iteration of the BB method (Section 3.3.1) as follows Step 1. Set k=l. Give an initial set L(o) of the subregions Di (i=1,..~o, often No=l). Let No=1. This means there is only one subregion D1, which coincides with D. Step 2. Calculate lower bound (LB) for each subregion. Since there is only one subregion, problem (31) is solved for the case when the 1 1 values of ai,bi,(1-1,...,4)are given by (29). Let u~,u~be the solution of the problem and fl, be the optimal value of the objective function. Then Pl = f l * Step 3. Determine an "optimal" region with the smallest LB. Let it be the/m-th region then
min Pl l'tlm = lei(k) Now there is only one subregion; therefore Step 4
lm =
1
Split the subregion Dr, into two subregions Dp and Dq (Dim
= Dp~ Dq). Suppose t h a t we start branching with the help of the branching function (u 1 + u 2). Then D; and Dq will have the form
Step 5. Determine LB and UB (upper bound) for Dp and Dq. We m u s t solve problems (31) and (32) for both subregions. Step 6. Determine the smallest UB rl(k) at the k-th iteration.
~(k)- min (~k-1, ~p, ~q) Let ??p > 77q. Then ??(~) =rlq. Step 7. If 0 ~
Pl < s then STOP with the solution.
Step 8. If pj > 77~ for j - p or j - q then the corresponding region is removed. Suppose pj < 77k
j = p,q
Step 9. Form the new set L(k)of subregions
L(1) =(Dp,Dq)
50 Step 10. Set k=k+l, and go to Step 3 for the next iteration Note that the algorithm stops in Step 7; the values in the vector of variable u are used to determine which structural groups make up the molecule. For example if Ul-1, then structural group 1 is present in the molecule; otherwise it is absent. On this simple example we showed how underestimators are constructed (Step 6 of Section 3.4) and described one iteration of the BB procedure (Step 7 of Section 3.4).
3.7 R E F E R E N C E S
Adjiman, C. S., Dallwig, S., Floudas, C. A., and Neumair, A. (1998). A global Optimization method, alpha-BB, for general twice-differentiable NLPs --I. Theoretic Advances. Computers and Chemical Engineering, 22(9), 1137-1158. Archer, W. L. (1996). Industrial Solvent Handbook, Marcel Dekker Inc. Barton, A. F. (1985). CRC Handbook of Solubility Parameters and Other Cohesion Parameters, CRC Press, Inc., Boca Raton, Florida. Brooke, A. (1996) GAMS - A User's Guide, Scientific Press, San Francisco, CA Churi, N., and Achenie, L. E. K. (1996). Novel Mathematical Programming Model for Computer Aided Molecular Design. Industrial and Engineering Chemistry Research, 35(10), 3788-3794. Constantinou, L., and Gani, R. (1994). New Group Contribution Method for Estimating Properties of Pure Compounds. AIChE Journal, 40, 16971710. Duvedi, A. P., and Achenie, L. E. K. (1996). Designing Environmentally Safe Refrigerants Using Mathematical Programming. Chemical Engineering Science, 51, 3727-3739. Friedler, F., Fan, L. T., Kalotai, L., and Dallos, A. (1998). A combinatorial approach for generating candidate compounds with desired properties based on group contribution. Computers and Chemical Engineering, 22(6), 809-817. Hansen, C. M., and Beerbower, A. (1971). Solubility Parameters. KirkOthmer Encyclopedia of Chemical Technology, A. Standen, ed., Interscience, New York. Horst, R., and Tuy, H. (1990). Global Optimization: Deterministic Approaches, Springer-Verlag, Heidelberg. Lyman, W. J., Reehl, W. F., and Rosenblatt, D. H. (1981). Handbook of Chemical Property Estimation Methods, McGraw-Hill Book Company. Maranas, C. D. (1997). Optimal Molecular Design under Property Prediction Uncertainty. AIChE Journal, 43(5), 1250-1263. McCormick, G. P. (1976). Computability of global solutions to factorable nonconvex programs. Part I -- convex underestimating problems. Math. Program., 10, 147-175.
61 Moore, R. E. (1966). Interval Analysis, Prentice-Hall, Englewood Cliffs, New Jersey. Odele, O., and Machietto, S. (1993). Computer Aided Molecular Design: A Novel Method for Optimal Solvent Selection. Fluid Phase Equilibria, 82, 47-54. Ostrovsky, G. M., Ostrovsky, M. G., and Mikhailow, G. W. (1990). Discrete Optimization of chemical processes. Computers and Chemical Engineering, 14(1), 111. Ostrovsky, G., Achenie, L. E. K., and Sinha, M. "A Reduced Dimension Branch-and-Bound Algorithm for Molecular Design," (to appear in Journal of Global Optimization, circa 2000) Pantelides, (1996). Global Optimization of General Process Models. In I.E. Grossmann , ed. Global Optimization in Engineering Design, Kluwer Academic Publishers. Pistikopoulos, E. N., and Stefanis, S. K. (1998). Optimal solvent design for environmental impact minimization. Computers and Chemical Engineering, 22(6), 717-733. Quesada, I., and Grossmann, I. E. (1995). A Global Optimization Algorithm for Linear Fractional and Bilinear Programs. Journal of Global Optimization, 6, 39-76. Ryoo, H. S., and Sahinidis, N. V. (1996). A Branch-and-Reduce Approach to Global Optimization. Journal of Global Optimization, 8, 107-138. Sherali, H. D., and Alameddine, A. (1992). A new reformulationlinearization technique for bilinear programming problems. Journal of Global Optimization, 2, 379-410. Sinha, M. A Systems Engineering Framework for Solvent Design. Ph.D. Thesis, University of Connecticut, 1999. Sinha, M., Achenie, L. E. K. and Ostrovsky, G. M. "Design of Environmentally Benign Solvents via Global Optimization," Comp. Chem Eng. 23, 1381-1394, 1999. Tamiz, M. (1996). Multi-Objective Programming and Goal Programming Theories and Applications, Springer, York. Vaidyanathan, R., and El-Halwagi, M. (1994). Computer-Aided Design of High Performance Polymers. J. Elastom Plasti., 26(3), 277. Venkatasubramanium, V., and Chan, K. (1989). A neural network methodology for process fault diognosis. AIChE Journal, 35, 1993.
This Page Intentionally Left Blank
Computer Aided MolecularDesign: Theoryand Practice L.E.K. Achenie,R_Gani and V. Venkatasubramanian(Editors) 9 2003 Elsevier ScienceB.V. All rightsreserved.
63
Chapter 4: O p t i m i z a t i o n Methods in C A M D - II A. Apostolakou & C. S. Adjiman
4.1 I N T R O D U C T I O N Computer-aided molecular design (CAMD) is a synthesis activity, which aims to identify a list of candidate molecules that perform a set of tasks most effectively. Its application to a specific problem should always be followed by a verification stage, which relies on experimental data. The availability of property prediction techniques that describe broad classes of compounds is a central issue in CAMD. The accuracy of the techniques used must be sufficient to enable the final candidate list to be meaningful. Several strategies can be followed to deal with the uncertainty inherent in property prediction: some of the property requirements can be relaxed (e.g., Duvedi and Achenie, 1996) or uncertainty can be accounted for explicitly in the formulation of the CAMD optimization problem (Maranas, 1997; Dua and Pistikopoulos, 1998). In many cases, however, it is necessary to develop more reliable property prediction techniques. The advent of connectivity-based group contribution methods is a particularly promising development in this area. Group contribution methods, based on the principles of transferability and additivity, are widely used in CAMD. In principle, only the mass of a chemical compound is exactly equal to the sum of the masses of its constituents. However, there are many properties which are approximately additive, provided the building blocks are appropriately chosen. Among the potential building blocks (atoms, bonds or groups), the bond and group additivity methods have received most attention. When the property under study depends on the shape of the molecule and on intermolecular forces, the additivity rule becomes less reliable. For example, by substituting fluorine and/or chlorine into a hydrocarbon molecule, polarity effects lead to a failure of the simple additivity principle. The change caused by fluorine cannot be attributed to the group alone but also depends on the environment in which the substituent is placed. Consequently, both structural and environmental or proximity effects must be accounted for in property prediction. In view of these issues and of the inability of simple group contribution techniques to distinguish between isomers, there has been a drive to develop property prediction techniques, which use the connectivity of molecules as a basis. For pure component properties such as critical constants, for instance, several approaches have been proposed: Needham et al. (1988) have used
54 Kier's shape indices (Kier and Hall, 1976), Constantinou et al. (1993) have developed a technique based on the concept of conjugate forms, Constantinou and Gani (1994) and Marrero and Gani (2001) have presented a "high-order" group contribution method, Marrero-MorejSn and Pardillo-Fontdevila (1999) have proposed the use of group-interaction contributions. Other recent connectivity-based methods are discussed in Poling et al. (2000). Early CAMD methologies were unable to take advantage of the availability of connectivity-based structure-property relationships. The number of atom groups of different types in the candidate molecule has been used as the key decision variable in "generate-and-test" methods (Gani and Brignole, 1983; Brignole et al., 1986; Joback and Stephanopoulos, 1989; Gani et al., 1991), in mixed-integer optimizationbased approaches (Macchietto et al., 1990; Odele and Macchietto, 1993; Duvedi and Achenie, 1996; Maranas, 1996; Pistikopoulos and Stefanis, 1998; Buxton, et al., 1999), and in stochastic-based optimization methods (Venkatasubramanian et al., 1995, Marcoulaki and Kokossis, 1998, 2000a,b; Ourique and Telles, 1998). With the advent of connectivity-based prediction methods, several researchers have developed new strategies for embedding this information with the CAMD methodology. Constantinou et al. (1996) have proposed a systematic strategy for generating isomers from a set of groups. Harper et al. (1999) have used this capability to integrate additional property prediction techniques based on molecular modeling within their CAMD framework. Churi and Achenie (1996) have developed a mixed integer formulation for mathematical programming approaches on the basis of the graph-theoretic representation of molecules. The integer decision variables are the number of groups of a given type in the molecule, binary variables denoting whether a bonding site j on a vertex i is bonded to another vertex p, and binary variables denoting whether a group of type k is at vertex i. Raman and Maranas (1998) have also used graph theory to derive a convex MINLP formulation for use with topological indices. In this case, the binary decision variables denote whether two vertices in the molecular graph are connected. Camarda and Maranas (1999) extended the formulation to identify the specific groups in the molecule. A similar formulation was recently used for the design of pharmaceutical products by Siddhaye et al. (2000). In their high-order group contribution approach for pure component properties, Marrero and Gani (2001) proposed (i) to enhance the group contribution approach with a larger set of functional groups that allows a more detailed representation of chemical structures, and (ii) to use large data sets to estimate the contributions of these groups. This method has led to significant improvements in accuracy and applicability of group contribution techniques. As a result, we seek to develop a formulation of the optimization-based molecular design problem, which makes use of this new property prediction technique and accounts for the full connectivity of the molecule. We have chosen to use an optimization-based approach
65 because it enables an implicit search of the space of solutions, which is extremely large due to the combinatorial nature of the problem (Joback and Stephanopoulos, 1989; Maranas, 1996). Thus, a large n u m b e r of molecular structures can be eliminated without fully evaluating their performance. This characteristic is especially valuable when the property estimation techniques and/or performance criteria used require expensive computations, or when the simultaneous design of material and process (e.g., Buxton et al., 1999) is addressed. In the next section, we present the problem definition, highlighting the features of the group contribution method used, and we propose the formulation of a general mixed-integer nonlinear program (MINLP) accounting for connectivity. The formulation is applicable to molecules containing arbitrary numbers of rings, aromatic or otherwise. It also allows the distinction between isomers of aromatic compounds. We note t h a t some molecules can be multiply defined in terms of the groups used by Marrero and Gani (2001) and that some rules must be applied to allow the unique identification of suitable molecular descriptions. As a result, in Section 4.3, we develop a systematic strategy for ensuring that molecules are correctly represented according to the Marrero and Gani rules. In Section 4.4, we apply the proposed approach to the design of aromatic compounds and in Chapter 12, to a simple refrigerant design problem.
4.2 P R O B L E M D E F I N I T I O N 4.2.1 G e n e r a l p r o b l e m f o r m u l a t i o n
Most CAMD problems can be stated as "given a desired range for a set of properties and performance criteria, design the compound that performs best, while possessing properties within the acceptable range" (Vaidyanathan and El-Halwagi, 1996). In order to write the general formulation for a CAMD problem, we introduce the following variables 9 lr is the vector of properties of the compound, 9 y is the vector of integer variables that determine the molecular structure, 9 x is the vector of relevant process variables, if applicable. A t y p i c a l CAMD p r o b l e m m a y t a k e t h e f o r m
rain
F(Tr(y,x))
y,x s.t.
n:L ___Jr(y, x) < tc g(y) <_0 h(y) =0 y ~ {0,1}q XE R n
(1)
66 where z U and z n are upper and lower bounds on the property values, F is the performance criterion to be optimized, and g and h are vectors of inequality and equality constraints generally associated with structural feasibility requirements as well as preferences imposed by the designer.
4.2.2 D e s c r i p t i o n of group c o n t r i b u t i o n m e t h o d The group contribution method recently proposed by Marrero and Gani (2001) allows the estimation of nine important physical properties of pure organic compounds (normal boiling point, critical temperature, critical pressure, critical volume, standard enthalpy of formation, standard enthalpy of vaporization, standard Gibbs energy, normal melting point, s t a n d a r d enthalpy of fusion). One of the distinguishing features of the method is its accuracy for a varied and large set of compounds. P a r a m e t e r tables have been developed from regression using a data set of about 2000 compounds with 3 to 60 carbons, including large, polyfunctional and complex heterocyclic compounds. The properties of a compound are calculated from the contributions of three types of groups: first order groups, second order groups and third order groups. The first order groups are intended to describe a wide variety of organic compounds and are larger and more numerous than groups in the commonly used method of Joback (1987). First order groups allow some level of distinction between isomers. The role of the second and third order groups is to provide further structural information about molecular fragments of compounds in order to distinguish between more isomers and to account for proximity effects arising from polyfunctionality. Thus, the estimation is performed at three levels. The overall property estimation model has the following form
keg I
keG 2
keg 3
where Ch is the contribution of the first order group of type k t h a t occurs n l h times in the molecule, Dk is the contribution of the second order group of type k t h a t occurs n2h times and Eh is the contribution of the third order group of type k that occurs n3k times. G1, G2 and G3 are the sets of first, second and third order groups respectively, c2 and c3 are weights equal to 1 or 0 which allow second- and third-order corrections to be turned on or off respectively. The left-hand side of (2) is a simple function f ( z ) of the target property z as listed in Marrero and Gani (2001). In this work, we develop a formulation of the general CAMD problem (1) which allows the use of this more versatile group contribution method. We focus exclusively on first order predictions as they already allow the representation of a wide variety of chemical classes including simple aromatics and cyclic compounds (see Table 1 for a list of first order
67 groups). The rules, which must be applied when deciding which first-order groups make up, a given molecule (Marrero and Gani, 2001) are R u l e 1. groups. R u l e 2.
The molecule must be described entirely by first-order There must be no overlap between first-order groups.
If alternative first order representations of a molecule are possible: R u l e 3. In general, the heaviest first order groups are used. Thus, while CH~CH2COO can in principle be represented as (CH3, CH2, CO0) or (CH3, CH2COO), the latter description should always be used because CH2COO is heavier than CH2 or COO. R u l e 4. For an aromatic substituent, an aC-R group should be used instead of an aC group. R u l e 5. For amides and ureas, the amide and urea groups should be used. 4.2.3 M o l e c u l a r r e p r e s e n t a t i o n To formulate the molecular design problem using the groups in the method of Marrero and Gani (2001), the number of each first order group in the compound must be determined. This requires the definition of a set of basic groups and the specification of the connectivity of the molecule. In to developing a mathematical framework for this problem, a graph representation of molecules has been adopted. In particular, a molecule is represented by a graph where basic groups and their bonds correspond to graph vertices and edges, respectively (Horvath, 1992; Mavrovouniotis, 1996). The vertex adjacency matrix or any other matrix used for representing a graph can be used to completely determine the molecular graph. In general, it suffices to describe each basic group by a number and a valency (number of bonds formed). First order groups (FOGs) are prime candidates to be used as basic groups. However, a number of the first order groups proposed by Marrero and Gani (2001) have two different atoms with free bonds: CH2CO can be connected to another group via the CH2 carbon or the CO carbon. In this case, information on the type and number of FOGs occurring in a molecule, and on the connectivity between these groups or vertices, is not always sufficient to unambiguously determine the molecular structure. For instance, the set of groups (CH3, CH3, CH2, CH2CO) with vertex adjacency matrix
CH 3 CH 3 CH 2 CH2CO
CH 3 0 0 1 0
CH 3 CH 2 CH2CO 0 1 0 0 0 1 0 0 1 1 1 0
68 describes diethyl ketone (CH3CH2COCH2CH3) and 2-pentanone (CH3CH2CH2COCH3). This can be a t t r i b u t e d to the a s y m m e t r y of the CH2CO group. However, 2-pentanone could also be constructed from the groups (CH2, CH2, CH3, CH3CO). According to rule 3, this second set of groups should be preferred since CH3CO is heavier t h a n CH2CO. Thus the bond C H 3 - CH2CO is allowed if CH3 and CH2 are bonded, but forbidden if CH3 and CO are bonded. To use first order groups as basic groups, we m u s t t h u s be able to provide a unique identification of the connectivity of the molecule. This issue is addressed by assigning to each first order group a valency for each b o n d type. For instance, group CH2CO has two bond types, a 'CH2' bond and a 'CO' bond, with a valency of 1 for each bond type. Its overall valency is 2. In order to keep the n u m b e r of p a r a m e t e r s and variables to a m i n i m u m , the bond types are labeled 'a', 'b', 'c'. Only three bond types are t h e n needed to describe all the first-order groups proposed by Marrero and Gani (2001). The a s s i g n m e n t of group type and valency information vk, t for group k and bond type t is listed in Table 1. The vertex adjacency m a t r i x for diethylketone is then CH 3 CH 3 0 CH 3 0 CH 2 1 CH2CO a 0 CH2CO b 0
CH 3 CH 2 CH2CO a CH2CO b 0 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0
and t h a t for 2-pentanone, CH 3 CH 3 0 CH 3 0 CH 2 1 CH2CO a 0 CH2CO b 0
CH 3 CH 2 CH2CO a CH2CO b 0 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0
The m a t r i x for 2-pentanone does not belong to the set of allowable vertex adjacency matrices. The different bond types are also useful when dealing with aromatic and cyclic compounds. By convention, aromatic bonds in aromatic groups are systematically assigned to type 'a'. Types 'b'and 'c' are t h e n used for bonds which connect the aromatic group to non-aromatic bonds. For instance group 21, aC-CH2, has an aromatic (aC) bond of valency 2 assigned to type 'a' and a non-aromatic (CH2) bond of valency 1 assigned to type 'b'. For all aromatics except group 16, the valency of the aromatic 'a' bond is 2 since each aromatic carbon in a ring m u s t be bonded to two other aromatic carbon. For group 16, which is used for fused aromatics, v16,a-3. Similarly, cyclic bonds in cyclic compounds are
59
s y s t e m a t i c a l l y a s s i g n e d to t y p e s 'a' a n d 'b'. T h e b o n d t y p e 'b' is n e e d e d s i n c e s o m e cyclic g r o u p s s u c h as C H = C a r e a s y m m e t r i c a n d r e q u i r e t w o cyclic b o n d t y p e s . B o n d s o n cyclic g r o u p s t h a t c a n b e m a d e w i t h n o n c y c l i c g r o u p s a r e a s s i g n e d to t y p e 'c'. S i n c e all cyclic g r o u p s h a v e e x a c t l y t w o o t h e r cyclic b o n d s so t h a t Vk,a+Vk,b=2 for a n y cyclic g r o u p k. T h e f i r s t o r d e r g r o u p s , t h e i r b o n d t y p e s a n d v a l e n c i e s a r e l i s t e d i n T a b l e 1.
I
Group
T a b l e 1: F i r s t o r d e r g r o u p s a n d t h e i r b o n d s Bond type a 1 Bond type b 2
B o n d tYpe
Class 3
C
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
CH3 CH2 CH C CH2=CH CH=CH CH2=C CH=C C=C CH2=C=CH CH2=C=C CH=C=CH CH-C C-C aCH aC 4 aC 5 aC 6 aN aC-CH3 aC-CH2 aC-CH aC-C aC-CH=CH2 aC-CH=CH aC-C=CH2 aC-C-CH aC-C-C OH aC-OH
Group
Vk,a
Group
Vk,b
CH3 CH2 CH C CH2=CH CH=CH CH2=C CH= C=C CH2=C=CH CH2=C=C CH=C=CH CH-C C-C aCH aC aC aC aN aC-CH3 aC aC aC aC-CH=CH2 aC aC aC-C-CH aC OH aC-OH
1 2 3 4 1 2 2 1
----
0 0 0 0 0 0 0 2
4
1 2 2 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 1 2
~ m m C= ~
-~ ~ ~ -~ -aC aC ~ ~ CH2 CH C CH=CH C=CH2 ~ C-C -~
Group
Vh.c
~ ~ --~ ~
0 0 0 0 0 0 0 0
0
~
0
0 0 0 0 0 0 0 1 1 0 0 1 2 3 0 1 1 0 1 0 0
~ ~ ~ ~ ~ ~ ~ ~ -~ ~ ~
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
~ ~ ~ ~ m ~ ~
' For aromatics, the free bonds m u s t be l i n k e d to other aromatic C or N. For cyclics, t h e y m u s t be l i n k e d to other cyclic atoms. 2 For cyclics, the free bonds m u s t be l i n k e d to other cyclic atoms. 3 See Section 4.3 - A: a r o m a t i c group, UA: u r e a or a m i d e group; UAS: u r e a or a m i d e subgroup; S: s t a n d a r d group. 4 F u s e d w ith a r o m a t i c ring. 5 F u s e d w ith n o n - a r o m a t i c subring. This group belongs to the set G1a. 6 Except as groups 16 and 17.
S S S S S S S S S S S S S S A A A A A A A A A A A A A A S A
70
.Table I (continued) 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 7O 71 72 73 74 75 76
COOH aC-COOH CH3CO CH2CO CHCO CCO aC-CO CHO aC-CHO CH3CO0 CH2CO0 CHCO0 CCOO HCOO aC-CO0 aC-OOCH aC-OOC COO CH30 CH20 CH-O C-O aC-O CH2NH2 CHNH2 CNH2 CH3NH CH2NH CHNH CH3N CH2N aC-NH2 aC-NH aC-N NH2 CH=N C=N CH2CN CHCN CCN aC-CN CN CH2NCO CHNCO CNCO aC-NCO
COOH aC-COOH CH3CO CO CO CO AC CHO aC-CHO CH3COO COO COO COO HCO0 AC aC-OOCH AC CO CH30 0 0 0 AC CH2NH2 CHNH2 CNH2 CH3NH NH NH CH3N N aC-NH2 AC AC NH2 CH= C= CH2CN CHCN CCN aC-CN CN CH2NCO CHNCO CNCO aC-NCO
CH2 CH C CO
CH2 CH C COO OOC 0 CH2 CH C 0
CH2 CH CH2 NH N N= N=
S A UAS UAS UAS UAS A S A S S S S S A A A S S S S S A S S S UAS UAS UAS UAS UAS A A A UAS S S S S S A S S S S A
71
Table 1 (continued) 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105
CH2NO2 CHNO2 CNO2 aC-NO2 NO2 ONO ONO2 HCON(CH2)2 HCONHCH2 CONH2 CONHCH3 CONHCH2 CON(CH3)2 CONCH3CH2 CON(CH2)2 CONHCO CONCO aC-CONH2 aC-NH(CO)H aC-N(CO)H aC-CONH aC-NHCO aC-NCO NHCONH NH2CONH NH2CON NHCON NCON aC-NHCONH2
106 aC-NHCONH 107 NHCO 108 CH2C1 109 CHC1 110 CC1 111 CHC12 112 CC12 113 CC13 114 CH2F 115 CHF 116 CF 117 CHF2 118 CF2 119 CF3 120 CC12F 121 CHC1F
CH2NO2 CHNO2 CNO2 aC-NO2 NO2 ONO ONO2 HCON(CH2)2 HCONHCH2 CONH2 CONHCH3 CONH CON(CH3)2 CONCH3 CON CONHCO CO aC-CONH2 AC-NH(CO)H AC AC AC AC NHCONH NH2CONH NH2CON NHCO NCON ACNHCONH2 AC NH CH2C1 CHC1 CC1 CHC12 CC12 CCI~ CH2F CHF CF CHF2 CF2 CF3 CC12F CHC1F
1 2 3 2 1 1 1 2 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 1 2 1 4 2 2 1 1 2 3 1 2 1 1 2 3 1 2 1 1 1
CH2 CH2 CH2 N
N(CO)H CONH NHCO N
N
NHCONH CO
CO
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
S S S A S S S UA UA UA UA UA UA UA UA UA UA A A A A A A UA UA UA UA UA A A UA S S S S S S S S S S S S S S
72
Table 1 (continued) 122 CC1F2 123 aC-C1 124 aC-F 125 aC-I 126 aC-Br 127 I 128 Br129 F 130 CI 131 C H N O H 132 CNOH 133 a C - C H N O H 134 OCH2CH2OH 135 OCHCH2OH 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165
OCH2CHOH -O-OH CH2SH CHSH CSH aC-SH -SH CH3S CH2S CHS CS aC-SSO SO2 SO3 (sulfite) SO3 (sulfonate) SOn aC-SO aC-S02 PH P PO3 (phosphite) PHO3 PO3 (phosphonate) PHO4 PO4 aC-PO4 aC-P CO3 C2H30
CC1F2 aC-C1 aC-F aC-I aC-Br I Br F C1 CH C AC OCH2CH2OH O
1 2 2 2 2 1 1 1 1 2 3 2 1 1
OCH2 OOH CH2SH CHSH CSH aC-SH SH CH3S S S S AC SO SO2 SO3 SO AC AC PH P P03 PHO3 PO
1 1 1 2 3 2 1 1 1 1 1 2 2 2 2 1 2 2 2 2 3 3 2 1
PH04 PO4 AC AC CO3 C2H30
3 3 2 2 2 1
304
-NOH NOH CH CHCH. 2OH CHOH
-
-
-
-
CH2 CH C S -
-
-
-
O -
-
SO S02 -
-
--
-
O -
-
-
-
PO4 P -
-
-
-
NOH
S A A A A S S S S S S A S S S S S S S A S S S S S A S S S S S A A S S S S S S S A A S S
73
Table 1 (continued) 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182
C2H20 C20 CH2 (cyc) CH (cyc) C (cyc) C H = C H (cyc) CH=C (cyc) C=C (cyc) CH2=C (cyc) N H (cyc) N (cyc) C H = N (cyc) C=N (cyc) 0 (cyc) CO (cyc) S (cyc) SO2 (cyc)
CeHeO CeO CHe(cyc) CH(cyc) C(cyc) CH=CH(cyc) CH=(cyc) C=(cyc) CH2=C(cyc) NH(cyc) N(cyc) CH=(cyc) C=(cyc) O(cyc) CO(cyc) S(cyc)
2 4 2 2 2 2 1 2 2 2 2 1 1 2 2 2
-m ---~ C=(cyc) ---N=(cyc) N=(cyc) ~ ~ --
0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0
-m m CH C ~ C= C= --N -C ~ ~
0 0 0 1 2 0 1 2 0 0 1 0 1 0 0 0
S S S S S S S S S S S S S S S S S
A g r a p h r e p r e s e n t i n g a single molecule has the following properties: 1. It is a joined 2. It h a s 3. It is a
(1,CH3,a)
connected single graph, t h a t is, every pair of vertices is by a path. no loops, t h a t is, no edges joining a v e r t e x to itself. labeled graph.
(2, CH2,a)
(3,CH3,a)
(4,OH,a)
Figure 1: Graphical representation of 2-propanol. T h r o u g h o u t this work, a vertex in a molecular g r a p h is d i s t i n g u i s h e d from the o t h e r vertices in the g r a p h by a u n i q u e n u m b e r , the enumeration number a n d is also labeled after the basic group, which occurs at this p a r t i c u l a r vertex. For example, in Fig. 1, the g r a p h r e p r e s e n t a t i o n of 2propanol is shown as a tree with 4 vertices (1,CH3,a), (2,CH,a), (3,CH3,a) a n d (4,OH,a), a n d 3 edges ((1,CH3,a), (2,CH,a)), ((2,CH,a), (3,CH3,a)) a n d ((2,CH,a), (4,OH,a)). The set of edges completely d e t e r m i n e s the molecular g r a p h of the compound. Since the e l e m e n t s of the vertex adjacency m a t r i x of the molecular g r a p h define the set of its edges, the m a t r i x gives complete i n f o r m a t i o n on the molecule's m a k e - u p a n d connectivity.
74 We now introduce variables for the s t r u c t u r a l description of a molecule, a n d t h e n derive s t r u c t u r a l feasibility constraints in t e r m s of these v a r i a b l e s to ensure t h a t the molecules are physically m e a n i n g f u l and t h a t t h e y are described according to the rules of the high-order group contribution m e t h o d of M a r r e r o and Gani (2001). 4.2.4 D e f i n i t i o n o f
Structural
Variables
Variables r e p r e s e n t i n g the existence of a bond b e t w e e n each pair of basic groups are used as key s t r u c t u r a l variables. In t e r m s of the graphical r e p r e s e n t a t i o n , these variables are the edges of the graph. We first define the following sets and p a r a m e t e r s : al Gin
= { k I k is a basic first order group} { k I k is a first order group with non-cyclic, non-aromatic bonds
-
only} = { k I k is a first order group with cyclic bonds} = { k I k is a first order group with a r o m a t i c bonds} = { t I bond type } = {a, b, c} = { i I e n u m e r a t i o n n u m b e r of a vertex in molecular graph} Nl,max - m a x i m u m n u m b e r of basic groups in final compound. ale Via T V
Several b i n a r y variable r e p r e s e n t a t i o n s can be adopted. Churi and Achenie (1996), R a m a n and M a r a n a s (1998) and C a m a r d a a n d M a r a n a s (1999) used a b i n a r y variable Zi, t,p, denoting w h e t h e r vertex i is linked via a t type bond to vertex j 1, if vertex i is linked via bond type t to vertex j z i , t , J - O, otherwise In addition, they defined a variable ui, h to describe the existence of group k at vertex i
Ui,k =
{~ if group k exists at vertex i, otherwise
w h e r e ke G1 and ie V. While this notation is very compact in t e r m s of the n u m b e r of variables, it leads to a large n u m b e r of constraints. For i n s t a n c e to prevent a bond b e t w e e n CH2CO b and CH3 a, as discussed in Section 4.2.3, V 2 constraints m u s t be imposed, where V is the n u m b e r of vertices in the set V. The constraints are of the form Ui, CH2C0 + Uj, CH3 + Zi, b, j <_2,Vi, j e V.
(3)
In this work, we develop an a l t e r n a t i v e formulation, which involves a l a r g e r n u m b e r of variables, but fewer constraints. In m a n y cases, such
75 formulations can eliminate much of the s y m m e t r y of the problem and result in problems t h a t can be solved more easily ( B a r n h a r t et al., 1998). We use the variable ui, k defined previously and we denote the existence of an edge between bond type t of group k at vertex i and bond type tt of group k k at vertex j is modeled via the binary variable y(i,k,t),(j,kk,tt)
Y(i,k,t),(j,kk,tt)
10 if bond type t of group k at vertex i is connected to bond type tt of group kk at vertex j, otherwise
=
where k, k k e G1, t, tte T and i,je V. Variable y defines the vertex adjacency m a t r i x for the molecule. In this case, a bond between between CH2CO b and CHa a is prevented by imposing a single constraint
ZZ
Y(i, CH2CO, b),(j, CH3,a)
= 0.
(4)
i~Vj~V
The n u m b e r of occurrences of group k in the molecule is modeled t h r o u g h the introduction of the integer variable nlh. To simplify the notation, we will use the following symbols:
Z - Z , Z -Z,Z =Z,Z -Z,s =Z,Z =Z
k~ G 1
k
kk~ G 1
kk
i~ V
i
je V
j
t~ T
t
tt~ T
tt
F u r t h e r m o r e , multiple s u m m a t i o n s will be expressed as
s 1 6 3-Z i
j
i,j
The n u m b e r of edges q in the molecular graph is expressed as
1
q = -2
Z
Y(i,k,t),(j, kk,tt)"
(5)
i,j,k,kk,t,tt
The factor of 89takes into account the fact t h a t each edge is counted twice. The n u m b e r of vertices pv is given by
k~G 1 i
keG 1
The sum of the valencies of all the vertices in the graph is given by
76
Z Z Vk,tUi,k
(7)
k~ G 1 t,i
In order to calculate the n u m b e r of occurrences nlk of group k in the structure, we set
nlk - Z Ui,k ' Vk ~ G 1.
(8)
i
4.2.5 S t r u c t u r a l Feasibility Constraints The main feasibility constraints stem from the fact t h a t the representation of a molecule must correspond to a connected, single graph with no loops. The first constraint ensures t h a t each bond type on each group at each vertex is connected to the appropriate n u m b e r of other groups Vk,tUi, k = Z Y(i,k,t),(j,kk,tt),Vk ~ G1, gt e T, gi e V. j,kk,tt
(9)
The following constraint ensures t h a t the graph obtained is connected (Churi and Achenie, 1996). It implies t h a t group k can occur at vertex i only if this vertex is connected to at least one of the vertices 1 to i-1. i-1
Z
Z
Y(i,k,t),(j,kk,t) >- ui,k,'v'k e G 1, g i ~ V \ {1}.
(10)
kk,t,tt j = l
To prevent the existence of loops, through a vertex connected to itself or two vertices connected together through two single bonds, we specify Z
Y(i,k,t),(i, kk,tt) = 0, Vi E V.
(11)
Y(i,k,t),(j, kk,t) < 1, Vi, J E V
(~2)
k,kk,t,tt
Z k,kk,t,tt
The s y m m e t r y of the vertex adjacency matrix is enforced by
Y(i,k,t),(j,kk,tt) = Y(j,kk,tt),(i,k,t),Vk, kk ~ G1,Vt, tt ~ T, Vi, j e V.
(13)
The r e q u i r e m e n t t h a t at most one group can occur at each vertex is expressed as
Z Ui,k
k~G l
(14)
77 Additional constraints are used to describe the allowable connections between groups, in particular for cyclic and aromatic groups. The n u m b e r of rings in the molecule, Rtot, is determined by 1
-2
Z Y(i,k,t),(j,kk,tt)-ZZ ui,k--1-kRt~ k,kk,t,tt,i,j k~G1 i
(15)
We note t h a t the set of non-cyclic groups is G1nwG1a\{17}. Group 17 is unique in t h a t it is both aromatic and cyclic and is included in the set of aromatic groups only. It has only one cyclic bond, which is assigned to bond type 'b'. Cyclic bonds in cyclic groups m u s t only be connected to cyclic bonds in other cyclic groups. This is expressed by preventing a cyclic group k at vertex i from being bonded to a non-cyclic group kk at vertex j t h r o u g h a bond type 'a' or 'b'
Z Z Z (Y(i,k,a),(j,kk,t) + Y(i,k,b),(j,kk,t)):0. ke Glckke GInWGla\ {17}i,j,t
(16)
and by preventing bond type 'a' or 'b' in a cyclic group from being bonded to group type 'c' in another cyclic group and to bond type 'a' in group 17
(~7) ke Glc kke Gic. i,j
j
Group 17, which possesses a cyclic bond ('b') but is not included in the set of cyclic groups m u s t also be constrained in a similar way
Z
Z Y(i,17,b),(j,k,t) =0. k~GlnUGlai,j,t
(18) (19)
Z ZY(i,17,b)(J,k, c) : 0 . k~Glc i,j
Similarly, aromatic bonds (type 'a') in aromatic groups m u s t not be linked to non-aromatic groups
(20)
Z Z ZY(i,k,a),(j,kk,t) =0" k~Glakk~Gl.UGlci,j,t
F u r t h e r , aromatic bonds in aromatic groups m u s t be not be linked to nonaromatic bonds 'b' or 'c' in other aromatic groups
Z Z
keGla kkeGla i,j
+
o
78 Additional constraints must be imposed on aromatic rings to ensure t h a t each such ring contains exactly six aromatic groups. For this purpose, we define the variable Ra, tot, a s the number of aromatic rings. We label the rings by defining two types of binary variables R = ~ 1, if aromatic ring w exists aw [0, otherwise and
r~'w
= { 1, if vertex i is in ring w 0, otherwise
Possible configurations of aromatic rings are shown in Figure 2.
jl
j
@@ @ J2 (a)
(b)
(d)
(e)
(c)
Figure 2: Possible configurations of aromatic rings.
In order to assign values to the ri,w variables, we note t h a t if there is no aromatic group at vertex i, the vertex cannot be in any ring. If there is an aromatic group other t h a n a fused aromatic (group 16), the vertex must be in exactly one ring (Fig. 2a). If there is a fused aromatic, the vertex must be in two to three rings (Fig. 2b-2e). This is ensured through the following constraints
E keGl,~\{16}
lgi,k 4;-2Ui,16 <__E ri,w <- E w
Ui,k+ 3Ui,16' g i e V.
(22)
keGl,~\{16}
where we W, the set of aromatic rings. To ensure that each ring has exactly six aromatic groups, we set
79 (23)
~_~ ri,w = 6R~w, V w e W. i
Finally, we need to e n s u r e t h a t each ring consisting of six a r o m a t i c groups forms a closed loop. This is enforced by forbidding certain types of bonds as follows. In case (a) in Figure 2, only n o n - f u s e d aromatics are present. Vertex i in ring w~ m u s t be in the same ring as the vertices it is connected to v i a its a r o m a t i c b o n d s (type 'a'). Thus, it is in the s a m e ring as vertex j l , b u t in a different ring from vertex j2. In other words, two non-fused a r o m a t i c groups at vertices i and j cannot be connected t h r o u g h their bond type 'a' if the two vertices are in different rings wl and w2, as imposed by the c o n s t r a i n t s Y(i,k,a),(j,kk,a) + ri, wl + rj, w2 _<2,Vi, j e V, Vwl,w 2 e W,w1 r w2. keGl~ \ {16} kkeGl~ \ {16}
(24) In case (b) in Figure 2, the f u s e d aromatic at vertex i is linked to a n o n a r o m a t i c at vertex j. This can only h a p p e n if vertex i is s h a r e d by two a r o m a t i c rings, wl and we, and vertex j is in one of these rings. This is expressed m a t h e m a t i c a l l y as
fused
Z Y(i,16,a),(j,k,a) § t),w 1 + ri, w2 + rj, w3 _<3,Vi, j e V, V W l , W 2 , W 3 e W, (25) k~Gl~ \{16} w2 r wl , w3 r wl , w2.
Cases (c), (d) a n d (e) of Figure 2 r e p r e s e n t instances where t w o f u s e d a r o m a t i c groups are bonded. In case (c), each fused a r o m a t i c group belongs to exactly two rings. Vertices i and jl share one common ring. Vertices i a n d j e s h a r e two common rings. Thus, two linked fused a r o m a t i c s belonging to two rings each m u s t belong to at least one common ring. In case (d), the fused aromatic group at i belongs to two rings and t h a t at j belongs to t h r e e rings. Vertex i m u s t share two rings with vertex j. S i m i l a r l y in case (e), the fused a r o m a t i c groups at vertices i and j belong to t h r e e rings each, and they m u s t share two rings. These conditions can be expressed m a t h e m a t i c a l l y t h r o u g h the constraints Y(i,16,a),(j,16,a) +r/,w 1 +r/,w 2 +r/,w 3 +r/,w 4 +r/,w 5 +r/,w 6 < 2 +
Z ri,w'Vi'jeV'
w~ W Vv~ , w2 , w3 , w4 , ws , w 6 e W, w 2 r Wl;W3 r Wl , W2 ; W4 r Wl , Wz , W3 ; w5 ~ Wl,Wz,W3,W4;W 6 r Wl,Wz,W3,W4,Ws.
(26) C o n s t r a i n t (26) applies as follows. In cases (c) and (d) in Figure 2, the fused a r o m a t i c at vertex i is in two rings so t h a t the r i g h t - h a n d side of (26) is 4. If the fused a r o m a t i c groups at i and j are bonded, y(i,16,a),O,16,a) = 1. If the two rings of i are wl and we, ri,wl + ~,w2 + ri, w3 = 1. As a result, we m u s t
80 have r/-,w4 + ~,w5 + ri, w6 -< 1. Thus, only one of the rings of j can be different from the rings of i, regardless of w h e t h e r j is in two or three rings. In case (e) in Figure 2, the fused aromatic at vertex i is in three rings and the r i g h t - h a n d side of (26) is 5. If the fused aromatic groups at i and j are bonded, y(i,16,a),(j,16,a) ---- 1. If the three rings of i are wl, we, w~, r/,w~ + r/.,w2 + ri, w3 = 1. As a result, we m u s t have ri, w4 + ri, w5 + ri, w6 _<1. In other words, j only one of the rings of j can be different from the rings of i, r e g a r d l e s s of w h e t h e r j is in two or three rings. Finally, a few constraints are added to obtain a t i g h t e r formulation. In p a r t i c u l a r , an a r o m a t i c group can be found at a given vertex only if t h e r e are a r o m a t i c rings in the molecule Ui,k <-
Z
Raw, Vk
~ G l a , Vi
~ V.
(27)
we W
Similarly, t h e r e can only be a cyclic group at a given vertex if t h e r e are n o n - a r o m a t i c cycles in the molecule
Ui,k < Rt~ - Z Raw, Vk e Glc , Vi e V.
(28)
w
F u r t h e r , the n u m b e r of aromatic rings cannot be g r e a t e r t h a n the n u m b e r of rings
Z
Raw - Rto t < O.
(29)
w~ W
The issue of r e d u n d a n t n u m b e r i n g of aromatic rings is p a r t i a l l y avoided by i m p o s i n g ordering of the rings, t h a t is Raw - Raw_ 1 <_O, V w E W \ {1}.
(30)
We now consider additional constraints to m a k e sure t h a t Rules 1 to 5 (Section 4.2.2) for the a s s i g n m e n t of first order groups are satisfied. Rules 1 a n d 2 are m e t de facto because of the choice of first order groups as basic groups. Rules 3 to 5 require the identification of forbidden m a t c h e s in the set of first order groups. For instance, the combination of groups l a and 48a (CH~ and C O 0 a) violates rule 3 and the heavier group 40 (CH3COO) should be used. This can be enforced by including the c o n s t r a i n t s
Z
Y(i,l,a),(j,48,a) = O.
(31)
k,kk
A methodology to identify forbidden m a t c h e s is described in Section 4.3.
81 Finally, bound constraints on the n u m b e r of groups present in the s t r u c t u r e are used. Since a m i n i m u m of two groups are required to form a molecule, we set
Z nlk _>2.
(32)
k~G 1
The upper bound on the total n u m b e r of first-order groups is set as a designer specification through the following constraint
Z nlk <-Nl,max" keG~
(33)
The set of constraints (8)-(30), (32) and (33) along with forbidden matches of type (31) provides the structural feasibility constraints for the design of a compound based on the set of first-order groups proposed by Marrero and Gani (2001). One drawback of this formulation is the lack of a constraint t h a t directs the n u m b e r i n g of the vertices in the molecular graph. As a result, a molecule can be represented in a n u m b e r of a l t e r n a t i v e ways, which differ from each other only in the vertex numbering. A partial solution to this problem is achieved by specifying consistent n u m b e r i n g (Churi and Achenie, 1996) ui, k -- ~ a Ui-l,k ~- O,Vi E V\ {1}. k~G 1 k~G 1 Z
(34)
The u n a m b i g u o u s description of molecules through their full connectivity allows the use of group contribution methods based on different sets of groups within the same optimization problem. For instance, the Chueh and Swanson (1973) method for the prediction of molar liquid h e a t capacities is based on a set of groups, which differs from the first order groups of Marrero and Gani (2001). In addition, their methodology requires some information on connectivity: C1 groups have a different contribution depending on the total n u m b e r of C1 groups bonded to a single carbon atom. Constraints t h a t enable the use of the Marrero and Gani (2001) and Chueh and Swanson (1973) methods simultaneously can be formulated using s t a n d a r d methods of logic modeling with 0-1 variables. This is d e m o n s t r a t e d on the case study presented in Section 4.4. The proposed formulation also allows the distinction between para, ortho and m e t a isomers of aromatic compounds.
82
4.2.6 P r o b l e m Type and S o l u t i o n The overall molecular problem as formulated belongs to the class of mixedinteger problems. All the binary variables participate linearly in the problem. Depending on the form of the property expressions used (Eq. (2)), the optimization problem is a Mixed-Integer Linear or Nonlinear Program (MILP or MINLP). It can be solved using standard methods for MILPs (Nemhauser and Wolsey, 1988) or MINLPs (Floudas, 1995; Grossmann, 1996). In the case of nonconvex MINLPs, global optimization algorithms can also be used (e.g. Adjiman et al., 2000). Commercially available software to solve such problems include GAMS/CPLEX for MILPs, GAMS/DICOPT for MINLPs. Branch-and-price algorithms that can solve problems with large numbers of binary variables efficiently are currently an area of active research and are likely to have a significant impact on our ability to solve larger molecular design problems (Barnhart et al., 1998).
4.3 I D E N T I F I C A T I O N OF F O R B I D D E N B O N D S B E T W E E N GROUPS Due to the presence of large number of first order groups in the set proposed by Marrero and Gani (2001), some molecules can be multiply described. Rules 3 to 5 enable the identification of correct descriptions and give rise to forbidden bonds. These are incorporated within the mathematical formulation through logical constraints of type (31). In this section, we present a systematic methodology to identify forbidden bonds and the corresponding constraints. The full list of such constraints is not given here because of its size. We split the first order groups into four distinct classes, as listed in the rightmost column of Table 1. Class A - Aromatic groups Class U A - Ureas and amides excluded.
There are 47 such groups. There are 16 such groups. Aromatics are
Class U A S - Urea/amide subgroups These are non-aromatic groups t h a t can be paired to give ureas and amides, i.e. groups 33 to 36 and groups 57 to 61 and 65. Class S - Standard groups groups.
These are the remaining 109 first order
Several cases can be distinguished. 1. One aromatic group and a non-aromatic group are combined.
83 If the s t r u c t u r e formed by these two groups is unique, t h a t is, it cannot be found in the list of first order groups and it cannot be formed by linking different a r o m a t i c - non-aromatic groups, the bond is allowed. This is the case of the combination of groups 21b (aC- CH b) and l a (CH~) to give aC-CH2-CH3. If the structure is not unique, then the only allowed combination is the one containing the heaviest aromatic/cyclic group. All other combinations m u s t be forbidden. This is the case for instance of the s t r u c t u r e aC-COO-CH2 which can be formed by combining groups 18b
( a C b) -
a48b ( C O O a,b) - 2a (CH~), or 45b (aC-COO b) - 2a (CH~).
The combination (18b-a48b-2a) violates rule 4 and is therefore disallowed as follows
Z Z (Y(i,48,a),(J1,18,b)+ Y(i,48'b),(J2'2'a))<1,'7'i e
V.
(35)
jleZ j2eV
,
One u r e a / a m i d e group and a s t a n d a r d group or u r e a / a m i d e subgroup are combined. If the s t r u c t u r e can only be obtained from this combination of groups, the combination is allowed. This is the case for instance of CH3CONH2 obtained from groups 86a (CONH~) and l a (CH~). If the structure can be obtained through another combination involving a urea/amide group, the combination containing the heaviest u r e a / a m i d e group is allowed. For instance, group 101 (NH2CONH) can also be generated by combining groups 65a (NH~) and 107b (NHCOb). However this is not allowed as group 101 is heavier t h a n group 107. This is enforced through
Zi,j Y(i,65,a),(j,lO7,b)
= O.
(36)
3. Two u r e a / a m i d e subgroups are combined. If the two groups do not form a urea or amide, the bond is allowed. This is the case for instance if groups 34a (CH2CO a) and 58b (CH2NH b) are bonded to form CH2CO-CH2NH. If the two groups are bonded through a CO-N bond, the bond is forbidden. This is the case for instance if groups 34a and 58a are bonded to form CH2CO-NHCH2. The bond is prevented t h r o u g h the constraints
Zi,j Y(i,34,a),(j,58,a)
= O.
(37)
84 4. Two or more s t a n d a r d groups are combined o r one u r e a / a m i d e subgroup and a s t a n d a r d group are combined. 9 If the structure can only be obtained from this combination of first order groups, it is unique and thus allowed. This is the case for instance of CH3COOH which can only be built by linking group l a (CH~) and group 31a (COOHa). 9 If the structure is itself a s t a n d a r d first order group or a u r e a / a m i d e subgroup, the combination is not allowed. This is the case for instance of CH2C1 which can be built from groups 2a (CH~) and 130a (Cla), but is also found in the first order group list (group 108). The bond is prevented through the constraint
Z
Y(i,2,a),(j,130,a)
(38)
O.
i,j
Such a structure can also be obtained by combining more t h a n two groups. For instance, the combination of groups 29a (OHa), 2a (CH~) and 50b (CH20 b) results in the formation of first order group 134 (OCH2CHzOH). The combination of the three groups is thus excluded through the set of constraints
Z Z (Y(i,2,a),(j~,29,a)+ Y(i,2,a),(jz,50,b))<-1,'v'ie
V.
(39)
jl6V jzeV
If the structure can be generated by a different combination of first order groups, only the combination containing the heaviest first order group is allowed. This is the case for instance of NO2CH2COO which can be built from groups 41b (CH2COO b) and 81a (NO~) or from groups 77a (CH2NO~) and 48a (COOa). Since group 77 is the heaviest of the four groups involved, the combination (77a,48a) is favoured and the combination (41b,81a) is excluded t h r o u g h the constraints
Z
Y(i,41,b),(j,81,a)
O.
(40)
i,j
4.4 A P P L I C A T I O N E X A M P L E 4.4.1 P r o b l e m d e s c r i p t i o n A small example is used to illustrate the application of the proposed formulation. The design of an aromatic compound with up to two rings is considered. The design specifications are the minimization of the s t a n d a r d
85
heat of fusion, given a maximum value for the melting point (Tin,max) and a m i n i m u m value for the boiling point (Tb, min) of the compound. A set of fifteen first order groups is considered, as listed in Table 2. The maximum compound size is N/max -- 20 groups. Table 2: First order groups (group n u m b e r ) used in application e x a m p l e
CH3 (1) AC (16) aCC (23)
CH2 (2) aC (18) OH (29)
CH (3) aCCH3 (20) aCOH (30)
C (4) aCCH2 (21) COOH (31)
aCH (15) aCCH (22) aCOOH (32)
4.4.2 P r o b l e m f o r m u l a t i o n Sets
In order to formulate the problem, the following sets are defined: Set 31, Set Set Set Set Set
of first order groups: G1 - {1, 2, 3, 4, 15, 16, 18, 20, 21, 22, 23, 29, 30, 32}. of non-aromatic first order groups: GI,={1, 2, 3, 4, 29, 31}. of aromatic first order groups: Gla={15, 16, 18, 20, 21, 22, 23, 30, 32}. of vertices: V = {1, .., 20}. of bond types: T = {a, b}. of aromatic rings: W = {1, 2, 3,4}.
Variables
The binary variables needed are 9 9
Ui, k, Vie V, Vke G1 denoting whether group k is present at vertex i, y(i,k,t),(j,kk,tt) Vi,j~ V, Vt, tte T, Vk, k k e G1 denoting whether group k at vertex i is linked via a type-t bond to a type tt bond of group k k of
vertex j, 9 Raw, V w e W denoting whether aromatic ring w exists, 9 ri, w, Vie V, V w e W denoting whether vertex i belongs to aromatic ring W.
The following (non-negative) continuous variables are defined: 9 Tm the melting point of the compound (in K),
9 Tb the boiling point of the compound (in K), 9 Hfus the standard heat of fusion of the compound (in kJ/mol), 9 nlk, Vke G1 the number of groups of type k in the compound, 9
Rtot the number of aromatic rings in the compound.
Although nlk and Rtot a r e defined as continuous variables, they are both forced to take on integer values via constraints (8) and (15) respectively.
86
Data The data needed consist of the valency of each group in G1, and the contributions to melting point, boiling point and heat of fusion for all these groups. These are listed in Table 3. The contributions for each group are t a k e n from Table 6 of Morrero and Gani (2001). Three constants are defined for use in the property prediction equations: T m o - 147.50 K, Tbo = 222.543 K and Hfus,O= 5.549 kJ/mol.
Table 3: Group data for the application example Number Vk,a Vk, b Tm, k Tb, k (K) (K) 1 1 0 0.6953 0.8491 2 2 0 0.2515 0.7141 3 3 0 -0.3730 0.2925 4 4 0 0.0256 -0.0671 15 2 0 0.5860 0.8365 16 3 0 1.8955 1.7324 18 2 1 0.9176 1.5468 20 2 0 1.0068 1.5653 21 2 1 0.1065 1.4925 22 2 2 -0.5197 0.8665 23 2 3 -0.1041 0.5229 29 1 0 2.7888 2.5670 30 2 0 5.1473 3.3205 31 1 0 7.4042 5.1108 32 2 0 12.4296 6.0677
Group k CH3 CH2 CH C ACH AC AC aCCH3 aCCH2 aCCH aCC OH aCOH COOH aCCOOH
Hfus,k (kJ/mol) 1.660 2.639 0.134 -1.232 -1.037 0.845 -0.531 2.969 0.948 -1.037 -0.2856 4.786 8.427 10.692 14.649
Objective function The objective function is min Hius
(41)
Constraints The design specifications are Tb,min _< Tb
(42)
Tm <-- Tm,max
(43)
To obtain a mixed-integer linear problem, these are reformulated as
(,Tb,min)
exp(, Tbo
Grbe
(44)
87
Tme < exp
[T~,max/
(45)
Tmo
The property prediction constraints are given by Tree = ~ T~,kn,k
(46)
keG l
The
(47)
= ~_, Tb,kn, k keG,
H:, s - H : . . , o - ~ H:.,.,kn, k
(48)
ke Gi
To reflect the fact t h a t no non-aromatic rings are allowed in the molecule, constraint (29) is re-expressed as
(49)
Rto, = ~_~ Raw ~W
Applying the methodology of section 4.3 to the set of groups in G1, the following set of forbidden bonds is identified: (18b, la), (18b, 2a), (18b, 3a), (18b, 4a), (18b, 29a), (18b, 31a). Thus 18b cannot be bonded to any of the non-aromatic groups. This is expressed as
Z Z Y(i,18,b),(j,k,t)-- 0
(50)
keG~. i,j,t
Constraint (26) is reduced to account for the fact t h a t a maximum of four rings is allowed. It is given by Y(i,16,a),(j,16,a) + ri,w1 + ~,w 2 + ri,w3 + ri,w4 <- 2 + Z r i , w ' V i ' J e V, w~ W Vw~ , w2 , w3 , w4 , ws , w 6 9 W , w 2
r Wl;W3 r
(51)
Wl , Wz ; W4 r Wl , Wz , W3.
Constraints (8)-(15), (20)-(25), (27)-(28), (30), (33)-(34), (44)-(51) are included in the formulation. Constraints (16) to (19) are not included because they deal with cyclic compounds. Results
The problem is solved with Tm,max- 410 K and Tb, m i n = 500 K. The three runs described in Table 4 were solved using GAMS/CPLEX. The runs are designed to test the formulation by generating different types of aromatic compounds. All runs were successfully completed and the results are presented in Table 5. The problem was also attempted using a formulation based on the variable type zi, t,j instead of the y(i,t,k),fj, kk, tt) variables. However, due to the large number of constraints that must be introduced
88 to describe the restrictions on aromatic groups, convergence was not achieved.
Table 4: Description of runs performed for application example Run number 1 2 3
Run description Design an aromatic compound Design an aromatic compound with at least two rings Design an aromatic compound with at least two non-fused rings ......
Table 5: Results for the application example. Numbers parenthesis are experimentalvalues from Afeefy et al. (2001). Run Compound name Compound T~ Tb Hfus Structure 1
1,3,5 triisopropylb enzene
2
Naphthalene
3
Biphenyl
" ~
~
~
~
ff--~
(K)
(K)
(kJ mol
218 (266)
517 (507511)
18.2
315 (353)
516 (491)
22.8
301 (343)
543 (527)
24.0
1)
4.5 C O N C L U S I O N S
Connectivity-based property prediction techniques are becoming increasingly important in the context of computer-aided molecular design tools. In this chapter, we have proposed an MINLP formulation, which enables the use of a group contribution method, based on a large set of functional groups. While a large number of binary variables are used to represent the vertex adjacency matrix, the model results in a comparatively small number of constraints, which reduce the symmetry of the problem and the computational effort required. The fact that the functional groups may possess more than one bond type is reflected in the formulation. This is used as a basis for the design of cyclic and aromatic compounds. General constraints are derived for molecules with an arbitrary number of rings, whether cyclic or aromatic. The proposed approach also enables a number of rules to be applied to the designed molecules, either to ensure more accurate property prediction, or to impose design requirements on the type of candidate molecules that are identified. It is also possible to use simultaneously several property prediction methods which are based on different sets of groups and which
89
require connectivity information. By varying the type of compound to be designed, or by introducing integer cuts, the methodology yields a list of candidate molecules ranked on the basis of optimality. The approach was demonstrated on an illustrative example for the design of aromatic compounds. The use of optimization techniques in computer-aided molecular design enables the implicit evaluation of a large number of alternative structures. This is especially true when the evaluation step is computationally demanding, as is the case when process performance is a selection criterion for the molecule. Furthermore, the ability to carry out an implicit search in the space of feasible compounds and to represent the full connectivity of the molecule is likely to be an important requirement in using the more accurate property prediction techniques t h a t are currently become available, in particular through advances in computational chemistry.
4.6 N O M E N C L A T U R E
Greek letters z Vector of physical properties for a compound Vk, t Valency of bond type t in group k Roman letters Weight in property estimation equation C2 C3 Weight in property estimation equation Contribution of first order group i Ci Contribution of second order group i Di Ei Contribution of third order group i R.H.S. of property prediction equation f F Performance index g Vector of inequality constraints Set of first order groups G1 Gin Set of non-aromatic, non-cyclic first order groups Via Set of aromatic first order groups Set of cyclic first order groups Vie G2 Set of second order groups G3 Set of third order groups h Vector of equality constraints Standard heat of fusion (kJ/mol) Hfus Constant used to calculate heat of fusion (kJ/mol) Hfus,O Contribution of group k to the heat of fusion Hfus, k nli Number of first order groups of type i in compound n2i Number of second order groups of type i in compound n3i Number of third order groups of type i in compound Yl,max Maximum number of first order groups in compound pv Number of vertices in molecular graph
90 Number of edges in molecular graph Vector of binary variables denoting whether vertex i is in aromatic ring w Vector of binary variables denoting whether aromatic ring w exists Variables denoting the total number of rings Set of bond types Normal boiling temperature (K) Constant used to calculate boiling point (K) Contribution of group k to the boiling point Minimum value of the boiling point (K) Normal melting point (K) Constant used to calculate melting point (K) Contribution of group k to the melting point Maximum value of the melting point (K) Vector of binary variables denoting whether group k exists at vertex i Set of vertices Set of aromatic rings Vector of continuous process variables Vector of integer variables describing the structure of a compound
q ri, w Raw Rtot
T Tb Tbo Tb, k Tbmin
T~ T, no Tm, k Tin, max Ui, k
V W X
y
Subscripts
a b c i
Bond type Bond type Bond type Vertex number j, j l , j2 Vertex number k, kk Group type t, tt Bond type w, w l, w2, w3 Ring number w4, w5, w6 Ring number Superscripts
a b c L U
Bond type Bond type Bond type Lower bound Upper bound
4.7 R E F E R E N C E S
[1] [2]
C.S. Adjiman, I.P. Androulakis, C.A. Floudas, Global o p t i m i z a t i o n of mixed-integer nonlinear problems, AIChE J., 46 (2000) 1769 H.Y. Afeefy, J.F. Liebman, and S.E. Stein, N e u t r a l T h e r m o c h e m i c a l D a t a in NIST Chemistry WebBook, NIST Standard Reference
91
[3]
[4]
[5]
[6] [7] [8]
[9]
[10] [11]
[12]
[13]
[14]
[15] [16] [17]
[18]
[19]
Database Number 69, Eds. P.J. Linstrom and W.G. Mallard, National Institute of Standards and Technology, Gaithersburg MD, (http ://webbook.nist. gov) (2001). C. Barnhart, E.L. Johnson, G.L. Nemhauser, M.W.P. Savelsbergh, P.H. Vance, Branch-and-price: column generation for solving huge integer programs, Oper. Res., 46 (1998), 316. E.A. Brignole, S. Bottini and R. Gani, A strategy for the design and selection of solvents for separation processes, Fluid Phase Eq., 29 (1986) 125. A. Buxton, A.G. Livingston and E.N. Pistikopoulos, Optimal design of solvent blends for environmental impact minimization, AIChE J., 45 (1999) 817. K.V. Camarda and C.D. Maranas, Optimization in polymer design using connectivity indices, Ind. Eng. Chem. Res., 38 (1999) 1884. C.F. Chueh and A.C. Swanson, Estimation of liquid heat capacity, Can. J. Chem. Eng., 51 (1973), 596. N. Churi and L.E.K. Achenie, Novel mathematical programming model for computer aided molecular design, Ind. Eng. Chem. Res., 35 (1996) 3788. L. Constantinou, K. Bagherpour, R. Gani, J.A. Klein and D.T. Wu, Computer aided product design: Problem formulations, methodology and applications, Comp. Chem. Eng., 20 (1996) 685. L. Constantinou and R. Gani, New group contribution method for estimating properties ofpure compounds, AIChE J., 40 (1994) 1697. L. Constantinou, S.E. Prickett and M.L. Mavrovouniotis, Estimation of thermodynamic and physical properties of acyclic hydrocarbons using the ABC approach and conjugation operators, Ind. Eng. Chem. Res., 32 (1993), 1734. V. Dua and E.N. Pistikopoulos, Optimization techniques for process synthesis and material design under uncertainty, Chem. Eng. Res. Des., 76 (1998) 408. A. Duvedi and L. Achenie, Designing environmentally safe refrigerants using mathematical programming, Chem. Eng. Sci., 15 (1996) 3727. C.A. Floudas, Nonlinear and mixed-integer optimization: Fundamentals and applications, Oxford University Press, Oxford (1995). GAMS, Generalized Algebraic Modeling System, www.gams.com. R. Gani and E.A. Brignole, Molecular design of solvents for liquid extraction based on UNIFAC, Fluid Phase Eq., 13 (1983) 331. R. Gani, B. Nielsen and A. Fredenslund, A group contribution approach to computer-aided molecular design, AIChE J., 37 (1991) 1318. I.E. Grossmann, Mixed-integer optimization techniques for algorithmic process synthesis, Advances in Chemical Engineering, 23 (1996), 171. F. Harary, Graph Theory, Addison-Wesley, Reading, 1969.
92 [20] P. Harper, R. Gani, P. Kolar and T. Ishikawa, Computer-aided molecular design with combined molecular modeling and group contribution, Fluid Phase Eq., 160 (1999) 337. [21] A.L. Horvath, Molecular design: Chemical structure generation from the properties of pure organic compounds, Elsevier, Amsterdam (1992). [22] K.G. Joback and G. Stephanopoulos, Designing molecules possesing desired physical property values, Foundations of Computer Aided Process Design, (1989) 363. [23] K.G. Joback, Designing molecules possesing desired physical property values, Ph.D. thesis, MIT, Cambridge (1987). [24] L.B. Kier and L.H. Hall, Molecular connectivity in chemistry and drug research, Academic Press, New York (1976). [25] S. Macchietto, O. Odele and O. Omatsone, Design of optimal solvents for liquid-liquid extraction and gas absorption processes, Chem. Eng. Res. Des., 68 (1990) 429. [26] C.D. Maranas, Optimal compute- aided molecular design: A polymer design case study, Ind. Eng. Chem. Res., 35 (1996) 3403. [27] C.D. Maranas, Optimal molecular design under property prediction uncertainty, AIChE J., 43 (1997) 1250. [28] E.C. Marcoulaki and A.C. Kokossis, On the development of novel chemicals using a systematic synthesis approach Part I. Optimisation framework, Chem. Eng. Sci., 55 (2000a) 2529. [29] E.C. Marcoulaki and A.C. Kokossis, On the development of novel chemicals using a systematic synthesis approach Part II. Solvent design, Chem. Eng. Sci., 55 (2000b) 2547. [30] J. Marrero and R. Gani, Group-contribution based estimation of pure component properties, Fluid Phase Eq., 183-184 (2001) 183. [31] J. Marrero-MarejSn and E. Pardillo-Fontdevila, Estimation of pure compound properties using group-interaction contributions, AIChE J., 45 (1999) 615. [32] M. Mavrovouniotis, Product and process design with molecular-level knowledge, in First International Conference on Intelligent Systems in Process Engineering, Eds. J.F. Davis, G. Stephanopoulos, V. Venkatasubramanian, AIChE Symp. Ser. 312, 92 (1996) 133. [33] G.L. Nemhauser and L. Wolsey, Integer and Combinatorial Optimization, Wiley, New York (1988). [34] D.E. Needham, I.C. Wei and P.G. Seybold, Molecular modeling of the physical properties of alkanes, J. Am. Chem. Soc., 110 (1988) 4186. [35] O. Odele and S. Macchietto, Computer aided molecular design: A novel method for optimal solvent selection, Fluid Phase Eq., 82 (1993) 47. [36] E.N. Pistikopoulos and S.K Stefanis, Optimal solvent design for environmental impact minimization, Comp. Chem. Eng., 22 (1998) 717. [37] B.E. Poling, J.M. Prausnitz and J.P. O'Connell, The properties of gases and liquids, McGraw-Hill, New York, 5th edition (2000).
93 [38] S. Raman and C.D. Maranas, Optimization in product design with properties correlated with topological indices, Comp. Chem. Eng., 22 (1998) 747. [39] N.V. Sahinidis and M. Tawarmalani, Applications of global optimization to process and molecular design, Comp. Chem. Eng., 24 (2000) 2157. [40] S. Siddhaye, K. Camarda, E. Topp and M. Southard, Design of novel pharmaceutical products via combinatorial optimization, Comp. Chem. Eng., 24 (2000) 701. [41] R. Vaidyanathan and M. E1-Halwagi, Computer-aided synthesis of polymers and blends with target properties, Ind. Eng. Chem. Res., 35 (1996) 627.
This Page Intentionally Left Blank
ComputerAidedMolecularDesign: Theoryand Practice L.E.K. Achenie,R. Gani and V. Venkatasubramanian(Editors) 9 2003 Elsevier ScienceB.V. All fightsreserved.
95
Chapter 5: Genetic Algorithms Based CAMD P. R. Patkar & V. Venkatasubramanian
Designing new molecules possessing desired properties is an important activity in the chemical, material and pharmaceutical industries. Much of this design involves an elaborate and expensive trial-and-error process that is difficult to automate. A CAMD approach using genetic algorithms (GAs) or genetic programming is presented. Unlike traditional search and optimization techniques, genetic algorithms perform a guided stochastic search where improved solutions are achieved by sampling areas of the search space that have a higher probability for good solutions. Moreover, GAs allow for the direct incorporation of higher-level chemical knowledge and reasoning strategies to make the search more efficient. A background of GAs and the implementation of GA-based search are presented followed by a discussion on the theory behind genetic search. Two polymer design case studies are discussed and an evolutionary design framework based on genetic algorithms is presented for the problems. Results from the studies are presented and some general conclusions are offered.
5.1 I N T R O D U C T I O N The very first chapter of the book highlighted the importance of product design. Clearly, the design of new materials possessing desired properties is a very important activity in the chemical, material and pharmaceutical industries. Computer-aided molecular design (CAMD) offers a very attractive alternative to the traditional trial-and-error experimental approach, particularly since the latter often turns out to be highly protracted and expensive. The application areas of material design are diverse and encompass polymers, polymeric composites, blends, paints and varnishes, refrigerants, solvents, drugs, pesticides, and so on. The focus of this book is primarily molecular design, a special case of the broader material design problem. Several examples exist of the successful design applications of CAMD namely solvents [1,2], refrigerants [3,4], pharmaceutical products [5] and polymers [6-10]. Part II of the book presents some of these applications. In general, the overall task of CAMD requires the solution of two subproblems: the forward problem, which involves the computation of some performance measures or physical, chemical and/or biological properties from the product structure and composition; and the inverse problem,
96 which entails the identification of the appropriate molecular structure or composition given the desired macroscopic properties. This is illustrated in Fig. 1. Various methods can be employed for the estimation of properties from the structure. Approaches typically used include those based on group contribution [11-14], topological indices [15-18], molecular modeling [19] or their combination [20]. Depending on the problem, the prediction method could be more general and include highly nonlinear neural networks or other black-box models as well [21].
Fig. 1. Components of the molecular design problem The solution to the inverse problem, which involves the systematic identification of viable structures, is a non-trivial task. A variety of techniques have been employed for the inverse method including knowledge-based systems [22, 23], machine-learning techniques [24], graph reconstruction methods [25, 26] and enumeration-based algorithms [27-29]. A number of rigorous mathematical formulations have also been proposed [1-10] and solved for several design applications. Some of these methods, both for the forward and inverse problems, have been discussed in detail in other chapters of the book. In general the desirable features of any inverse solution method are 9 Generality of application, 9 Ability to handle nonlinear objective functions and the resulting local optima, 9 Ease of implementation and adaptability, 9 Computational ease in handling large search-spaces, 9 Robustness to approximations/uncertainties in the property predictors.
97 In spite of their a d v a n t a g e s in certain specific problem domains, all the methods mentioned above typically lack one or more of these features. An inverse solution s t r a t e g y based on genetic algorithms [30, 31], which forms the focus of this chapter, is able to overcome most of these difficulties for several design applications.
5.2 GENETIC ALGORITHMS & GENETIC P R O G R A M M I N G 5.2.1 B a c k g r o u n d Genetic algorithms are a method for stochastic, evolutionary search. The u n d e r l y i n g idea of the genetic algorithm is drawn from the D a r w i n i a n model of n a t u r a l selection and evolution. Pioneering work on GAs was done by Holland [30]. Detailed discussions on GA f u n d a m e n t a l s and applications can be found in Goldberg [31], Davis [32], Rawlins [33] and M a n et al. [34]. The original idea presented by Holland is as follows: consider t h a t every candidate solution to a given search problem can be r e p r e s e n t e d in a 'genetic' form called the chromosome, with a one-to-one m a p p i n g between the solutions in the state space and their corresponding genetic forms. It is a s s u m e d t h a t a solution to the forward problem already exists so t h a t any point in the state space can be evaluated in t e r m s of the objective of the search. Since the m a p p i n g between the state space and the chromosomes is one-to-one, the quality of any candidate solution is completely determined by its genetic information i.e. its chromosome. Therefore the t e r m s solution and chromosome are often i n t e r c h a n g e a b l y used in a genetic search. The best feasible solution to a given search problem m u s t have a corresponding chromosome under a given one-to-one, invertible mapping, from state space, ~2 to the genetic space, W. Let this chromosome be called the t a r g e t chromosome. Then the basic assumption in a genetic search is t h a t a given gene pool (collection of chromosomes) can potentially lead to the t a r g e t chromosome by the process of evolution, which is the creation of new chromosomes from existing ones via exchange of genetic information. The process of evolution is carried out in a m a n n e r similar to t h a t in living systems based on n a t u r a l selection and the law of 'the survival of the fittest'. N a t u r a l selection in living systems implies t h a t stronger individuals are more likely to survive and win in a competing environment. In other words, fitter individuals are more likely to produce b e t t e r offsprings. In a GA, the fitter chromosomes are those t h a t are closer to the target. Then the i m p l e m e n t a t i o n of the algorithm is carried out such t h a t the fitter chromosomes are rendered with better chances of passing their genetic information to the subsequent generations of chromosomes. The process of evolution is carried out for a pre-decided n u m b e r of generations or till the t a r g e t chromosome is obtained.
98
Figure 2: Framework for implementation of a genetic algorithm
5.2.2 Implementation The framework for the implementation of a genetic algorithm or a genetic program is shown in Figure 2. The process starts with a collection of chromosomes. Each chromosome is assigned a fitness depending upon its proximity to the target. The fitter chromosomes are selected as 'parents' and they are allowed to exchange or alter their genetic information to create offspring. This is achieved by means of operators called genetic operators. A new population or generation of offspring is created to replace the existing population. This is the process of evolution, which is repeated for a pre-decided number of generations or till the target is located. There are five main aspects to the overall procedure namely (1) genetic encoding, (2) assignment of fitness, (3) selection of parents for reproduction, (4) genetic operations and (5) replacement of existing chromosomes with newly evolved ones. These are discussed in detail below.
Genetic Encoding Genetic encoding is the process of devising a one-to-one, invertible map, ~0 t h a t represents every point in the original state space, ~2 of the problem in a corresponding point in the genetic space, W. The state space
99 r e p r e s e n t a t i o n of a candidate solution is called the 'phenotype' and the genetic information i.e. the corresponding chromosome, the 'genotype'. The t e r m s genotype and phenotype have been derived from living systems where the phenotype is w h a t is obtained when the genetic information is decoded or 'expressed'. Using ~p-1the genetic information of a chromosome can be decoded to get the original point in the state space. Depending upon the problem, there could be more t h a n one way of mapping the two spaces. However, in most cases, a convenient genetic representation of the state space variables arises n a t u r a l l y from the problem description itself. This is particularly true when some or all the variables are symbolic. The chromosome consists of one or more 'genes'. Each gene is simply a sequence of one or more units on the chromosome. In a classical genetic algorithm, the units are binary i.e. they have value 1 or 0. One may view t h a t a value of 0 indicates a 'recessive' part of a gene and 1, an 'active' part. More generally, a gene could consist of units t h a t are not restricted to binary values but can take on symbolic or numeric values. When the values are not restrictively binary, the overall procedure is called a genetic p r o g r a m instead of a genetic algorithm. However, the latter t e r m is used loosely to refer to both the evolutionary procedures. All possible combinations of the values of the units of a gene determine the different values called 'alleles' possible for the gene as a whole. Different genes in a chromosome can have different n u m b e r of units. Moreover, there can be units of more t h a n one data-type (binary, numeric or symbolic) in a given chromosome and sometimes even in the same gene. Further, certain problem could require such an encoding wherein different chromosomes in a population have different numbers of genes. The hierarchical structure of genetic encoding is shown in Fig. 3. for binary units.
Figure 3: Bit-string genetic encoding It should be noted t h a t the gene forms the f u n d a m e n t a l unit of information in a chromosome and an individual unit has little m e a n i n g unless considered in concert with the other units making up the gene. This is similar to the DNA of living systems where the individual units are nitrogenous bases but it is the sequences of these bases t h a t determine the genes and the genetic make-up of an individual. The process of genetic encoding establishes a relation between the individual variables in state space and the genes. Therefore the encoding is essentially deciding the
100 number and types of genes used to make up the chromosomes, and the mapping between them and the state variables. Fitness Function The fitness of a chromosome is a positive value indicating its quality or degree of 'goodness'. Therefore it is obviously related to the objective of the search problem at hand. A given chromosome must have a unique fitness value, which implies that fitness must be a function of the genotype. Since there is a one-to-one correspondence between the genotype and phenotype, the function can also be expressed as a function of the phenotype. This is called the fitness function. This is usually very closely related (if not identical) to the original objective function of the search problem defined over the state space. In several cases, particularly those in which the optimal objective function value is known beforehand, it is convenient to devise a fitness function whose range is the interval [0,1]. The fitness value of the target chromosome is 1 and the extent of departure from 1 is a measure of the 'distance' from the target for other chromosomes. Selection of Parents It is important to devise phenomenon of natural chromosome, the greater parent. There are several are given below:
a selection mechanism selection in evolution. should be its chances schemes for selection of
that will simulate the Thus, the fitter the of being selected as a parents, some of which
Random selection: Parent chromosomes are randomly picked for reproduction from the current population. Such a scheme does not incorporate the idea of natural selection. Random selection is rarely used. Roulette Wheel selection: This is the most commonly used selection policy. Here the probability of selection of a chromosome is directly proportional to its fitness (hence the selection is also called fitness-proportionate selection). It is given by F(i) P(i)- u
EF(Y)
(I)
j=l
where P(i) is the probability of selection of chromosome i, F(i) is the fitness of chromosome i and N is the total number of individuals in the population. Consider a very simple example where the population size, N is 5. Let the chromosomes have fitness values 0.8, 0.5, 0.3, 0.25, 0.15. Then the probabilities of selection of the different chromosomes are given by the areas of the 'roulette wheel' as shown in Fig. 4.
101
Figure 4: Probabilities of selection in Roulette-Wheel selection Commonly, the probability is determined by using cumulative fitness values instead of the actual or raw fitness. The raw fitness values are first scaled with that of the highest in the population to get the scaled fitness
F(i) sf(i)
-
Fmax
(2) Then the cumulative fitness of chromosome i is given by
curer(i) = ~ sf (j)
(3)
j=l
where sf(j) is the scaled fitness of chromosome j. Now the probability of selection of a chromosome is given by
~i~sf (J)
P(i)- j=l N
(4)
E#u) j=l Using equation (4) instead of (1) favors the highly fit individuals even more during selection. The effectiveness of the roulette wheel policy strongly depends on the actual definition of the fitness function. Rank Selection: The above drawback of roulette wheel selection is avoided by using rank selection where the selection of individuals is only on the basis of their rank in the population. However, roulette wheel is usually preferred over rank selection since the latter is better able to simulate the law of natural selection. Tournament Selection: This type of selection is often used in optimizing game-playing strategies where two players adopting different strategies
102 are m a d e to play against each other upon which the strategy of the winner is chosen. Similarly, here two chromosomes are picked and the one with the higher fitness is chosen as a p a r e n t for reproduction. Genetic Operators: The process of creation of offspring chromosomes from p a r e n t s is achieved by means of genetic operators. A genetic operator modifies the genetic information of one or more p a r e n t chromosomes according to some probability; otherwise it leaves the p a r e n t chromosome(s) unchanged. This probability is called the operation rate of the genetic operator. Two genetic operators are primarily used in a classical GA, n a m e l y crossover and mutation. Crossover: Crossover involves the creation of two offspring chromosomes by exchange of contiguous chunks of units from two p a r e n t chromosomes. In a single-point crossover, one point is chosen as the crossover point. Each p a r e n t chromosome is cut at the crossover point and the p a r t s are exchanged. This is shown in Figure 5a. Some studies have shown two-point crossover to be superior to singlepoint in certain cases [35]. Recently other types of crossover such as multiple-point i.e. n-point crossover [36] with n > 2 and uniform crossover [37, 38] have been proposed. Mutation: M u t a t i o n operates on one chromosome at a time. It involves the modification of one or more units of the chromosome. A binary unit when m u t a t e d becomes 1 if it was originally 0 or becomes 0 if originally 1. In general, a unit of a chromosome, upon mutation, changes from its c u r r e n t value to some other allowable value. In single-bit mutation, (the t e r m bit originates from the case of binary representations), a unit is r a n d o m l y picked on the chromosome and mutated. In a string-wise mutation, every bit on the chromosome is m u t a t e d with some probability. This is shown in Figure 5b where four of the total ten bits of the chromosome have been mutated.
Figure 5a: Single-point crossover
103
Figure 5b: Length-wise mutation The crossover operator results in drastic changes in the genotype, which results in huge leaps from one point to another in the state space. Thus by means of crossover the algorithm is able to rapidly navigate through several regions of the search space spread far apart from one another. Mutation, on the other hand, results in small changes in a chromosome. This translates to small movements in a given region of the search space. Thus mutation is a local optimization operator. This combined exploratory and local-search ability of the GA is its most significant feature. The algorithm is able to quickly recognize the promising areas of the search space and then closely investigate each of them by local search. In addition to crossover and mutation, one could devise other operators for creation of offsprings. Examples of such operators will be discussed in the polymer design case study presented in later sections. In the case of constrained optimization problems, it is often desirable to eliminate or at least discourage infeasible solutions in the population. The formation of infeasible solutions can be prevented upfront via suitable modifications to the operators. Such modified operators that only produce feasible solutions are called constrained genetic operators. An alternative way of tackling infeasibility is by providing a penalty on infeasible solutions so t h a t their fitness values become very low. Then such poorly fit chromosomes will most likely get eliminated during the course of evolution as a result of n a t u r a l selection. The argument in favor of allowing infeasible chromosomes at all is that they could contain some good genes despite being infeasible on the whole. Hence, by means of contributing the good genes, such chromosomes could eventually lead to fitter, feasible offsprings during evolution. Either policy may be adopted to tackle the problem of infeasibility.
Replacement P o l i c y By repeated selection of parents followed by application of the genetic operators, a number of offsprings can be created. Then, one of several strategies may be adopted to replace the current generation with the offspring generation. One such scheme is called generational replacement
104 where all the chromosomes in the existing population are replaced by the offspring [39]. In such a case, to maintain a constant population size of N, each generation will involve N offspring chromosomes. A drawback of the above policy is that all the chromosomes, including the best, of the current population are discarded. Then, if some of the fitter chromosomes fail to produce offsprings in the succeeding generation, their good genes could be lost permanently. To avoid such an occurrence, the generational replacement policy is usually combined with a policy called elitism. In an elitist strategy, one or a few of the best chromosomes are directly passed on to the next generation. This conserves a fixed number of best solutions produced till that point of evolution. Though elitism can lead to a faster domination of a certain chromosome in the population, in general it has been found to improve the overall performance of the genetic procedure. Some schemes replace only the worst chromosomes when new chromosomes are inserted into the population. Such schemes generate a small n u m b e r of offspring. In other words, these are heavily elitist policies. Another policy is to replace parents by the offsprings produced by them. A problem with this is that highly fit parents need not always lead to good offspring chromosomes. Thus, used standalone, the policy can lead to the loss of good chromosomes. Another method involves replacement of the eldest chromosomes, which are the ones that have been in the population for more than a certain number of generations. Once again, the best chromosomes could get eventually discarded during evolution if such a policy were implemented. The most commonly used scheme is generational replacement with elitism.
5.3 T H E A L G E B R A OF G E N E T I C A L G O R I T H M S
There are broadly two schools of thought as to why genetic algorithms really work: Schema Theory and Building Block Hypothesis. Recently, a generalization of the schema theory called Forma Analysis has been proposed [40]. The important features of each theory are presented. 5.3.1 S c h e m a T h e o r y The s t a n d a r d theory is based on Holland's schema analysis presented in his pioneering work on GA's [30]. The schema theory or schema analysis operates in the genotype space. It is applicable when the chromosomes are linear strings of a fixed length and when the units take on a well-defined set of values. Schema analysis provides valuable insight into the operation of a genetic algorithm. There have been extensions to the analysis t h a t enable tracing the evolution of individual strings for infinite and finite populations [41, 42, 43].
105
Figure 6: Three-dimensional cube as the genotype space
We present here a discussion on the schema theory from Man et al. (1999). In order to understand the meaning of a 'schema', let us consider an example where the genetic representation involves chromosomes consisting of three bits. Thus, the genetic space is the three-dimensional cube shown in Fig.6 The standard theory is based on Holland's schema analysis presented in his pioneering work on GA's [30]. The schema theory or schema analysis operates in the genotype space. It is applicable when the chromosomes are linear strings of a fixed length and when the units take on a well-defined set of values. Schema analysis provides valuable insight into the operation of a genetic algorithm. There have been extensions to the analysis that enable tracing the evolution of individual strings for infinite and finite populations [41, 42, 43]. The origin corresponds to the chromosome 000. In any genetic representation, each vertex of the genetic search space corresponds to a chromosome. Here, the total number of possible chromosomes is eight. The bit-strings of adjacent vertices differ by exactly one bit; in other words they are separated by a Hamming distance of 1. The shaded face of the cube consists of vertices represented by the string 0.* where '*' is used to represent '0 or 1' or a 'wild card' symbol. Binary strings containing one or more '*' are called 'schemata' ( s i n g u l a r - 'schema'). The number of fixed bit values t h a t appear in a schema is called the order, o of the schema. For instance, in Fig. 6, the schema 0.0 corresponds to the left edge (shown by a thick line) of the shaded face of the cube. The order of this plane is 2. It matches the chromosomes 000 and 010, which make up the edge. The shaded front face of the cube is an order 1 plane and is represented by the schema 0.*. Here, schema 0.* corresponds to the four chromosomes that make up the vertices of the face. In general, every schema represents exactly 2 r chromosomes, where r is the number of 'wild card' symbols, *, in the schema template. In binary encoding when the length of each chromosome is L, every chromosome is a corner of the L-dimensional hypercube and belongs to 2 L - 1 different planes. Apart from the order of a schema, another important attribute is the distance between its
106 outermost fixed positions, which is called its Defining Length, 5. For instance, the Defining Length of the schema *0101" is 3 whereas that in the case of *010"1 is 4. The defining length is a measure of the compactness of the information contained in a schema. The way a genetic algorithm utilizes schemata is as follows. The genetic search samples several chromosomes in each generation. Each such population of sample solutions provides information about numerous planes. In particular, planes of low order would likely be sampled by several solutions in the population. Every chromosome being sampled effectively results in 2 L - 1 planes being sampled. This is called the 'implicit parallelism' in genetic algorithms. The competition for survival between different chromosomes can be viewed at a higher level as the competition between the corresponding planes or schemata. Thus implicit parallelism means that many such schema competitions are being simultaneously evaluated and solved in parallel. Holland derived the Implicit Parallelism Lower Bound that states that the number of schemata processed in a single generation is O(N3), where N is the size of the population. Fitzpatrick et. al. [44] argued that for L>64 and 26 < N < 22~ the number of schemata processed was greater than N 3. The membership of a schema at a given stage of evolution is defined as the n u m b e r of chromosomes in the current population belonging to t h a t schema. As will be shown later, during the course of evolution, a fitter schema has greater chances of survival and correspondingly its membership grows i.e. its representation in the population increases. By the same token, as a result of natural selection, the representation of all poorly fit schemata would decrease in the highly competitive environment. The schema theory suggests that such increase or decrease in the representation of competing schemata in the population is the outcome of genetic operations acting according to the relative fitness of the chromosomes belonging to the schemata. The Schema Theorem [30] gives a lower bound for the sampling rate of a given schema that is the rate of change of membership of the schema during evolution. It is derived as follows: Since a schema is a collection of strings, we can associate an average fitness value with every schema at time (generation) t. Let /~(t)be the average observed fitness of a given schema ~ at time t, i.e. the average fitness of all the members of the population at time t that are members of schema ~. Let N~ (t) be the membership of schema ~ at time t. If fitness proportionate selection is adopted during reproduction, we can estimate the number of members of schema ~ in the next generation. If ~(t)is the average fitness of the entire population at time t then the probability of selection for reproduction of a member of schema ~ (in a single string selection) is equal
107 to /2r
Then the expected number of members in schema ~ in the
next generation is E(N r (t + 1)) = N r (t) lx~~((tt~,
(5) Let C,
--
~(0 (6) A value of a>0 implies that the schema has an above average fitness and vice versa. Substituting equation (5) into (6), it can be seen that an 'above average' schema receives an exponentially increasing number of members in the subsequent generations: E(Nr (t))= Nr (0X1+ e)t (7) The above equation shows that the growth of an above-average schema is highly favored as a consequence of the fitness proportionate selection policy. However the above equation does not accurately reflect the sampling rate. The disruptive action of the evolutionary operators tends to decrease the membership of such schemata and needs to be incorporated in the sampling rate. Consider single-point crossover being applied over chromosomes of length L. The crossover point would, in general, be selected uniformly among L-1 possible positions along the chromosome. Then the probability of destruction of a schema ~ as a result of the crossover is
~(~)
Pd(~) = ~ L-1
(8) where 5(~) is the defining length of schema ~. The probability that schema would survive the crossover is given by
5(~)
Ps(~)- 1 - ~ L-1
(9) If the operation rate of crossover is Pc then the probability of survival of schema ~ is
108
Pc
5(r (10)
It should be noted that even if the crossover point occurs between fixed positions, schema ~ might still survive the operation. Therefore equation (10) has to be modified as Ps(r 2 1 - P
5(r
~L-1
(11)
The effect of mutation can be similarly incorporated. Suppose that the probability of bit mutation is Pm. Then the probability of a single bit survival is 1-Pm. Therefore the probability of survival of schema ~ after a sequence of one-bit mutations is Ps(~) = (1- Pm)~162 (12) where o(~) is the order of schema ~. Since Pm << 1, the above equation can be approximated as Ps (~) = 1- PmO(~) (13) Incorporating the disruptive effects of crossover and mutation into equation (5), an equation for the reproductive growth of schema is obtained as E(Nr (t +1))>- Nr fie(t)[1-P~ 5(~) ~-(t) "~
- PmO(~)
] (14)
In general, in addition to crossover and mutation, several other operators may be applied. If ~ is the set of all genetic operators being used then the above equation can be stated as E(N~ (t + 1))> Nr fie (t) [ -
-oZPwpwr)] (15)
where the term PwPw(~) quantifies the potential disruptive effect of the application of a genetic operator w e ~.
109 The generalized form of the schema growth equation derived above is the mathematical statement of the Schema Theorem or the F u n d a m e n t a l Theorem of Genetic Algorithms. The implication of the theorem is t h a t short, low-order, above-average schemata receive exponentially increasing trials in subsequent generations. Bridges and Goldberg [45] extended the schema theorem for binary schemata to replace the inequality with an equality by including terms for string gains as well as disruption terms. It is important to note that the schema theorem applies equally for a given phenotype space, ~ and the corresponding genotype space, W, regardless of the mapping between the two spaces. Assuming that ~2 and W have the same size, there are as many as I~1! such mappings possible yet the schema theorem applies equally to each of them. This makes the theorem powerful and widely applicable. However the schema theory does have some limitations, the most obvious of which is that it is applicable only to binary representations having one-to-one correspondence between the chromosomes and the solutions in the phenotype space. However, in several problems, it is convenient and natural to use non-binary units. For instance, under real-valued or symbolic encoding, the theory cannot explain the mechanics of the genetic program. The schema theory also assumes the use of standard genetic operators such as crossover or mutation. A number of problems require the use of special problemspecific operators, which may often be constrained operators. All the above instances lack the theoretical backing of the schema theory. As a result, until recently, they were mostly considered to be heuristic approaches. Forma Theory is a recent generalization of the schema theory, which can provide theoretical support for such approaches to the same extent as schema theory does for classical GAs. A detailed discussion on Forma Theory is beyond the scope of this chapter. However the key features of the theory are briefly presented in the following section.
5.3.2 F o r m a T h e o r y The Forma Theory was developed by Radcliffe in a series of papers [40, 4649]. The theory does not require any specific representation and is therefore applies equally to non-binary encoding. The representation is considered purely as a matter of implementation and does not affect the analysis. Thus the theory is a generalization of the schema theory of classical GAs and is therefore more flexible. The theory defines 'formae' as sets of solutions sharing a certain property assumed to be relevant to the solutions' fitness. Formae are simply extensions of the idea of the schema where the latter refer to a set of solutions sharing specific binary units. The theory presents some guidelines as to the properties required of operators with respect to such formae that enable a genetic program to actually work. In his analysis, Radcliffe suggested some standard operators for a given set of formae. The specifics of a given problem are incorporated by means of defining the
110 appropriate set of formae. Then the effect of standard operators is analyzed on the abstract search space. The theory is also able to examine the effect of non-standard, 'heuristic' operators.
5.3.3 B u i l d i n g B l o c k H y p o t h e s i s Holland introduced the idea that for a GA to work efficiently, the stringbased representation should be able to effectively reflect the structure of the search space. Ideally, certain bits or groups of bits (genes) in a chromosome should represent certain properties of the corresponding phenotype that have significant bearing on the fitness. The assumption, then, is t h a t the combination of such 'good' genes would lead to highly fit solutions. Chromosomes having one or more good genes are simply short, low-order schemata whose fixed-value bits have significant contribution towards high overall fitness. Such high-performance schemata are called 'building blocks'. The building block hypothesis suggests t h a t a genetic algorithm seeks near-optimal performance through the juxtaposition of such blocks. The agents responsible for the juxtaposition of building blocks are the genetic operators such as crossover and mutation. These operators have the ability to generate, promote and juxtapose building blocks to form the optimal or nearly optimal strings. Crossover tends to conserve the genetic information present in the parent chromosomes Therefore, when the chromosomes chosen for crossover are similar, their capacity to generate new building blocks diminishes. On the other hand, mutation does not conserve genetic information and can generate new building blocks radically. The building block hypothesis suggests that the encoding can critically determine the performance of the GA since the coding should be such t h a t short, high-performance building blocks should not only be possible but also easy enough for the algorithm to locate quickly. It should be noted that the above theories only offer possible explanations as to why GA's work. But in general, because of the heuristic nature of the search, no guarantees can be offered about convergence. However this very aspect of the algorithm enables the search to overcome problems presented by local minima traps or discontinuous spaces. Thus the heuristic nature of the GA is in a way both its strength and weakness.
5.4 GA-BASED CAMD: THE POLYMER D E S I G N P R O B L E M The adaptation [50] and application of GAs [51-54] as a solution framework for CAMD is described in this section. It is illustrated via the polymer design problem" a common design problem in polymer engineering, which is the determination of a polymer structure t h a t meets a n u m b e r of physical properties constraints. Stated more specifically, the polymer design problem is to determine the repeat unit structure of a polymer, say--[--Xl--X2--..--XL--]n-- satisfying a set of desired macroscopic physical properties, where xi are functional groups.
111
Figure 7: GA framework for the polymer design problem 5.4.1 P r o p o s e d GA F r a m e w o r k The proposed framework for the polymer design problem uses (i) the s t a n d a r d group contribution methods discussed earlier for the forward problem and (ii) an adaptation of the standard genetic algorithm for the inverse problem. Figure 7 shows the GA framework for polymer design. The standard GA is modified in three aspects: representation of molecules (polymer repeat units), creation of new operators in order to exploit chemical knowledge of molecular interactions and rearrangements, and fitness function in order to handle property constraints. The selection policy is the commonly used fitness-proportionate selection. Elitism as 10% of population size is incorporated into the replacement policy. A detailed discussion is presented in the following sections.
5.4.2 M o l e c u l e R e p r e s e n t a t i o n A s t a n d a r d GA employs the bit-string encoding scheme as discussed earlier. However for the polymer design problem, if bit strings were used to represent molecular structures then one would need binary matrices to represent the groups present in the structure and their connectivity. Such a representation would not only make the overall scheme more complicated as a result of extensive bookkeeping, but also render the
112
r e p r e s e n t a t i o n difficult to follow and interpret. A more suitable and n a t u r a l r e p r e s e n t a t i o n would be to represent chemical structure as a string of symbols or functional groups. Under such an encoding, the string is composed of one or more genes, each of which represents an elemental, s u b - s t r u c t u r a l or monomer unit. The units are functional groups on the m a i n backbone chain and the side-chains. Example
Groups
Fl~.mental S, l [-.I s l r l
--~--
,,,,,,~l
Polym
-
II d I H
--~1
c-
I
0
N--
I
H
--0--0-II 0
O on
er Representati
=-
[ (~H 2 0 H ( - ) I 1 --
--F
_ _ .( ~~. .-, _~~ .- ) X,,,'_ _
II O I H
Monomer: Symbolic
--0--
(,(o o) ((ll I J) (l# OI)))
rl
t-------~
C:ll 2 --
i[(C BZ C) ((H H) NIL (H H)))
OH 2 i!
Figure 8: Molecular structure representation Since the encoding is symbolic, the method is not a classical genetic algorithm but a genetic program. It is i m p o r t a n t to bear in mind t h a t the problem involves a search over polymer repeat units t h a t m a y be of different lengths. Consequently, the encoding does not require chromosomes to have a fixed n u m b e r of genes. It will be seen later t h a t the operators can in fact modify the length of a p a r e n t chromosome to result in offspring of different length. Figure 8 presents examples of the symbolic coding scheme r e p r e s e n t i n g molecular structure as nested lists in Lisp [55]. For the example shown i l l u s t r a t e d in the figure, ((C C) ((H H) (H C1))), the first list of two Cs s t a n d for two carbon backbone units. The subsequent lists contain elements t h a t are side-chain substituents for each backbone unit in the order of the lists. It is necessary to emphasize once again t h a t the adopted genetic encoding based on functional groups is a n a t u r a l representation of the problem, which enables easy expression of the rich and complex chemistry of molecules. F u r t h e r it facilitates the integration of any heuristic chemical knowledge t h a t one might have about the problem into the genetic f r a m e w o r k so as to speed up the search process. For instance, instead of s t a r t i n g the initial GA population at random, a designer using the GA s y s t e m can s t a r t with structures t h a t he or she believes to be good guesses based on his or her experience.
113
5.4.3 F i t n e s s f u n c t i o n
For the polymer-design problem, two kinds of fitness functions are used depending on the n a t u r e of property constraints. When one is designing for a target property value with some bounds (i.e. both upper and lower bounds on the desired value), the following Gaussian-like function is employed:
xp[
mn)/1
where Pi is the i th property value, Pi, max and Pi,min a r e respectively the m a x i m u m and m i n i m u m acceptable property values, which are used to normalize the property values and P~ is the average of the m a x i m u m and m i n i m u m acceptable property values, respectively, which are used to normalize the property values. The index i ranges over all the property constraints t h a t are applied. For example, consider designing for a glass transition t e m p e r a t u r e of 400 K (P~- 400 K), with Pi,max - 402 K and Pi,min - 398 K. Then, if for a p a r t i c u l a r molecular candidate Pi is 420 ~ then the candidate is s o m e w h a t far from the desired value as indicated by its fitness of 0.29 (for - 0.001). The function F ranges from 0 to 1, with 1 being the t a r g e t molecule's fitness. The p a r a m e t e r a is the fitness decay rate t h a t determines how the fitness values fall off as the solutions move away from the center of the target. The Gaussian fitness function is shown in Fig. 9. The second type of fitness function used is a sigmoidal function. This is preferred when the design involves property constraints t h a t have only a lower bound or an upper bound, but not both:
] § exp -
where PF=O.5,i is the property value for which the evaluated fitness is 0.5. It is t a k e n to be the lower or the upper limit of the acceptable property constraints. PRange,i normalizes the property values so as to remove any bias of a single property on the overall fitness. The total fitness is t a k e n as the m e a n of all individual property fitness values. The p a r a m e t e r controls the slope of the sigmoid. Figure 10 displays the fitness function
for~= 10.
114
Figure 9: Gaussian fitness function
Figure 10: Sigmoidal fitness function 5.4.4 Adaptation of Genetic Operators The molecular string representation offers an excellent platform to fully exploit the richness and variety of the chemistry of molecular evolution. Towards this end new genetic operators (in addition to the crossover and mutation operators) previously not found in the standard genetic algorithm literature, have been developed [52]:
Single-point Crossover Figure 11 shows the single-point crossover operator. In this example, crossover occurs after position three of parent #1 and position two of p a r e n t #2 (as shown by the dotted lines). The offsprings are created by crossing-over the genes of the parents as shown. When the parents are chromosomes of different lengths as in the case of Fig. 12, the cut-off point is chosen by counting the genes from the left or the right in each parent. Obviously, the crossover operator can lead to offspring with chromosomes of lengths different than either parent.
115
P a r e n t 2:
P a r e n t 1: |
C,H3 ~
O - - C-- C - r O - - O - -
--
II
I
O f f s p r i n g #1
i H HOH I
_
H
H CHs
0
I
OJ n
-
C,--/.~--O--,, GHa
n
O f f s p r i n g #2
_
0_1.
0
Figure 11: Operator for single-point crossover Main-chain Mutation and Side-chain Mutation These operators are analogous to the s t a n d a r d bit mutations. Main-chain and side-chain m u t a t i o n s involve the r e p l a c e m e n t of a r a n d o m l y selected main- or side-chain group respectively by a nothe r chemically feasible group. The m u t a t i o n operators conserve chemical consistency i.e. the valency considerations of each atom are properly satisfied after each operation. For instance, when a group on the main-chain is m u t a t e d to a n o t h e r group, the side-chain groups are correspondingly r e t a i n e d or removed according as the valency of the new group is equal to or less t h a n the group t h a t was mutated. Fig. 12 illustrates the main-chain and sidechain m u t a t i o n operators.
Parenl:
Ci
Offspring:
~l MainchainMutation..r.._ O i"~ 'iii GI .......
by -~)/Offspring:
Parent: $idechain Mutation IF
I
I
C--C
Repltioe~ - F by .
.
.
.
.
Figure 12: Main- and side-chain mutation operators
116
Insertion and Deletion The insertion operator randomly inserts a group at a single main-chain or side-chain location. Similarly, the deletion operator randomly removes a small n u m b e r of main-chain or side-chain groups. Removal of a sidechain group is equivalent to replacing the group with hydrogen. Insertion and deletion operators always lead to a modification in the n u m b e r of genes of the chromosome being operated. Examples of these operators are shown in Fig. 13 and Figure.
Figure 13: The insertion operator
Figure 14: The deletion operator
117
Parent 1:
Parent 2:
C-- C - O-I
H
, H, HJn
I
H
I
Blending
i_
O-- C--
C-- 0 II
0
0
Offspring' i
I
I
I
I
I
I
0-- O--O--O--O-O--
_H
H
H
I
H
II
0
--0--0 II
0
Figure 15: The blending operator The Blending Operator
The blending operator produces one offspring from the end-to-end connection of two parents. This essentially combines the attributes from both parents. Figure 15 shows the blending of two parent chromosomes. The blending operator radically increases the molecular length. The Hop-Mutation Operator
When this operator is applied, a randomly selected gene of the molecule exchanges position with another randomly selected gene. Thus, the selected genes 'hop' into the positions occupied by each other. An example of the process is illustrated in Fig. 16. This facilitates small rearrangements in the ordering of the units in a molecule, thus causing a local search for the appropriate isomeric form that increases the fitness. The operation is equivalent to the mutation of two genes of the chromosome to two pre-decided values. Hence the operator is known as hop-mutation.
5.5 CASE STUDIES: RESULTS AND DISCUSSION In this section, two short examples of the polymer design problem, taken from work done by Venkatasubramanian and co-workers [52], are presented. The first case study is based on design cases that had been investigated by Joback and Stephanopoulos [56] using their heuristicguided enumeration approach. The performance of the genetic search framework is demonstrated for polymers considered by Joback and Stephanopoulos in their study. The problem was to design polymers that were satisfy the following property constraints:
118
Figure 16: The hop-mutation operator Glass Transition Temperature: Tg > 400 K Volume Resistivity: R > 1x 1016 ohm - cm T h e r m a l Conductivity:
L > 1.6 x 10 -7 W mK
P e r m e a b i l i t y to Oxygen: P(O2) < 1.0 cc-mil/100 in2/day/atm Note t h a t the property constraints had only one bound, lower or upper, but not both. Such constraints are easier to design for t h a n those with both bounds and tighter tolerances. The latter situation is discussed in the second case study. Given the open-ended n a t u r e of the constraints, the sigmoidal fitness function was chosen. The polymer groups considered for the search are the same as Joback's and are listed in Table 1. Appropriate values for the genetic algorithm p a r a m e t e r s such as the population size, operator probabilities, etc. are i m p o r t a n t for an efficient search. The various p a r a m e t e r values used in the case studies are shown in Table2. Polymer molecules of length 2 to 10 groups were considered. A population of 100 members was used. Steady state reproduction was employed whereby the population r e m a i n e d fixed at all times. An elitist policy was used in which ten of the fittest m e m b e r s of the population from the p a r e n t generation are directly passed unchanged to the next. These p a r a m e t e r values were chosen by V e n k a t a s u b r a m a n i a n and co-workers after limited experimentation. It should be noted t h a t these might not be the optimal p a r a m e t e r values for the problem. A n u m b e r of p a r a m e t e r s can have a major impact on the design outcome and in fact, a sub-optimal set of p a r a m e t e r s can possibly lead to failure in discovering the t a r g e t solution. P a r a m e t r i c sensitivity and robustness analyses for the polymer
119 design problem are briefly discussed in the longer case study presented in c h a p t e r 13.
Table 1. Palette of groups for the first case study ~CH(C6H3)--
~CH2~ - - C ( C H 3)2m
- - C ( C H 3)(C6H5) m ~CH2-~CH2--
o
~O--C--O~ II O
~O--C~ II O
~C-NH~ II O ~CF2~
II O ~CHC1--
--cc12--
Tab:!e2":: GA parameters for thep~ Parameter Steady state population G a u s s i a n fitness decay rate (a) Sigmoid slope p a r a m e t e r (~) M a x i m u m polymer length Elitist retention with respect to population size
........
Genetic Operator Probabilities: Crossover Backbone m u t a t i o n Sidechain m u t a t i o n Hop Deletion Blending Insertion
Value 100 0.001 10 10 10%
'
0.2 0.2 0.2 0.2 0.1 0.1 0.0
Joback reports t h a t there are about 18,000 feasible molecules for this set of constraints and he lists fifty of them. The results of the genetic algorithm are s u m m a r i z e d in Table 3. Each i n d e p e n d e n t r u n of the GA consisted of evolution up to a m a x i m u m of 100 generations, with a steady state population of 100 molecules. The table shows the n u m b e r of distinct polymers found as well as the total n u m b e r of molecules. The total n u m b e r typically includes several copies of the same polymer. One can see t h a t each r u n was successful and h u n d r e d s of solutions were identified. The first solution was often found within the first 5-10 generations. When the r u n s were allowed to evolve for more generations (say, 500 or so), m a n y
120
more solutions were found. As mentioned before, this is a relatively easy design problem since the constraints are open-ended and not tight. Table 3. Results for case s t u d y 1. Initial p o p u l a t i o n size - 100. Total generations - 100
No. Distinct Solutions Found
Run #
Total No. Solutions Found
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 2 3 4 5 6 7 8 9 10 Average .
.
.
.
.
.
.
1042 1063 1058 1099 1058 1083 1040 999 1032 1049 1052.30 .
.
.
.
.
.
.
.
.
.
.
.
.
.
5278 5274 5204 5434 5161 5530 5215 5381 5124 5118 5271.90 .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
In the second case study, certain features were introduced to make the design problem more complex t h a n the first case study. First of all, i n d e p e n d e n t changes of the groups in the side chain as well as in the backbone of the polymers were allowed. In the first case study, the side chain groups could not be changed independently. Next the constraints were tightened so t h a t design problem was to identify a molecule whose property values were within • 0.5% of the target properties. It is i m p o r t a n t to note here t h a t this tolerance was very tight and made the search more difficult. Previous efforts in molecular design had not considered such tight constraints. Lastly, the n u m b e r of constraints was increased from four to five. The property constraints t h a t were considered in this case study were density, glass transition t e m p e r a t u r e , linear t h e r m a l expansion coefficient, dielectric constant, and specific heat capacity. The properties were calculated by using van Krevelen's group contribution methods.
Mainchain groups ~O-C~
mCm
II
I
morn
o
Sidechain groups
~H
~CH3
~F
~CI
Figure 17: Base groups used in the second case s t u d y
121 The m a i n - c h a i n and side-chain base groups chosen for this case study are given in Fig. 17. These groups were chosen such t h a t group contribution p a r a m e t e r s were available for all the properties considered and t h a t the molecules constructed by the genetic operators satisfied normal chemical bonding constraints. Feasibility constraints were p r o g r a m m e d into the genetic algorithm in order to avoid chemically infeasible group combinations. This is another illustration of the powerful ability of the GA-based approach to allow easy incorporation of complex chemical interactions or a r r a n g e m e n t constraints. Three t a r g e t polymers were selected t h a t offered different levels of difficulty in design: 9 Polyethylene t e r e p h t h a l a t e (PET), 9 Poly(vinylidene propylene) copolymer (PVP), 9 Polycarbonate of bisphenol-A (PC) Polyethylene t e r e p h t h a l a t e is the simplest, and the polycarbonate is the most difficult of the three. This is so because PC has nonlinear group interactions where the ordering of the groups m a t t e r in d e t e r m i n i n g the properties and hence the search space is more complex. The properties of these t a r g e t molecules, computed using group-contribution are listed in Table 4. These were submitted, one molecule at a time, to the genetic design system as the target properties with a tolerance of i 0.5% in the property values.
Table 4: Target polymers and their properties Glass Thermal Specific Density transition expansion heat P, g/cm3 temperature coefficient capacity
Target Polymer
...................................................................................................................................
L O
__~t
H
H
I
I
Cp,
Dielectric constant
J/kg'K ..........l; ....................................
.Tg.,...K ...................
O(,,...K-. 1...............
1.342
340
2.96 x 10 .4 1153
3.44
1.175
249
2.77 x 10 .4 1378
2.14
1.184
437
2.85 x 10 .4 1134
3.00
O H H Jn Polyethylene terephtha]ate
H
F i
H I
H I
"1
c--c-c-c---4F-
I I H F H CH3..In I
I
I
Poly(vinylidene propylene) copo|yme
r
~
,c.~~ _ ~
_lO_C_O Lp/,__c Lp2 1 Polycarbonate of Bisphenol-A
122 Tables 5 and 6 summarize the performance of the algorithm averaged over fifty runs for different design scenarios. Two different design scenarios were considered. In the first, the program was asked to design monomers t h a t varied in length from 2 to 7 units on the backbone, even though the target polymer's length was less than 7 (Tables 5a and 5b). This made the search more difficult as there were more possibilities with increased length. In the second test case, the permitted monomer length was from 2 to 10 units (Tables 6a and 6b). In each case, two different initializations of the starting population were considered: (i) random monomer lengths with random backbone and sidechain groups selected from Fig. 17 (Tables 5a and 6a) and (ii) random carbon backbone of varying lengths with H sidechain (Tables 5b and 6b). The stochastic nature of the genetic algorithm necessitated the results to be averaged over several runs for each case in order to get statistically meaningful results. Each run terminated at the 2 0 0 th generation. All the runs employed the identical set of parameters given in Table 2. The gaussian fitness function was used this time since such a fitness function is more appropriate for bounded constraints.
Table 5: Results for random groups in the backbone and side-chain. Monomer length = 2- 7
..........................................................................
. ......................
...::
: .
..............
:
Target Polymer ........................................................
.----:::::::----:
. .....
: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
..................................
:
.....
::::
.......
.. . . . . . . . . . . . . . . . . . . . .
. ...........
..:::.::
.. . . . . . . . . . . . . . .
-
Avg. generation # Avg. # of solutions at Percentage of when first solution the end of GA Search runs was found successful ........
.. . . . . .
:. . . . .
. ...........
Polyethylene terephthalate
, ............
: ..........
:.::_
..............
. - : : : -
.....
:::_:.~
:. . . . . . . . . . .
--
: .....
.......::
_ _ :
.......
--.:
..........................................
:-:::
:- ..............
------::-::
28.2
10.5
100%
Poly(vinylidene propylene) 11.3 copo]ymer
14.0
100%
.
..
: ...............
.::_.:::_._
Polycarbonate of 41.0 3.9 100% b is~p..henol-A...........................................................................................................................................................................................................................
..............................
Tabl e 5b: Res.ults for random : C H 2 , . g r o u p s .................................
Avg. generation # Avg. # of solutions at Percentage of when first solution the end of GA Search runs was found successful
Target Polymer P01yethylene terephthalate
..........................
.... 13.6
11.3
100%
Poly(vinylidene propylene) 11.3 copolymer
14.3
100%
Polycarbonate of bisphenol-A
3.8
100%
58.0
123 The results for the success rate indicate that the genetic search did very well in general. For the polymers PET and PVP, the search discovered these polymers in every run. Furthermore, it discovered multiple instances of these polymers with exactly the same structure and also found them fairly quickly as seen from the low average generation count. In addition, it also found several other structures, which had very high fitness values (typically, 0.90 or better). It took longer to find the solution for L=10 in comparison with L=7, as the search space was larger for the former. It is interesting to note that for L=10, the genetic search discovered dimers as well as monomers. With respect to computational effort, the longest run (for polycarbonate in Table 6b) took about 5 minutes in real-time (about 2 cpu secs) on a Sparc 10 workstation. T a b l e 6a: R e s u l t s for r a n d o m g r o u p s in the backbone a n d side-chain. ............................... M o n o m e r L e n g t h = 2-10 . . . . . . . . . . . .
Avg. generation # when first solution was found
Avg. # of monomers found at the end of the GA search
Avg. # of dimmers Percentage found at the of runs end of the successful GA search
Polyethylene terephthalate
28.4
9.1
7.8
100%
Poly(vinylidene propylene) copolymer
12.1
6.7
14.8
100%
Target Polymer
Polycarbonate of bisphenol-A .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
............................................
.
60.6 .
.
.
.
.
.
.
.
.
.
.
.
2.9 .
.
.
.
.
.
.
.
.
.
.
.
5.8 .
.
.
.
.
.
Table 6b: R e s u l t s for r a n d o m - C H 2 -
.
.
: : : ,
100% -
-
.
.
.
.
.
.
.
.
.
.
.
groups
Avg. # of Avg. Avg. # of dimmers generation # monomers found at when first found at the the end of solution was end of the the GA found GA search search
Percentage of runs successful
Polyethylene terephthalate
14.7
8.5
8.1
100%
Poly(vinylidene propylene) copolymer
12.4
6.9
13.9
100%
Polycarbonate of bisphenol-A
73.1
1.7
3.5
76%
Target Polymer
.......
: . . . . . . . . . . . . . . . . . .
_
.
.
.
:
.......
-
-
-
124
The polycarbonate was the most difficult structure to identify as mentioned earlier. Consequently, it took more generations on average to discover this polymer. However, the genetic search did discover this polymer as well with 100% success rate for L=7 case. For the L=10 case, it was less successful (76%) when the initial population was a r a n d o m collection of-CH2- chains. This was so because the members of the initial population were very different in their structure from the t a r g e t and hence it took longer to discover the correct groups and structure. It was observed t h a t if the evolution were allowed to continue for 300 generations in the failed runs, the genetic search was able to discover the t a r g e t in most cases. In the case of random groups initialization, some of the right groups (like benzene or OCO) were already present in the initial population. This gave a b e t t e r s t a r t and hence a quicker search.
5.6 C O N C L U S I O N S This c h a p t e r has illustrated the use of genetic algorithms or genetic p r o g r a m m i n g for computer-aided molecular design. A background of GAs, their theory and i m p l e m e n t a t i o n has been provided. Though the two test problems discussed are relatively small, they are sufficient to present a flavor of the utility of a genetic search method for CAMD. As clearly d e m o n s t r a t e d by the case studies, the genetic algorithm f r a m e w o r k offers a n u m b e r of advantages: first of all, it is a multiple point search technique t h a t examines a set of solutions and not just one solution - this and the stochastic n a t u r e of the algorithm helps the search to escape local m i n i m a traps. Secondly, it is not derivative-based and is therefore able to avoid the difficulties faced by m a t h p r o g r a m m i n g techniques in t h a t respect. F u r t h e r m o r e the framework allows relatively easy expression of the rich and complex chemistry of molecules thus allowing easy integration of whatever heuristic knowledge one might have about the problem, into the genetic framework to speed up the design process. This is illustrated in the larger polymer design case study discussed in chapter 13. One can appreciate the significant advantage of having a multi-point search in t h a t regardless of whether the true t a r g e t solution is located, a n u m b e r of near-optimal solutions are presented to the designer. This becomes particularly significant for the design of molecules t h a t are too complex for the forward predictions to be completely reliable. In such cases, one would like a range of design candidates t h a t could be subjected to f u r t h e r testing with actual synthesis or experimentation in a laboratory. A GA search strategy, no doubt, also suffers from some drawbacks. Mainly, the heuristic n a t u r e of the search results in no g u a r a n t e e being
125 offered of finding the target solution. Secondly, the selection of good parameter values for a given problem requires some degree of experimentation. But then, these shortcomings are true of other heuristic approaches as well. And for a general nonlinear optimization search problem, the target i.e. the global optimum solution cannot be guaranteed in any case. Notwithstanding these drawbacks, the advantages of using a GA-based inverse strategy more than warrant its use as a design system. The appendix presents a bigger, more complex version of the polymer design problem wherein the merits of the algorithm become even more apparent. The study also briefly addresses issues related to parametric sensitivity and the robustness of the GA, which are of vital importance as far as the practical utility and application of the design system is concerned.
5.7NOMENCLATURE
N F
cumf sf P(') E(-) 5 o
L fi~(t) ft-(t) N~(t) Pc Pm
O W (z
CAMD GA(s) PET PVP PC
AND ABBREVIATIONS
population size fitness cumulative fitness scaled fitness probability expected value defining length of a given schema order of a given schema (maximum) length of chromosome average observed fitness of schema ~ at time t average fitness of the population at time t number of members in schema ~ at time t probability of crossover probability of bit-mutation state or phenotype space genetic or genotype space decay rate for Gaussian fitness function slope parameter for sigmoidal fitness function Computer-Aided Molecular Design Genetic Algorithm(s) Polyethylene terephthalate Poly(vinylidene propylene) copolymer Polycarbonate of bisphenol-A
5.8 R E F E R E N C E S :
1. S. Macchietto, O. Odele and O. Omatsone, Chem. Eng. Res. Des., 68, 5 (1990) 429-433. 2. O. Odele, and S. Macchietto, Fluid Phase Equilibria, 82, 47 (1993).
126 A. Duvedi and L. E. K. Achenie, Chem. Eng. Sci., 51 (1996) 37273739. N. Churi and L. E. K. Achenie, Ind. Eng. Chem. Res., 35 (1996) 3788-3794. S. Siddhaye, K. V. Camarda, E. Topp and M. Southard, Comput. Chem. Eng., 24 (2000) 701-704. R. Vaidyanathan and M. E1-Halwagi, J. Elastom. Plast., 26, 3 (1994) 277. R. Vaidyanathan and M. E1-Halwagi, Ind. Eng. Chem. Res., 35 (1996) 627-634. C. D. Maranas, Ind. Eng. Chem. Res., 35 (1996) 3403-3414. 9. C. D. Maranas, AIChE J., 43, 5 (1997) 1250-1264. 10. K. Camarda and C. D. Maranas, Ind. Eng. Chem. Res., 38 (1999) 1884-1892. 11. K. G. Joback and R. C. Reid, Chem. Eng. Commun., 57 (1987) 233. 12. R. Gani, N. Tzouvras, P. Rasmussen and A. Fredenslund, Fluid Phase Equilibria, 47, 2 (1989) 133. 13. D. W. van Krevelen, Properties of Polymers; their Correlation with Chemical Structure; their Numerical Estimation and Prediction from Additive Group Contribution, 3rd Ed., Elsevier, Amsterdam, The Netherlands, 1990. 14. L. Constantinou and R. Gani, AICHE J., 40, 10 (1994) 1697. 15. L. B. Kier, Quant. Struct.-Act. Relat., 4, 109 (1985). 16. L. B. Kier, Quant. Struct.-Act. Relat., 5, 1 (1986). 17. H. Weiner, J. Am. Chem. Soc., 69, 17 (1947). 18. M. Randic, J. Am. Chem. Soc., 97, (1975) 6609. 19. A. Meniai and D. M. T. Newsham, Trans. Ind. Chem. Eng., 70, Part A (1990) 78-77. 20. P. M. Harper, R. Gani, P. Kolar and T. Ishikawa, Fluid Phase Equilibria, 158-160, (1999) 337-347. 21. P. Ghosh, V. Venkatasubramanian, J. M. Caruthers and A. Sundaram, Comput. Chem. Eng., 24 (2000) 685-691. 22. K. Nagasaka, H. Wada, H. Yoshimitsu, H. Yasuda and T. Yamanouchi, AIChE Annual Meeting 39e, Chicago, IL (1990). 23. R. Gani, B. Nielsen and A. Fredenslund, AICHE J., 37, 9 (1991) 1318. 24. G. Bolis, L. D. Pace and F. Fabrocini, J. Comput. Aided Molecular Design, 5 (1991) 617-628. 25. E. V. Gordeeva, M. S. Molcharova, and N. S. Zefirov, Tetrahedron Comput. Methodol. 3, 389 (1990). 26. L. B. Kier, H. Lowell and J. F. Frazier, J. Chem. Inf. Comput. Sci., 33, 142 (1993). 27. G. C. Derringer and R. L. Markham, J. Appl. Polym. Sci., 30, 4609 (1985). 28. K. G. Joback and G. Stephanopoulos, Proc. FOCAPD, Snowmass, CP, (1989) 363. 29. M. Skvortsova, I. I. Baskin, O. L. Slovokhotova, V. A. Paulin and N. S. Zefirov, J. Chem. Inf. Comput. Sci., 33, (1993) 630-634. ~
~
~
~
~
~
127 3 0 . J . H . Holland, Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor, MI, 1975. 31.D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, Reading, MA, 412, 1989. 32.D. Davis (Ed.), Handbook of Genetic Algorithms, Van Nostrand Reinhold, New York, 1991. 3 3 . G . J . E. Rawlins (Ed.), Foundations of Genetic Algorithms, Kaufmann Publishers, San Mateo, CA, 1991. 34. K. F. Man, K. S. Tang and S. Kwong, Genetic Algorithms: Concepts and Designs, Springer, London, 1999. 35.L. Booker, Improving search in genetic algorithms, in Lawrence Davis (Ed.), Genetic Algorithms and Simulated Annealing, Pitman, London, 1987. 36.L.J. Eshelman, R. A. Caruana and J. D. Schaffer, Biases in the crossover landscape, in Proc. Third International Conference on Genetic Algorithms, Morgan Kaufmann, San Mateo, CA, 1989. 37.G. Syswerda, in Proc. Third International Conference on Genetic Algorithms, Morgan Kaufmann, San Mateo, CA, 1991. 38.W. Spears and K. A. De Jong, in Proc. Fourth International Conference on Genetic Algorithms, Morgan Kaufmann, San Mateo, CA, 1991, 220-236. 3 9 . J . J . Grefenstett, IEEE Trans. Systems, Man and Cybernetics, SMC-16, 1 (1986) 122-128. 40.N. Radcliffe, Annals of Mathematics and Artificial Intelligence, 10 (1994). 41.A. Nix and M. D. Vose, Annals of Mathematics and Artificial Intelligence, 5, (1991) 79-99. 42.M.D. Vose and G. E. Liepins, Complex Systems, 5, (1991) 31-44. 43.D. Whitley, An executable model of a simple genetic algorithm, in D. Whitley (Ed.), Foundations of Genetic Algorithms 2, Morgan Kauffman, San Mateo, CA, 1992. 44.J.M. Fitzpatrick and J. J. Grefenstette, Machine Learning, 3, 2/3 (1988) 101-120. 45. C. Bridges and D. E. Goldberg, in Proc. Second International Conference on Genetic Algorithms, Lawrence Erlbaum Associates, Hillsdale, New Jersey, 1987. 46.N.J. Radcliffe, Complex Systems, 5, 2 (1991) 183-205. 47.N.J. Radcliffe, in Proc. Fourth International Conference on Genetic Algorithms, Morgan Kauffman, San Mateo, CA (1991) 222-229. 4 8 . N . J . Radcliffe, in D. Whitley (Ed.), Foundations of Genetic Algorithms 2, Morgan Kauffman, San Mateo, CA, 1992. 4 9 . N . J . Radcliffe, in R. M~inner and B. Manderick (Eds.), Parallel Problem Solving from Nature 2, Elsevier Science Publishers, North Holland, Amsterdam, 259-268, 1992. 50.V. Venkatasubramanian and A. Sundaram, in Encyclopedia of Computational Chemistry, John Wiley and Sons, 1997.
128 51. V. Venkatasubramanian, A. Sundaram, K. Chan and J. M. Caruthers, in J. Devillers (Ed.), Genetic Algorithms in Molecular Modeling, Academic Press, London, 1996, 271-302. 52. V. Venkatasubramanian, K. Chan and J. M. Caruthers, Comput. Chem. Eng., 18 (1994) 833-844. 53. R. C. Glen and A. W. R. Payne, J. Comput. Aided Molecular Design, 9 (1995) 181-202. 54. J. Devillers, J. Chem. Inf. Comput. Sci., 36 (1996) 1061-1066. 55. P. H. Winston and B. K. P. Horn, LISP, Second Edition, AddisonWesley Publishing, 1984. 56. K. G. Joback and G. Stephanopoulos, FOCADP '89, Snowmass, CO, 1989.
Computer Aided Molecular Design: Theory and Practice L.EK. Achenie, R. Gani and V. Venkatasubramanian (Editors) 9 2003 Elsevier Science B.V. All fights reserved.
129
C h a p t e r 6" A H y b r i d CAMD M e t h o d P. M. Harper, M. Hostrup & R. Gani
6. 1 I N T R O D U C T I O N As in any design problem, the design process in CAMD also needs to generate and evaluate alternatives in order to find the desired chemical product. In the case of CAMD, the alternatives are chemically feasible molecules (or mixtures of molecules) and the feasible candidate molecules (or mixtures) are those that satisfy the design specifications represented by a set of property constraints. This chapter describes a framework for a hybrid CAMD method. The design process, according to this framework is divided into three phases. 9 The pre-design phase - definition phase of the CAMD problem. 9 The design phase - solution phase of the CAMD problem in terms of generation of feasible candidates. 9 The post-design p h a s e - analysis phase of the CAMD where the where the final selection is made. Figure 1 illustrates the principal ideas behind this framework through a simple CAMD problem where functional groups are used as the building blocks for generating feasible molecular structures. Pre-design Interpretation to
~put/constmints
"1 want acyclic alcohols, ketones, aldehydes and ethers with solvent properties similar to Benzene"
~
A set of building blocks: CH3, CH2, CH, C, OH, CH3CO, CH2CO, CHO, CH30, CH20, CH-O + A set of numerical constraints
Design (Higher levels) 2.order"" group ....
CH2_ CH/"
CH.......... ; ;OH3.
~O /
~'CH /
~
....
Group from other GCA method
":(~H3 : ...' -..... .... CH 3
I
CH2 ....... .--.. C H ~ OH3 C.3,/ ~O~...:~ CHz-/ ~;~,-
Refined property estimation. Ability to estimate additional properties or use alternative methods. Rescreening against constraints.
Design ( S t a r t ) - A collection of group vectors like: 3 CH3, 1 CH2, 1 CH, 1 CH20 All group vectors satisfy constraints =.-~l iv-~l
Start of Post-design ................. CH~
CH 2
~o ~
CH 2
"CH ~
CH 3
I
CH3 CH3 CH 2
I
CH_
OH 3
cHS xo/ "cHS
Figure 1: Illustration of the CAMD framework
130 The application of the framework illustrated in Figure 1 requires a number of methods and tools that need to be integrated in order to provide a flexible, reliable and robust solution to a large range of CAMD problems. Figure 2 highlights the architecture of such a hybrid CAMD method. In this chapter, the term "product" will be used to mean molecules as well as mixtures. Problem Specification
Pfe-design phase
.......
[ I Constraint Selection
I Compound: }Identification Design phase
--
Database Approach ...........................
i. F
t
"" . . . .
I Approach Le~,ei3 J , e v e , 4 ............. 9 l:: (
Result Analysis and I * - " " " Verification ~ ... Post-design phase
I ,
,
I Candidate Selectionl I
_
_-J
t
--"~lec~ar"
.
" ~ . , Mod elli?ng
"",,,
9
Databases
"~"
"--
--
j} _
t
( Ext~rllil I . . . . .~
..
.a.~.Q
"
t
Tools I
Figure 2: The hybrid CAMD method and framework for integration 6.2
P R E - D E S I G N PHASE
The CAMD formulation in terms of design specifications is performed in a pre-design phase where the CAMD problem is described in terms of identified design goals, desired molecule type(s) and properties. As shown in Figure 2, this pre-design phase consists of a problem specification step and a method & constraint selection step, which includes an algorithm for problem formulation.
6.2.1 CAMD P r o b l e m S p e c i f i c a t i o n The design process starts with a definition of the basic needs (or ultimate goals). The type of goal may influence many of the design decisions that will need to be made during the later phases of the CAMD problem
131 solution. The goal should describe the function of the desired chemical product, the e n v i r o n m e n t / e q u i p m e n t where the function should be performed as well as the capabilities t h a t are desirable/undesirable. For example, in the case of design of solvents, the desired solvent m u s t dissolve a specified solute(s), it m u s t be selective if other soluble solutes are also present, it m u s t not cause a negative e n v i r o n m e n t a l impact and it should be easy to recover. The description of the goals of CAMD can be of different t y p e s - a few examples are given below.
"Find a solvent suitable for removing phenol from a waste water stream by liquid-liquid extraction. The solvent should pose a low health risk for the users, should be environmentally friendly and could be a single molecule or a mixture." This is an example of a well-defined problem as almost all necessary details are given. From the specified details the properties t h a t are needed (such as solubility, EH&S properties, liquid immiscibility, etc.) can be identified. The goal values for the properties are not given but if the objective is to find the best solvent, then it m u s t have the highest solubility and the least environmental impact. "Identify a molecule(s) with the same pure component properties as benzene, such as normal boiling point, normal melting point, octanol-water partition coefficient, solubility parameter as that of benzene but with a much lower environmental impact in the work place." Again, this is a welldefined problem with even the goal values given because the property values of benzene are already known. "Find a solvent to be used for washing off an equipment (for example a printing press) which is environmentally friendly and cheap." Here, the problem is not very well defined because while some of the constraints are defined, one piece of i m p o r t a n t information is m i s s i n g - w h a t should be dissolved by the solvent? F u r t h e r m o r e the definition of 'cheap' depends on the process involved as well as the current solution used. "Find an additive (molecule or mixture) for a tape so that the tape will stick to a painted surface for a year and then can be removed without pulling off the paint." This is another example of a not very well-defined problem because we need more information on the glue t h a t will be added to the tape as well as the various compositions of paints where the tape will stick. The m a i n question here is which properties are we looking for and w h a t are their goal values? "Find a molecule that will have inhibition activity against Alzeheimer's disease." Problems of this type, although very well defined in t e r m s of property, are difficult to solve because of the potential search space. If, however, we add the f o l l o w i n g - " s e a r c h only among the isomers of X ~ ' where "XX" is a particular molecular t y p e - then we have a well-defined problem, even though the n u m b e r of possible isomers m a y be quite large
132 and prediction of the inhibition activity as a function of the molecular structure may be quite difficult.
Find all molecules that form an azeotrope with ethanol at a pressure of 1 atm. This is not a typical product design problem. CAMD, however, can also solve problems like this. It is not well defined because the search space is potentially very large. However, if we select a molecule type (for example, acyclic hydrocarbons of molecular weight less than 100), then the problem becomes well defined. The above examples highlight the need for a knowledge-based system that can identify the needed properties from the general problem specifications presented above. Once the properties have been identified, their goal values need to be specified and methods for obtaining the necessary property values need to be selected. That is, the qualitative problem specification needs to be transformed into a quantitative problem specification.
6.2.2 M e t h o d & C o n s t r a i n t S e l e c t i o n The objective of this step is to transform the qualitative problem specification from the previous step into a quantitative form that is suitable for CAMD problem solution during the design phase. The quantitative problem specification consists of the following: 9 Identify the needed p r o p e r t i e s - this matches the qualitative specification with behavior (properties) of the chemical product. 9 Identify the goal values of the needed properties - this matches the actual goal of the product with respect to its function and behavior. 9 Identify the methods for obtaining the property values - this determines how the property (behavior) of the product will be obtained. 9 Identify the building blocks for generation of molecular structures or candidate chemicals for mixture d e s i g n - this determines the search space and the scale of the molecular structural model. In order to assist in the transformation of the qualitative problem specification into a quantitative one, use of a knowledge base, can be very useful. A knowledge base, particularly suitable for applications involving solvent-based separation processes, is highlighted below.
K n o w l e d g e base The objective of this knowledge base is to assist in the transformation of general qualitative solvent design problem specifications into quantitative ones t h a t are suitable for CAMD problem solution.
133 The information contained in the knowledge base is ordered as a hierarchical system with the application types of the solvent-based process at the top and the properties and property values at intervals of specified conditions of temperature, pressure and/or composition at the bottom. Figure 3 illustrates a section of the information tree belonging to this knowledge base. It can be noted that the property entries in the information tree in Figure 3 have three branches: Essential Properties The properties in this branch are essential for the function of the desired product and is most often either related to the phase behavior of the molecule or the driving forces for the separation operation the molecule is intended for. For example, the constraint that the molecule must be in the liquid state at the operational temperatures of the process creates an essential requirement that the boiling point of the molecule is above the operational temperature while the melting point is below. Also, if the molecule is to be used as a solvent for liquid-liquid extraction, it must cause a phase split and have a density different from that of the solutes.
Figure 3: Partial information tree of the knowledge base Desirable Properties: Desirable properties are related to the performance or efficiency of a product in a specified application. The
134 product may still be acceptable if these properties are not matched. They become important during the selection of the feasible candidates and during performance evaluation in order to determine the optimal design. As a rule of thumb fixed lower or upper limits cannot usually be set for these desirable properties. Generally, the aim is to have the highest or lowest possible value for the identified desirable properties. An example of a desirable property is the selectivity towards a specified solute that must be extracted from a mixture with other solutes through a solvent-based extraction process such as liquid-liquid extraction. For convenience, the undesirable properties are also be included in this class of properties. EH&S and Special Properties: These properties are associated with the performance of the product in a specific operation or function and its effect on the surroundings (or environment) as a result of their use and emission. These properties may be specified as essential, desirable and/or undesirable. However, they are placed as a separate class because methods for their direct estimations are usually not available. Consequently, they may be considered in the post-design phase through database search or even through direct or indirect experiments. In this way, this type of potentially expensive analysis is reserved only for those candidates that satisfy all other product criteria. Note that some of the essential and desired products may implicitly also satisfy the EH&S and special property constraints. Examples of the special properties are those related to, for example, smell, color and taste. Each property branch is divided into a pure properties and a mixture properties leaf. The pure properties are further divided into primary properties, secondary properties and functional properties (this is not shown in Figure 3) while mixture properties belong to the class of functional properties (see Chapter 1). Note that some mixture (functional) properties such as solubility may be calculated as a function of primary properties while some other functional properties and secondary properties may be calculated as a primary property. For example, if a rigorous model for estimation of solute solubility is not available, the necessary property values may be estimated through solubility parameters. However, since solubility parameters, by definition is both a functional (temperature dependence) and a secondary property (function of molar volume and heat of vaporization), it becomes a primary property if the t e m p e r a t u r e is fixed to 298 K and if it is directly correlated as a function of molecular structural parameters. The knowledge base contains this information and is useful when a needed model of one type is not available. In the case of functional properties, the CAMD problem specification needs to specify the range of conditions where these properties must be
135 matched, that is, the intervals of conditions of operation as a function of temperature, pressure and/or composition. In addition to the information contained in the partial information tree of Figure 3, the knowledge base may also include tabular data linking a particular CAMD problem type with corresponding properties, linking properties to EH&S analyses as well as data related to the CAMD problem type and the phenomena involved. Three examples of such tabular data are given through Tables 1, 2 & 3.
Table 1: List of separation techniques and their corresponding separation phenomena "defined by class and phases involved) Separation technique Crystallization Distillation Distillation plus decanter Extractive distillation Azeotropic distillation Liquid-Liquid extraction Super-critical extraction
Class
Phases i n v o l v e d
Property difference Property difference Property difference Solvent-based Solvent-based Solvent-based
Solid-Liquid Vapor-Liquid Vapor-Liquid Liquid-Liquid Vapor-Liquid Vapor-Liquid-Liquid Liquid-Liquid
Solvent-based
Fluid-Vapor-Liquid
In the knowledge base (Table 1) the properties important to the function in a particular application are listed along with the relative property differences needed to perform the function (column 2 of Table 1) and the associated phases involved in the particular application (column 3 of Table 1). In the knowledge base for essential and desirable properties as a function of application type (Table 2), the listed properties should only be used as a starting point. Other properties may need to be added and some of the listed properties may need to be removed depending on the particular CAMD problem specifications. The EH&S properties listed in Table 3, are given as general guidelines based on the phases involved in the applications listed in Table 2. Note that the consideration of EH&S properties is often dependent on the entire process (how the solventproduct is handled and the possible routes of discharge to the environment). Nevertheless, the consideration of EH&S related properties on a unit operation level can address work place health and safety issues associated with non-routine releases as well as make it possible to use more rigorous approaches to environmental impact minimization (see also section 6.4.2).
136
Table 2: List of important properties for some separation techniques Solvent Design Properties L-L Extractive Azeotropic Solid Gas Extraction Distillation Distillation Separation Absorption Pure
E
D
E
D
E
D
4
4
q
4
4
4
E
D
E
D
,/ q 4
q
4
T
4
4
Tm
4
N/
P pV
'/ ,/
4
~/
H vap
4
Mixture
Selectivity SL SP DC Phase-split Azeotrope
1
E
D
E
4 q ,4 ,/ 4
D
q E
,4
4 ,/ q 4
4 q ,4 4
D
4 4
E
D
4 4
D
E
,/ ,/ ,4
4
Pm
gm H
4
Note" E is Essential; D is Desirable; L-L is liquid-liquid; the definitions of property variables in column 1 are given in Nomenclature.
Table 3: List of properties for addressing EH&S considerations Properties
Implicit
Explicit
Toxicity Biological persistence Chemical stability Reactivity Biodegradability Pv H (in water) Log P Log W s Flash point BOD p (vapor) Evaporation rate LD50 ODP
Environmental Concern
Health
Safety
Environment
~/
~] ~/
~] "4 ~] ~] ~]
~] ~] ~]
~] ~] ~] ~]
~] ~]
~] ~]
137
Problem Formulation Algorithm The objective of the problem formulation algorithm is to transform the qualitative problem specification into a quantitative one through a combination of the use of knowledge base, insights and experience. It is a multi-step process requiring different levels of information. A step-by-step algorithm that may be useful for CAMD problem formulation is given below. The corresponding representation of the algorithm as a block diagram is shown in Figure 4. 9 List the unit operations to be considered. 9 For each unit operation: o o
o o
o
Retrieve the known properties of the compounds the designed compound is to be used with. Obtain the operational ranges of temperature and pressure along with the composition ranges for the compounds in the system. Identify the property models available for estimation of the needed pure and mixture properties. Extract the list of relevant pure and mixture properties from the knowledge base for the unit operation. If the selected property models from the previous step are unable to estimate the needed properties, consider either adding a new model or estimating a similar property that can be estimated reliably. If any of the design properties require information about the other compounds in the system in order to set up the target values, compare the requirements with the list of known properties obtained from above. If some requirements cannot be fulfilled the properties are removed from the set of design criteria.
9 Create a superset of criteria by combining the sets of identified properties for each of unit operations. 9 For each of the properties in the superset create the target ranges (the design constraints) by combining the property intervals identified for each of the unit operations and uses. The identified property intervals represent the design criteria satisfying the requirements of all the operations examined. 9 List the methods available for predicting the required properties. 9 List the molecule types that can be handled by the property prediction methods and the predictive thermodynamic models. 9 From the list of compound types for which property prediction methods exist create the list of building blocks used to create/assemble the molecules in the design phase.
138
Figure 4: Block diagram of the problem formulation algorithm The result of the problem formulation algorithm is: 9 A list of building blocks to use (e.g. CH3, CH2, CH, OH, COOH). 9 A set of inequality constraints based on pure component properties. 9 A set of inequality constraints based on mixture properties (along with information regarding the conditions at which to evaluate the properties). 9 Information regarding the methods (pure and mixture) available for the evaluation of the constraints. A database containing information on, type of molecules versus building blocks (for example, groups) and type of molecules versus specific EH&S properties helps in the problem formulation. For example, functional groups (building blocks) such as "OH" and "COOH" must exist in alcohols and acids, respectively. Therefore, selection of molecular types such as alcohols and acids could be linked to automatic selection of "OH" and "COOH" functional groups in the set of building blocks. Similarly, aromatic compounds are likely to be carcinogenic while chlorides may cause corrosion and have a negative impact on environmental indicators. Therefore, choice of these EH&S properties as constraints means automatic exclusion of the corresponding compounds and therefore, their corresponding building blocks. The first two steps in Fig. 1 also highlight this feature. For the specified type of desired molecules, the corresponding
139 building blocks have been selected. A good exercise for the reader would be to consider the groups tables given in chapters 2 and 4 and prepare a table of molecular type versus groups (building blocks). An example for such a table is given below in Table 4 for simple mono-functional molecules.
Molecule Type Acyclic hydrocarbons Aromatic hydrocarbons Alcohols Ketones Esters Acids
Table 4: Molecule type versus groups Groups (building blocks) CH3, CH2, CH, C
CH3, CH2, CH, C, ACH, AC, ACCH3, ACCH2, ACCH CH3, CH2, CH, C, OH CH3, CH2, CH, C, CH3CO, CH2CO CH3, CH2, CH, C, CH3COO, CH2COO, HCOO, CO0 CH3, CH2, CH, C CH3, CH2, CH, C, COOH
The information related to the quantitative CAMD problem specification is now passed to the next phase of the design process, that is, the design phase of the hybrid CAMD method.
6.3
DESIGN PHASE
Given the quantitative problem specification, the objective of the design phase is to apply a suitable method for generating the feasible candidates. Here, the feasible candidates can be a set of molecules (or mixtures) that satisfy all property constraints and/or the molecule (or mixture) that not only satisfy the constraints but also reflect the optimal performance. Whether it is a set of candidates or an optimal candidate (or a set of local optimal candidates) depends on the CAMD algorithm used in this design phase. In principle, any of the CAMD methods described in Chapters 2-5 & 7 can be used in this design phase. The hybrid CAMD method described in this section employs successive generate & test approaches ordered in a hierarchy based on the level of molecular structural information used and the corresponding property estimation method. The properties are also ordered according to a hierarchy where the primary pure properties are estimated first, followed by secondary pure properties, followed by functional pure properties, and finally, the mixture properties. Note that the implicit EH&S properties and the implicit special properties are analyzed in the post-design phase in this hybrid CAMD method. In the CAMD solution approach of the generate & test type, all feasible molecules are generated from a set of building blocks and subsequently tested against the design specifications to screen out the alternatives that do not fulfill the requirements. The socalled combinatorial explosion problem associated with CAMD algorithms in general and generate & test approaches in particular is avoided through the employed multi-level approach. That is, through successive steps of
140 generation and screening against the design criteria, the level of molecular detail is increased only on the feasible candidates and not on all possible combinations.
6.3.1 H y b r i d G e n e r a t e & Test CAMD A l g o r i t h m The hybrid generate & test based CAMD algorithm has four levels. Each level has its own generate & test algorithm. Higher levels use additional molecular structural information compared to lower levels. The fundamental basis for the developed algorithm is the continuous refinement of the results obtained from each level. The lower levels have a low computational complexity (i.e., it is possible to generate a large number of alternatives without excessive calculations) but do not in all cases generate all the information necessary to perform the estimation of the important properties. The higher levels are more complex and cannot handle a very large number of alternatives without application of a significant computational effort. Consequently, the design strategy of the developed algorithm is a hybrid approach where the lower levels are used to "pick out" promising candidates from the search space while the higher levels use the output from the preceding level as input. The net effect of this approach is that the results are refined from level to level without spending computational resources on candidates, which are unable to fulfill the requirements. In outline form the characteristics of the levels are: 9 Level 1 generates group vectors by combining groups from a basic group-set (for example, the UNIFAC first-order g r o u p s - see the groups sets also used in chapters 2 and 4). Based on the equations and feasibility considerations given by Harper (2000), the algorithm generates all the feasible molecular representations without suffering from combinatorial explosion. The testing of the generated molecules against the design criteria is performed using methods based on the Group Contribution Approach (GCA). 9 Level 2 takes the results from level 1, that is, the molecules surviving the test step of level 1 and combines the members of each group vector to form new molecules (including isomers). 9 Level 3 brings the molecules out of the (pseudo) macroscopic group representation from level 2 into a microscopic (atom-based) representation by replacing the group information with the equivalent atomic information. 9 Level 4 expands the microscopic information by adding a 3dimensional representation to the results from level 3. This multi-level procedure is illustrated through an example in Figure 5. Note t h a t entry is possible at any level as long as the appropriate data is available.
141
Figure 5: Illustration of the 4-level CAMD hybrid method 6.3.2 L e v e l 1: G e n e r a t i o n of g r o u p v e c t o r s f r o m f i r s t - o r d e r g r o u p s Level 1 generates vectors of groups (fragments) by combining groups from the first-order group-set. These sets are capable of forming at least 1 feasible molecular structure. Simultaneous calculation of related properties (that are dependent only on first order groups) and screening of the generated structures are performed in order to control the problem size and execution time. The algorithm here is based on the group classification work of Gani et al. (1991) but uses a different and more efficient method of group vector assembly. The main features of the new algorithm are: 9 Building blocks are classified according to type. 9 Feasibility rules are based on the n u m b e r of groups from a specific class a compound m a y contain. 9 Valence rules are used to determine the n u m b e r of groups with 1,2,3 & 4 connections to be used in molecule structure generation. G e n e r a t i o n a l g o r i t h m for l e v e l 1 The m a i n steps of the level 1 algorithm are illustrated t h r o u g h Figure 6. By using equation A.3 (see appendix A of this chapter) repeatedly in conjunction with the classification system and the feasibility rules it is possible to only generate compounds (group vectors) fulfilling the feasibility r e q u i r e m e n t s (i.e. no compounds are generated and
142 subsequently discarded due to violation of the feasibility requirements). The algorithm for generation of feasible compound representations is: Set C (the collection of designed compounds) equal to 0. 2. Set Pc, v (the collections of compound sub-blocks from different classes and categories), where c = P, S, D, T, Q; and v = 1, 2, 3, 4, 5; equal to 0. Give list of building blocks (including the classifications). 4. Select compound type (acyclic, cyclic or aromatic). 5. Give m a x i m u m ( K x ) and m i n i m u m (Kin) n u m b e r of groups in a compound. For all K (K = Kmi n ;K~ax): a. Find all integer solutions (V#K; i = 1; IK) to equations A.4 & A.5. b. For all solutions V/;K; i = 1; IK: i. Find all integer solutions (Gi,j ; j - 1, di) to equations A.6-A.11 as given in Appendix A. ii. For all solutions Gij ; j = 1; Ji:. A. For each n c,v where c = P; S; D; T; Q; and v = 1; 2; 3; 4; 5; perform a lookup in Pc, v, to see if results are present for the nc, v, key. If not, find all possible combinations when selecting nc,v, groups from the collection of available groups Nc, v, (the n u m b e r of combinations) where c =P; S; D; T; Q; and v = 1; 2; 3; 4; 5; and store the combinations in Pc, v, under the nc, v, key. B. Find all combinations of the entries in P under the nc, v keys from Gi;j. Add each solution to Ci (the n u m b e r of combinations can be calculated by equation A.12). iii. Screen C i against the property constraints t h a t can be handled in level 1 (see the next section for details) discarding any compound not fulfilling the requirements. iv. Add the surviving compounds from C i to C. 7. S e t K = K + 1. 8. If K < Kma~ go back to 6 else continue. o
.
.
9. STOP
143 Unorclerect ~et of
/ Classifiaction system
{
I
Additional specifications Maximumn,Jmberof 9roup~:, Ringformation allo~,edor not.
Determine now many groups with 1,2,3& 4 connections are needed
Sets of group classes:
1
Example: i Total number or groups 8: 2 Group with 3 connectionsJ 4 Groups with 1 connection| 2 Groups with 2 connection~,
9
I
I
~
1
Rulesrelating / / ~ reasibilib/to classification
/
1
Generate all possible combinationsfollowingthe rules and specifications,
RESULT: 4 OH3, 1 OI42, 2 OH, 1 CH2COO
/
/
I
3 OH3, 1 CHO, 1 CH2, 2CH, 1 CH2COO
I I
!1 _1
I I I
Figure 6: Illustration of level-1 generation After a successful run of this generation and screening algorithm, the net result is a collection C of vectors of groups describing a series of molecules all satisfying the property constraints that can be examined at level 1. Note two very important features of this method: 1. The screening is embedded into the generation algorithm. This is done in order to identify and remove undesirable candidates at an early stage and thereby conserving storage resources. 2. The created candidates can be represented as a vector of length 4 with each element pointing to a sub-vector in Pc,o. By using this approach information is not duplicated unnecessarily. An example of the application of the level 1 algorithm is highlighted in Figure 6 while the block diagram of the algorithm is given in Figure 7.
144
: l~
p;o.e,. F-'---~..
"o~k;J
specific ionditions
Obtain rule set [
]
............ ;
(cla:i~s,a!!:dn ~ ~
,
,,
~
~, ~'M,n'Ix~.xI)
1"'"-
_. ~
[ Solve equations F.... ...... ~ , .......
,, ~1 t
....
..... 32~3:,
~
~
,
"" "",l,I
1solutions
'' "1 'n nsnon: O ' .
1"--_
~" for j=l ..... J "~
'"J iol'u tio ns G= nc .
c=P,S,D,T,Q"
Find all combinations when selecting nc.~. groups from Nc ,,
4,
Combine the results to form compounds and screen against design criteria
.......
........
~
.,Z
Figure 7."Block diagram for algorithm of level-1 generation P r o p e r t i e s Handled in Level-1 The properties handled in level-1 are group contribution methods based on the group-set (in particular the groups used in the methods of Constantinou and Gani (1994) or Marrero and Gani (2001)) as well as correlations based on properties predicted by group contribution. Here the issue of property trust (defined in chapter 1) comes into play. By using the results from property prediction methods in correlations for other
145 (secondary) properties in order to further expand the property range (defined in chapter 1) of the predictions, the property t r u s t is diminished because of the risk of error propagation. At this level of the hybrid CAMD method it is not possible to improve the property trust by using experimental data as the input for the correlations. This is because the molecular structures are ambiguously defined and it is therefore not possible to perform lookups to external sources of data in a fast and easy way. 6.3.3 Level-2: G e n e r a t i o n of S t r u c t u r a l I s o m e r s F r o m G r o u p Vectors
This level generates new molecular structures by combining elements of the individual fragment sets of the group vectors from level-1. First- and second-order groups (such as those defined by Constantinou and Gani, 1994) are considered in the calculation of properties in this level. The main feature of this algorithm is that it is pseudo recursive. That is, all allowed combinations are considered, and, efficiency is maintained by continuous removal of duplicate structures. Also, the combination rules satisfy conditions of chemical feasibility. Generation of structural isomers from group vectors
The results obtained from the level-1 generation and screening algorithm are vectors of groups. Each vector can theoretically represent a number of different structural isomers. In Figure 1, the generation of isomers from the collection of group vectors is highlighted for the case of a group vector consisting of 3 CH3, 1 CH2, 1 CH, 1 CH20. With the help of 2nd-order groups, two isomers are highlighted. The goal of the generation in level-2 is to: Increase the dimensionality of the molecular model in order to bring the results closer to the end goal of 3D structures. Provide a foundation for improving the quality of the predicted properties as well as allowing estimation of properties t h a t cannot be handled when considering molecular model consisting only of first-order functional groups (groups from level-I). The generation is performed by combining the groups from each of the results (group vectors) from level-1 into connected graphs with groups as vertices and bonds as edges. Special care must be exerted when combining non-symmetrical groups with more than one free connection (as shown in Figure 8). The method for handling such groups is to split up the group into a sub-graph as shown in Figure 9. Because of the need to be able to handle non-symmetrical groups the generation is in fact the combination of a collection of sub-graphs (Figure 10), most of which only have one vertex, into a connected graph. When considering the generation of acyclic
146 compounds the problem is that of generating all spanning trees in a graph with the added constraint of restrictions on the valence of each of the vertices. An example of a base-graph is shown in Figure 11. In Figure 11, the creation of the base-graph (the graph the spanning tree is to be created in) has not been completed. This is due to the requirement t h a t compounds should be chemically feasible and also adhere to the rules of application of first-order groups. In order not to generate multiple identical compounds from different group vectors and in order to ensure t h a t the "promotion" into chemical structures is "reversible". The requirement of reversible promotion can be addressed by defining rules for how groups can be combined/connected. The rules imposed cause the basegraph to be incomplete in all but the simplest cases (all groups belong to category 1 of the group classification system). An easy additional simplification can be applied for all group vectors having more t h a n 2 groups. Since the utilization of all groups is required it is obvious to disallow connection between groups having only 1 free connection. The result of the application of the feasibility rules and simplifications is illustrated by Figure 12. In the base-graph storing the allowable connections, the valence restriction of the groups is not fulfilled since it is a map of all molecules superimposed onto each other creating a molecule superstructure representing all possible combinations (in the same way as a flowsheet superstructure represents a number of process options in process design formulations). If a molecule base-graph (or super-structure) meeting the valence requirement exactly is found, there is only one way of combining the groups into a molecule. Once the molecule superstructure has been determined the task is to identify all the spanning trees in the superstructure with the constraint t h a t the valence requirement of each group must be fulfilled for all the identified spanning trees (Figure 13 shows an example of such a spanning tree). The identification of all spanning trees in a graph is a complex problem even without considering the valence of the individual groups. While the above t r e a t m e n t of the isomer generation as a tree building process only covers the generation of acyclic compounds it is a simple task to extend the concept to the generation of cyclic structures by relaxing the valence requirement for all groups with a valence greater t h a n 1 in the generation of the spanning trees. The ring forming process is then performed after the tree identification by connecting vertices with free connections. As an added requirement to the problem of identifying the spanning trees and later the rings in cyclic molecules is the necessity of generating unique structures only and avoiding graph isomorphism (the problems
147 related to graph isomorphism are described in R a m a n and M a r a n a s (1998)).
Figure 8: Non-symmetrical group having more than I free connection
Figure 10: The collection of subgraphs that are to be combined into molecules
Figure 12: The base-graph from figure 11 after application of simplifications and feasibility considerations
Figure 9: Sub-graph created by splitting a non-symmetrical group
Figure 11: The base-graph in which spanning trees are to be found, not considering feasibility
Figure 13: Example of a valid spanning tree (a molecule)
G e n e r a t i o n algorithm for level-2 Based on the discussion above, the methodology applied to identify the spanning trees is a recursive tree building process with repeated pruning used to remove branches leading to false solutions or duplicate structures (see Figures 14 and 15).
148
Figure 14: The generation tree obtained by applying the generation algorithm for level 2 (for an acyclic molecule)
Figure 15: Partial generation tree obtained by applying the generation algorithm for level 2 (for a cyclic molecule) The algorithm is as follows: Set the list of generated compounds (C) to O. Each compound C holds a list of the free connections available (F) and information about which groups have been used to make a connection.
149 2. 3. 4.
5. 6. 7. 8. 9. 10.
Create C O by selecting a starting group and marking the group "used". Add the free connections of the group to F o For all compounds in C: a. Select a compound Cj from C. b. For all free connections in F~ 9 i. Select a free connection U from F J ii. For all unused groups in C~" A. Compare U with the connections for the unused group. If connection is allowed create a copy of Cj and add the copy to C as C z. B. In Cz: Connect the unused group in question, mark the group as used, delete U from F z . C. Add the free connections of the recently used group to F z . c. Delete C from C. If all groups have been u s e d - Go to 9 Compare all members of C and remove duplicates. Remove all compounds having no free connections (false solutions). Go to 4 If cyclic compounds are to be generated form these by creating all possible variations by connecting the remaining free connections. STOP
C a l c u l a t i o n of P r o p e r t i e s in Level-2 Since the generation algorithm creates structures larger than the individual groups selected as building blocks it is possible to calculate properties using methods operating on structural descriptors that are assembled from the initial groups. An example of such a method is the second-order group contribution method of Constantinou and Gani (1994), and Marerro and Gani (2001) where the properties are predicted by summing up contributions from first-order groups as well as larger substructures (second-order groups) in the compounds with first-order groups as their building blocks. The identification of the existence of second-order groups in a structure created in level-2 can be performed by a pattern matching algorithm in which the generated adjacency matrix is examined for the presence of a smaller adjacency matrix (representing the second-order substructure). By performing this check for all second-order substructures it is possible to obtain the second-order description (a vector of the second-order groups present in the molecule) and thereby predict the properties using methods such as the Constantinou and Gani (1994) method. It is notable that since the same first-order (or level 1) description can be regarded as the "parent" of molecules having different second-order descriptions, the methods used in level-2 not only improve the quality of the property prediction but also allow for distinction between isomers.
150 Now t h a t new isomers have been generated, a property estimation method employing this molecular representation may be employed to estimate the properties again as well as to estimate other new properties (as highlighted in Figure 1).
6.3.4 Level-3: Creation of Atomic Based Adjacency Descriptions In level-3 the compound descriptions obtained from level-2 are subjected to further refinement and structural variation. The goals of level-3 are to bring the compounds closer to a 3D structure and to enable the use of higher order estimation methods that are not based on the original groupset or combinations hereof (such as, use of second-order groups). Note that the atomic representations also define the connectivity of the molecules. Therefore, property prediction methods based on connectivity indices can be employed to predict properties that could not be predicted earlier (due to unavailable group contributions) or for verifying previously estimated values.
G e n e r a t i o n Algorithm for Level-3 The level-3 generation algorithm transforms the group based connectivity information (the adjacency matrix from level-2) into atom-based information. This is achieved by expanding each group into its corresponding atom-based adjacency matrix and replacing the groups in the group based description with additional rows and columns to allow for group expansion. When performing the group expansion into an atomic representation it is possible to experience that one group based description yields more than one atomic description. This is the case with compounds containing any of the groups listed in Table 5. It can be noted that the additional representations appear in the cases where the original groups have a ring element with 1 or more free connections because of the ambiguously defined distance (in the ring) between the free bonds (as in ortho/meta/para) or between hetero-atoms and bonds in aromatic rings (as in Pyridine derivates).
Table 5: Examples of first-order groups with multiple atomic representations First-order group C5H4N C5H3N C4H3S C4H2S
Number of isomers on an atomic basis 3 6 2 4
The algorithm for generation of atomic adjacency matrices from groupbased ones consists of the following steps: 1. Set the matrix A equal to the group based matrix from Level 2 2. List the groups in the compound
151 3. For each of the groups in the compound: (a) Load the corresponding atom based matrix or matrices (for groups with ambiguous 2 dimensional representation) (b) Insert the atom based matrix in the place of the corresponding group in A. If the particular group has multiple representations create a corresponding number of copies of A 4. Identify the atoms taking part in the original bonds between groups 5. Reconnect the molecule by establishing connections between the atoms identified in point 3 6. Stop After performing the conversion the net result is a series of compounds described using atoms and how they are interconnected. Furthermore all 2D structural variations on the atomic level have been generated. This conversion process is illustrated in Figure 16.
P r o p e r t y P r e d i c t i o n in Level-3 In the property prediction step of level-3 the additional structural details generated through the algorithm described above is used to further distinguish between isomers and enabling the use of higher-order methods. It depends, of course, whether isomer distinction and/or use of higher-order methods are necessary. This depends on the CAMD problem specification and the types of molecules that are being generated. By having the designed molecules represented using an atomic level the feasible candidates become expressed by the "common language" of chemical information (i.e. the 2-dimensional representation) and it should therefore be possible to use all property estimation methods using this representation. Having the 2D atomic structure enables the use of other sources of property information than those used in levels 1-2. 1. Directly by calculating structural descriptors for predicting properties by correlations (such as the boiling point method using the Kier shape index as described by Horvath (1992)) 2. Indirectly by using the detailed structural information as a starting point for the re-description of the molecule into another fragment based description different from the original source of the candidate (created in level-I). 3. Perform structural searches in databases. Furthermore the structural information contained in the atomic description is available for the creation of 2D drawings (structural formulas) of the candidates.
152
CH3
OH
C H3
0
I
OH
I
0
l',nsert C ~3
,p,,. H:: .--:::-:::::::-:::::::::::~:::::: :::::m::-:::: :::::::~:::: :::::::::~,m:: ::::::::: ::.:-:::::::: ::::""-- :::::::: ............... :vI~
I
H H H C OH ...H-. o H 0
_C
111
Ii I
i
Inse~'t 0 H',
........................................................................................................ Ii~.
....
OH
0
H H H C H 0 H
H H
0
0
1
Reconnect the g~rou,ps
I 0 1
....................................................................
i ...........
llw
C 1 1 1 0 .....H................................0...........il ......... o
! o.....
I
....
H H H C H 0 --H
H,!
FO
H C
~ i
o
I "
I
0 1
1110 I
1 011
! .........0
. . . .
Figure 16: Illustration of the conversion from group to atomic representation
The re-description into alternative group-sets, thus enabling the use of other methods, serves a dual purpose. Properties already handled can be re-estimated and additional properties can be handled using group/fragment based methods capable of predicting properties not possible to predict with the original group-set (an example is the enthalpy of fusion which can be estimated using the group-set and method of
153 Joback & Reid (1987). The two options are the equivalent of increasing the property trust and increasing the property range described in Chapter 1.
However, by doing this, there will be a computational cost associated with the re-description. The ability to use prediction methods based on a higher order of structural descriptors (capturing more of the structural information of the compound) t h a n used in the previous levels also increases the property trust since such methods can distinguish some forms of isomers (Horvath, 1992) and the predicted value therefore is an estimate for the particular isomer r a t h e r t h a n the best fit to match the average of all compounds having the same description. 6.3.5
L e v e l - 4 : G e n e r a t i o n o f 3D s t r u c t u r e s
In this level, generation and testing of molecules enter an interactive mode. For any selected candidate from level-3 it is possible to use molecular modeling programs such as MOPAC or Chem3D (CambridgeSoft Inc., 1997). A three-dimensional graph (or molecular model) is created by applying a set of standard bond lengths and angles for the various types of connections. Consequently, the true molecular model of a molecule that can be further analyzed in terms of conformers, stability, properties, etc. is obtained. In level-4 the final step towards a highly detailed molecular description is taken by the conversion of the selected 2D structures from level-3 to 3D molecular models. The added dimensionality of a 3D representation yields the possibility for additional structural variations. The structural isomers possible to generate and distinguish in level-4 are the ones related to the relative steric placement of bonds and atoms. The isomer types theoretically possible to distinguish and generate are (following the definitions of Morrison and Boyd (1992)): 9 9 9 9 9
Z/E isomers R/S isomers cis/trans isomers Boat/Chair isomers Anti/Gauche isomers
The later two isomer types are what is known as conformational isomers while the rest are configurational isomers. Conformational isomers can be created by rotating single bonds and are controlled by the internal energies of the compound. The configurational isomers, however, cannot be transformed into each other by rotation around single bonds. The generation algorithm of level-4 considers only the distinction between configurational isomers and leaves the conformational isomer analysis and distinction to the post-design phase of the hybrid CAMD method. The reasons for this lie in the fact that the conformational isomer behavior (or
154 simply the conformation) of a compound is dependent on the state of a compound (temperature, pressure) as well as the presence of other compounds in the immediate environment and requires very specialized tools in order to analyze the conformational space. Furthermore, in a bulk phase of a compound no single conformer will be the only one present. Instead there will be a Boltzman distribution of the conformers depending on the energy level of each possible conformer (Jonsdottir, 1995).
Generation Algorithm for Level-4 The basis for the generation in level-4 is the addition of 9 Hybridization information (i.e. the bond configuration and standard angles between the bonds) 9 Placement in a x,y,z coordinate space for each of the atoms in each adjacency matrix description obtained from level-3. For a single compound representation the level-4 promotion algorithm is: 1. 2. 3. 4. 5. 6. 7. 8.
Select an atom participating in 1 bond, add the atom to Y (the collection of used atoms). Assign the x,y,z position of origin (0,0,0) to the atom. Set the bond direction (D) to 0,0,1 for the free connection. Add the free connection to the tail of the list of free connections F. Select the free connection U from the head of F. Find the atom M participating in the connection U and not part of Y Determine the hybridization of the atom based on the atom type and the number and types of bonds it participates in. Determine the (x,y,z) position PM of the atom by calculating PM - a + D U+ Pu
(1)
where a is the bond length (the bond length can be fixed or dependent on the bond type) and Pu is the position of the other atom participating in bond U. 9. 10.
11. 12. 13.
Remove U from F. Add the free bonds of M to F. Each free bond obtains the bond direction information by rotating the base configuration for the atom (the hybridization) in such away that the previously made connection is superimposed on D U AddMtoY. If Y does not contain all the atoms go back to 5. If F is not empty (only possible for cyclic structures) create the connection pairs for the remaining free connections based o n the original connections in the level-3 description.
155
14.
15.
16.
For each atom: (a) Analyze for the existence of chiral centers (R/S isomers). (b) If found, duplicate the entire structure and swap the positions of any two substituents. For each double bond: (a) Analyze for the possibility of Z/E isomerism. (b) If found, duplicate the compound and swap the positions of the substituents on either of the atoms participating in the double bond. For each single bond in a ring between atoms participating in 4 single bonds: (a) Analyse for the possibility of cis/trans isomerism. (b) If found, duplicate the compound and swap the positions of the substituents on either of the atoms participating in the bond.
Figure 17 illustrates the conversion of the atomic description to a 3D model. The analysis for the presence of R/S, Z/E and cis/trans isomers is done using the extended ACMC method (see appendix B at the end of this chapter) by calculating and comparing the codes for the atoms participating in each analyzed substituent. Starting
point
~i:i ..................
~........ ~........ ~........ ~........ ~ ........ ~........ i ........ ~........ i ........ ,~........ i ........ i ........ l"""li
i~~ .................. ; ; t C H ~ - ~ ,
I
I \ 1
CH3
Convert~l!
CH ~ C H 2
~
~ C H 3
........ ~....... t ~ 1
....... t ....... .:"....... l t .
........ i ................
........ ~........ ~ ~
~........
........ ~........ ; " : ( : E ~ li
.................. ~....... ~ ....... t ....... ~ ....... , , ....... ....... i ....... ~....... , ....... ~,....... ~. . . . . .
, .......... i........ i......... !......... ~....... ~....... i..... ~ ...... i........ ~......... ~ ~..... ~....... ~........ I ....... i ........ i1t .... i .................. i ......... ! ....... :..."....... t ....... t ........ i ....... i ....... t ....... t ........ ~....... t ........ I ........ I ........ ~........ ! ........ i ........ ~1i .................. i ........ i ........ i ........ i ........ i ........ i ........ ~........ i ........ i ........ ~........ i ........ i ........ i ........ ~........ i ........ i ........ i.i~
,/1. Vl~
,-e- ................. t--#lt~~ ........ t ....... t ....... i ....... t ....... :.......... i ....... t ........ i ........ i i ~ I ........ i ........ i ........ i~ ..................................... ~........ ~-----i~----:i~ ......... i ....... i.......... i ......... !~ ........ i ........ !~! ......... i:i~ ........ ~........ i I~ .................. ~........ i ........ ~........ i ........ ~........ ~~~ ........................ i ................ i ........ ~........ i~~ ........ i~I~
I
~ .................. ! ........ ~ ....... ~ ........ ;......... ~ ....... ": ....... *" ....... } ....... ~ ........ ~-1T1]-1-i
........ i ........ i1~
........ i ........
i
Tripos I
I
onvert~
I
Native
Alchemy input file
onvert~ D~ I
Chem3
Aut~176 module invoked
I I
Figure 17: Illustration of the process of creation of 3D models
Capabilities and Limitations of the Level-4 Generation Algorithm It should be noted that the algorithm for generation of 3D molecular structures does not consider 2 important aspects: 1. Torsion angles in the final structures are random. This is due to the fact that the algorithm only examines and considers 1 bond at a time. If torsional information should be included in the placement
156 calculations the algorithm would have to examine 2 additional bonds for each atom placement. 2. For compounds containing cycles there is no guarantee that the generated 3D models contain cycles with uniform bond length. Since the ring-building process is completed by forming bonds without consideration for the length it is possible to obtain models containing very deformed rings. Both of the above limitations may be handled through external molecular modeling software capable of property prediction and/or generating descriptor information. Note that the analysis of the 3D structure of rings and the torsional state of a molecule need to be investigated as parts of the examination of conformational isomers. Furthermore, their "correct" values are heavily dependent on the methods used to calculate the properties of the compounds (i.e. energy minimization performed with the MM2 method (CambridgeSoft Inc., 1997) may lead to different results than those that can be obtained using the AM/3 force field of MOPAC (CambridgeSoft Inc., 1997)).
6.4
P O S T - D E S I G N PHASE
In the post-design phase the results from the design-phase solution engine are analyzed with respect to properties and behavior that could not be part of the design considerations. Examples of such properties and behavior are price, availability, legislative restrictions, process wide performance and many more. At the end of this analysis the final selection of the product identity must be made.
6.4.1 A n a l y s i s of d e s i g n s o l u t i o n s The analysis involves using other sources not considered in level-1 to level-4. Examples of such sources could be" 9 Property estimation and molecular modeling tools for validation of predicted properties not handled by the CAMD algorithm and/or validation of the properties estimated during the design phase. 9 Databases for examination of environmental o r legislative requirements as well as reaction pathways. 9 Supplier catalogues for price and availability information. 9 Engineering insight and simulation tools such as mixture analysis and phase behavior calculations. Which tools and data sources to use, depend on the original CAMD problem specification. Databases, process synthesis/design tools, process modeling & simulation tools, analysis tools, etc., are all useful in this phase. Also, analysis based
157 on experiments and/or experimental data should be considered. Finally, web-based database search, if possible, could also be carried out. This is particularly useful for verification of EH&S properties. 6.4.2 F i n a l C a n d i d a t e S e l e c t i o n After validation of the obtained results the final candidates must be selected. This selection must take all the available information into account including socio-economic aspects and the out of process (or indeed life cycle) performance of the different compounds. I n t e g r a t i o n of P r o c e s s - P r o d u c t D e s i g n This type of selection procedure is beyond the scope of this book but the presented CAMD framework has been used successfully in process design algorithms addressing the process-wide environmental performance with respect to energy consumption and emissions control (Hostrup et al., 1999). In the approach of Hostrup et al. (1999) the presence or absence of each result from CAMD in the process is controlled by integer variables in a super-structure formulation of the process design problem and subsequently selected using an MINLP solution algorithm. The method shows t h a t the design/selection of compounds for a particular purpose can be performed as a subproblem of a larger process design problem. The benefit doing this r a t h e r than including the compound design in the overall problem formulation is that of being able to use external sources of data to validate the estimations as well as enabling the use of computationally more complex models for property estimation. The benefits are achieved without sacrificing any versatility since solving the CAMD problem as a subproblem with the proposed method identifies all compounds possessing the properties essential to the desired functionality as well as making it possible to screen out less desirable candidates by adjusting the property constraints related to performance and environment. The developed framework is therefore very suited for use with the advanced methods for impact minimization developed by other researchers (such as the MEIM method by Pistikopoulos et al. (1994) and the WAR algorithm by Cabezas et al. (1999)). In any real application of CAMD the final testing involves experimental determination of key properties and behavior regardless of what method is used to select the final candidates. The power and purpose of CAMD is to limit the number of candidates to those showing the maximum potential and not to replace experimental testing.
6.5 I M P L E M E N T A T I O N OF THE FRAMEWORK The proposed framework has been partially implemented as a computer program "ProCAMD" (ICAS Documentation, 2002). The screening on the
158 basis of the atomic representation (level 3) is done using external tools, specifically the property prediction program "ProPred" (ICAS Documentations, 2002) and the commercial drawing and property estimation program "ChemDraw Ultra 2000" (CambridgeSoft Inc., 1999). The "ProPred" package includes an implementation of the extended ACMC method (see Appendix). The treatment of 3D structures (the results from level 4) has been performed in the commercial molecular modelling program "Chem3D Pro" (CambridgeSoft Inc., 1997).
Figure 18." Link between ProCamd and Chem3D
Figure 19 shows the modular structure of the implementation along with the dependencies of the modules and the external programs and data sources it is connected to. It has been a goal of the development to enforce a structure where each of the major parts of the algorithm was represented by separate modules of code thereby making it easy to update and modify the code, as well as having the opportunity to create custom solutions in the future by extracting selected modules and inserting them into another framework. 6.5.1 E x t e n s i o n of t h e Hybrid CAMD M e t h o d to C o m p l e x M o l e c u l e s
The hybrid CAMD method has been extended (Nielsen, 2000) to include a new database of large complex molecules containing their pure component data and solubility data in known solvents. The search for solvents starts with defining the solute structure, determining the pure component properties (if not available in the database), generating the group representation and evaluating the property model parameters in terms of sensitivity of parameters to calculations of solubility and generation of solubility versus solubility parameter of solvent diagrams. The maximum of the solubility in these plots identify approximately the solubility parameter of the complex solute molecule and therefore, the target properties of the desired solvent. The algorithm is shown in the form of a block diagram in Figure 20. It has been applied for solution of the CAMD problems discussed in chapter 8.
159
Figure 19: Structure of ProCAMD highlighting methods & tools employed.
6.6 A P P L I C A T I O N E X A M P L E Several examples of the application of the methodology are given elsewhere in this book. See for example chapters 8 and 9 where applications of ProCAMD are highlighted for solvent design problems. Besides solvent design applications (in chapters 8 & 9), the following simple molecular design problems are suggested for the reader as tutorial exercises. 9 Find all organic molecules with C, H & O atoms having normal boiling points between 300 K and 400 K that form azeotropes with ethanol at 1 atm pressure. 9 Find non-aromatic organic molecules that when added to a mixture of acetic acid-chloroform in the liquid phase, causes a phase s p l i t assume a temperature of 300 K and a pressure of I atm. 9 Find all cyclic organic molecules with C, H & O atoms that have the same normal boiling point (equal or lower), Hildebrand solubility parameter, melting point (equal or higher) as benzene but not its EH&S properties.
160
Figure 20: The extended hybrid CAMD method
Find how many chemically feasible molecules can be formed with the groups CH3, CH2, CH, OH, CHO, CH3CO, CH2CO considering a minimum of 2 groups and a maximum of 5 groups.
161
Find all compounds that match the following property constraints 475 K < normal boiling point < 525 K 325 K < normal melting point < 375 K -250 kJ/mol < Heat of fusion at 298 K < -220 kJ/mol - 0.75 < Log Octanol-water partition coefficient < - 0.50 4.0 < Log water solubility (log mg/L) < 5.5 Solutions to the ([email protected]).
above problems
can
be
obtained
from
R.
Gani
6.7 C O N C L U S I O N S The hybrid CAMD method could be regarded as a general purpose methodology that provides then framework for future developments needed to solve current and future problems in area of product and formulation design. The framework is flexible enough to provide the link between molecular structure representation and property estimation at different scales of size. It also provides link with databases and knowledge-based systems needed for pre-design and post-design phases. Although most of the examples (employing this methodology) shown in this chapter and elsewhere in the book deal with selection and design of solvents, it can and has been employed for fluid design, search for azeotropes, search for polymer repeat unit structures, search for additives and many more. The vast collection of property models integrated to the ProCAMD software makes the application range quite large. Current and future work is extending the methodology towards design of larger molecules and isomers typically found in design of drugs, pesticides, speciality chemicals and polymers.
6.8
1.
2. 3. 4.
5.
REFERENCES
H. Cabezas, J. Bare and S. Mallick, "Pollution prevention with chemical process simulators: The generatized waste reduction (WAR) algorithm", Computers and Chemical Engineering, 23 (1999) 623-634. CambridgeSoft Inc., Chem3D Pro Users Guide, CamSoft Inc., Cambridge, MA, USA, 1997 CambridgeSoft Inc., ChemDraw Ultra 200 Manual, CamSoft Inc., Cambridge, MA, USA, 1999. L. Constantinou and R. Gani, "New group Contribution Method for the Estimation of Properties of Pure Compounds", AIChE J., 10 (1994) 1697-1710. R. Gani, B. Nielsen, A. Fredenslund, "A Group Contribution Approach to Computer Aided Molecular Design", AIChE J., 37 (1991) 1318-1332.
162 6.
7.
8.
9.
10.
11.
12.
13. 14. 15.
16.
17.
18.
P. M. Harper, "A Multi-Phase, Multi-Level Framework for Computer Aided Molecular Design", PhD-thesis, Technical University of Denmark, Lyngby, Denmark (2000). L. Horvath, "Molecular Design. Chemical structure Generation from the Properties of Pure Organic Compounds", Studies in Physical and Theoretical Chemistry Book Series, Volume 75, Elsevier, Amsterdam, The Netherlands (1992). M. Hostrup, P. M. Harper, R. Gani, "Design of Environmentally Benign Processes: Integration of Solvent Design and Process Synthesis", Computers and Chemical Engineering, 23 (1999) 13951414. ICAS Documentations, Internal Report PEC02-14, CAPEC, Department of Chemical Engineering, DTU, Lyngby, Denmark (2002). K. G. Joback, R. C. Reid, "Estimation of Pure Component Properties Chemical Engineering from Group Contributions", Communications, 57 (1987) 233-243. K. G. Joback, G. Stephanopoulos, "Searching Spaces of Discrete Solutions: The Design of Molecules Possessing Desired Physical Properties", Advances in Chemical Engineering, 21 (1995) 257-311. S. Jonsdottir, "Theoretical Determination of UNIQUAC Interaction Parameters", PhD-thesis, Technical University of Denmark, Lyngby, Denmark (1995). J. Marrero, R. Gani, "Group-contribution based estimation of pure component properties", Fluid Phase Equilibria, 183-184 (2001) 183. R. T. Morrison, R. N. Boyd, "Organic Chemistry", 6 th Edition, Prentice-Hall Inc., New Jersey, USA (1992). M. B. Nielsen, "Solubility Prediction of Complex Compounds with UNIFAC", MSc-thesis, Technical University of Denmark, Lyngby, Denmark (2000). E. N. Pistikopoulos, S. K. Stefanis, A. G. Livingston, "A Methodology for Minimum Environmental Impact Analysis", In AIChE Symposium Series on Pollution Prevention Through Process and Product Modification, AIChE, New York, USA (1994). V. S. Raman, C. D. Maranas, "Optimization in Product Design with Properties Correlated with Topological Indices", Computers and Chemical Engineering, 22 (1998) 747-763. Y. Xiao, Y. Qiao, J. Zhang, S. Lin, W. Zhang, "A Method for Substructure Search by Atom-Centered Multilayer Code", Journal of Chemical Information and Computer Science, 29 (1997) 701-704.
163
APPENDIX A: E q u a t i o n s U s e d in G e n e r a t i o n A l g o r i t h m s If an u n r e s t r i c t e d exhaustive e n u m e r a t i o n of all possible combinations of groups is performed, the n u m b e r of alternatives is given by (Joback and Stephanopoulos, 1995), u ....
(N + K - ~)!
h ' = h".,.,,i-.,
(A.1)
Eq. A.1 is derived from K=Km~n
(A.2)
where, M~,K=:(N+R-1) K (A.3) In the above equations, K is the n u m b e r of elements in a population of size N, while M is the n u m b e r of combinations. The valence constraints are expressed as, g
~ n~g -I- ~-D -[- 7~T + n Q
ns = n~T+ 2nQ - 2{A - 1)
(A.4) (A.5)
In Eqs. A.4 and A.5, n s, riD, nT, and nQ are the n u m b e r of groups in a molecule with 1-free a t t a c h m e n t , 2-free a t t a c h m e n t s , 3-free a t t a c h m e n t s and 4-free a t t a c h m e n t s , respectively, while A is the m a x i m u m n u m b e r of rings in a molecule. The classification of compounds in t e r m s of classes and categories (see Gani et al. 1991) is used to control the chemical feasibility of the generated molecules. In addition to the constraints A.4 and A.5, the following conditions are considered. Constrains related to category-2 groups "~'~~:~L~ . ~
.
.
.
.
.
.
Constrains related to category-3 groups G'C;~) < ns,~ + n~,z + + nQ,s Constrains related to category-4 groups
(A.6)
(A.7)
164 (~
< uS,a + nr2.4 + uT.a + nQ,.,l
(A.8)
Constrains related to category-5 groups " K ~LvN~. . . . .
(A.9)
Constrains related to categories 4+5 (combined) groups .~ICL;A~L ~ uS,4 + ~D~4 + nTA + nQA +U.,'q,~ + rt.Dj~ + nT~5 -{- nQA
(A.10)
Constrains related to categories 3+4+5 (combined) groups C I~+4+s~ _< §
+~,D~a+
~ +nr
+ ~,D,4 + nT, a + nQ,a
(A.11) The n u m b e r of combinations for the j'th solution to the category constraints for the i'th solution to the valency constraint is given by
(A.12) For a solved illustrative example (Harper, 2000) and the corresponding group classification tables, contact R. Gani ([email protected]).
B: Molecular Encoding Technique The applied method of group and fragment identification is based on the generation and identification of molecular codes (molecular "fingerprints"). By applying an encoding method to both molecules and fragments a set of numbers is obtained for each molecule and fragment. Presence or absence of a particular fragment in a molecule is determined by examining the code sets of a molecule and the fragment i question. If the fragment's codes are a subset of the molecule's the fragment is present in the molecule. The encoding of the molecular and fragment fingerprints is done using an expanded and adapted version of the Atom-Centered Multi-Layer Code approach of Xiao et al. (1997). The method has been improved to allow for additional flexibility in the definition of fragments and better handling of bond type (cyclic/acyclic) considerations.
165
.....................
""
......
. . . . . . . . . . . . ..... .............
". 9" ' " "
~."
I I
~
..
'.
". / ' - / . . .
,. .
.....
"',
. ~ . / . ~ , . . ~- . . . . . ~.'
,
9 ~,'~
. : . . ~ . . . . ."~, . "' ( " - ..... ~ ~'~1
........ '"'~:'" "-..
/f-----"-~,
/ /
20
",
)
"\
".
. |
""'"" .-'"
:
: .'
c
1
2813 5oo6 7502 10001
2
.
Levds t 3
4
5
c NI c . c N c N 4713 1[ 7093 119853 134933 8606 1121422 231542 23~82 2
1 1 4 13302 4123802 2 34362 2 1 17501 11 28281 1
, s / . . ~ ' . ~ .'7"---,r . . . . . . ~
"~
9. . . .
7
I ]_. t.:o0es
/" "'
~
".........
..........................
c .
"'-. "-.
-~.. ,~..
.
o
?~....
: ',
..
.'-k~""
"'~'" , " ~ " , " '. ....
--
"" '
.."
Levels
"' .
: :
~ , tcooes
c
INI 201 2
{
c IN
28131 1 -^0- I 1
4713
~o Ol
Figure 21: Examples of codes for a molecule and a fragment Examples
of
the
results
of
the
molecular
encoding
technique
are
illustrated through Figure 21. It can be seen that the code-set for the f r a g m e n t is a s u b s e t o f t h a t o f t h e d e p i c t e d m o l e c u l e a n d i t c a n t h e r e f o r e be established
that the fragment
is p r e s e n t
in the molecule.
[~ [ ....
This Page Intentionally Left Blank
Computer Aided MolecularDesign: Theoryand Practice L.E.K. Achenie, R. Gani and V. Venkatasubramanian(Editors) 9 2003 Elsevier ScienceB.V. All fights reserved.
167
C h a p t e r 7: I d e n t i f i c a t i o n of M u l t i s t e p R e a c t i o n S t o i c h i o m e t r i e s : CAMD P r o b l e m F o r m u l a t i o n A. Buxton, A. Hugo, A.G. Livingston & E.N. Pistikopoulos
Reaction path synthesis and the selection of an optimal route for the manufacturing of a desired product provide the earliest opportunities for waste reduction when designing environmentally sound processes. In the work presented here, a systematic procedure for the rapid identification of alternative multi-step stoichiometries is described in which minimum environmental impact considerations are incorporated. Both the size and complexity of the reaction path synthesis problem are reduced by decomposing it into a series of steps. First, a new group based co-material enumeration algorithm, introduces material design principles through structural and chemical feasibility constraints to rapidly generate a manageable set of raw materials and co-products. Next, stoichiometries are extracted from the co-material set using a two step optimisation procedure, including whole number stoichiometric coefficient constraints, carbon structure constraints and case specific constraints based on chemical knowledge. Thermodynamic, economic and environmental impact criteria are employed in the evaluation of feasible stoichiometries, with aspects of the Methodology for Environmental Impact Minimisation (Pistikopoulos et al., 1994) providing the framework for the environmental evaluation of alternatives.
7.1 INTRODUCTION In the synthesis of a facility to manufacture a given desired product, the selection of an appropriate chemistry provides the earliest opportunity to influence the environmental and economic performance of the process. However, reaction route design and selection is a large and difficult problem. The major difficulty is attaching enough information to a particular reaction route alternative to make an informed choice about the potential of the route to be developed in to a promising process. When only chemistry is known, it is difficult to quantify the costs and wastes associated with the eventual process in which the chemistry will be carried out, because of the large number of sources of expenses and waste which are not directly related to chemistry, and the range of different process topologies and equipment which may be associated with alternative
168 chemistries. This problem is compounded by the fact that the vast majority of reaction schemes of industrial interest are of a multi-step nature (Rotstein et al., 1982). Recognising these problems, it seems more sensible to identify candidate multistep reaction routes rapidly according to some simple criteria using limited information, t h a n to devote time and resources to developing detailed reaction schemes which may be rejected later due to poor process performance. However, the synthesis of alternative reaction paths leading to a desired product was initiated by organic chemists who were interested in synthesising large, complex molecules, albeit in more efficient ways. Consequently, their approaches tended to concentrate on the generation of chemistries, rather t h a n on the selection of promising routes. Agnihotri and Motard (1980)categorised these tools as information based systems and logic based systems, according to the reaction representation technique employed. Information-based systems have their roots in real chemistry. Molecules are represented in terms of their atomic or group constituents and reactions, known as transforms, are based on real, known chemical transformations (Corey et al., 1969, 1972, 1976; Gelernter et al., 1973; Wipke et al., 1976; Govind and Powers, 1981; Kaufmann, 1977; Knight, 1995; Mavrovouniotis and Bonvin, 1995). The development of appropriate transforms relies heavily on chemical knowledge while each transform may carry with it information relating to the molecular substructures to which it can be applied and the structural alterations it brings about - requiring details of any by-products which are produced or reagents which are required as well as typical operating conditions for the reaction and kinetic information. Consequently, according to Govind and Powers (1981), information based systems offer good predictive power, in that they are able to represent specific distinct reactions in detail. However, they suffer poor generality, since their ability to represent different reactions is limited to the available transforms. Furthermore, a large data base of information or a set of predictive techniques is required to implement such approaches. Information based systems most commonly build synthesis trees, usually working ~backwards (retrosynthesis) from the product. Retrosynthesis is an open ended problem which may lead to the development of a large network of reaction schemes and corresponding materials, even with a small number of transforms. Accordingly, the screening requirements which go along with information based systems are typically large. By comparison, logic based systems are much easier to handle and control. These methods employ purely mathematical representations for molecules and
169 their reactions (Ugi and Gillespie, 1971; Hendrickson, 1976). The most widely studied logic-based approach is centred around an atom balance, a matrix equation which describes the chemistry of a particular set of predetermined species, and from which stoichiometries leading to a particular product can be extracted (Rotstein et al., 1982; Fornari et al. 1989, 1994a, 1994b; Crabtree and E1Halwagi, 1994; Holiastos and Manousiouthakis, 1998). In this approach, only chemical formula information is required to generate stoichiometries, so that this approach provides a much more direct route to alternative multi-step reaction schemes. However, in order to apply this approach, all candidate raw materials and stoichiometric co-products (which will henceforth be referred to collectively as com a t e r i a l s ) must be known in advance and included in the matrix. While the careful pre-selection of these materials provides an early opportunity to limit the size of the problem and to screen out poor materials, no systematic method has been proposed to generate these materials. While Fornari et al. (1989, 1994a, 1994b) and Crabtree and E1-Halwagi (1994) limited themselves to single step reactions, Rotstein et al. (1982) and Holiastos and Manousiouthakis (1998) demonstrated the potential of the approach to develop multi-step reactions by considering closed cycle sequences of reactions known as clusters. A cluster of reactions is a sequence of thermodynamically feasible reactions in which the intermediates produced by the reactions in the cluster must also be consumed by other reactions in the cluster, with the net result being an overall main reaction which is thermodynamically infeasible, and therefore not directly achievable (Rotstein et al., 1982). In cluster synthesis, this main reaction must be specified in advance. Rotstein et al. (1982) also applied their approach to open cycle sequences of reactions, in which the intermediates produced within the sequence are not completely consumed. However, although they introduced unspecified raw materials and co-products, they limited themselves to overall reactions in which the desired product and certain of the raw materials were specified in advance. Without careful consideration, stoichiometries generated by the matrix based approach may involve any number of apparently simultaneous reactants and co-products (so that a single stoichiometry may in fact be decomposable in to several sequential steps) with stoichiometric coefficients that may take any values. Buxton et al. (1997) were the first to tackle these problems directly, introducing linear whole number stoichiometries constraints together with limitations on the numbers of reactant and product species. Recently, Holiastos and Manousiouthakis (1998) introduced non-linear integer constraints to perform the same functions in the context of reaction cluster synthesis. They defined allowable chemical reactions according to the general characteristics of elementary reactions, which depict chemical transformations as they truly happen at
170 the atomic scale, and applied their constraints accordingly. Using a modified branch and bound solution procedure they circumvented the non-linearity of their integer constraints. Extensions have predominantly concentrated on the application of integer programming techniques to the design of simplified reaction mechanisms for improved computational efficiency (Androulakis, 2000; Edwards et al., 2000; Sirdeshpande et al., 2001). The key advantage of such information based systems is that they can provide kinetic information for the preliminary screening of reaction routes. Knight (1995) employed computational chemistry involving statistical mechanics and probability theory to determine products, their distribution and the reaction rates, while Mavrovouniotis and Bonvin (1995) used c h e m o m e t r i c s - the simulation of reaction systems with kinetic models and principal factor analysis to identify the major pathways. Consequently, the information and computational requirements of these approaches are large. Although the predictive power of the matrix based approach is poor, since it provides much less information, much simpler criteria can be applied to identify promising candidate stoichiometries, or at least to eliminate poor alternatives. Simple economic criteria, based only on the values of products and reactants have been employed by Fornari and Stephanopoulos (1994b). Gibbs free energy of reaction has been used to provide an initial indication of the cost feasibility of a process: conversion, yield, recycle flows, difficulty of separation etc. (Fornari and Stephanopoulos, 1994b), to indicate the directionality and reversibility of reaction steps (Mavrovouniotis and Bonvin, 1995), to determine equilibrium concentrations among reacting species (Crabtree and E1-Halwagi, 1994) and to provide an upper limit for thermodynamic feasibility (Agnihotri and Motard, 1980; Fornari et al., 1994a, b, 1989; May and Rudd, 1976; Rotstein et al., 1982)a Gibbs free energy change of reaction of 10 kcal/gmol has long been accepted to provide an upper bound for the thermodynamic feasibility of reactions (Rotstein et al., 1982). Rotstein uses this criterion to determine the temperature range over which reactions are thermodynamically feasible. The only documented reaction route design technique to take explicit account of environmental issues is that of Crabtree and E1-Halwagi, (1994). In order to select an i n n o c u o u s stoichiometry, they imposed simple concentration limits on certain compounds in the reactor effluent stream. However, this approach does not provide a consistent method of assessing the environmental impact of alternative reaction routes since only the effluent concentrations of certain compounds were considered (not the i m p a c t s of all compounds). Furthermore, it is unlikely that the reactor effluent, or even the by-products or co-products would be emitted directly to the environment. Moreover, the input wastes associated with the raw materials and the impacts of downstream processing are not included.
171 In the work presented here, a procedure for the rapid identification of alternative multi-step stoichiometries is developed. Material design principles are introduced to formalise the development of a set of co-materials and an optimisation procedure, based around the matrix representation, is employed to extract stoichiometries from this set. Linear constraints are developed to limit the number of reactant and product species and to ensure that each stoichiometric step involves whole number stoichiometric coefficients. Thermodynamic, economic and environmental impact criteria are employed in the evaluation of the stoichiometries, while aspects of the Methodology for Environmental Impact Minimisation (MEIM) of Pistikopoulos et al. (1994) provide the framework for the environmental evaluation of alternatives. The application of the method is illustrated in this chapter through an example; the synthesis of production routes for the pesticide 1-naphthalenyl-N-methyl carbamate also known as carbaryl. While in Chapter 14, the main features of the methodology is further highlighted through a second case study; the production of acetic acid, an important alipahtic intermediate.
7.2 I D E N T I F I C A T I O N OF E N V I R O N M E N T A L L Y B E N I G N STOICHIOMETRIES
The problem addressed here may be stated as follows: Given a desired organic product Identify a set of candidate multi-step organic reaction stoichiometries for the production of the desired product which are both economically and environmentally promising. A three step procedure is applied, involving: (i) selection of co-material groups, (ii) determination of a set of candidate co-materials using group based molecular design techniques and (iii) identification of a set of promising candidate multi-step stoichiometries using the matrix based representation system and an optimisation procedure incorporating aspects of the MEIM. The use of such a structured, stepwise procedure reduces the multi-step stoichiometry identification problem to a manageable size. The key to the procedure is the introduction of co-material design (steps (i) and (ii)). With the product and stoichiometric co-materials known, the identification of feasible reaction stoichiometries is no longer an open ended problem. The steps of the procedure are described in the following sections.
172 7.3 CO-MATERIAL D E S I G N 7.3.1 I n t r o d u c t i o n
Co-material design is based on the observation that much organic chemistry essentially consists of reorganising functional groups, through additions, substitutions and eliminations, so that co-materials are expected to contain (at least) the chemical groups present in the desired product. According to this observation, a group based computer aided design approach is adopted for comaterial design. The aim of this approach is to systematically enumerate a set of alternative stoichiometric co-material candidates from a group set selected according to the groups present in the product, those present in any existing industrial co-materials, the types of chemistries to be considered (e.g. aromatic or aliphatic) and other considerations such as property constraints. Groups are employed as the molecular building blocks rather than atoms for several reasons. First of all, this considerably reduces the combinatorial size of the molecular generation problem without much loss of g e n e r a l i t y - very many organic compounds can be constructed using only a small number of groups. Secondly, a suitable choice of groups (e.g. UNIFAC groups) gives direct access to physio-chemical, thermodynamic and environmental properties through group contribution methods. Finally, with appropriate group bonding restrictions, such a method provides a short cut to structurally and chemically feasible molecules, hence significantly reducing molecular screening requirements. Any of the molecular design techniques reviewed by Buxton (2002) may be applied to generate sets of candidate materials. However, of the variety of available techniques, only the enumeration and knowledge based approaches are specifically designed to explicitly enumerate molecules from a pre-selected set of groups. All other approaches can be viewed as implicit enumeration strategies, in which the aim is to identify optimal structures through evolution or optimisation without explicitly constructing all alternatives. Thus, the knowledge based and enumeration approaches represent the best candidates for use in co-material design. Of these, the most general approach is that of Gani and coworkers, as reported by Constantinou et al. (1996). This procedure is UNIFAC group based, and includes in the enumeration algorithm rules designed to ensure that only structurally and chemically feasible molecules result from the molecular design exercise. These two features make this approach the most attractive starting point for co-material design. Although structural and chemical feasibility rules feature in the other group based techniques, Derringer and Markham (1985) focussed only on polymers, Joback, Stephanopoulos and coworkers (1984, 1989,
173 1995) employed a generate and test paradigm, applying their rules only after generating all possible combinations of groups, and Porter et al. (1991) considered only certain homologous series. The computer aided product design (CAPD) approach reported by Constantinou et al. (1994, 1996), is based on a system of group classification and categorisa-
tion. A total of one hundred and eight unique UNI FAC groups are featured in the technique, including nine aromatic groups. These groups are divided into nine classes and five categories. The class of the group (0 - 4) represents the number of free attachments of the group (i.e. the group valency) and the category signifies the level of restriction for bonding with other groups - the higher the category the tighter the restrictions. The aromatic groups are placed in classes 5 - 8, class zero consists of some simple complete molecules. The molecular design algorithm is based on a set of primary and secondary conditions. The primary conditions ensure structural and chemical feasibility, firstly by guaranteeing that the complete compound has zero valency and secondly t h a t it obeys the principles of chemistry. These principles have been embodied in a set of rules which determine the maximum permissible number of groups from any category which can be present in a molecule and the permissible combinations of groups from the different categories. The secondary conditions are related to restrictions arising from the limited validity of the group contribution properties prediction methods. The rules based on the primary conditions are divided into three sets; a set each for acyclic, cyclic and aromatic molecules, and the UNIFAC groups have been divided in to three sets (which share many common groups) according to the desired molecular structure. From these group sets, the rules allow for the design of cyclic and acyclic molecules of up to twelve groups and of aromatic molecules of up to eighteen groups with a maximum of three aromatic rings. The molecular design algorithm has been developed to systematically generate all molecules which satisfy these conditions. However, some feasible structures are rejected because of doubtful stability or because group parameters are not available. Nevertheless, despite this conservatism, this technique can potentially generate thousands of molecules (Constantinou et al., 1996), which is more t h a n adequate for co-material design. Furthermore, the approach provides very well for the inclusion of the additional structural restrictions which may be necessary in co-material design. Full details of group classification, categorisation and division, and of the primary chemical feasibility rules are provided in Constantinou et al. (1996), and a description of the enumeration algorithm is presented in Gani et al. (1991). It is these rules which form the basis of the co-material design procedure presented in the following sections. In addition to these rules, the co-material design procedure features other rules based on engineering and
174 chemical insight, which are designed to reduce the size of the enumeration problem.
7.3.2 Co-Material Design Procedure GROUP PRE-SELECTION Group pre-selection is the first step towards designing co-material molecules and has the most direct effect on the number of molecules generated. To restrict the size of the enumeration problem, the following simple rules are employed to guide group pre-selection: (i) select the groups present in the product, (ii) select the groups present in any existing industrial raw materials, co-products or by-products, (iii) add groups which provide the basic building blocks for the functionalities of the product or of similar functionalities, (iv) add groups from the group sets for the desired chemistry (cyclic, acyclic or aromatic) and (v) reject groups which violate property restrictions (e.g. chloro groups may violate environmental r e s t r i c t i o n s - Gani et al., 1991). CO-MATERIAL ENUMERATION FORMULATION The co-material enumeration formulation consists of four sets of equations; chemical feasibility rule equations (based on the rules provided by Constantinou et al., 1996), the octet rule for structural feasibility, additional problem specific structural restrictions and the objective function. It is assumed t h a t this approach provides all interesting organic co-materials and that all generated molecules are chemically feasible. The existence of generated molecules may be verified from chemistry literature (Compounds, 1996), although such sources tend to include rare compounds which may be unlikely co-materials. The sets employed in the algorithm are shown in Table 1. Table 1: Co-Material Enumeration Model Sets J CL CT R
chemical groups group class group category chemical feasibility rules
Chemical Feasibility Rules In Constantinou et al. (1996), the chemical feasibility rules are given according to the categories of groups. Category one groups have no bonding restrictions. According to Gani et al. (1991) category two groups of classes 1 - 4 are special groups which can appear more than once but cannot be connected with each other or with another group from the same or higher category. Since there are only six category two groups in classes 1 - 4 , only one of which (the chloro group) is included in the example problems considered here, no general rules reflecting
175 these restrictions were included in the co-material e n u m e r a t i o n formulation. To avoid violation of these restrictions, integer constraints are instead included on a case by case basis. For categories 3 - 5, the chemical feasibility rules are presented in t a b u l a r form, with a s e p a r a t e table for acyclic, cyclic and aromatic molecules in Constantinou et al. (1996). In each table the columns are: the total n u m b e r of groups in a molecule, the largest class of group present, the n u m b e r of groups from this largest class, the m a x i m u m allowable n u m b e r of groups from category 3, the m a x i m u m allowable n u m b e r of groups from category 4, the m a x i m u m allowable n u m b e r of groups from category 5, the total n u m b e r of groups allowed from categories 3, 4 and 5 together, and the total n u m b e r of groups allowed from categories 4 and 5 together. Thus, each row in the tables represents a unique set of rules for the allowable n u m b e r s and combinations of groups from categories 3, 4 and 5 according to the total n u m b e r of groups, the largest class of group present in the molecule and the n u m b e r of groups from this largest class. Above a certain total n u m b e r of groups, it is possible to construct molecules with the same total n u m b e r of groups in which the largest class of group is different, and in which the n u m b e r of groups from this largest class is different. Thus, there can be several rows in the table and therefore several rule sets, for a particular total n u m b e r of groups. In order to e n u m e r a t e co-materials, each table c o l u m n is first w r i t t e n as an R x 1 vector, where R is the n u m b e r of rule sets (i.e. the n u m b e r of rows in the rule table). However, the t r e a t m e n t of classes is somewhat different t h a n in the tables. I n s t e a d of writing a largest class vector, and a n u m b e r of groups from this largest class vector, two vectors are written for each class, one which gives a lower bound, and a second which gives an upper bound on the allowable n u m b e r of groups from each class. For classes above the m a x i m u m for the particular rule set, both lower and upper bounds are set to zero. For the largest class, both lower and upper bounds are given the appropriate value for the rule set, and for classes below the m a x i m u m , the lower bound is set to zero and the upper bound is given the value of the total n u m b e r of groups m i n u s the n u m b e r of groups from the largest class. The rules can then be w r i t t e n as the following equations. F i r s t of all, an R z 1 vector of binary variables d~ is introduced such that:
~dr- 1
(1)
/,
This vector is used throughout the equations to ensure t h a t only one rule set r is active at any one time. The total n u m b e r of groups in a molecule is then given by:
Z: j
,:2)
=
cl
ct
r
176 where nj,cZ,ct is defined as a positive integer variable which represents the number of groups j which appear in a molecule, and n rt is the total n u m b e r of groups in the rule set r. cl and ct are the class and category of group j respectively, each group is given a unique class and category a s s i g n m e n t by the following equation:
cl
ct
This equation allows nj,d,a to be non-zero only for cl = cl' a n d ct = ct' while for all other combinations of cl and ct, nj,cl,c t must be zero. The allowable n u m b e r of groups from each class is given by:
j
ct
r
j
ct
r
ar%'
,
Vcl c C L
(4)
a~n~'
,
Vcl E C L
(5)
where /~r _d,,~n and "lbr _~l,max are the m i n i m u m and m a x i m u m n u m b e r s of groups allowed from class cl in rule set r. The numbers of groups from categories 3, 4 and 5 are limited by: E j
(6)
Z nj'cl'3 ~-~ Z drT~Crt3 cl r
j
cI
r
j
cl
r
(8)
a~n~
where n~ta, n ct4 and n~t5 are the m a x i m u m group numbers allowed from categories 3, 4 and 5 respectively in rule set r. The numbers of groups from categories 3, 4 and 5 s u m m e d together, and from categories 4 and 5 s u m m e d together, are similarly limited: E j
Z (TtJ'cl'3 .at-TLJ'eI'4 @ TtJ'c/'5) --~ E drTt~t345 cl r
E j
Z el
(nj,cl,4 + nj,d,5) < Z
drn~t45
(9) (10)
r
where n~t345 and n ct45 are the m a x i m u m total group n u m b e r s allowed from categories 3,4 and 5 s u m m e d together, and from categories 4 and 5 s u m m e d together, respectively.
Octet Rule In order to ensure t h a t complete molecules have zero valency, the octet rule is
177
introduced: E
E
j
E (2 -
cl
vj)nj,d,~t =
2m
(11)
ct
where vj is the valency of group j (equal to class for classes 0 - 4) and m is 1, 0, -1 or -2 for acyclic, monocyclic, bicyclic and tricyclic compounds respectively.
Additional Structural Restrictions In addition to the above rules, other restrictions may be introduced on a case by case basis to limit the numbers of co-materials designed. To prevent chemistries in which the co-materials are much simpler or much more complicated t h a n the product, the m a x i m u m and m i n i m u m number of groups in each co-material can be bounded:
E E E nj,d,ct >_nmin j
cl
EEE j
cl
(12)
ct l'l'max
(13)
ct
where nmin and nmax are the m i n i m u m and m a x i m u m allowable numbers of groups. These constraints indirectly restrict chain length in homologous series. More direct constraints can be written by bounding the sums of the numbers of group types in any series. Since the formation and cleavage of carbon-carbon bonds often requires extreme operating conditions which are likely to disrupt the chemistry of interest, it m a y be desirable to avoid co-materials which m u s t undergo changes in carbon skeletal structure in order to arrive at the product. In general this is difficult to achieve, since co-material design focuses on types and numbers of groups, r a t h e r t h a n on the connections between them. However, m a n y undesirable materials can be avoided by imposing restrictions on the allowable types and numbers of groups. The numbers of branches, substituents, substituted sites and functional groups may also be limited in this way to avoid co-materials which are significantly more or less structurally complicated t h a n the product. For example, if only monosubstituted benzenes are required, the following equations are introduced: EETtACH, cl ct EEnAC,cl,ct cl ct
and m tures. cation which
cl,ct = 5
-1
is set to zero in the octet rule (equation 11) to allow only monocyclic strucAdditional restrictions can be incorporated in the stoichiometry identifiexercise to avoid, or at least further reduce, the generation of chemistries alter carbon skeletal structures, if required.
178
Objective Function The objective is set as the minimisation of the total number of groups in a molecule:
MinimiseEEEnj,d,ct j
cl
(14)
ct
In this way, co-materials are enumerated subject to the above rules, starting with the simplest first. Solution Procedure The above formulation consists entirely of binary and integer variables in linear equations and is therefore an mixed integer linear programming (MILP) problem. In order to generate a set of co-materials, the problem is solved repeatedly with an integer cut written after each iteration to exclude the current optimal group combination from future iterations. However, it is the precise combination of numbers of groups which must be eliminated, not just the combination of group types (excluding group type combinations would eliminate homologous series). In order to do this the binary variable CUTj,t is introduced, which is related to nj,cl,ct as follows:
(15) t
cl
ct
CUTj,t
-
1
(16)
t
According to these equations, CUTj,t is non-zero only for t = t' where t' is the n u m b e r of times group j occurs in a molecule. CUTj,t is zero for all other values of t # t'. The integer cuts are written in terms of CUTj,t. Note t h a t linear group contribution property prediction equations and bounds may be included in the above formulation without affecting the solution procedure. For example, to exclude co-materials with high toxicity, the following equation could be introduced based upon the lethal concentration (molfl) causing 50% mortality in fathead minnow (LC50):
where dl/j is the toxicity contribution of group j from Gao et al. (1992), and LC5Omin is the lowest permitted LC50. Since LV5Omin is fixed, this equation is linear. ADDITIONAL MOLECULES To complete any stoichiometry, it may be necessary to include some simple additional molecules, which cannot be systematically designed using the above
179 procedure. A set of simple complete molecules appears as class zero in Constantinou et al. (1996). However, further molecules may be required on a case by case basis according to any existing industrial stoichiometries and the type of chemistries to be considered. Examples of such molecules include oxygen, hydrogen, hydrogen chloride or other hydrogen halides, chlorine or other halogen molecules, carbon monoxide and carbon dioxide. A subset of these, or a larger set, may be selected as required as the final step of co-material design.
7.4 S T O I C H I O M E T R Y I D E N T I F I C A T I O N F O R M U L A T I O N
The multistep reaction stoichiometry identification problem can be defined as follows. Given, (i) a desired product and desired production rate, (ii) a set of stoichiometric co-materials, (iii) cost information for each material and group contribution parameters for the corresponding group set (iv) a set of role specification and chemistry constraints and (v) a range of reactor operating conditions, then the objective is to determine a set of candidate multi-step reaction stoichiometries which are promising in terms of both economics and environmental impact. The model for the identification and economic and environmental evaluation of a single step reaction stoichiometry is presented below, followed by a description of the solution algorithm in which this model is used to develop multistep stoichiometries. The model consists of seven sets of equations; an atom balance, whole number stoichiometries constraints, role specification constraints, chemistry constraints, carbon structure constraints, pure component property prediction equations and a reactor process model. The sets employed in the model are shown in Table 2. Table 2: Stoichiometry Identification Model Sets E S C S ( c S) J
elements species carbon containing species chemical groups
The formulation is based on the assumption that chemical species undergo reactions either singly (e.g. thermal decomposition or isomerisation, ignoring any reagent, catalyst or solvent effects) or at most in pairs, so that the number of reactants is limited to at most two. An upper limit is applied on the total number of materials in each stoichiometry (since the number of reactants is limited
180 this effectively limits the number of co-products) and no competing reactions are considered (stoichiometry determination can only develop stoichiometric coproducts not side products). The following additional assumptions are made in the analysis: isobaric reactor operation at known pressure Ptot, gas phase reaction and perfect gas behaviour. Only the products and the reactants are costed, no process equipment or operating costs are considered and the inherent inaccuracies in the property prediction techniques and thermodynamic models employed are accepted. Clearly, incorporating side reactions will add to the impacts so t h a t the present results are lower bounds in this respect. The limits and cuts employed here are practical constraints which can be tightened or relaxed as desired. In principle, the thermodynamic model permits consideration of operation at any pressure. More detailed costing depends on more sophisticated process models. 7.4.1 A t o m B a l a n c e The starting point for this work is an atom balance equation which describes the chemistry of a particular set of S species composed of E elements (Rotstein et al., 1982). The atom balance is written as follows" c~E = 0
(18)
where c~ is the E - S atomic matrix and V~ is the S. 1 column vector of stoichiometric coefficients v~. It is a s s u m e d t h a t the r a n k of the matrix c~ is E. In general, S = E + m, so t h a t m represents the degrees of freedom (DOF's) in the system. These DOF's represent stoichiometric coefficients which m u s t be specified in order for the atom balance to be solved. The remaining S - m coefficients are then determined as functions of these. Clearly when m = 0, a unique solution exists, and when m >_ 1, there is an infinity of solutions, corresponding to an infinity of possible stoichiometries. 7.4.2 W h o l e N u m b e r S t o i c h i o m e t r i e s C o n s t r a i n t s At the atomic level, chemical species react in whole number ratios so t h a t in general, meaningful chemical reactions are written in terms of stoichiometric coefficients which are rational numbers (i.e. whole numbers or numbers which can be expressed as ratios of whole numbers) so t h a t through multiplication by appropriate factors, stoichiometries involving only whole n u m b e r coefficients can be obtained. In such stoichiometries the product coefficient is a whole number which may be greater t h a n or equal to unity. In their atom balances, Rotstein et al. (1982), and later Crabtree and E1-Halwagi
181
(1994), assigned the value unity to the product stoichiometric coefficient with no restrictions on the co-material coefficients. While this does not lead to any loss of generality, it potentially allows the development of an infinity of meaningless solutions in which the co-material coefficients are not rational numbers. In order to ensure t h a t only solutions involving whole n u m b e r stoichiometric coefficients are obtained, the following linear equations are introduced where vp is the stoichiometric coefficient of the desired product. vp _> 1
(19)
Vs c S
(20)
N
Xs -- ~
2(n-1)bns,
n=l
Assigning 89 >_ 1 allows the necessary flexibility in the value of the product stoichiometric coefficient so t h a t there is no loss of generality, x~ is a d u m m y coefficient which is defined as a positive, continuous variable. For each species s, this variable is expressed as a linear combination of binary (i.e. 0 - 1 ) variables bn~. In this way, the continuous coefficients x~ are constrained to take positive whole n u m b e r values in the range from zero to an upper limit d e t e r m i n e d by the value of N. The real stoichiometric coefficients v~ are related to the d u m m y coefficients x~ as follows: vs = xs - 2x~ii~,
Vs C S
The b i n a r y variable ii~ is necessary since the coefficients v~ m a y take positive or negative values. The variables ii~ take the value zero if species s is a product (v~ positive) and u n i t y if species s is a r e a c t a n t (v~ negative) so t h a t ii~ is the r e a c t a n t flag. This equation m a y be linearised using the Glover (1975) transformation, yielding: vs=xs-2.y~,
VscS
(21)
y~ - ?)max 9iis < 0,
Vs C S
(22)
xs § Vmax(ii~ -- 1) -- ys _ 0,
Vs E S VscS
(23)
y~-x~_0,
(24)
where y~ is a d u m m y variable for the product x~ii~ and Vmax is the m a x i m u m p e r m i t t e d m a g n i t u d e for any stoichiometric coefficient. The variables y~ are defined as positive continuous variables. To ensure t h a t t h e y t a k e non-zero values only w h e n species s is a reactant, the following additional constraint is applied: ys >_ iis,
Vs c S
(25)
182 Note t h a t for any particular stoichiometry, xs and vs are non-zero only for the species involved and zero for all other species, while ys is non-zero only for the reactants involved and zero for all other species (including products and coproducts).
7.4.3 Role Specification Constraints Role specification constraints (Fornari et al., 1994a, 1989) are used to restrict the participation of molecules in the stoichiometries; for example, to avoid certain stoichiometric co-products or to define a species as a raw material only. In order to apply such constraints the raw materials and products in any stoichiometry must be identified. Raw material identification is taken care of by the binary reactant flag iis, from the whole number stoichiometry constraints. Products are identified using the following equations: xs -- ys -- Vmax " Is <_ O, xs--ys--(Vmax+l)'Is+Vmax>_O,
Vs C S VscS
(26) (27)
where Is is a vector of binary elements is. Together with equations 20 and 21-24, these relationships assign the value zero or unity to is when the stoichiometric coefficient vs is negative or positive, respectively. Thus, is is the p r o d u c t flag. In order to relate the raw material and product flags, a third flag iiis is introduced which takes the value zero if species s is a raw material o r a product, and unity if species s is not involved in the stoichiometry. The three flags are related as follows: is § iis + iiis = l,
Vs C S
(28)
The role specification constraints are then posed simply by specifying the values of the flags in advance. For example, to define species s as a raw material only, it is excluded from being a co-product by setting is = 0. The role specification constraints may be written differently for different stoichiometry steps. The full list of the constraints used in the example presented later in this chapter is presented in Appendix A.
7.4.4 Chemistry Constraints In addition to the role specifications, the binary flags are employed to develop knowledge based chemistry constraints. These are used to restrict the number of reactants and products involved in any stoichiometry, and to eliminate certain chemistries. According to Holiastos and Manousiouthakis (1998) an elementary reaction can involve up to three reacting molecules and, if the reaction is to be reversible,
183 up to three product molecules. Furthermore, since the formation or cleavage chemical bonds which occurs during an elementary reaction requires the orbitals of reacting molecules to come sufficiently close together and be correctly oriented, elementary reactions involving two reacting molecules are more likely t h a n those involving three purely on statistical grounds. According to the same ideas, the number of different reacting species in any stoichiometry is here limited by the following:
E
iis _< _,.Nma~
(29)
8
where Nrmax is a problem specific maximum number of reactants. In the example presented in this chapter, N ~ ax is assigned the value two, which effectively eliminates side reactions except in the unlikely event of a simultaneous isomerisation. A problem specific upper limit on the total number of species N~ ax involved in any stoichiometry is also imposed according to the number of different species involved in the most complex step of the existing industrial routes to the product of interest.
~ ( i ~ + ii~) < N'~pm~~
(30)
8
Since there must be at least one reactant, this constraint limits the number of products to at most N~ a~ - 1. Note t h a t these constraints limit only the numbers of different species involved in any stoichiometry, not their stoichiometric coefficients. However, all stoichiometric coefficients are constrained to be less than ~a~, and can be further constrained by introducing the following equation if required: Ms ~ l/y ax,
VS C S
(31)
where u y ~ is the maximum permitted stoichiometric coefficient of species s. Other knowledge based chemistry constraints may be imposed directly on certain species. The following examples are provided for illustration, the full list of the chemistry constraints employed in the illustrative example problem are presented in Appendix A. 9 species a and b
must not react together
iia + iib <_ 1 9 species a
may only react with species b or species
i i a - (iib + iic) <_ 0
c
(32)
184 9 species c may only be produced by reacting species a and b 2ic - (iia + iib) < 0
(33)
In order to identify a set of candidate stoichiometries, the stoichiometry selection problem must be solved iteratively, with integer cuts to exclude previous solutions. The binary flags provide the mechanism for this. Since the same combination of reactants may be involved in several stoichiometries in which the product is derived from the same underlying reaction but with redistribution of the co-products, the integer cuts are written to exclude only the combinations of raw materials observed in the solutions. In this way, such r e d u n d a n t solutions are avoided. 7.4.5 C a r b o n S t r u c t u r e C o n s t r a i n t s
According to section 3.2, constraints may be needed to prevent chemistries in which carbon-carbon bonds are broken or formed. However, since the atom balance contains no structural information it is not possible to write such constraints directly. Furthermore, since the need to break or form carbon-carbon bonds depends on the set of co-materials and the nature of the chemistry to be considered, carbon structure constraints can only be developed on a case by case basis. Moreover, the development of general constraints is hampered by the fact t h a t only the final product is known in advance (it is not known which co-materials will be reactants or co-products). Despite these difficulties, general constraints which infer certain restrictions on carbon structural changes are possible, and under certain special circumstances, carbon structure changes can be eliminated. Noting t h a t ys is non-zero only for reactants and xs - y~ is non-zero only for products, the following constraint may be used to prevent a n e t change in the number of carbon-carbon bonds in a stoichiometry: ~ - y~ Ncb ~ = ~(x~8
ys)g: b
(34)
8
where N2b is the number of carbon-carbon bonds in species s. The n e t gain or loss of carbon-carbon bonds may be allowed by writing this constraint as an inequality. For stoichiometries involving straight chain acyclic molecules in which the carbon skeleton is uninterrupted, the following prevents any change in the carbon skeleton (and permits only one carbon containing reactant and no carbon containing co-products): ~c~N~ = Npb , 9 9
cb
Vcs C C S
(35)
185 where N~b is the n u m b e r of carbon bonds in the product. Clearly, this is a n ext r e m e l y restrictive constraint. Less restrictive constraints can be w r i t t e n for the same type of chemistry. For example, the f o r m a t i o n of carbon-carbon bonds for stoichiometries involving straight chain acyclic molecules with u n i n t e r r u p t e d carbon chains can be prevented by considering the relationship between the stoichiometric coefficients of the reactants and the product, if only one carbon containing r e a c t a n t is allowed. Consider the production of a product with a single carbon-carbon bond from reactants containing up to six such bonds. Allowing only a single carbon containing r e a c t a n t and disallowing the formation of carbon-carbon bonds, the following reaction schemes are permitted: C-C C-C-C C-C-C-C C-C-C-C-C C-C-C-C-C-C C-C-C-C-C-C-C
--+ --+ ~ ~ ~ -+
C-C C-C+C 2(C-C) 2(C-C) + C 3(C-C) 3(C-C) + C
while schemes such as: 2(C) 2(C-C-C) 2(C-C-C-C-C) 2(C-C-C-C-C-C-C)
-~ -~ -~ -~
C-C 3(C-C) 5(C-C) 7(C-C)
are not. These reaction schemes imply t h a t in an allowable stoichiometry, the following relationships between vp and yes m u s t be obeyed if species cs is selected as a reactant, according to the ratio of the n u m b e r of carbon-carbon bonds in the r e a c t a n t to t h a t in the product: If 0 <_ ~N2
~
_
--
--Vcs
Pb
If
3
then
vp
-2v~s
If
5<~--~ <6 N~9 b - -
then
vp --- - 3 v ~
Since the n a t u r e of the relationship between vp and v~s depends on the bond ratio, the relationships are in fact quite general and can be applied to acyclic products with u n i n t e r r u p t e d carbon chains featuring any n u m b e r of carboncarbon bonds. In addition, they can be extended for r e a c t a n t s with any n u m b e r of such bonds. Clearly, a problem arises if N~b = 0, however this can be overcome by introducing the binary variable p which takes the value of zero w h e n N~b > 1 and u n i t y w h e n N~b = 0 according to:
1 - p < N~b <__( 1 - p ) N mcba x
(36)
186
cb ~ is the m a x i m u m n u m b e r of carbon-carbon bonds f e a t u r e d in the species a n d N~a from the set CS. Incorporating p, the vp to vc~ relationships are embodied in the following g e n e r a l constraints" E(2t-1)qt,c~ < [
N~b
Vp(1 -- p) -- (1 -- iics)Vma x ~ (Zcs --~ E
] < 2 E tqt,cs + O.99Zcs, Vcs c CS
tqt'cs)Ycs ~-
( 1 --
Zcs)V p -~- pVmax,
VC8 e C S
(37)
(38)
t
w h e r e qt,c~ a n d z~ are integer variables such that:
Z~s + E
qt,~ = 1,
Vcs c C S
(39)
t
so t h a t for each species cs only one of z~ and the vector of qt,~ b i n a r y variables can t a k e the value unity. Note t h a t ycs is used in equation 38 so t h a t the cons t r a i n t s affect only reacting species. To u n d e r s t a n d how these c o n s t r a i n t s function, consider the following: 9 W h e n N~ b - 0 and N ~ = 0, p - 1 from equation 36, and in order t h a t e q u a t i o n 37 be obeyed z~ = 1 and all qt,c~ = 0. Thus since y~ is a positive variable, w h e t h e r ii~ is zero or one, 0 <_ y~ < Vmax from equation 38, which imposes no additional restriction on y~s. 9 W h e n N~b = 0 and N~C~ _ 1, p = 1 from equation 36, a n d in order t h a t e q u a t i o n 37 be obeyed zcs = 0 and qt,,cs - 1 (and all qt#t,,~ = 0). Thus, w h e t h e r iic~ is zero or one, 0 __ t'yc~ <_ v, + Vma~. This potentially r e s t r i c t s y~s, b u t since chemistries in which products with no carbon-carbon bonds are developed by b r e a k i n g up r e a c t a n t s with such bonds are not of i n t e r e s t here, this limitation is acceptable. 9 W h e n N~b _ 1 and N~C~ < N~b, p = 0 from equation 36, a n d in order t h a t e q u a t i o n 37 be obeyed Zcs = 1 and all qt,~s = 0. Thus, y~ _< 0 from the u p p e r bound in equation 38 so t h a t ii~s m u s t equal zero for the lower bound to be feasible. This m e a n s t h a t all species with fewer carbon-carbon bonds t h a n the product are not p e r m i t t e d as reactants. 9 W h e n N~b >_ 1 a n d N~C~ >_ N~b, p = 0 from equation 36, a n d in order t h a t e q u a t i o n 37 be obeyed zcs = 0 and qt,,cs = 1 (and all qt#t,,~ = 0). T h u s if ii~ = 0, 0 < t'y~ <_ 89 or if iic~ = 1, vp <_ t'ycs <_ vp. This m e a n s t h a t if species iic~ is not a r e a c t a n t , t h e r e is no additional limitation on y~, b u t if species cs is a r e a c t a n t , yc~ and therefore V~s m u s t be r e l a t e d to vp in one of the w a y s prescribed above, depending on the bond ratio.
187 Note t h a t once the product is known, equation 36 can be solved for p, so t h a t equations 37 and 39 can be solved for zcs and qt,cs, p, z~s and qt,cs can then be entered as parameters in equation 38 which becomes linear as a result.
7.4.6 Thermodynamic and Environmental Property Equations The enthalpy of formation and Gibbs Free Energy of formation of each species is required to estimate the enthalpy and Gibbs Free Energy of reaction in the process model. Pure component heat capacities are required for the energy balance and pure component toxicity is also needed. ENTHALPY OF FORMATION According to Perry and Green (1984), the enthalpy of formation at 298K AH}~s of species s in kJ/mol can be found using the group contribution scheme of Verma-Doraiswamy: AH}~S = 4.1868EnsjSH~'29s, J
Vs C S
(40)
where 5Hs'~"~ is the contribution of group j (in kcal/mol) from Perry and Green (1984). GIBBS FREE ENERGY OF FORMATION Also from Perry and Green (1984), the Gibbs Free Energy of formation AG/s(Tope~) of species s in kJ/mol can be estimated using the group contribution techniques of Van Krevelen and Chermin (1951) with accuracy of +21 kJ/mol:
Aa/s(Toper)=4.1868{ E nsj.a~r + [~nsy~S+ j
Rln (--)]Tope~}, as
VsES
(41)
?Ts
where Tope~.is the reactor operating temperature, a~F and ~ r are the group contributions of group j in kcal/mol (from Perry and Green, 1984), R is the gas constant, as is the symmetry number of the molecule (the number of independent orientations which appear identical to an observer) and ~s is the number of optical isomers. For molecules with no symmetrical orientations or optical isomers, as and ~s are assigned the value of unity to avoid numerical problems. ~ r a n d / ~ F are valid for temperatures between 300K and 1500K, so t h a t is bounded as follows" 300 _< Toper<_1500
Toper
(42)
HEAT CAPACITY AND TOXICITY yap(J/mole K) of species s is estimated using The ideal gas molar heat capacity C v p,i the following polynomial equations:
188
CpVap j 37.93)-~- ( ~ n,jAJb -~- 0.21)T -]-- ( ~ n ~ j A Jc ,s _ ( E TtsjnaJ
/
--3.91 • 10-4)T 2 + ( E J
(43)
J
nsJAJd+ 2.06 x 10-7)T 3, V8 C S
where the coefficients A {, A j, AJ and A ( are group contribution p a r a m e t e r s from Joback and Stephanopou~os (1989). ~ In order to m e a s u r e the short-term environmental impact of any m a t e r i a l rel e a s e d , the toxicity of each species s is estimated using the group contribution techniques of Gao et al. (1992): -logLC50~ = ~ oljnsj, J
V8 E S
(44)
7.4.7 R e a c t o r P r o c e s s Model E q u a t i o n s The process model is based on a single reactor in which chemical equilibrium is achieved. Unreacted raw materials are recycled a s s u m i n g t h a t they are separ a t e d cleanly from the products and ignoring, for the present, the necessary separation technology. The chemical equilibrium position is located by minimising s y s t e m Gibbs Free Energy, and since this position is independent of the reactor, it is not necessary to prepostulate the reactor type (e.g. PFR, CSTR etc.}). COMPONENT MASS BALANCES The m a s s balance around the reactor is written as follows for all species in the set S:
n~; - n~i = crv~,
Ys c S
(45)
where n~ is the total molar feed flow of component s to the reactor, nsy is the molar flow of component s in the reactor effluent and cr is the extent of reaction. Recalling t h a t x~ is non-zero for all species involved in the reaction and ys is nonzero only for the reactants and taking a flow basis of 10 kmol/hr, the following restrictions are written for n~ and nsy: nsi = 10y~, Vs c S
(46)
nss <__10x8, Vs E S
(47)
According to equation 46, nsi is non-zero only for the reactants in a particular stoichiometry, and according to equation 47, n8S can be non-zero only for the components involved in a particular stoichiometry. For all other components in the set S, both n~i and nss are zero. To ensure all reactions exhibit acceptable conversion, the extent is bounded as follows: ~r ~ Cr~o
(48)
189 For the reactants, the product -c~vs represents the consumption rate in the reactor and therefore the fresh feed demand. For the products, E~vs represents the production rate. Thus, the fresh feed demand and production rate in kmol/hr are given by:
F~ = -~rv~ii~, Vs E S P~ = G~v~i~, Vs C S
(49) (50)
ENERGY BALANCE It is assumed that the fresh feed enters the reaction block at 298K and the products leave the block at the reactor operating temperature Top~. The reactor energy demand therefore has two contributions, one from fresh feed pre-heat and one from the heat of reaction. The heat of reaction is estimated in three steps. First, the total feed (comprising fresh feed and recycled reactants) is cooled from the operating temperature to 298K, the reaction is then performed and finally the entire reactor effluent is reheated to the operating temperature. The reactor energy demand Qreactor, in kJ/hr per mole of product is given by:
QrcactorPp = E ( - crvsiis)
~[ov~
S
f29S
C'p,s dT +
8
s
§E
+1000r
Cp,sVapdT
J Toper ~ roper yap
Cp,s dT
nsf 98
8
where Pp is the production rate of desired product and C~,~p is the vapour heat capacity of component s. Substituting for n~f from equation 45, this reduces to:
= l{lo00r
"T~ Cp,svapdT}
(51)
98
The heat of reaction at 298K AH~98 in kJ/mol is estimated from: _
(52) 8
SYSTEM GIBBS FREE ENERGY
The Gibbs Free Energy of the reaction system G~y~(in kJ/hr since n~f is a molar flow) may be estimated from the following expression (for perfect gases or real gases at low pressure): Gsys
1000 = E nsfAGfs-4- RTo;cr E nfsln [ nfsPoper Differentiating this expression with respect to n~f at constant temperature and pressure leads to the condition for chemical equilibrium:
A G R = - RTope~lnK
(53)
190 where R is the gas constant, and the Gibbs Free Energy change of reaction per mole of product AGR in kJ/mol and the reaction equilibrium constant K are given by:
AGR : ~ v~AG y~
(54)
8
K-
IL
pe~
ns ~
In order t h a t the chemical equilibrium condition be obeyed, a reaction with a large positive or negative AGR must exhibit a very small or a very large K respectively. A very small K implies very small n~f values for the products, while a very large K implies very small nsf values for the reactants. In some cases, these nsf values are so close to zero that the optimisation problem becomes poorly scaled. Introducing an n~f lower bound does not solve the problem since any such bound may render the equilibrium condition infeasible. Thus, r a t h e r than imposing the chemical equilibrium condition, Gsy~ is minimised directly instead, with an n~f bound in place to prevent scaling problems. While this prevents reactions with large positive or negative AGR from achieving equilibrium, since the n~f lower bound is small, the solutions are barely affected. The nfs bound is written as follows, with the binary flag iiis introduced so t h a t nf~ takes the value zero for all species not involved in the stoichiometry of immediate interest:
ny~_> 1 x 10-4(1-iii~),
VsES
(55)
Since n~y is zero for some s, iiis must also be introduced in to the G~y~expression: Gsys
i000 : E
nsfAGfs
8
+RTop~r[~nfsln((ny~+iii~)Poper)-En~yln(En~yPe)]s
~
(56)
Crabtree and E1-Halwagi (1994) use a similar approach to deal with species not involved in a particular stoichiometry, although they reported using the reaction equilibrium condition to determine the reaction equilibrium position. Note t h a t the chemical equilibrium condition provides a unique relationship between n~f and Toper, whereas in the G~y~ expression they are independent variables. Thus an additional temperature bound is required to ensure t h a t t e m p e r a t u r e is consistent with the extent bound from equation 48. This bound is calculated by solving the following equation for T':
AGR(T') = -RT'lnK~o
(57)
191 where K~zo is the equilibrium constant evaluated at cr = ergo "
lnK~z~= E v~ln ((n~ + v~c~~+
- E vsln ( ~ (nSi + v ~ P e ) )
(58)
The reactor operating temperature must then satisfy:
Toper> r'
(59)
THERMODYNAMIC AND ECONOMIC CONSTRAINTS The Gibbs Free Energy of reaction per mole of product is employed to eliminate thermodynamically infeasible solutions, using a 10 kcal/mol (or 41.868 kJ/mol) upper limit as follows: AGR < 41.868 vp
(60)
The profit associated with each reaction is calculated as follows, assuming t h a t any stoichiometric co-products are sold at their market value: Profit = ~ s
vsCs
(61)
vp
where C8 is the market value of species s using Chemical Prices (1998). Note t h a t individual reaction steps cannot be rejected on the basis of profit since the profit of any one step is not representative of the profit of the entire chemistry. ENVIRONMENTAL CONSIDERATIONS The environmental impact directly associated with carrying out each stoichiometry is assumed to arise only from the energy consumption necessary to maintain reactor temperature. By-products are not considered and it is assumed that there are no material emissions of the co-products of any stoichiometry. In addition, the impacts associated with separating the products from the recycle, and with all other downstream processing are ignored for the present. In these respects, the impact figures calculated here are very much lower bounds for the eventual process impacts. This simplistic treatment of environmental impact assessment reflects the level of information available at stoichiometry selection. In principle, the full range of life cycle assessment (LCA) based metrics available within the MEIM could be used to develop a full impact vector for each stoichiometry. However, since air emissions are the dominant form of energy associated waste, the critical air mass (CTAM) metric is chosen. According to Stefanis (1996), the critical air mass associated with energy production is 1.629 • l0 s kg air/MWh. Assuming that the environmental impact per unit energy of maintaining reactor temperature is the same as that of burning fossil
192 fuels to produce electricity, the environmental impact arising from the reactor energy demand per mole of desired product is:
CTAME(kgair/hr)=l'629•
V/(Qr~act~ )3600
(62)
Note t h a t it is assumed here for simplicity that the impact of cooling the reactor (which is necessary for negative is the same as that of heating it. This is a simplistic assumption, however, in this way reactions which require withdrawal of energy are equally penalised in environmental terms as those which require energy supply.
Qr~actor)
This assumption is made on the basis that reactions which require energy withdrawal are likely to be exothermic reactions occurring at moderate temperatures, so t h a t the heat of reaction term dominates the energy balance (equation 51). This is only likely to occur towards the reaction temperature lower bound (i.e. 300K) at which the reactor temperature is too low to use cooling water at ambient temperature. Thus, some kind of refrigeration would be required which carries with it a high energy demand and therefore a high impact, associated with compression requirements. In order to complete the impact assessment, the input wastes associated with the materials consumed in any stoichiometry must be included. However, the quantification of the input waste of any material can be a lengthy exercise, since all processing steps necessary to produce the material from naturally occurring substances must be considered in accordance with the principles of LCA (Heijungs 1992; ISO 14040, 1997; SETAC, 1993). Thus, r a t h e r t h a n performing this exercise for all co-materials, it is more efficient to assess the input wastes only of those materials which are identified as raw materials by the multi-step stoichiometry identification formulation (i.e. those materials with no precursors). Since these materials are not known at the outset, input waste assessment can only be performed after stoichiometry identification.
et al.,
Provided input wastes are included in this way, consistent impact figures can be obtained for multi-step stoichiometries involving branches of different lengths, and different stoichiometries can be compared on a consistent basis.
7.5 SOLVING THE M U L T I S T E P S T O I C H I O M E T R Y I D E N T I F I C A T I O N PROBLEM 7.5.1 O v e r v i e w
It is desirable to use the above model to enumerate and evaluate multistep stoichiometries simultaneously and within a framework of constrained
implicitly
193 optimisation. However, the optimisation objective (minimise Gsus) is not suitable for a such an approach since it must be applied to each individual stoichiometric step and furthermore, even for a single step stoichiometry, the model is a large mixed integer nonlinear programming (MINLP) problem. It involves a large number of optimisation variables, including for each species s: the binary variables is, iis, iiis, bns (Vrt) and also the binaries qt,es (Vt), Zcs and p if the carbon structure constraints are employed, and the continuous variables vs, xs, ys, n4, nsi, Fs, Ps and AGfs. Furthermore, the relationships between these variables are not trivial, including many instances of products between binary and continuous variables. Thus, in order to solve the multistep stoichiometry identification problem, a decomposition based approach is adopted in which the single step problem is solved by explicit enumeration and subsequent evaluation of stoichiometries in two sequential steps. This procedure is then applied successively in an algor i t h m designed to build up multistep reaction stoichiometries. The enumeration and evaluation of single step stoichiometries are discussed below, followed by a description of the multistep stoichiometry identification solution algorithm. SINGLE STEP STOICHIOMETRY ENUMERATION The basic single step stoichiometry enumeration formulation consists of equations 18 - 33. Carbon structure constraints (equations 3 4 - 39) are optional and are included on a case by case basis. With the exception of the more complicated carbon structure constraints (equations 37 and 38) all equations are linear. However, as discussed above, provided equations 36, 37 and 39 are solved in advance, equation 38 becomes linear. Thus, with or without carbon structure constraints, the single step stoichiometry enumeration problem can be formulated as an mixed integer linear programming (MILP) problem so t h a t optimal solutions can be guaranteed. Recognising t h a t simple stoichiometries with few reactants and co-products are more attractive t h a n complex ones (which in general require more complex reaction and separation technologies) a new objective function is introduced in order to extract stoichiometries systematically from the matrix _%starting with the simplest first. The number of materials involved in any stoichiometry Nspe is obtained by summing the reactant and co-product flag values, so t h a t this objective is written as follows: minimise E{(i~
+ ii~)
= N~p~}
(63)
8
In order to identify a set of candidate stoichiometries this problem must be solved repeatedly, with integer cuts introduced at each iteration to exclude previous solutions. Accordingly, the simplest stoichiometries are enumerated first
194 and as cuts are added the solutions become progressively more complicated. SINGLE STEP STOICHIOMETRY EVALUATION The single step stoichiometry evaluation problem consists of the property prediction and reactor process model, equations 40 - 62. For each stoichiometry, this model is solved immediately after the stoichiometry enumeration model. Thus, vs, xs, ys, is, iis and iiis are known and are treated as parameters in the stoichiometry evaluation problem which is then reduced to an nonlinear prog r a m m i n g (NLP) problem. The optimisation objective is to minimise Gsy~ and the main optimisation variables are Top~, ~ and n/s.
7.5.2 Multistep Stoichiometry Identification Algorithm OVERVIEW OF ALGORITHM In order to generate multistep stoichiometries the single step stoichiometry enumeration and evaluation problems are solved successively using a depth first enumeration strategy, in which the desired maximum number of reaction steps is specified in advance. The operation of the algorithm is schematically depicted in Figure 1 for the case where at most three reaction steps are allowed.
: ~ / s t e = 0 - - 1 ~"
]
-- . . . . . .
l o-,
.......... t~? ......
- Eva
....
don
"1
~A,SaBUE
,
[ Eva
'
~A~a~
tion
r
. . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
System 2B
~'~sm,,
System 214
,
iion l"~
.....
'System 2C ' ~ [ E. . . . . . tion I
/ Evaluation I .....................
'System 2D
~B"
,
..... ,
:
........
|
', , .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
, .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
~ .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Figure 1: Multistep Stoichiometry Identification Algorithm At the first level, system zero, the final desired product is the target molecule, and a single stoichiometry involving up to two first generation precursor reactants is extracted from the matrix es. One of these first generation precursors is then arbitrarily selected as the target molecule for system 1A and a single sto-
195 ichiometry involving up to two second generation precursor reactants is identified which leads to this compound. One of these second generation precursors is then selected as the target molecule for system 2A. Since system 2A completes this branch of the enumeration exercise, it is solved iteratively until all stoichiometries leading to this second generation precursor target molecule have been enumerated. Once this has been achieved, system 2B is solved iteratively for all stoichiometries leading to the other second generation precursor. With this pair of second generation precursors completely fathomed, System 1A is run again to generate two more, which are then treated as the target molecules for system 2A and system 2B. This process is repeated until all stoichiometries leading to the first generation precursor target molecule have been enumerated. Once this has been achieved, systems 1B, 2C and 2D are employed to enumerate all stoichiometries leading to the other first generation precursor. System zero is then solved again to generate two more first generation precursors, and the whole procedure is repeated until all stoichiometries leading to the final desired product have been enumerated. In this way, multistep reaction stoichiometries are developed in which a family tree of precursors are linked to each other by individual reaction stoichiometries which lead eventually to the desired product. In principle, this approach may be applied to generate any number of successive reaction steps. Each system comprises of both stoichiometry enumeration and evaluation, so t h a t for each stoichiometry, the evaluation problem is solved immediately after the stoichiometry is generated. Infeasibility in the linear stoichiometry enumeration problem implies that no stoichiometries exist which lead to the particular target molecule, while infeasibility in the non-linear stoichiometry evaluation model implies violation of the thermodynamic constraints (ignoring numerical problems) for a particular stoichiometry. Thus, if the stoichiometry enumeration model is initially infeasible or becomes infeasible after all possible stoichiometries in a particular system have been enumerated, the algorithm immediately moves onto the next system. If however, the stoichiometry evaluation model is infeasible for a particular stoichiometry, the algorithm continues to enumerate stoichiometries within the same system. The results of the system are stored only if both problems are feasible. The same occurrence matrix _%and variables are used throughout all systems, so t h a t each time a system is solved all variables are over written. Thus, parameters are employed to store results and to communicate variable values between stoichiometry enumeration and evaluation problems within each system. Different role specification constraints, chemistry or carbon structure constraints may be included in different systems, if required, simply by including different equations in the model definitions. Otherwise, the same equations and models are also employed throughout all systems.
196 INTEGER CUTS Within the algorithm, integer cuts are automatically written each time a syst e m is solved. The cuts are written in such a way t h a t they prevent the same stoichiometry from occurring again both within the current system and within all s u b s e q u e n t systems. Furthermore, they are written to prevent the reappearance of the stoichiometry in both forward and reverse directions. In this way, each m a t e r i a l which appears as a reactant in the entire multistep stoichiometry n e t w o r k is fathomed only once and circuits, in which a reaction appears in both forward and reverse directions within the same multi-step stoichiometry, are avoided. The integer cuts are written in t e r m s of the r e a c t a n t and co-product flags in m u c h the same way as the chemistry constraints, and since the same r e a c t a n t and co-product flag variables are used for each system, it is a simple m a t t e r to write these constraints in such a way t h a t once written, they are included in all s u b s e q u e n t systems. TARGET MOLECULE IDENTIFICATION At the outset, only the desired product is known so t h a t a m e a n s of communicating t a r g e t molecule identities between the subsequent systems is required. This is achieved by using the iis values from each system as lower bounds for the stoichiometric coefficients in the next. If species k and 1 are the precursor r e a c t a n t s generated by a certain system, iik and ii~ will take the value u n i t y for this system. These species m u s t then be identified as the products of the pair of subsequent systems. However, species k and 1 m u s t be considered independently, i.e. one in one system and one in the other. In order to do this, the vector IIs m u s t be split so t h a t in one system ik = 1 while iz (and all other product flags) are unconstrained, whereas in the other system il = 1 while ik (and all other product flags) are unconstrained. This is achieved by incorporating the following equations in all stoichiometry e n u m e r a t i o n formulations: iis = as + bs,
Vs C S
(64)
as -- 1
(65)
bs _< 1
(66)
s
s
E q u a t i o n 64 splits the vector IIs from each system into two vectors As and Bs, of which the elements as and bs are binary variables. According to equations 65 and 66, As and Bs m a y have only one non-zero entry each, so t h a t the nonzero entries in IIs are divided, one in to As and one in to Bs. The values of As and Bs are t h e n communicated to the subsequent pair of systems through the p a r a m e t e r vectors APs and B G respectively, by replacing equation 19 with one of the following equations in all systems subsequent to system zero: is >_ aps,
Vs c S
(67)
197 i~ > bp~, Vs c S
(68)
Note t h a t equation 66 is written as an inequality to permit stoichiometries in which there is only one reactant (e.g. isomerisation or thermal decomposition). In such cases, all elements of the vector B~ take the value zero and the algorithm omits the entire branch of corresponding subsequent systems. Note also t h a t %, the stoichiometric coefficient of the target molecule for each system subsequent to system zero, which is needed in the stoichiometry evaluation equations, is identified using one of the following equations: vp - E
ap~v~
(69)
bp~v~
(70)
8
vp = E 8
where ap~ and bp~ are the parameter values generated by the previous system, and v~ are the stoichiometric coefficients of the current system. For all systems, vp is included as a parameter in stoichiometry evaluation. CALCULATION OF FINAL RESULTS The profit and impact from each system are stored as parameters immediately after the system is solved. These figures are calculated per mole of the target molecule produced in the current system. The total profit and impact associated with the multistep stoichiometries are calculated per mole of the final product, by starting at the final systems and working forward towards the final product, adding the profits and impacts sequentially. However, the target molecule in a certain system may exhibit a stoichiometric coefficient with any value (subject to Vma~) in the previous system where it appears as a reactant, and the product of this previous system may also exhibit any such stoichiometric coefficient value. Thus, the profit and impact of each system m u s t be multiplied by the magnitude of the stoichiometric coefficient of its target molecule as it appears in the previous system, and divided by the magnitude of the stoichiometric coefficient of the product of this previous system, before being added to the profit or impact of the previous system. Since each system may have up to two immediate subsequent systems, the profits and impact of both subsequent systems must be treated in this way. In addition, each combination of subsequent system stoichiometries must be considered and a separate profit and impact figure calculated for each. Depending on whether ap~ or bp~ is used in a particular system, xp, kk-1 the magnitude of the stoichiometric coefficient which the target molecule of system k exhibits as a reactant in system k - 1 is given by one of: k-1 XP, k --- E
ap~k-1 x~k-1 8
(71)
198
k-1 ~ - E bPsk-l-k-1 :I; s
X p, k
(72)
8
where ap~ -1 and bp~-1 are parameter values from system k - 1 and x sk-1 are the stoichiometric coefficients of system k - 1. Thus, the profit and impact of system k are multiplied by xp, k and divided by xpk-1 before being added to the profit and k-1 impact of system k - 1. Using these parameters, profits and impacts are cascaded through the multistep stoichiometry network and a total profit and impact figure is arrived at for each set of stoichiometries which eventually leads to the final product. Note that each reactor is assumed to be fed at ambient conditions. It is assumed that any cooling which may be required between successive reaction steps to achieve this will be accommodated by energy integration at a later stage of the process design, with no environmental penalty.
7.6 A P P L I C A T I O N
7.6.1 Case Study: Production of 1-Naphthalenyl Methyl Carbamate 1-naphthalenyl methyl carbamate, also known as carbaryl was employed as a pesticide (Kalelkar, 1988; Shrivastava, 1987; Worthy, 1985). It was manufactured under the trade name SEVIN by Union Carbide India, Limited (UCIL) in Bhopal until December, 1984 when production was terminated following the Bhopal disaster. UCIL's process involved the raw materials 1-naphthol and methyl isocyanate, a toxic substance with a permissible exposure limit (PEL) of 0.02ppm (AGCIH, 1977; Dagani, 1985). Under disputed circumstances, 45 tons of methyl isocyanate underwent a chemical reaction and were released, killing approximately 2,500 people in the vicinity of the plant and resulting in some 300,000 additional casualties. Crabtree and E1-Halwagi (1994) considered this example with the objective of identifying stoichiometries with more innocuous raw materials, to reduce the potential impact of fugitive emissions. The approach employed here is somewhat different in that the objective is to identify stoichiometries which exhibit low environmental impact under normal operating conditions. While materials with high toxicity can be excluded at the co-material design stage by including equation 17 in the co-material design formulation, this potentially excludes stoichiometries which could be environmentally promising provided proper cont a i n m e n t were employed. Thus, no such limit was included in this example. In cases where fugitive emissions are of concern, the methodology for environmental risk assessment of non-routine industrial releases presented by Stefanis and Pistikopoulos (1997) could, in principle, be incorporated as part of stoichiometry
199 evaluation. GROUP PRE-SELECTION According to Worthy (1985), there are two accepted industrial routes to carbaryl, which can be produced with or without methyl isocyanate. The alternative chemistries are shown in Figure 2. Methyl IsocyanateRoute CH3NH 2
§
COCI 2
Methyl Amine
>
CH3--N--- ~
Phosgene
§
CH3mN-- ~
§
O
2 HCI
Methyl Isocyanate
O O - - C ~ N ~ CH 3
OmH
II
1-Naphthol
O
Carbaryl (1-NaphthalenylMethyl Carbamate)
Non-Methyl Isocyanate Route
§
COCI 2
HCI
> .iCl
O--H
II
O
1-Naphthalenyl Chloroformate HC! CH3NH 2 iCl
II
O
> O---C I N ~ CH 3
II
O
Carbaryl
Figure 2: Carbaryl Production Routes
For simplicity, to limit the size of the co-material design problem in this illustrative example, the group set is restricted to the simplest set of groups which are required to form the product and industrial co-materials shown in Figure 2. The selected set of UNIFAC groups (eleven in all) then consists of the aromatic groups AC, ACH, ACC1 and ACOH, and the groups -CH3, CH3NH-, CH3NH2, -CO0-, -CHO, -OH and -C1. Note that methyl amine (CH3NH2) appears as a class zero group in Constantinou (1996), that is as a complete molecule, so that the NH2 group is not required. Note that the -C1 group is a category two group.
200 CO-MATERIAL DESIGN Using this group set, the co-materials were then constructed by solving the co-material e n u m e r a t i o n formulation once for acyclic molecules and once for aromatic molecules. Additional structural restrictions were included, according to the structures of the industrial co-materials: (i) for non-aromatic molecules an upper limit of two groups was imposed, (ii) for aromatic molecules an upper limit of twelve groups was imposed since it is unlikely t h a t carbaryl (which contains twelve groups) would be synthesised from a more complex molecule, (iii) only u n s u b s t i t u t e d or monosubstituted aromatics which contain the double ring (naphthyl group) aromatic structure were allowed (since the product contains the n a p h t h y l group is monosubstituted) by specifying a m i n i m u m of seven ACH groups, and a total of ten ACH and AC groups altogether, and (iv) only one s u b s t i t u e n t group with a carbon free a t t a c h m e n t was allowed in the aromatics. In addition, all non-aromatic molecules containing carbon bonds were screened out after enumeration. Thus, chemistries in which the n a p h t h y l group structure is constructed or decomposed, or in which any other carbon-carbon bonds are formed or broken, are avoided. For the acyclic molecules, constraints were included to prevent chlorine bonding with itself or with any groups of higher category. However, for the aromatic molecules, these constraints were removed, to allow the formation of 1n a p h t h a l e n y l chloroformate. The results of co-material e n u m e r a t i o n are shown in Figure 3.
H I N ~ CH 3 1) Naphthalene
2) 1-Chloronaphthalene
3) 1-Naphthol
OH
4) N-Methyl-l-Naphthylamine
.i CI
II
I N ~ CH 3
II
O
II
O
5) 1-Naphthalenyl Hydroxyformate
O
7) Carbaryl
6) 1-Naphthalenyl Chloroformate
CI 2
CH3C 1
CH30 H
CI----C~ H
I~
8) Chlorine
9) Chloromethane
10) Methanol
11) Chloromethanal
H'-'C I N~. CH 3
II CH3NH 2 12) Methyl Amine
C! ~C'-~ O CI 13) Phosgene
CH3--N--- C-~ O
O 15) Methyl Formamide
14) Methyl lsocyanate
Figure 3" Co-Material Design R e s u l t s - Carbaryl Example
201
Note that species 8, 11, 13, 14 and 15 are included as additional molecules since none of these can be constructed according to the structural restrictions employed. Four further additional molecules were also included, as shown in Figure 4. H2
16) Hydrogen
02
17) Oxygen 18) Water 19) Hydrogen Chloride
a20
HCI
Figure 4" Additional Molecules MULTI-STEP STOICHIOMETRY IDENTIFICATION RESULTS The solutions of the stoichiometry identification program are presented in the form of a table of stoichiometric coefficients in Table 3, where blank spaces indicate zero coefficients and the species are numbered as above. According to the industrial routes, stoichiometries of up to two steps in length were allowed, with a m a x i m u m of four species permitted in any step. The role specification and chemistry constraints employed in this example are given in Appendix A. No carbon structure constraints were employed in this example. A production rate (c~vp) lower bound of 2.5 kmol/hr and an allowable reactor temperature range of 300-800K were imposed. Table 3" Multistep Stoichiometries- Carbaryl Example ]]
Index]Nsp~
Species
]]
K [kmol/hr erVp I Profit Toper $/mol I
11 2[ 3]4151 6171819110111112113114115116117118119
CTAM tnair/mol
System 0 - Producing Species 7 A B C D
3 4 4 4
1 -1 1 1 -1 1
-1 1
-1 -1
1
-1
300 300 300 300
9.99 5.51 10.00 3.17
0.4508 -2.9885 0.5026 -2.9485
19.57 22.22 16.63 16.08
300 300 300 300
20.00 9.34 10.00 10.00
0.5509 0.5013 0.5015 0.5249
6.90 18.23 17.85 13.44
System 1 - ProducingSpecies 3 E F G H
3 4 4 4
I
I 4
K
4
L M N
4 4 ! 4
-2
2 -1 1 -1 1 -1 1
I I III
1 -1 -1 -1
1
System 1 - ProducingSpecies 14
IIII
I l-1[-11 11 1 1 1 1 2
System 1 - ProducingSpecies 15
I I I rSystem I I I 1-1111:II I I il ,11 11 Producing Species 1-
s00 I lo.00 101ss71 300 I 2.7510.04001 736 2.50 0.0535
5~s 4.20 2050.78
6
300 300 300
10.00 10.00 10.00
3.95561 3.49111 3.5880
17.86 14.11 10.27
System zero produced four candidate stoichiometries that satisfy all constraints,
202 in which materials 3, 6, 12, 14 and 15 appear as first generation precursor reactants. Systems 1A and 1B produced a total of ten further stoichiometries leading to all of these materials except species 12, which is allowed only as a reactant in systems 1A and 1B since it could only be produced by decomposing more complex naphthyl molecules. All stoichiometries except I and K achieve acceptable conversion at 300K. For stoichiometry I, Gsys is minimised at 800K and high conversion since AGR for this reaction has a large negative temperature gradient, so that equation 56 is dominated by its first RHS term. For stoichiometry K however, reactor temperature has to be elevated to meet the production rate bound, so that this stoichiometry is the only one for which T' > 300K (from equations 57-59). The two industrial chemistries shown in Figure 2 were reproduced; stoichiometries A and I representing the methyl isocyante route, and stoichiometries D and N representing the non-methyl isocyanate route. Stoichiometries C and D represent the first and third of the three alternative single step routes put forward by Crabtree and E1-Halwagi (1994), their second alternative does not appear here since it involves three apparently simultaneous reactants. Table 4 shows the total profits and impacts for the individual solutions combined in to multi-step stoichiometries. For example, the index AEI denotes the combination of steps A, E and I. Note that the profits reflect only the values of the products minus the values of the reactants, assuming that stoichiometric co-products are sold at their market value, and that in this example, raw materials are assumed input waste free. Note also that stoichiometries with poor conversion are not penalised since the costs and impacts of separation are not included here, and it is assumed that unconsumed reactants are recycled with no loss of heat and no compression or pumping requirements. Only stoichiometries involving step K can justifiably be eliminated from further consideration on impact grounds, this step being penalised in impact terms by high reactor temperature, and only stoichiometries involving steps M or N can justifiably be eliminated on economic grounds. Despite the fact that these steps both exhibit high profits, species 6 is such a high value material that only step L generates sufficient profit to cover the cost of consuming species 6 in system zero. It is for this reason that stoichiometry DL remains competitive despite the poor economic performance of step D, which was rejected by Crabtree and E1-Halwagi (1994) on economic grounds. This clearly illustrates the advantages of considering multi-step production routes. Of the remaining ten stoichiometries, the original industrial chemistry (steps A and I) with the addition of step E, F, G or H to produce species 3, exhibits the
203 Table 4: Total Profits and Impacts Index A A A A
E F G H
I I I I
B B B
J J J
L M N
B B B
K K K E E F F G G H H L M N
L M N J K J K J K J K
D D D
Total Profit $/mol 1.1403 1.1408 1.1409 1.1644 1.0070 0.5426 0.6394 1.0205 0.5561 0.6529 1.0453 1.0570 1.0439 1.0574 1.0441 1.0576 1.0675 1.0810 1.0070 0.5426 0.6394
Total CTAM
tnair/mol 32.15 43.48 43.10 38.69 44.28 40.53 36.69 2090.86 2087.12 2083.27 27.72 2074.31 39.05 2085.64 38.68 2085.26 34.27 2O8O.85 33.93 30.19 26.34
most promising economics of all, which is probably why it was selected. Furthermore, the environmental impacts of the routes based on this chemistry are also among the most promising. Of these routes, stoichiometry AEI represents the best compromise solution. Only stoichiometry CEJ exhibits a significantly lower impact t h a n AEI, with only a marginally reduced profit, and so appears to r e p r e s e n t the best compromise solution of all. However, conversion is poor for step J so t h a t higher separation and recycle costs are anticipated. Issues such as this m u s t be explored in order to eliminate further alternatives.
7.7 C O N C L U S I O N S
In the work presented here, a procedure for the rapid identification of alternative multi-step stoichiometries has been described in which each stoichiometric step involves whole n u m b e r stoichiometric coefficients and a limited n u m b e r of species. The key to the procedure is the introduction of m a t e r i a l design principles to formalise the development of a set of co-materials from which stoichiome-
204 tries are then extracted using an optimisation procedure. The co-material enumeration procedure is based on a set of structural and chemical feasibility rules from Constantinou et al. (1996). However, r a t h e r t h a n employing their molecular generation algorithm, the rules are instead used to develop a set of linear integer constraints governing the numbers and combinations of particular structural groups in a molecule. Combining these rules with the octet rule, and additional structural restrictions to limit the total number of groups, and the numbers of branches, substituents, substituted sites and functional groups, results in the co-material enumeration MILP formulation. This problem is solved repeatedly, introducing integer cuts after each iteration to exclude previous solutions, to produce a set of co-materials. Stoichiometries are then extracted from this set of materials using an optimisation procedure, in which stoichiometries are explicitly enumerated and subsequently evaluated in two sequential steps. Stoichiometry enumeration includes whole number stoichiometric coefficients constraints, constraints to restrict changes to the carbon skeletons of the reacting species, and case specific constraints based on chemical knowledge. Thermodynamic, economic and environmental impact criteria are employed in the evaluation of the stoichiometries, with aspects of the MEIM (Pistikopoulos et al., 1994) providing the framework for the environmental evaluation of alternatives. The illustrative example has shown that the co-material design technique provides an interesting set of co-material molecules and that, with the inclusion of a few simple rules based on chemical knowledge, it is possible to limit the quantity of co-materials to a manageable number. Furthermore, by incorporating simple chemical rules along with thermodynamic, economic and environmental criteria in stoichiometry identification it is possible to identify a small number of alternative stoichiometries which are promising both in terms of economics and environmental impact. Moreover, it has been shown that developing multistep stoichiometries directly can lead to the acceptance of alternatives which would be rejected as single step syntheses. In the illustrative example, existing industrial chemistries were identified as the most promising compromise solutions, with several new and competitive alternatives. This suggests that the approach could lead to promising results in the search for production routes for new molecules. F u r t h e r reinforcement of this conclusion appears in a second application presented in Chapter 14.
7.8 R E F E R E N C E S
AGCIH. Airborne Hazards at Work. American Conference of Governmental Industrial Hygienists. Great Britain Factory Inspectorate. London (1977)
205 Agnihotri, R.B. and R.L. Motard. Reaction Path Synthesis in Industrial Chemistry. Computer Applications to Chemical Engineering, ACS Symposium Series 124, 193-206 (1980) Androulakis, I.P. Kinetic mechanism reduction based on an integer programming approach. AIChE Journal 46(2), 361-371 (2000) Buxton, A. Solvent Blend and Reaction Route Design for Environmental Impact Minimisation. PhD Thesis. Imperial College, London (2002) Buxton, A., A.G. Livingston and E.N. Pistikopoulos. Reaction Path Synthesis for Environmental Impact Minimization Computers chem. Engng. 21, $959-$964 (1997) Chemical Prices. Chemical Marketing Reporter 253(8), 25-35 (1998) Compounds. Dictionary of Organic Compounds. Chapman & Hall, London 6th Edition (1996) Constantinou, L., C. Jaksland, K. Bagherpour and R. Gani. Application of the Group Contribution Approach to Tackle Environmentally Related Problems. AIChE Symposium Series, Volume on Pollution Prevention through Process and Product Modifications 303, 105-116 (1994) Constantinou, L., K. Bagherpour, R. Gani, J.A. Klein and D.T. Wu. Computer Aided Product Design: Problem Formulations, Methodology and Applications. Computers chem. Engng 20(6), 685-703 (1996) Corey, E.J. and W.T. Wipke. Computer Assisted Design of Complex Organic Syntheses. Science 166 (1969) Corey, E.J., W.T. Wipke, R.D. Cramer III and W.J. Howe. Techniques for Perception by a Computer of Synthetically Significant Structural Features in Complex Molecules. J. Am. Chem. Soc. 94, 431 (1972) Corey, E.J., H.W. Orf and D.A. Pensak. Computer Assisted Synthetic Analysis. The Identification and Protection of Interfering Functionality in MachineGenerated Synthetic Intermediates. J. Am. Chem. Soc. 98, 210 (1976) Crabtree, E.W. and M.M. E1-Halwagi. Synthesis of Environmentally Acceptable Reactions. AIChE Symposium Series, Volume on Pollution Prevention via Process and Product Modifications 90, 117-127 (1994) Dagani, R. Data on MICS Toxicity are scant, leave much to be learned. Chemical & Engineering News 63(6), 37-40 (1985) Derringer, G.C. and R.L. Markham. A Computer-Based Methodology for Matching Polymer Structures with Required Properties. Journal of Applied Polymer Science 30, 4609 (1985) Edwards K., T.F. Edgar and V.I. Manousiouthakis. Reaction mechanism simplification using mixed-integer nonlinear programming. Computers chem. Engng. 24, 67-79 (2000) Fornari, T. and G. Stephanopoulos. Synthesis of Chemical Reaction Paths: The Scope of Group Contribution Methods. Chemical Engineering Communications 129, 135-157 (1994a) Fornari, T. and G. Stephanopoulos. Synthesis of Chemical Reaction Paths: Eco-
206 nomic and Specification Constraints. Chemical Engineering Communications 129, 159-182 (1994b) Fornari, T., E. Rotstein and G. Stephanopoulos. Studies On the Synthesis of Chemical Reaction Paths - II. Reaction Schemes with Two Degrees of Freedom. Chemical Engineering Science 44(7), 1569-1579 (1989) Gani, R., B. Neilsen and A. Fredenslund. A Group Contribution Approach to Computer-Aided Molecular Design. AIChE J. 37, 1318-1332 (1991) Gao, C.,R. Govind and H. Tabak. Application of the Group Contribution Method for Predicting the Toxicity of Organic Chemicals. Environmental Toxicology and Chemistry 11, 631-636 (1992) Gelernter, H., N.S. Sridharan, A.J. Hart, F.W. Fowler and H.J. Shue. An Application of Artificial Intelligence to the Problem of Organic Synthesis Discovery. Topics Curr.Chem. 41, 113 (1973) Glover, F. Improved Linear Integer Programming Formulations of Nonlinear Integer Problems. Management Science 22(4), 455-460 (1975) Govind, R. and G.J. Powers. Studies in Reaction Path Synthesis. AIChE J. 27(3), 429-442 (1981) Heijungs, R., J.B. Guinee, G. Huppes, R.M. Lankreijer, H.A. Udo de Haes, A. Wegener Sleeswijk, A.M.M. Ansems, A.M.M. Eggels, R. van Duin, H.P. de Goede. Environmental Life Cycle Assessment of Products: Background and Guide. Multicopy. Leiden (1992) Hendrickson, J.B. A General Protocol for Systematic Synthesis Design. Topics in Curr.Chem. 62, 49 (1971) Holiastos, K. and V. Manousiouthakis. Automatic Synthesis of Thermodynamically Feasible Reaction Clusters. AIChE J. 44(1), 164-173 (1998) ISO 14040. Environmental Management- Life Cycle Assessment- Part 1: Principles and Framework. (1997) Joback, K.G. Unified Approach to Physical Property Estimation using Multivariate Statistical Techniques. Master's thesis. MIT, Cambridge, Mass (1984) Joback, K.G. Designing Molecules Possessing Desired Physical Property Values. PhD thesis. MIT, Cambridge, Massachussetts (1989) Joback, K.G. and G. Stephanopoulos. Designing Molecules Possessing Desired Physical Property Values. Proceedings FOCAPD, CACHE Corporation, Austin, Texas 11, 631-636 (1989) Joback, K.G. and G. Stephanopoulos. Designing Molecules Possessing Desired Physical Property Values. Advances in Chemical Engineering 21 - Intelligent Systems in Process Engineering, Academic Press (1995) Kalelkar, A. S. Investigation of Large Magnitude Accidents: Bhopal as a Case Study. Authur D. Little Inc., Cambridge, Massachussetts (1988) Kaufmann, G. Computer Design of Synthesis in Organo-Phosphorous Chemistry. Computer-Assisted Design of Organic Synthesis, Table Ronde Roussel UCLAF, Paris (1977) Knight, J.P. Computer-Aided Tools to Link Chemistry and Design in Process
207 Development. PhD Thesis, Massachusetts Institute of Technology (1995) Mavrovouniotis, M.L. and D. Bonvin. Design of Reaction Paths. FOCAPD, AIChE Symposium Series 91, 41-51 (1995) May, D. and D.F. Rudd. Development of Solvay Clusters of Chemical Reactions. Chem. Eng. Sci. 31, 59 (1976) Perry, R.H. and D. Green. Perry's Chemical Engineers' Handbook. 6th ed.. McGraw Hill (1984) Pistikopoulos, E.N., S.K. Stefanis and A.G. Livingston. A Methodology for Minimum Environmental Impact Analysis. AIChE Symposium Series, Volume on Pollution Prevention through Process and Product Modifications 90(303), 139150 (1994) Porter, K.E., S. Sitthiosoth and J.D. Jenkins. Designing a Solvent for Gas Absorption Trans IChemE 69(A), 229-236 (1991) Rotstein, E., D. Resasco and G. Stephanopoulos. Studies on the Synthesis of Chemical Reaction Paths - I. Chemical Engineering Science 37(9), 1337-1352 (1982) SETAC. A Conceptual Framework for Life-Cycle Impact Assessment. (1993) Shrivastava, P. Bhopal, Anatomy of a Crisis. Ballinger Publishing Company, Cambridge, Massachussetts (1987) Sirdeshpande, A.R., M.G. Ierapetritou and I.P. Androulakis. Design of flexible reduced kinetic mechanisms. AIChE Journal 4"/(11), 2461-2473 (2001) Stefanis, S.K. A Process Systems Methodology for Environmental Impact Minimization. PhD Thesis. Imperial College, London (1996) Stefanis, S.K. and E.N. Pistikopoulos. A Methodology for Environmental Risk Assessment for Industrial Non-Routine Releases. Ind. Eng. Chem. Res. 36, 3694-3707 (1997) Ugi, I. and P. Gillespie. Chemistry and Logical Structure. 3. Representation of Chemical Systems and Interconversions by BE matrices and their Transformation Properties. Agnew. Chem. Ind. Ed. Engl 10, 914 (1971) Van Krevelen, D.W. and H.A.G. Chermin. Estimation of the Free Enthalpy (Gibbs Free Energy) of Formation of Organic Compounds from Group Contributions. Chem. Eng. Sci. 1, 66-80 (1951) Weissermel, K. and H.-J. Arpe. Industrial Organic Chemistry. Second, Revised and Extended Edition. VCH, Weinheim FRG (1993) Wipke, W.T., H. Braun, G. Smith, F. Choplin and W. Seiber. SECS Simulation and Evaluation of Chemical Synthesis: Strategy and Planning. In: ComputerAssisted Organic Synthesis (W.T. Wipke and W.J. Howe, Eds.) ACS Symposium Series 61 (1976) Worthy, W. Methyl Isocyanate: The Chemistry of a Hazard. Chemical and Engineering News 63(6), 27-33 (1985)
208
A p p e n d i x A: Role S p e c i f i c a t i o n and C h e m i s t r y C o n s t r a i n t s for C a s e S t u d y - 1 M a n u f a c t u r e of Carbaryl A.1 R o l e S p e c i f i c a t i o n C o n s t r a i n t s Table 5 shows the knowledge based role specification constraints employed in the carbaryl example, where R denotes reactant only, P denotes the final product, C denotes product or co-product, N denotes the exclusion of a species from a system and a blank space denotes no restriction. Table 5: Role Specification C o n s t r a i n t s - Carbaryl Example Species System 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 RRRRRRP R R R C N C 0 1A & 1B CCCC R C C C R
These constraints were developed specifically for two step stoichiometries according to the following arguments, based on chemical knowledge and the existing industrial chemistries: 9 The product (carbaryl, species 7) should appear only as the product in system zero, or as a product or co-product in systems 1A and 1B, never as a reactant. 9 Other naphthyl group containing molecules should be reactants only in system zero (i.e. no naphthyl containing co-products should appear in system zero), and naphthyl containing compounds with complex substituents (i.e. species 4, 5, and 6) should be products or co-products only in systems 1A and lB. 9 Methyl isocyanate and methyl formate (species 14 and 15) may be produced only in systems 1A and 1B and consumed only in system zero (i.e they may not be decomposed to produce simpler molecules). 9 /-/2 (species 16) appears as a coproduct only in all systems since hydrogenation reactions are not required. 9 H C I (species 19) appears as a co-product only in system zero, and as a reactant (Cl provider) or co-product (as in the industrial chemistries) in
systems 1A and lB. 9 H 2 0 (species 18) appears as a possible O H group donor or recipient in all
systems.
209 9 02 (species 17) appears as an oxygen provider in systems 1A and 1B only.
A.2 Chemistry Constraints The following knowledge based chemistry constraints were employed: 9 n a p h t h y l containing species with complex substitutions not allowed as reactants
ii4 + ii5 + ii6 = 0
(73)
9 other n a p h t h y l group containing species may not react with each other
iil § ii2 § ii3 § ii7 _ 1
(74)
This Page Intentionally Left Blank
PART II: A p p l i c a t i o n s of CAMD The first part of the book dealt with some of the solution techniques commonly employed to tackle the CAMD problem. This part demonstrates the application and practice of some of those techniques to different types of problems in CAMD. Chapters 8 & 9 describe the industrial application of CAMD methods for solvent design and selection. In particular, the use of the hybrid CAMD method (chapter 6) is highlighted. Chapter 10 deals with optimal solvent design for blanket-wash using the optimization-based CAMD method of chapter 3. Chapter 11 extends the application of CAMD from single solvent design to solvent mixture design together with an application example. Chapter 12 provides an example of the application of the global optimization-based CAMD method of chapter 4 to optimal refrigerant design while chapter 13 highlights the application of genetic algorithm-based CAMD (chapter 5) to polymer design. Chapter 14 provides a detailed case study of the application of CAMD to identify multistep reaction stoichiometries (using the method described in chapter 7). Finally, chapter 15 presents the application of CAMD to design of fuel additives employing the genetic-algorithm based CAMD method of chapter 5.
Computer Aided MolecularDesign: Theory and Practice L.E.K. Achenie, R. Gani and V. Venkatasubramanian(Editors) 9 2003 Elsevier Science B.V. All fights reserved.
213
C h a p t e r 8: CAMD for S o l v e n t S e l e c t i o n in I n d u s t r y - I J. M. Vinson
8.1 INTRODUCTION While the process design for a new commercial drug is not the critical step of getting a new drug to market, it is i m p o r t a n t to do a good job of scaling from the laboratory to full production. A good p h a r m a c e u t i c a l m a n u f a c t u r i n g process is insensitive to small variations in operating conditions and r u n s in a reasonable time. Along with the commercial m a n u f a c t u r i n g needs, batches of active p h a r m a c e u t i c a l ingredient are required for clinical trials, so processes in development m u s t be capable of delivering ever-increasing a m o u n t s of active p h a r m a c e u t i c a l ingredients (API). One aspect of the development of production processes for API's is the selection of appropriate solvents for dissolution of raw materials, reactions and product crystallization. The most common mechanism for d e t e r m i n i n g solvents is to select from a common set of solvents, as p h a r m a c e u t i c a l companies are not i n t e r e s t e d in developing new solvents for their processes. While this method is sufficient for most cases, situations arise where none of the usual solvents are terribly effective. When the timelines are short, the developers have little choice but to go with the best of the ineffective solvents, or to use a more complex procedure. For example, the procedure might call for several extractions and back-extractions to achieve an acceptable yield with a m o d e r a t e solvent, whereas a better solvent might be available t h a t can effect the extraction in one or two steps. CAMD is one tool t h a t can help quickly point to a n u m b e r of candidate solvents. The goal is to help the development t e a m find candidate solvents t h a t they m a y not have considered in the normal course of development. As we have seen t h r o u g h o u t this book, computer aided molecular design is a methodology in which molecules are designed to meet specific needs. While the approaches vary widely, depending on the application area, they all require the ability to predict the behavior of the full compound. This is accomplished
214 through molecular dynamics, expert systems, genetic contribution methods, and combinations of these techniques.
algorithms,
group
The hitch for complex industrial compounds, such as those found in pharmaceutical applications, is that it is not always possible to accurately predict the properties of the compounds. This chapter will describe the application of computer-aided molecular design to situations where the standards CAMD techniques do not obviously apply.
8.2 CAMD METHODOLOGY U S E D 8.2.1 G e n e r a l i z e d CAMD F r a m e w o r k As described in chapters 6, the basic approach to compute-aided molecular design (Harper, 2000) is a three-step process: 1. Pre-Design: Define the problem in terms of desired properties of the compound to be designed. At this stage it is also critical that one select the best formulation of the problem, as the problem with the most clarity and the most available data will be easier to solve. Since design is an iterative process, it is not unlikely that one will come back to the pre-design stage to evolve the problem definition, based on results obtained in stages two and three. 2. CAMD: Run the actual CAMD algorithm to generate compounds and test them against stated criteria from the pre-design stage. 3. Post-Design: Test the results based on properties that are not easily screened during stage two, such as environmental, health & safety criteria. The compounds must also be tested either in simulation or in the laboratory.
8.2.2 S p e c i a l F e a t u r e s for C o m p l e x S o l u t e s The very nature of specialty chemical and pharmaceutical development is of working with new compounds. Many of these contain unusual active groups or combinations of active groups that make property predictions dubious by standard means. These factors add an extra degree of difficulty to solvent selection problems. However, one can make use of extensive experimental apparatus to enable computer-aided molecular design. In particular, where traditional methods of solvent selection by experiment do not result in an acceptable solvent, the results of those very experiments can help point researchers in the right direction. The following seven-step procedure (Vinson et al. 2000) details how to combine
215 experimental work with CAMD to help find appropriate solvents for complex solutes. Step 1. Select N solvents with solubility parameter values between those of hexane (minimum) and water (maximum). Step 2. Compute solubility of the solute in each of the N solvents using the regular solution theory (requires only solubility parameter value). Step 3. Plot the calculated solubility values against the known solubility parameter values of each solvent and identify the location of the maximum solubility value together with the corresponding solubility parameter value. Step 4. Based on the solubility parameter value from step 3, identify compounds with similar solubility parameter values from the database to produce a list of compounds with known properties such as melting points, boiling points, and so on. Step 5. Use the generated data to define the solvent design/selection problem and go to stage 2 of the generalized CAMD framework (see chapter 6). Step 6. Validate the selected compounds (solvents) from step 5 by plotting their ability to dissolve the solute on the solubility versus solubility parameter plot. They should all lie near the maximum. Step 7. Consider other properties as given in the post design phase for further screening and final selection. Note that the first two steps of this procedure are traditionally conducted in the lab in the search for appropriate solvents. The remaining steps walk researchers through a mechanism to develop a CAMD formulation to find alternate solvents.
8.3APPLICATION EXAMPLES With the case studies, we explore the use of CAMD for solvent selection in a number of examples inspired by the pharmaceutical industry. Only the first example is worked in full detail. The second example provides the basic setup of the problem and suggestions as to how one might approach finding appropriate solvents via CAMD. The final example is a challenge problem, which is presented to show the full complexity of solving computer aided molecular design in a real-world situation. The ProCamd (ICAS Documentation, 2002) software developed at CAPEC (www.capec.kt.dtu.dk) has been used for the solution of the problems
8.3.1 Example 1: Extraction Solvent R e p l a c e m e n t This case study is an example of CAMD used for solvent selection. Not only does this example show the difficulty of handling complex compounds, but it also demonstrates the need for well-thought out problem formulation. In this case,
216
there are several problems to be handled by a new solvent. The first task is to determine which problem has the highest likelihood of successful resolution. This example is of a reaction system, followed by extraction. The basic chemistry is described in Figure 1. The first reaction is a peptide coupling between compounds A and B with diisoproyl carbodiimide (DIC) as a coupling agent and N-hydroxybenzotriazole as a catalyst. This liquid-phase reaction runs in a solvent mixture of 1:1 dimethylformamide (DMF) and methylene chloride (MeC1). Reactant A has limited solubility in these solvents, thus the reaction runs over several hours. The second reaction is a saponification that hydrolyzes the ethyl group in compound C with 2.5 N sodium hydroxide. The current process calls for no isolation between the first and second reaction, which is common in the pharmaceutical industry due to purity concerns. The second reaction is followed by an extraction in methylene chloride, leaving the product in the aqueous layer. The final workup (not shown in Figure 1) involves an isoelectric precipitation that isolates the product as a zwitterion. The DIU byproduct of reaction 1 is somewhat soluble in water, which leaves DIU with the product throughout the precipitation step and necessitating additional back-extractions after the second reaction to purify the aqueous phase. R
~COO[-I
+
.~ ...OCH2CH3 HC1 9NH2""R '
A
N ~C
~Y_ / ~
H
A (solid)
B (solid)
O
diisopropyl carbodiimide
0 1) 1"1 DMF/CH2CI2 2) 0.2 equiv. N-hydroxybenzotriazole 3) ambient temp., 6 h Reaction 1
R'C/~q" R,'~OCH2CH 3 O C (dissolved)
+
HCI
H diisopropyl urea (dissolved)
0
O
R .~C f N.. R' OCH2CH3 + O C (dissolved)
NaOH
Reaction 2
Rxc/N"R' ONa + 0 D (dissolved)
CH3CH2OH
Figure 1: Reaction chemistry details for example 1 While scaling this process from the laboratory to the manufacturing scale, several inefficiencies in this process became evident, and the development team began looking into alternatives. As far as CAMD goes, there are a number of problems to solve, several of which interact with one another: One could attempt to find a replacement solvent for methylene chloride, as it is environmentally undesirable for large-scale operations. A solvent that preferentially removes the DIU impurity from the first reaction mixture would reduce the need for additional workup after the second reaction. Finally, the reaction time could be
217 improved by finding a new solvent (or solvent system) that does a better job of dissolving compounds A and B.
Pre-Design Phase After studying the available data and the compounds in question, it became clear t h a t the best problem to solve is that of removing the DIU impurity from the post-second-reaction mixture. While it would be more profitable to explore a better reaction system solvent, there is not enough solubility data on the compounds A, B, C and D to attempt solvent design for them. In addition, these compounds have structures that reduce the utility of the group contribution methods. As we shall see in the results, this particular approach can also help replace or reducing the methylene chloride in the process. The list of options and some comments on their viability are listed in Table 1.
Table 1: Summary of potential problems to solve in Example 1
Option
Comment
Replace methylene chloride
Wide range of possible resolutions, may take too | long. Not enough experimental data for this system. Reduce the number of extractions, but this may be a large process change. DIU solubility data are available. The smallest change, as this is the current process. A better solvent system must be found. DIU solubility data are available. Requires significantly more solubility information than is available for the reactants and products. Compounds are not compatible with current group contribution estimation methods.
Remove DIU after the first reaction Remove DIU after the second reaction Find a better solvent for the reactants, A and B
]
j
]
In this example, we apply the solvent selection approach for complex compounds, described in Section 8.2.2. Steps one and two had been completed in earlier experimental work. DIU solubility was determined in a number of solvents, which happen to span a wide range of total solubility parameters. In selecting solvents, one wants to ensure a good representation of the range of total solubility parameter in order to focus in on the most likely range of total solubility parameter for the designed solvent. Step three of the method produces a plot of solubility data for DIU as a function of total solubility parameter for the solvents. This is shown in Figure 2. Since the peak on Figure 2 is around total solubility parameter of 22, it is most likely t h a t the best solvents for DIU will have a total solubility parameter between 21
218 and 23. It is also clear that the solute is not very soluble either in paraffinic or cyclic hydrocarbons (solubility parameter around 15 MPa 0.5) and only slightly so in water (solubility parameter of 47.8 MPa ~ and other polar solvents. The likely solvents having similar solubility parameters are acrylic alcohols, ketones, aldehydes, acids, esters and ethers. Aromatic solvents, though they may fit into this total solubility range, will not be considered due to their poor health effects profiles. Solubility of Diisopropylurea (%w/w) 4.5-
3.5-
2.5-
0.5 A
0
14
A 9
16
9
9
18
20
22
24
26
28
Solubility Parameter (MPa)^.5
Figure 2: Plot of solubility versus total solubility parameters for DIU as the solute In Step four the other properties of the solvent are identified, either by comparison with database materials or by specification of the process. In this case, the process creates the specifications. Since the solvent must be a liquid at the operating conditions, its melting point must be less than 300 K. Since DIU is to be removed from the reaction mixture, the solvent should split the reaction mixture into two phases with the solvent rich phase containing the majority of the DIU, which is not totally miscible in water. As a first pass environmental assessment, the Octanol/Water partition coefficient should be kept as low as possible. Also, for recovery concerns, the solvent must be easy to separate from DIU and therefore, its boiling point should not be greater than 450 K for separation by evaporation or distillation. These target properties are listed in Table 2.
219
Table 2: Target properties for example 1
Property
Target / Range
Total solubility parameter Boiling point Melting point Octanol/Water partition coefficient (logP) Water solubility Water capacity of solvent Groups to search
21 - 23 MPa ~ Less than 450 K Less than 300 K Less than 2
Immiscible Less then 1.0, preferably zero. CH3, CH2, CH, C, OH, CH3CO, CH2CO, CHO, CH3COO, CH2COO, HCOO, CH30, CH20, CH-O, COOH, COO Low environmental impact Limited health & safety concerns
CAMD Phase Moving into step five of the process described in Section 8.2.2, new compounds were generated by ProCAMD based on the specifications listed in Table 2. This problem formulation generated 3498 compounds, based on combining groups to form only chemically feasible molecules. The octanol/water constraint removed 260 compounds. The total solubility parameter constraint removed 2634 compounds. The melting and boiling point constraints removed 534 and 39 compounds, respectively. The solvent capacity constraint removed another 17 compounds, leaving 14 final compounds. Of these, about half appear in the DIPPR database of compounds. Table 3 gives a list of those designed solvents that appears in the database together with their water solubility and EH&S properties. Note that 2-butanol and 2-methyl-l-propanol are isomers in terms of the groups that make up the compounds (2 CH3, CH2, CH, OH). The final four compounds (five-carbon alcohols) are also isomers with respect to their groups (2 CH3, 2 CH2, CH, OH). Note that the predicted and actual total solubility parameters do not necessarily match.
220
Table 3: Potential solvents for Example 1 Compound
Solubility Param 1
Predicted Solubility earam
1 -Butanol
22.47
23.3536
6.32E+004
21.47
21.6034
1E+006
21.83
22.9094
8.5E+004
Tumorigen (C); Mutagen (M)
21.83
22.5414
1.81E+005
Reproductive-Effector (T)
21.72
22.576
2.2E+004
.
Water Solubility 2 . (mg/L)
RTECS C o d e 3
(CAS #)
(000071-36-3) t-Butanol (000075-65-0) 2-Methyl- 1Propanol (000078-83-1) 2-Butanol (000078-92-2) 1 -Pentanol (000071-41-0) Ethylene-glycolmonopropylether (002807-30-9) Ethyl Lactate (000097-64-3) 2-Methyl-Ibutanol (000137-32-6) 3-Methyl- 1butanol; Isoamyl alcohol (000123-51-3) 2-Pentanol (006032-29-7) 3-Pentanol (000584-02-1)
Mutagen ( M ) ; ReproductiveEffector (T); Human-Data (P); , Primary-Irritant (S)
Mutagen (M); Primary-Irritant
(s) 21.65
20.0055
3.169E+005
22.62
22.3818
1E+006
21.16
22.1107
2.97E+004
Primary-Irritant (S)
21.16
22.1574
2.67E+004
Tumorigen (C); Human-Data (P); Primary-Irritant (S)
21.16
21.704
4.46E+004
21.16
21.1227
5.15E+004
To explore the sensitivity of the CAMD method, we can a d j u s t each of t h e s e c o n s t r a i n t s individually t h r o u g h ProCAMD. For example, t i g h t e n i n g the logP c o n s t r a i n t to less t h a n 1.0 will filter over 1000 compounds a n d change the r e s u l t s of the s u b s e q u e n t filters as well, leaving ten compounds in the end. This i n f o r m a t i o n is listed in Table 4, showing the change a n d the n u m b e r of c o m p o u n d s screened out for each condition along with the n u m b e r of compounds r e m a i n i n g after screening. In a quick a s s e s s m e n t , the most sensitive p a r a m e t e r for t h e overall problems is the boiling point range, as r a i s i n g the u p p e r b o u n d i n c r e a s e s the n u m b e r of c a n d i d a t e s to 96 from four.
1Solubilityparameter data is fromthe CAPECDatabase(www.capec.kt.dtu.dk). 2Water solubilitydata from:"SRCPhysPropDatabase"(onlineversion), SyracuseResearchCorporation, Syracuse,NY, USA. aRTECSCodeis from:WebSpirsversionof"RTECS(through2000/04)."
221
Table 4: Number of compounds screened out for a variety of conditions
Change
logP Sol. Par. Tm
Tb
Water Capacity
Final Compounds
Original constraints logP to less t h a n 1 W a t e r capacity max 0.5 Melting point max 250 K Melting point max 200 K Boiling point max 500 K Boiling point max 350 K
260 1031 260 260 260 260 260
392 23 392 39 0 282 463
17 17 28 17 2 45 0
14 10 3 14 4 96 0
2634 1991 2634 2634 2634 2634 2634
181 426 181 534 598 181 181
Post-Design Phase Steps 6 and 7 of the design procedure for complex solutes call for additional analysis of the candidate solvents found from the CAMD phase. Ideally, we would test the a p p a r e n t l y best compounds as solvents for DIU. In this case, of the solvents listed in Table 3, l-Butanol and 2-Methyl-l-Propanol were easily available from our stockroom. A quick test of DIU solubility showed excellent results for both the butanol (6.25 wt% at 25~ and methyl propanol (6.48% at 25~ These values were 50% higher t h a n the best solvent tested in our prior work, as shown in Figure 2. We t h e n took a new look at the reactions above and decided to work with butanol as an extraction solvent after the second reaction. We examined two possibilities. The first option was to keep the DMF/MeC12 mixture for the first reaction and conduct one methylene chloride extraction after reaction two, followed by a butanol extraction. As this option was not substantially different t h a n the s t a n d a r d chemistry, there were no significant improvements in the yield or purity of the mixture after the second reaction. The second option was to remove methylene chloride from the chemistry entirely, as it does not participate in the reaction, and use butanol as the only extraction solvent after reaction two. For comparison purposes the reaction mixture after the second reaction was divided into two portions. One portion was extracted with two methylene chloride t r e a t m e n t s per the standard. The other portion was extracted with an equivalent volume butanol in two extractions. The results for these experiments are listed in Table 5.
222
Table 5: Example 1 Results Content of aqueous layer (percent of original) DIU total D (extraction 1) D (extraction 2) D total DMF HOBT D-urea impurity
MeCI Extraction 0% 98% 94% 92% 20% 76% ~95%
Butanol Extraction 0% 74% 99% 73% 11% 8% ~20%
Overall both solvents remove all the DIU after two extractions, which is the primary goal of the extractions. The butanol system is at a slight disadvantage for product recovery due to the first extraction pulling about 26% of the product, D, into the butanol layer. Clearly, this extraction could be optimized to achieve greater t h a n 90% total recovery of product from the extraction step. The other advantage of the butanol extractions is that the content of both the DMF and HOBT in the aqueous layer is much lower. Finally, the butanol extractions were successful in significantly reducing the level of the D-urea reaction byproduct. While this was not stated as a primary goal, reducing the amount of extraneous organics in the aqueous phase is advantageous to the subsequent precipitation reaction and purification steps. The all-butanol extraction option has the additional advantage of removing methylene chloride entirely from the process. 8.3.2 E x a m p l e 2: M a s s s e p a r a t i n g a g e n t
In this example, the development team was exploring the design of a manufacturing process to recover pure product and potentially recover solvents for reuse. The method used to design the process was based on the process synthesis procedures of Jaksland (1995). The synthesis procedure explores the properties of the mixture to select appropriate separation techniques. The original design of the process simply used distillation and then n-heptane as an anti-solvent to crystallize the product, which was filtered and dried, as shown in Figure 4. However, there were some disadvantages to heptane, particularly regarding solubility of the impurities that end up as solids with the X-2P product. The challenge for CAMD was to find a suitable replacement mass-separating agent (MSA) for heptane that will cause the X-2P to precipitate out of solution while retaining the toluene and reactants in the liquid phase.
223
Figure 3: Example 2 process flow The primary chemistry in this example is X-2R1 + X-2R2 --->X-2P + H 2 0 This takes place at atmospheric conditions in the presence of toluene as the primary solvent with MTBE carried over from previous processing with X-2R2. Table 6 gives the approximate composition of the post-reaction mixture and for which the process synthesis has been conducted.
Table 6: Example 2 post-reaction mixture composition Component
Concentrati on (mol%)
X-2R1 MTBE X-2R2 X-2P Toluene Water
0.1
As in the from the exchange However,
1 10 73.9 10
S t a t e of p u r e c o m p o n e n t s 298~ 1 atm State Solid Liquid Solid Solid Liquid Liquid
Tb (K) 524.33 328.35 445.91 611.1 383.7 373.2
at
Tm (K) 443 164.55 300.93 353.1 178.18 273.2
original process the best mechanism for removing the MTBE and water post-reaction mixture is concentration of the mixture in at least one of toluene. At that point the mixture is essentially free of water. care must be taken to not remove too much toluene, as the product
224 tends to form a highly viscous tar with the toluene at higher concentrations. As a result, the mixture passed to the product isolation step must be at least 30 wt% toluene.
Pre-Design Phase The p r i m a r y goal of CAMD in this example is to replace heptane with another MSA to effect the precipitation of the X-2P product while retaining the other materials in the liquid phase. Table 7 lists the solubilities of the compounds in heptane.
Table 7: Solubility of compounds in heptane
Compound
Solubility in heptane (g/cm ^3)
X-2R1 X-2R2 X-2P Toluene
0.0125 0.0186 2.83E-04 0.397
In the CAMD Pre-Design toluene, X-2R2 and X-2R1 must be miscible in the new solvent, and X-2P must be immiscible. Ideally, the relative values of the solubility will also be greater for the first three and lower for X-2P. Unfortunately, very few of these mixture properties can be predicted due to the complex nature of the solutes. As a first pass, we can use CAMD to find solvents with similar properties to heptane. The target values for this initial CAMD problem are listed in Table 8.
Table 8: Target properties for example 2
Property
Heptane value
Target / Range
Boiling point Melting point Total solubility Groups to search
371.6 K 182.6 K 15.2 MPa o.5 N/A
Less than 400 K Greater t h a n 150 K 1 4 - 16 MPa ~ CH3, CH2, CH, C, OH, CH3CO, CH2CO, CHO, CH3COO, CH2COO, HCOO, CH30, CH20, CH-O, COOH, COO Low environmental impact Limited health & safety concerns
225 CAMD Phase Based on the specifications above, ProCAMD generates 3498 compounds, filtering 3397 based on solubility parameter, 14 based on melting point and 44 based on boiling point, leaving 43 candidates compounds. Of these compounds, the following were found in the DIPPR databank: MTBE, Ethylal, Ethyl propyl ether, tert-Butyl ethyl ether, Methyl tert-pentyl ether, Diisopropylether, Acetal, n-Butyl ethyl ether, Di-n-propyl ether, and Ethyl-tert pentyl ether.
Post-Design Phase Based on discussions with the chemist and available compounds from the stockroom, we decided to explore the use of diisopropylether via experimentation. The chemist was also curious about 2-pentanone, which did not appear on the list because its solubility parameter is closer to 18 MPa 0.5. The solubility information for each of these is listed in Table 9 and Table 10, respectively.
Table 9: Solubility of compounds in 2-pentanone Compound
Solubility in 2pentanone (g/cm ^3)
Relative to heptane solubility
X-2R1 X-2R2 X-2P Toluene
0.3043 0.1337 1.685E-04 0.126
24.3 7.19 0.60 0.32
Table 10: Solubility of compounds in diisopropylether Compound
X-2R1 X-2R2 X-2P Toluene
Solubility in diisopropylether (g/cm^3) 0.072 0.024 4.82E-05 0.326
Relative to heptane solubility 5.76 1.29 0.17 0.82
With two potential solvents selected at this point, decisions need to be made as to which properties are the most important. In this case, it is most important to reduce the solubility of the X-2P in the solvent. The fact that diisopropylether has less than 20% of the solubility of heptane for X-2P takes precedence over the better solubility for X-2R1 and X-2R2 in 2-pentanone.
226
8.3.3 Example 3: Challenge Problem From a computer-aided design perspective, this problem has proven difficult. The hope of presenting this problem is to give researchers in this area an idea of the complexities that arise in the real world. The mixture in this case contains water, acetonitrile, ammonia, and three difficult-to-model internal compounds, as listed in Table 11. The product, X-3P, is an Ammonia-Bromine salt, which impacts any computations on the mixture. The structures of this compound and the other unique compounds are shown in Figure 4. The mixture is highly non-ideal due to the electrolytes present in X-3P and ammonia.
Table 11: Example 3 mixture composition Compound Acetonitrile Water Ammonia X-3P (product) X-3R (reactant) X-3B (byproduct)
Wt. % 51.7 29.3 10.3 7.7 0.7 0.4
NH3Br
N
Br
o
o
CN
CN X-3P
X-3R
o
NC
CN
X-3B
Figure 4: Molecular structures of X-3P, X-3R and X-3B
227 The goal for the operation is to remove the water (to less than 2 wt%) and to drive the composition of the mixture to approximately 15% X-3P in Acetonitrile. The new solvent should also be a liquid at normal conditions. The suggested properties for such a solvent are listed in Table 12.
Table 12: Target properties for challenge problem
Property
Target / Range
Boiling Point Melting Point
Less than 400 K Greater t h a n 150 K
Miscible with Acetonitrile Immiscible with Water Good solvent for X-3P Low environmental impact Limited health concerns
First try to find solvents that satisfy all the constraints except those related to the solubility of X-3P. Then use experimental data, if available, to find out which of the candidates have good solubility for X-3P. This will reduce the number of candidates. In the final selection, perform simulation as well as more detailed analysis of the property constraints, especially since the property models used in the design-phase may be subject to errors.
8.4 C O N C L U S I O N S This chapter demonstrates the utility of computer-aided molecular design even for complex solutes, where the solvent interaction is difficult to determine analytically. The chapter presents a procedure that combines experimental work with CAMD for complex solutes and then goes on to show how this applies in real situations encountered in the pharmaceutical industry. The final example presents a challenge problem for future computer-aided molecular design researchers.
8.5
REFERENCES
P. M. Harper, "A multi-phase, multi-level framework for computer aided molecular design", ", PhD-thesis, Technical University of Denmark, Lyngby, Denmark, 2000.
228 ICAS Documentations, Internal Report PEC02-14, CAPEC, Department of Chemical Engineering, DTU, Lyngby, Denmark, 2002. C. Jaksland, "Separation process synthesis and design based on thermodynamic insights", PhD-thesis, Technical University of Denmark, Lyngby, Denmark, 1996 J. Vinson, P. M. Harper, R. Gani, "Solvent selection for chemical and pharmaceutical processes", AIChE Annual Meeting, Paper no. 240 c, Los Angeles, USA, November 2000.
Computer Aided Molecular Design: Theory arid Practice L.E.K. Achenie, R. Gani and V. Venkatasubramanian(Editors) 9 2003 Elsevier ScienceB.V. All fights reserved.
C h a p t e r 9: C A M D for S o l v e n t S e l e c t i o n in I n d u s t r y -
229
II
J. L. Cordiner
9.1 I N T R O D U C T I O N Fine Chemicals M a n u f a c t u r i n g is increasingly looking at reducing the time to market, this m e a n s t h a t decisions about the process are pushed f u r t h e r and f u r t h e r back the decision train. These decisions are then required when less and less of the a p p a r e n t l y required information is available. Conventional wisdom needs to be tested to consider w h a t information is really needed and w h a t level and quality of decision is required at each stage. In some cases, for example pharmaceuticals, the process route needs to be decided very early for registration reasons. The choice of the route can have large implications on the costs of production and capital requirement. It is then advantageous to have methods to challenge the normal route selection and development processes. This c h a p t e r describes two methods & tools t h a t m a y be used in early evaluation of processing routes related to solvent selection. These two methods & tools are S M S W I N (developed at Syngenta) and ICAS-tools (developed at C A P E C http://www.capec.kt.dtu.dk). The methodology applied is briefly described and i l l u s t r a t e d t h r o u g h two case studies.
9.2 GENERATING AND REVIEWING ALTERNATIVE PROCESS ROUTES Clearly the synthetic routes from research (see Fig. 1) are usually not practical for a m a n u f a c t u r i n g setting. The chemist and engineer need to work together to consider how all the routes for consideration will be operated at the m a n u f a c t u r i n g scale desired by the business. At this stage it is vital the early evaluation tools are able to aid this process in generating processes t h a t can be radically different from conventional wisdom. Each chemical route can be operated at a m a n u f a c t u r i n g scale in a n u m b e r of different ways and these needs to be considered in any route evaluation. In addition the early evaluation tools are required to enable comparison of routes and processes to enable the most practical options to be chosen. Clearly the level of information on each route will be sparse at this stage and therefore the tools m u s t allow quality decision to be
230 t a k e n on the limited data. It is therefore important to remember t h a t comparison requires the data to be consistent but not necessarily accurate at this stage. As it is i m p o r t a n t to consider the whole supply chain in route selection one should use the tools alongside experience from different professionals r a t h e r t h a n expecting the tools to do the whole job. Researchroute ~ - -
~"~
Generateroute 1 I options I ~--/]~-~~'~
~ ~ Selection cdteria
\
k SHEimpact ~
Market requirements
(quality;toxl Activity VPC/ margin Capital
/
~
~l "I ~
/
/
/
/
/
I oUtlinef/s & I
I
costs !
I ~ /f I \ ~ I \ I \
I
X
X
X
/
~.____1
I.J I ~
/
/
/
/
FF&P=formulation, f i l l apack nd
9
( I
u,ness targets
]
/ /
/
/
I.,
Ongoing F development t ~ - ~ . . . ~ tr~loSn t ~
manufacture ~
)
/ ~ / f f / r / ~ l Design,f/s, / f I costestimate / / I \ /
FF&Pdevelopment
Process development
Product specification market forecast
. . ,-any . . I .... I I Tormutauon. j
/
i ~
I
/ /
\
I k ................ ............... ~ ~
X
THE KEY DE,~ISION POINT
9
/
/
/
~
I --~ Decision tOinvest
t
Figure 1: Schematic of the development process for an agrochemical product (Carpenter [1,2]). 9.2.1 C h a l l e n g e s for the Early Evaluation Tools The early evaluation tools need to be user friendly, robust and easy to use. In particular the tools need to be as intuitive as possible for the infrequent user, minimising the n u m b e r of forms to be filled in or clicks required. This can be seen in setting the most commonly used information at very easy and fast reach as shown in Fig. 2. Wherever possible an expert system to select items or calculation methods needs to be employed in such a way that it is easy for the non specialist to use the tool whilst providing sufficient information (knowledge) about the problem and guidance to arrive at an acceptable solution. For example, the physical property method for early (solvent) evaluation and setting up of this method needs to be made very easy. This can be demonstrated by pre-setting the groups for UNIFAC (described in chapter 2 of Part I) for as many molecules and frequently used building blocks for molecules as possible as is done typically for UNIQUAC for molecules. The databases in SMSWIN, ICAS and in most commercial simulators already have this feature.
231
Figure 2: Property selection From S M S W I N
Many of the process developers will need help when considering different solvent options. Here in the form of a decision tree (see Fig. 3), would selection and points to when further advice from
in selecting a property method an expert system, as highlighted be beneficial. This allows rapid the expert system is required.
The tools should be as visual as possible. Many visualisation tools are provided in SMSWIN and ICAS to help process developers to rapidly access processing route options. For example, residue maps (for evaluation of feasible separation regions), eutectic/azeotropic diagrams (for evaluation of separation constraints), solubility/saturation plots (evaluation of phase boundaries) and many more. Having diagrammatic ways of presenting the same data can aid understanding of the solvent-based separation process. For example, a triangular phase diagram highlight the existence of one or two liquid phases in equilibrium with a vapour phase for a ternary mixture consisting of two solutes and a solvent. For the same system, a solvent-free two-dimensional phase diagram can be used to determine (visually) the amount of solvent (or entrainer) needed to break or sufficiently move an azeotrope.
232
SyngentaProperLy Metl~dSelection. propertiesor component
No---~
\ "known? ~ A ~ r e yo yes (~
Use EOSSeek Advice
yes
no
tryingto
~distinguishbetween~----- no----~
j
UNI~F
isome~
|
~
l
yes
(Aspenor SMSWIN)
~ s ~ systemat ~ low pressure? ~ "~i\e~<-i,,.,I ,___ ua~z
. ~
I No methodsavailable (need some
If onlymoderate pressure Use NRTL-HOC F--~ ~ Wils~ r~o or NRTI-SRK, Wilson-SRK (Seeruleson 2 liquid phasesfor Wilson/NRTL)
yes
experimental data)
Regress data using Aspen or SMSWlN intoappropriatemodel.
n
o
~
~
(ELErCwNiR):L i
__no~ j ~ Wilson
yes--
/
Are ~ 21iquid ~ n o phases ~ ~
NRTL UNIQUAC
UseWilson-HOCfor Carboxylicacids Use ENRTL-HFfor HF/H20
/ a n y apour~ ~PhaseAssociation?\ ~e.gHF,Carboxylic~
]
] I
* trylfor nOot UN hFer lAcParameetrSmeht~
is UseWilson-HOCfor I CarboxylicacidsUse ENRTL-HFfor HF/H20
r
~oeeNoteson PhysicalPropertyMethodSelection~ r ContactJoanCordiner(PSG)Huddext 6084//
Figure 3: Decision tree property model selection
9.2.2 F a c e t s of S o l v e n t - B a s e d P r o c e s s i n g R o u t e s t h a t N e e d Consideration As explained in detail in Part-I of this book, there are many issues that need to be considered from materials handling, where solids as well as fluids need to be handled with issues relating to toxicity, environmental impact, etc., through to the issues of packing and formulation. A variety of tools, for example, databases and expert systems, aid the professional in decision points by having the
233 commonly used information easily at hand. It is useful to have a tool that allows a rapid means of reviewing each research process assessing the manufacturing process options. The chemist or chemical engineer can carry out a rapid assessment using diagrams and tables seeing the potential issues easily and quickly. The reaction chemistry can have a large impact on the choice of solvent-based processing route depending on the yield and selectivity obtained. Typically yield is considered most important given that materials may not be recycled due to complexity of separation. Where recycling is possible, the best route for this needs to be considered and yield is therefore not such an issue. Often, it is possible to significantly alter the known reaction chemistry in products as well as in reaction rate via the choice of solvent(s). This is highlighted for the aromatic nucleophilic substitution of the azide ion with 4-fluoronitrobenzene (Cox [3, 4]), where the reaction rate changes by 6 orders of magnitude depending upon the use of a solvent as shown in Fig. 4. Clearly choice of solvent is very important.
4-
+
N3"
NO 2
Solvent
H20
F
NO 2
MeO
Me2SO
HC'()NMe2
(Me2N)3PO
1.3"10 4
4.3"10 4
2.0"10 6
H
ks/kH2o
1
1.6
kH2o- 4.0"10 s M l s 1 Figure 4: Effect of solvent on rate constant The solvent chosen to enhance the reaction rates, potentially can be used for the subsequent separations required and]or need to be separated. In most fine chemical processes there tends to be more than one stage of chemistry required and often, a different solvent may be chosen for each stage with initial manufactures requiring purification and separation at each stage. This can mean a large number of different solvents are used throughout the process, which involves much separation leading to yield losses. Selecting the solvents for each reaction stage carefully to minimise solvent swaps can be very cost effective and can also increase yield. Clearly any tool that aids solvent selection can radically reduce the capital and operating costs of any route. The tools can lower the experimentation time required by reducing the number of solvents to be tested in the laboratory. One can look at using the tool as doing the experiments quicker. Reducing the experimentation time and hence aid faster development enables
234 more radical processes to be tried. The techniques can then also be used to look at selecting a stage wide or process wide solvent r a t h e r t h a n having a n u m b e r of solvents and solvent swaps through the process. 9.2.3 S o l v e n t S e l e c t i o n M e t h o d o l o g y in S M S W I N The solvent selection methodology employed through SMSWIN is based on d a t a b a s e search where the solvents are classified according to C h a s t r e t t e [5]. According to this classification system, a solvent is classified as protic or aprotic. Aprotic solvents are further classified as dipolar or apolar. The C h a t r e t t e classification is highlighted for a selected set of solvents from the d a t a b a s e of S M S W I N in Fig. 5.
Figure 5: Solvent taxonomy (from SMSWIN) With SMSWIN, the following steps are performed in solvent selection. Quick and initial scan: Here, a quick scan of the database is made with respect to some known target properties of the solvent. Solvents having similar solubility p a r a m e t e r as the solute is a property t h a t m a y be used. Boiling points, melting points, molecular weights, etc., m a y also be used. Finally, the search is limited to a class of solvents (aprotic or protic). The solvents obtained from this search could then be further analysed. It is
235 quite possible, however, if the database is not very large that this initial scan may not provide any good candidate solvent. 9 Detailed search o Solvents matching specified constraints: Under detailed search, a larger database is used and while some of the properties such as boiling point and melting point may be retained, the solubility of the solute in candidate solvents are calculated (rather t h a n implicitly obtained through solubility parameters). All known solvents of a specified class are considered and only those that do not satisfy the specified constraints are rejected. This also includes solvents for which the property models were not available and]or solvents for which the pure component property constraints could not be checked because of missing data in the database. Therefore, at this step, there will be a large list of candidate solvents t h a t will need to be reduced. o Analysis of scatter graph: In this step, first a scatter graph of productivity index versus percent recovery of the solvent is plotted for all the solvents. In the next step, those solvents that had missing property model parameters, missing data in database and/or calculated solubility lower than a specified minimum, are removed from the scatter graph. From the remaining solvents, those that belong to a region of potential promising solvents, are retained for the next step (verification of constraints) o Verification of constraints: The most promising solvents from the last step are verified for specific desired properties of the solvent. For example, in solvents for enhancing reaction rates, ability to cause a liquid-liquid split and no formation of eutectic mixtures are important. The candidate solvents are verified for these properties. Those that remain are further analysed in terms of percentage recovery of the solute and the candidates are ordered in terms of the highest percentage recovery. 9 Final selection and analysis: The final selection is made from those candidates that satisfy all constraints. At this level, EH&S constraints are added and a further screening is made. From the remaining solvents, one criteria for selection could be cost and]or availability.
9.2.4 S o l v e n t S e l e c t i o n M e t h o d o l o g y in ICAS The tool-box in ICAS that performs solvent search/design is called ProCAMD (ICAS Documentations [6]). It is based on the hybrid CAMD method described in chapter 6 of this book. ICAS also has a database where a preliminary search can be made based on the specified pure component property constraints and solubility parameters. Alternatively, the search for solvents from SMSWIN could be used as the pre-design phase for ProCAMD. Once the problem has been specified in terms of property constraints and building blocks to be used,
236 ProCAMD generates and tests candidate solvents. All these solvents satisfy the specified property constraints that can be directly calculated through in-house models in ICAS (ProCAMD). Like SMSWIN, the EH&S property constraints are considered in a separate step (post-design phase of the hybrid CAMD method of chapter 6).
9.3 CASE S T U D Y 9.3.1 N i t r i c A c i d O x i d a t i o n o f A n t h r a c e n e to A n t h r a q u i n o n e
This case study highlights the use of SMSWIN and ICAS-ProCAMD (chapter 6) for the solution of a problem involving process wide solvent selection for the nitric acid oxidation of anthracene to anthraquinone. The techniques looked at solvent effect on reaction, solubilizing the starting material, solubility of nitric acid, recovery of product, separation scheme for recovery of solvent, recovery of nitric acid, boiling point and melting point, vapour pressure, price, safety, toxicity factors and many more. The solution details are taken from the MEng-thesis of Bavishi [7]. The problem is first solved with SMSWIN and with the generated problem definition information, the solution of the problem has been repeated with ProCAMD. S o l v e n t S e l e c t i o n Criteria
Based on the processing constraints, the following desired properties for the solvent are needed. 1. Anthracene has to be soluble in the solvent at 145~ The solubility is approximately 0.27 by mass fraction in the existing solvent at the reaction temperature. So ideally we prefer the new solvent to have solubility greater t h a n that. 2. Recovery of Anthraquinone, the product, from the solvent. Ideally prefer to achieve greater recovery of the product than in the current solvent. Also need to ensure that no eutetic is formed when the product is crystallised. 3. Solubility of Nitric Acid in the solvent needs to be high in order for the instantaneous reaction between the Nitric Acid and Anthracene to take place. 4. Reactivity of the solvent with Nitric Acid, Anthracene and Anthraquinone will need to be known. The solvent in this case is simply a reactant carrier and does not appear in the reaction mechanism. Therefore the solvent should not participate in the reaction. 5. Solvent used needs to be immiscible with water. The process is designed to treat such solvents. Therefore the solvent chosen should form an azeotrope
237
6.
7.
8. 9.
with water, where the liquid splits into two liquid phases with different compositions. The chosen solvent should have a m i n i m u m boiling point of 145~ because the reaction t e m p e r a t u r e is 145~ At this t e m p e r a t u r e the solvent should be a liquid for liquid phase reaction. The chosen solvent should have a m a x i m u m melting point of 25~ because the product is crystallised at 25~ This will minimise the chance of solvent to be crystallised out with the product. The solvent will be released to the environment via the effluent s t r e a m and via vents. Therefore we w a n t a solvent, which is e n v i r o n m e n t a l l y friendly. The solvent used should also be economically favourable. This factor should not be of a great concern as long as a majority of the solvent is being recovered. If the solvent used requires addition of make-up of fresh solvent feed for each batch of reaction, then the cost of the solvent would be a major criterion.
S o l v e n t S e l e c t i o n U s i n g S M S W I N a n d P u b l i s h e d Data. Quick Scan: We need to select a solvent t h a t would not participate in the reaction and would be immiscible with water. Since we w a n t to find solvents t h a t are similar (or better) in solubility t h a n the known solvent, we w a n t to find similar solvents t h a t do not participate in the reaction. So, we are looking for dipolar aprotic solvents with solubility p a r a m e t e r s close to t h a t of a n t h r a c e n e (the solute). T h a t is, we are looking for dipolar aprotic solvents t h a t have solubility p a r a m e t e r s < 9.7 (cal/cc) ~ and > 8.5 (cal/cc)~ compounds m a t c h i n g this specification are: Acetone 2-Butanon H e x a m e t h y l phosphoramide (HMPA) Methyl propyl ketone
9.62 9.45 8.58 8.99
Checking the boiling points for the above compounds showed t h a t except for HMPA, all other compounds have boiling points > 145~ This m e a n s t h a t only H M P A is a suitable solvent. However, since very little data and/or property model p a r a m e t e r s are available for this compound, this was also rejected. Consequently, the quick scan did not produce any candidates. Detailed Search: In the next step a detailed search was made within the solvents d a t a b a s e of SMSWIN. For all dipolar aprotic solvents, those t h a t h a d a boiling point < 145oC and a melting point > 25oC were collected. If a compound did not have these properties in the database, they are not rejected. For all the compounds (solvents) t h a t were retained, the solubility of a n t h r a c e n e was calculated. Again, if the property model p a r a m e t e r s were not available, they corresponding solvent was not rejected. This gave a large search space.
238 Scatter Graph: In this step, first productivity index and % recovery of the solute is calculated. Then the calculated data for all the solvents are plotted (as shown in Fig. 6a). In the next step, the number of solvent candidates is reduced by employing the following c o n s t r a i n t s Solubility of anthracene > 0.3 mass fraction at 145~ only solvents with all property model parameters must be considered and only solvents with known melting point and boiling point temperatures must be considered. This produced a much smaller number of solvent candidates located in a well-defined region (see Fig. 6b). Scatter graphfor temperature range: 25.0~ to 145~
Scatter graph for temperature range: 25.0~ to 145~
o
+++
+
+
%+ I
I
I
+
94
I ++
+ ++
+
+
+*'* '*+*+ I"*
.i.+~.i.., +
....
+ +
93
+
,+
+
I. :
95 97 0/.Recovery
96
Figure 6a: Scatter graph screening of solvent candidates
99
94
95
98
97
% Recovery
98
before Figure 6b: Scatter graph screening of solvent candidates
98
after
Verification of Constraints: The next step has been to verify that each of the candidate solvents from Fig. 6b actually caused a liquid-liquid split and did not form an eutectic mixture. Final Selection and Analysis: The final list of candidate solvents are listed in Table 1, ordered in terms of % recovery of anthracene. Now adding EH&S constraints listed in Table 2, the final list is reduced to those that are marked in bold letters in Table 1. Out of the five solvents that have satisfied all constraints, tetralin happens to be easily available and is also the cheapest. Solvent Selection Using ProCAMD Using the knowledge (information) generated from the use of the solvent search method in SMSWIN, the same problem is formulated in ProCAMD. The search is made separately for acyclic, cyclic and aromatic compounds. Within each molecular class, molecular types may be pre-selected and this in turn will select the building blocks for the CAMD-phase. Figures 7a and 7b show the general problem specification in ProCAMD. It can be noted from Figs. 7a-7b t h a t
239 preselection of molecule types also means automatic selection of the building blocks.
Table 1: List of solvents satisfying all property constraints except E H & S properties. PERCENTAGE RECOVERY OF SOLVENTS ANTHRAQUINONE (%) 97.4 Acetophenone 97.1 Benzyl chloride 97.5 1-Chloronaphthalene 97.8 4-Chlorotoluene 2-Chlorotoluene 97.8 1,4-Dichlorobutane 97.6 1,4 Dichloro-2-butene,trans 97.4 98.2 m-Divinylbenzene 4-Ethyl-m-xylene 98.0 97.9 m-Ethyltoluene 98.4 1-Heptanal Indane 97.9 98.0 Indene 97.6 Mesitylene 98.2 1-Methylindene 97.9 1-Methylnaphthalene 98.0 1-Methyl-3-n-propylbenzene o-Methylstryrene 98.o Prehnitol 97.5 trans- 1-Propenylbenzene 98.2 IsoPropylbenzene 98.3 N-Propylbenzene 98.2 Pseudocumene 97.6 1,1,2,2- Tetrabromoethane 98.0 Tetralin 98.0 1,2,3,5-T etramethylb enzene 97.5 1,2,3- Trimethylbenzene 97.6 The property constraints are given in terms of non-temperature pure component properties, t e m p e r a t u r e dependent pure component properties, mixture properties and azeotrope/miscibility calculations (see figure 7e for more details on the target properties). Note that the solubility p a r a m e t e r is calculated at 298 K. Among the t e m p e r a t u r e dependent properties, only the vapour pressure is specified > 0 & < 0.0013 (gm 3) at 298 K.
240
Table 2: EH&S property constraints PROPERTIES
RELEASE RANGE
Very Toxic Respiratory Sensitisers Potent Carcinogen Toxic Corrosive Animal Carcinogen Harmful Skin/Eye irritants Non Hazardous Non Irritant Non Genotoxic
< O.1 mg/m3 < O.lppm
CLASS ,,,
H1 H2
M
> O.lppm < 10ppm < lmg/m 3 < 500ppm < 10mg/m3 > 500ppm > 10mg/m3
Figures 7c and 7d show the specifications for the mixture properties and the azeotrope/miscibility calculations within ProCAMD. The UNIFAC model is selected and anthracene is selected as the solute that needs to be extracted with the solvent (Fig. 7c). From Fig. 7d, it can be noted that an azeotrope with water is specified and a liquid phase split is also specified. Figure 7e shows a typical screen shot when ProCAMD has finished the calculations in the CAMD-phase. ProCAMD did not find any cyclic compounds (because of the limitations of group parameters within the property models) but it did find acyclic compounds and aromatic compounds, listed in Table 3. One of the compounds, 1-Methyl-3-n-propylbenzene has already been found through SMSWIN (see Table 1). Therefore, the post-design phase was not continued further since the analysis had already been done through SMSWIN.
Table 3: List of feasible compounds from ProCAMD ACYCLIC
AROMATIC SOLVENTS
CYCLIC SOLVENTS
SOLVENTS n-Decylacetate
1,2,3,4-Tetramethylbenzene
No molecule met the
1-Undecanal
1-methyl-3-n propylbenzene
specifications
n-Nonylacetate 1-Decanal Methyl decanoate
241
Figure 7a: General problem specification in ProCAMD
Figure 7b: General problem specification in ProCAMD
Figure 7c: Mixture property specification
Figure 7d: Azeotrope /miscibility calculation specifications
242
Figure 7e: Screen shot of results from ProCAMD 9.3.2 Case Study 2: Solvent for Dehydration In this example the problem is to find a solvent to replace toluene as an e n t r a i n e r in batch dehydration, which is the bottleneck in this stage of a processing route. The existing process operation is carried out by the addition of toluene to a batch distillation column with a decanter to recover e n t r a i n e r from the distillate. The feed to the system contains a n u m b e r of products including the i n t e r m e d i a t e to an agrochemical. The two key components are, however, Dimethyl acetamide (DMAC) and water. The other components can be ignored due to their high molecular weight and small impact on the VLE of the water-DMAC system. The c u r r e n t system employs an e n t r a i n e r as DMAC hydrolyses with w a t e r p a r t i c u l a r l y at elevated t e m p e r a t u r e s hence toluene as an e n t r a i n e r was selected to allow the separation at lower t e m p e r a t u r e s . The new solvent would need to fit into the existing equipment with minimal changes required. In addition the p u r i t y of the agrochemical i n t e r m e d i a t e product s t r e a m passing to the next stage of the process should r e m a i n the same as with toluene as the entrainer. The following t a r g e t s need to be matched by any solvent to be selected. 9 Final w a t e r content of the i n t e r m e d i a t e product s t r e a m should be less t h a n 9 kmol. 9 DMAC losses to be controlled by t e m p e r a t u r e (< 117~
243 9 A m a x i m u m of 20 kmol of e n t r a i n e r can r e m a i n with the i n t e r m e d i a t e product stream. 9 Batch dehydration time should decrease in order to reduce cycle time and DMAC losses. 9 DMAC loss in distillate should be a m a x i m u m of 0.3 kmol%. Based on the above targets, the selected e n t r a i n e r needs to have the following properties. E n v i r o n m e n t a l and toxicity constraints are not considered at this stage but will be analysed in a post-design stage (not highlighted in this case study). 9 Form a heterogeneous azeotrope with w a t e r with a boiling point below 117oC. 9 The liquid-liquid split should be at least as good as toluene. 9 Separation of DMAC and the e n t r a i n e r should be good, i.e. no azeotrope should form between the e n t r a i n e r and DMAC and the solvent power should be high. Applying the ProCAMD program, the following candidates have been found. Figure 8a shows the screen shot from ProCAMD highlighting the solution details. Figure 8b confirms t h a t the substitute e n t r a i n e r satisfies the desired (target) properties. The next step would be to perform batch distillation simulations to verify the functional (operational) target properties and to analyse the e n v i r o n m e n t a l and toxicity constraints.
Figure 8a: Problem specification details and solution statistics from ProCAMD
244
Figure 8b: Problem specification details and feasible solvent from ProCAMD
9.4 C O N C L U S I O N S & F U T U R E C H A L L E N G E S Many of the typical processes contain very complex molecules of which there is little information. These complex molecules have many functional groups and be in the presence of similar molecules which are produced as by products or as pre or post stage products. Indeed many final molecules are required in a particular enatiomer. Some typical molecules are shown in Fig. 9 (from Carpenter [2]). The selection of the separation task therefore becomes complicated. It is important therefore to have good predictive tools for the important physical properties and the ability to improve these predictions with as much known information as possible. This sort of tool has been developed by the CAPEC Group at the department of chemical engineering of the Technical University of Denmark. There are however ways forward by using as much information as available from the molecule and similar molecules to give some guidance. This is where using the tools along side experience and experiment can work very well.
245 Br
F,
O\ +
_~Br
H .'"~~ -
o==~ O
_
N~O
~Me
P
Cl
O
O~ §
N %N
O
o A substituted diphenyl ether used as an herbicide
Ii
O
A green azo dyestuff for dying polyester
Nit
F
o
A synthetic pyrethroid insecticide
Figure 9: Typical examples of complex molecules (solutes). It is common in many processes to have by-products and intermediates t h a t are very similar in structure to the product, indeed it is also common to have enantiomers where one is the active compound and all other enantiomers inactive. This makes the separation selection and also the prediction of the properties more difficult. Measurement of the required physical properties can also be problematic due to the difficulty of producing a pure sample of any byproduct. There is therefore a substantial gap in the currently available property prediction methods to be filled. The currently available CAMD methods and tools (see Part I of this book) need to be further developed to take account of wider solvent issues and could also be widened to route selection including formulation of active products, for example, surfactant selection. In addition visualisation tools along with optimisation t h a t allow selection of separation schemes taking into account efficiency of separation (Bek-Pedersen et al. [8]) will prove very useful. Solvent selection tools will also be greatly improved when reaction effects are better predicted. Finally, early evaluation tools are proving very useful in improving solvent-based process route selection practise, bringing chemical engineers and chemist together and facilitating co-current development that is focussed much earlier reducing the necessary experimentation and development time-scales.
ACKNOWLEDGEMENTS Permission to publish from Syngenta is gratefully acknowledged. Thanks to a great m a n y friends and colleagues for advice and information, especially: Dr
246 Keith Carpenter and Dr. Alan Hall, Dr Will Wood of Syngenta Technology and Projects and James Morrison Consultant.
9.5 R E F E R E N C E S
1. K.J. Carpenter, "Chemical Engineering in Product Development- The Application of Engineering Science", Entropic, 223 (2000). 2. K.J. Carpenter, 16th International Symposium on Reaction Engineering (ISCRE 16), 2001. 3. B. G. Cox, "Modern liquid phase kinetics", Oxford Chemistry Primer Series 21, Oxford University Press, UK (1994). 4. B.G. Cox and A. J. Parker, J. Am. Chem. Soc., 95 (1973) 408. 5. Chastrette, JACS, 107 (1985)1-11. 6. ICAS Documentations, Internal Report PEC02-14, CAPEC, Department of Chemical Engineering, DTU, Lyngby, Denmark, 2002. 7. P. Bavishi, MEng Thesis-2000, Department of Chemical Engineering, Imperial College, London, UK (2000). 8. Bek-Pedersen, E., Gani, R., Levaux, O., Computers and Chemical Engineering, 24 (2000) 253-259.
Computer Aided MolecularDesign: Theoryand Practice L.E.K. Achenie, R. Gani and V. Venkatasubramanian(Editors) 9 2003 Elsevier ScienceB.V. All fights reserved.
247
Chapter 10: Case Study in Optimal Solvent Design M. Sinha, L. E. K. Achenie & G. M. Ostrovsky
10.1 I N T R O D U C T I O N Solvents are extensively used as a major component of ink in the printing industry. The function of solvents in ink is to act as a vehicle for polymeric resins, pigments and dyes. The ink solvent also assists in wetting and dispersion of dyes and pigments. In letterpress and offset lithographic printing processes, the ink is carried to the plate by means of a train of rubber rollers commonly called "blankets" as shown Fig 1. Thus a thin film of ink is distributed over a large surface area on the blankets. These ink solvents are volatile and evaporate to leave behind the pigments and resins on the blanket surface. Cleaning is required whenever the residue build-up affects the print quality and between print jobs. Paper fibres, ink residue, paper coating and dried ink, are types of material t h a t must be removed from the rubber blankets.
Figure 1: Schematic of Lithographic Printing One of the most used solvents in lithographic printing is the '%lanket wash" which is specially formulated to clean ink and other residue from rubber blankets. Blanket cleaning is accomplished automatically or manually. In an
248 automatic blanket wash process, as shown in Fig. 1, the blanket wash is jet sprayed onto the blanket. Therefore a large amount of the wash is lost by evaporation even before it makes contact with the blanket. Blanket wash solvents are mostly solvent mixtures as opposed to single component solvents. As such, next to solvent performance, one of the most pressing concerns of the printing industry with regard to the environment is the volatile organic component (VOC) level of solvents. At present the VOC levels of solvents used in the printing industry are unusually high, well over 80% and far beyond the industry target of 30%. For example, a commonly used blanket wash, "VM&P naphtha" has a 100% VOC content (United States Environmental Protection Agency, 1997a). Blanket washes and solvents for "rag and bucket" operations are chosen based on their performance and their impact on the environment, health and safety. There is a wide variation in the performance attributes of cleaning solvents by different vendors. To enhance the cleaning operation, companies sometimes mix solvents from different vendors. However, this trial and error approach is costly and may not necessarily yield the solvent mixture with the desired performance attributes. In addition, the solvent for a cleaning operation may not meet safety, health and environmental restrictions.
Another important issue is minimizing the effect of a solvent on the surface characteristics of the rubber blanket by inducing swelling. Swelling severely affects the print quality in lithographic processes. Thus, there is a need to account for this in blanket wash design. The goal of this case study is to design globally optimal solvents to be used for cleaning in lithographic. These solvents should (i) have a minimal drying time, (ii) dissolve residue ink, (iii) not swell the blanket, and (iv) be environmentally benign. Drying time is correlated with the heat of vaporization of the solvent. The ink residue is assumed to consist of phenolic resins.
10.2 P R O B L E M D E F I N I T I O N
The problem as posed can be modelled as a multicriteria optimization problem. However, in the printing industry, there are rather loose and minimal requirements on these attributes. Therefore these attributes are regarded as constraints with given targets (similar to goal programming, Tamiz, 1996). A straightforward approach to modelling the problem as a special kind of multicriteria problem is to consider a lumped objective in which the different criteria appear as terms with appropriate weights. However this approach forces the solvent formulation engineer to think of appropriate weights (usually of no physical meaning) to employ, a rather non-trivial task. A more meaningful and
249 rigorous approach is to consider the problem as a multi-level optimization problem. The latter is rather difficult to solve and has usually been restricted to bi-level optimization problems in which the decision variables are continuous. We reiterate that the goal of this case study is to design optimal solvents to be used as cleaning agents in the printing industry. These solvents should (i) have a minimal drying time, (ii) dissolve residue ink, (iii) not swell the blanket, and (iv) be environmentally benign. Drying time is correlated with the heat of vaporization of the solvent. The ink residue is assumed to consist of phenolic resins. Solvents that can effectively dissolve the ink residue obey the solute-solvent interaction
R ~ =4(5 D -SD) 2 + ( 6 p - S p ) 2 + ( 5 . - 6 . ) 2 _<(R*)2 where 5d*=23.3, SP*=6.6, 5D=8.3 and (R*)2=19.8, and 5d,, 5p, 5D are determined from a model, for example a group contribution model (see Table 1). The heat of vaporization, boiling point and melting point solvent properties are calculated using the Constantinou and Gani (1994) method. The fragment-based method is used to calculate Kow (Lyman et. al., 1981). The group contribution parameters for solubility parameter calculation are based on van Krevlen and Hoftyzer's method (Barton, 1985). The models and their reference are summarized in Table 1. Table 1. Property Prediction Models for CAMD_I and CAMD_2 Property Reference Solubility Parameter Barton, 1985 Boiling Point Constantinou and Gani, 1994 Melting Point Constantinou and Gani, 1994 Heat of Vaporization Constantinou and Gani, 1994 Partition Coefficient (log Lyman et al., 1981 Kow) We note that the nonlinear property prediction constraints ((pj in PMD) do not employ the zij; and wi variables from the Churi-Achenie octet rule implementation (see Chapter 3). Thus the problem is nonlinear with respect to only the uik variables. In the property prediction models, the nonlinearities are present in all the uik variables. The estimators for the case study are constructed in the appendix. These estimators are then used in the proposed branch and bound technique to solve the problem.
250 10.2.1 Case Study
CAMD_I
In this case study, structural feasibility constraints are employed to ensure feasible molecular structures. For simplicity introduce the notation Iprl = ~ H ' I]'/2 -- l//P'I]/3 "-I/J'v ' 1[]4 -- IprD
The resulting molecular design formulation is shown below. CAMD_I: min ~
uo.(AHv) j
~ i
(1)
j
subject to Z
t
i
exp(( ~ exp((
i
Z
~ ~
j
j
j
/g/J -~ "max
(2)
u~j(2 - v j) = 2
(3)
uo. (Tb)J) / 204 .4) > 323
(4)
~_~ ~_, u ij ( Tm ) j) / 102 .425 ) < 223 i
j
(5) ~_,~_u~(Z~ i
+ ~,,~_uo(Z')j < 4.0
j
i
4(5 D -23.3) 2 + (Sp - 6.6) 2 +(5 H - 8 . 3 ) : < (19.8) 2
D - 6.31/tv > 0 ~/[i -~ llf i -~ ~/[ i'
(6)
j
i = 1,2,3,4
(7)
(8) (9)
To solve CAMD_I: we proceed as follows Step 1" (a) Decide on the set of groups to be used to form compounds. We choose as basis set twelve groups, namely CHa-, CH2-, Ar-, -Ar-, -OH, CHaCO,-CH2CO-,-COOH, CHaCOO-,-CHeCOO-,-CHaO, and-CH20-.
251 (b) Identify the design variables. These are given by the structural variables u/j, which determine whether a particular structural group is present in the molecule. Step 2" Identify the performance objective. The performance objective is given by the double summation in Eq. (1), which gives the heat of vaporization of the compound. Step 3: Identify the constraints. Constraints are employed in order to ensure that the last seven groups in the basis set are not allowed to occur more than twice in a compound as follows ~ u ~ < 2. j = 5 ..... 12
The constraint Sp _> 6.3, will ensure minimal blanket swelling. The environmental impact of solvents is accounted for by requiring that the maximum value of the partition coefficient (log Kow) be 4.0. To ensure that the solvent is a liquid at ambient temperature, the limits on boiling point (Tb) and melting point (Tin) are imposed. The constraints are Eqs. (4) through (9). Eqs. (4) to (7) are the property target constraints on blanket swelling, and Eq. (8) are constraints imposed by the branching functions. Eq. (9) are simple bounds on the branching functions. Step 4: Decide whether to use the Odele-Machietto or the Churi-Achenie Octet Rule Model. Here we employ the much simpler (although restrictive) OdeleMachietto model for acyclic compounds where vj is the valence of jth structural group. The model is given in Eq. (3). We also include the molecular structural constraints (Eqs. (2) and (3)). Step 5" Using information from previous steps, assemble the mathematical program, i.e. the performance objective, constraints, design variables and the Octet Rule Model. Eqs. (1) through (9) make up the mathematical program. Step 6: Construct linear estimators of the performance objective and the constraints. The simple example in Chapter 3 gives an illustration of how to do this; also see the appendix in this chapter. Step 7: Enter an iterative loop using the branch and bound (BB) procedure in Section 3.3.1 of Chapter 3. There are two nonconvex constraints. The splitting functions employed are ~D, ~P, I~rHand ~y. The MILP solver used is a public domain code lp_solve by Hartmut Schwab available at (ftp.es.ele.tue.nl/pub/lp_solve). This solver uses the simplex algorithm, lp_solve uses a rather simple depth first strategy. Identify the optimal molecule using information from the solution.
252
Three different runs were investigated for case study 1. The three runs correspond to n~ax of 3, 4, 5, 6, 7, and 10 ( C A M D _ l a , C A M D _ l b , C A M D _ l c , C A M D _ l d , C A M D _ l e , and C A M D _ l f , respectively). The corresponding problem dimensions are 36, 48, 60, 72, 84 and 120. For all cases the n u m b e r of constraints are 15. The t e r m i n a t i o n criterion used is an absolute tolerance of 10 .3. The results are shown in Table 2. Problem C A M D _ l a has a very limited search space. A feasible solution was found in the first iteration in the branch-and-bound algorithm. In C A M D _ l c , the algorithm took 31 iterations and 351.4 seconds on a 333-MHz DELL P e n t i u m II personal computer. The m a x i m u m n u m b e r of sub-regions constructed is 16. The globally optimal solution corresponded to methyl-ethyl ketone (MEK or CH3CH2-CO-CH3) with objective function 35.471 k J / m o l e . This compound was found at the 10 th iteration with a valid upper bound of 35.471 and a lower bound of 33.99. Since the difference between the upper and lower bound was more t h a n the tolerance, the algorithm continued executing. The algorithm finally converged to M E K as the global solution after 21 more iterations. The two other feasible compounds found were propanol (CH3-CH2CH2-OH) and diethyl-ketone (CH3-CH2-CO-CH2-CH3). The objective function values for propanol and diethylketone were 44.77 kJ/mole and 40.12kJ/mole, respectively.
Table 2: Application of Reduced Space BB algorithm to CAMD_I Case
nmax
CAMD_la CAMD_lb CAMD_lc CAMD ld CAMD_le CAMD_lf
3 4 5 6 7 10
Variables Constraints Iterations CPU time (min) 36 15 1 0.045 48 15 18 0.86 60 15 31 5.85 72 15 42 17.21 84 15 46 48.45 120 15 67 713.5
Max number of subregions 1 12 16 20 21 21
We note t h a t at any iteration, the solution of the relaxed MILP problem is a s t r u c t u r a l l y feasible compound since all the structural constraints are linear. During the execution of the algorithm, fifteen different compounds were found. Of these, two other compounds satisfied the specified or performance constraints. For case C A M D _ l e , the n u m b e r of iterations is 46 and 3 compounds are designed. The m a x i m u m n u m b e r of subregions created is 21. In C A M D _ l f , the n u m b e r of iterations is 67. The m a x i m u m n u m b e r of subregions created is 21. Even t h o u g h the n u m b e r of iterations does not grow very much, the CPU time increases. This is because the CPU time associated with each LP solution increases significantly when the n u m b e r of variables increases. Another desirable property of this algorithm is t h a t a very small n u m b e r of subregions are created.
253 For the three cases, the number of subregions created are 16, 21 and 21, respectively. Thus the algorithm is very efficient in terms of storage requirements. It should be noted that as the dimension of the problem increases from 60 to 120, the number of iterations only increases from 31 to 67. This is perhaps the consequence of the fact that the number of branching variables, namely 4, is the same in all the cases. Recall that in all the example problems above, although the number of variables uij increased from 60 to 120, the number of branching functions is unchanged at 4. In contrast, if we employ the standard full space BB algorithm, we will need to perform branching with respect to all the variables ui. Here, the number of branching variables ranges from 60 to 120.
10.2.2 Case study CAMD_2 In this case, the same formulation is solved with the Churi-Achenie model (see Step 4 above). The connectivity variables z and w are employed in the structural representation as described in section 3 of chapter 3. The second constraint in CAMD_I is replaced by the following set of structural constraints. This leads to a large increase in the number of linear structural constraints. m
sm~
m
y~ Z up - ~ 1 u i~ v k
p=lj=l
i - 1...nmax
(10)
i - I smax
~_zij p > -w i j "l
p=l
nm ax m
Z
i = 2 ....
nma x
(11)
n m ax
Zuik+
i= l k = l
Zwi=nmax
(12)
i=l
(13)
wl=O Wi
i=l...(nmax-1)
~-- W i + l
~-~
Z
Zijp q-
Muik < M
i = l...nm~x,k = l .... m
j=u+l p = l S~
(14)
(15) S m,,r
i = l...(nmo~ - l), p = (i + l)...nmo~
(16) n
i = 1...nm~, j = 1 . . . s , ~
p~l Zijp < 1 m ZUikk=I
m ZUi_l,k k =I
<0
(17)
i = 2...n,,~x
(18)
254 This formulation is solved for nmax equal to 3, 4 and 5 (CAMD_2a, CAMD_2b, and CAMD_2c). The n u m b e r s of search variables are 57, 84 and 115 respectively. The corresponding n u m b e r s of constraints are 67, 84 and 113. Note t h a t the formulation is nonlinear with respect to only the uih variables. The results are s u m m a r i z e d in Table 3. The n u m b e r of variables t h a t participate in the nonlinear t e r m is the dimension of uik variables. The r e m a i n i n g variables determine the connectivity information and a p p e a r only in the linear t e r m s in CAMD_2. The dimensions of the vector of variables uik in the three runs are 36, 48 and 60 (CAMD_2a, CAMD_2b and CAMD_2c).
Table 3: Application of Reduced-Space BB algorithm to problem CAMD_2 nmax Variables Constraints Iterations CPU Max number time of subregions (min) CAMD_2a 3 57 67 1 0.1 1 CAMD_2b 4 89 89 18 3.36 9 CAMD_2 c 5 115 113 22 14.5 11 Case
C A M D _ 2 a corresponds to a problem with a reduced search space restricted by 3. For this case the global optimal solution was found in only one iteration. W h e n the search space was increased to 89 and 115 variables, the n u m b e r of iterations also increased to 18 and 22. For the run CAMD_2c one of the feasible compound found in an i n t e r m e d i a t e step is -CH20-CH2COO-CH20-CH2-, a cyclic compound. The structural constraints used in case study 2 allows design of cyclic compounds. The constraints in case study 1 are restricted to only acyclic compounds. nmax-
For about the same n u m b e r of variables, the n u m b e r of iterations in case study 2 (CAMD_2) is relatively smaller t h a n case study 1. In addition, the m a x i m u m n u m b e r of nodes generated in case study 2 is much smaller t h a t in case study 1. This can be a t t r i b u t e d to the fact t h a t in CAMD_2 the n u m b e r of variables a p p e a r i n g in nonlinear t e r m is much smaller compared to problems of similar dimension in CAMD_I.
10.2.3 Case Study CAMD_3 In this case study, solvents are designed with entirely different criteria. Here the most desirable a t t r i b u t e of the solvent is recoverability. T h a t is, after the b l a n k e t w a s h operation is performed the solvents compounds t h a t evaporate are recovered by a solvent recovery system. This case study a t t e m p t s to find a solvent compound t h a t will be least expensive to recover. M a n y competing
255 solvent recovery techniques can be applied, n a m e l y condensation, gas adsorption and gas absorption. Here the recovery system is restricted to the condensation. A typical condensation recovery system consists of a compressor t h a t t a k e s in the p r i n t i n g solvent-laden exhaust gases (from the ventilation system) and compresses t h e m to a higher pressure. These high-pressure gases are passed t h r o u g h a condenser t h a t cools this stream. Next it is flashed to recover the solvent. The details of the recovery operation have been discussed elsewhere (Sinha, 1999). Here the objective is to find the solvent compound t h a t will have m i n i m a l total a n n u a l i z e d cost (TAC) associated with recovery. Here we will use as b r a n c h i n g functions (except for the functions in (6.4)) the following functions
~5 = Hvo + Z ~.uuHvj , IPr6 = Vo "~E Eu~T~j ij ij
, IV 7 = Pcomp, and ~8 = T~o.d 9
The CAMD_3 ease study with recovery considerations is:
minTAC=85675*(Ps176163 I" v
o
VM
subject to: (19)
,~_, ,~_ u U _< 4 i
j
(20)
~_~~.,uij (2 - Vj ) = 2 i
]
log10 (Vm) - l~
(P'comp)
-
2.7(T8 / Tco.d),.7 < -11.47
exp((~_~ ~_ Uutmj) /102.425 ) < 223 i
500 <
Tbo + Z Z uijTbj < 700
Hvo + ~ ~_~uijHvj < 80
~_~_u~j(Z~ j
(23)
j
i
i
(22)
j
i
20 <
(21)
(24)
j
+ ~_,Euu(Z')j < 4 . 0 i
(25)
j
4(5 D -23.3) 2 +(Sp-6.6) 2 +(5 M-8.3) 2 <(19.8) 2
(26)
256 'g~ - 6.3Vv -> 0
~i ~Vi <~i'
(27)
i=1,...,8
(2s)
where Eq. (20) is Odele's octet rule implementation, Eqs. (21) to (27) are recovery, melting point, boiling point, heat of vaporization, octanol-water partition, solvent power and swelling constraints, respectively, Eq. (28) represent constraints imposed by the branching functions. The following modified basis with 15 groups is used in this study: [CH3-, CH2-, OH, CH3CO-,-CH2CO-,-COOH, CH3COO-,-CH2COO-, -CH30, -CH20-, CH2=CH-, -CH=CH-, -CH2NH2, =CHNH2, CH3NH-]. The aromatic groups are removed and some groups with nitrogen are added to include amine or other compounds with nitrogen. There are a total of 8 splitting functions. The last 4 splitting functions are used for construction of linear underestimators for the objective function and underestimator and overestimator for the recovery constraint. The construction of estimators is discussed in the attached appendix. This case study has 60 variables and 3 nonlinear constraints. Moreover, the objective function is nonlinear. The condenser temperature however can range between 150 K and 298 K. This results in poor scaling and causes difficulty during optimization. To overcome this we have scaled the condenser temperature between 0.1 and 0.9 such that T" = 185Tcond + 131.5 where T ' i s the scaled condenser temperature. The globally optimal compound designed is a diester with the structure CH3(CH2COO)2-CH2NH2. The recovery cost associated with this compound is $25,981. The corresponding compressor pressure is 2 atm and the condenser temperature is 288.75 K. The algorithm took 56 iterations and a CPU time of 41.6 seconds. At termination, the number of nodes (i.e. subregions) is 20. The above problem was solved again with local optimization software D I C O P T in GAMS (Brooke, 1996). The optimal compound found by D I C O P T is HO-CH2COOCH3NH, an ester. The objective function associated with this compound is 106,327, the compressor pressure is 10 atms and condenser temperature is 298 K. We note that the extra effort associated with the global optimization is justified and results in almost 4 times reduction in the recovery cost.
10.3 D I S C U S S I O N AND C O N C L U S I O N S The molecular design problem is reduced to solving an MINLP problem in which the n u m b e r of binary variables uij can range from several tens to several hundreds. The use of the standard branch-and-bound method for solving the
257 problem can be computationally intensive since all the variables uij must be used as branching variables. To overcome this problem, we have proposed a new strategy. The main idea of the method consists in that we do branching using branching functions instead of all the search variables. This approach results in a decrease in the number of branching variables in our molecular design framework. For example, in case study 1, a problem with 120 nonlinear variables is solved with just 4 splitting variables. This is also demonstrated in the case studies. The maximum number of nodes stored in memory during the search is 21 for C A M D _ l e and C A M D _ l f and 20 for CAMD_3. In other words, during branch and bound, the bounding operation is performed the search variables space, while the branching operation is performed in a reduced dimension space defined by the branching (or splitting) functions. The branching functions are determined from the special tree function representation of both the objective function and constraints. In order to construct the corresponding linear underestimators, we employed the sweep method we developed in our research (Sinha, Achenie and Ostrovksy, 1999) and (Ostrovsky, Achenie and Sinha, 2000). The proposed algorithm scales well. Specifically, as the problem size increases, the computational effort increases almost linearly. We anticipate that this linear behavior will be exhibited also in large molecular design models.
10.4 R E F E R E N C E S
See Chapter 3
10.5 A P P E N D I X : C o n s t r u c t i o n o f E s t i m a t o r s
One very important property for solvent is its ability to dissolve the solute. A solute-solvent interaction is often characterized by the Hansen solubility parameter 57' (Archer 1996). This parameter is characterized by the three intermolecular interactions, namely hydrogen bonding interaction (SH), polar interactions (SP) and nonpolar (dispersive) interaction (51)) (Hansen, 1971). The mathematical expression for the solvent selection criterion based on the Hansen solubility parameter is R U = 4(6z)-5~)) 2 + ( S p - S p ) 2 + ( 5 . - 5 ~ )
o _ v,o.
IVv
_
IVv
, , --
~Vv
2 <(R*) 2
(29)
(30)
258 where
~
= E Z u U * F~j;~,~
~,,
- E E u~ * ( - u , o);
= E E , o . *(F~j) ~-
ij
ij q6,
ij
(31) -
Vo + E E u~V~ ij
Here FDi, FPi and UHi are the group contribution p a r a m e t e r s associated with the dispersion, polar and hydrogen bonding solubility p a r a m e t e r s respectively (Barton 1985). Substituting Eq. (30) into Eq. (29) we obtain (32) Note t h a t the nonconvex Hansen solubility design criteria make the solvent design problem multiextremal. R* is the interaction radius associated with the solute. The distance between the solute and solvent solubility p a r a m e t e r s is RiJ and can be computed as shown in Eq. (29). ~ is the molar volume of the solvent. We will now construct linear underestimators for this important constraint. Eq. (32) is made up of four separable terms. The first and the fourth t e r m s are squares of linear equations. The second and the third terms are relatively more complicated. Using Eq. (30) one can obtain the STF representation of the third t e r m in the form ~? - ( ~ ) ~ ,
Therefore the
can be used as branching functions. Let us consider for
illustration construction of an underestimator for the third term in Eq. (32). First we need to find bounds for
259
The l i n e a r u n d e r e s t i m a t o r s are constructed in a reverse sweep t h a t s t a r t s at the fifth level and goes down. First, a linear u n d e r e s t i m a t o r of ~p~ is constructed in t e r m s of ~p( as follows L[~p~;S 4] = p~ (~p()+ P2- Here S r - ((p(,~p(). The sign of pl, which depends on the i n t e r v a l (~4,~4), d e t e r m i n e s w h e t h e r the function p~(~p() is convex or concave with respect to variables tp~ and ~p~. The u n d e r e s t i m a t o r now h a s the following form
The signs of (pa-#4) and ~ a +p4) d e t e r m i n e if the corresponding functions are concave or convex. Subsequently, the u n d e r e s t i m a t o r is constructed with respect to 1VH a n d ~y. After r e a r r a n g i n g the terms, the linear u n d e r e s t i m a t o r is r e p r e s e n t e d as L[qg~,S 2] = P 8 ~ , + P9 (~v)+ P~o. We r e i t e r a t e t h a t the subregion is not in t e r m s of search variable uih, b u t r a t h e r in t e r m s of functions of uik. Based on the region S the coefficients pl, (lua-pr and (pc +p4) are calculated and a decision about construction of the u n d e r e s t i m a t o r is m a d e at two levels (Ostrovksy, Achenie and Sinha, 2000). This m a k e s the algebraic s t r u c t u r e of the u n d e r e s t i m a t o r adaptive.
This Page Intentionally Left Blank
Computer Aided MolecularDesign: Theoryand Practice L.E.K. Achenie, R. Gani and V. Venkatasubramanian(Editors) 9 2003 Elsevier ScienceB.V. All fights reserved.
261
Chapter 11: CAMD in Solvent Mixture D e s i g n M. Sinha & L. E. K. Achenie
11.1
INTRODUCTION
Solvents are extensively used in industry to clean equipment parts by separating grease and grime, to suspend solids as in inks and paints, for separation of solid or liquid component from a mixture to be followed by purification (liquid-liquid extraction and gas absorption), and many other purposes. Once media of choice for the processing industry, many organic solvents are being phased out of products and processes for environmental and health reasons (Krishner 1995). One of the most used solvents in a printing press is the '%lanket wash" which is specially formulated to clean ink from lithographic printing presses. There are more t h a n 52,000 lithographic printers in the United States (Adrian 1991) and each use 160 gallons per year (total of approximately 8 million gallons per year). These solvents are eventually released into the atmosphere thus posing considerable environmental problems. There is a tremendous need to replace and/or recover these solvents. The search for new solvents is usually based on design heuristics, prior experience and direct experimental studies. This approach is inherently trial and error, and therefore costly, time-consuming and may not necessarily yield the solvent with the desired performance attributes. For example, the solvent mixture may have a drying time that is too slow for the intended use. While there is no substitute for experimental study, there is a definite need for a preexperimental stage t h a t will quickly and cheaply generate new solvents t h a t are promising enough to be considered for the costly and time-consuming experimental stage. In a recent article Zhao and Cabezas (1998) discussed different property requirements t h a t should be satisfied in order to design solvents for different applications. In our discussions with the Hartford Courant (Hartford Courant) a major performance issue in the selection of a blanket wash solvent is minimizing the effect of a solvent on the surface characteristics of the rubber blanket on which the printing paper is processed. Many solvents swell the rubber blanket. Environmental restrictions and the need for reduction of operating costs mandates recycling of the spent solvent in the short term. Previous and existing
262 industrial approaches to solvent selection and substitution have relied on data base search and query approaches. For example, S A G E (Center for Aerosol Technology 1993) and S o l v D B (National Center of Manufacturing Science) have large databases of existing solvents and associated processes. Through query and answer sessions, the user is led to a suggested selection of solvents. Eastman Chemicals and Dow chemicals have developed a large database of solvents. The database contains solvent properties such as flash point, thermodynamic functions, vapor pressure and hazard evaluation. Eastman's solvent alternative strategy aims at matching the solubility constants and the evaporation rates (Krishner 1995). Solvents used in industry are often blended together to meet user requirements. These may include limits on boiling point, viscosity and other transport properties, solute-solvent interaction or the solvent power characterized by its solubility parameters. Other implicit requirements on miscibility have to be satisfied to ensure single-phase mixtures. Computer Aided Product design (CAPD) has emerged as a powerful strategy for identifying promising compounds with pre-specified levels of certain thermophysical properties. More formally, CAPD is a reverse engineering procedure that incorporates desired levels of physico-chemical properties directly into the design of products. This approach has been applied to polymer design (Venkatasubramanium and Chan 1995); reinforced polymer composite design (Vaidyanathan and E1-Halwagi 1994; Vaidyanathan and E1-Halwagi 1996; Vaidyanathan et al. 1998) liquid-liquid extractant (Gani and Brignole 1983) (Naser and Fournier 1991); and refrigerants (Duvedi 1995; Duvedi and Achenie 1995; Joback 1989). CAPD has been applied to solvent design problems as well. These include, design of solvents for liquid-liquid extraction (Macchietto et al. 1990; Odele and Machietto 1993), and gas absorption processes (Pistikopoulos and Stefanis 1998); solvents for separation processes (Pretel et al. 1994) and solvent blends for paint formulation (Klein et al. 1992). Computer-aided mixture design (commonly referred to as the formulation problem) has been applied to solvent design for liquid-liquid extraction (Gani and Brignole, 1983; Macchietto et al., 1990; Odele and Machietto, 1993), refrigerant mixture design (Duvedi and Achenie, 1997), polymer blend design (Vaidyanathan and E1-Halwagi, 1996), coating applications (Dunn et al., 1997) and in the paint and ink industry. A generalized approach for designing mixtures appears in articles by Kein, Wu et a/.and by Duvedi and Achenie (Duvedi and Achenie, 1997; Klein et al., 1992). Many mixing rules for prediction of mixture properties have been developed and reviewed by Horvath (Horvath, 1992). Even though many water-soluble solvents exist that can be used to make a blanket wash formulation, deciding on the composition of the final wash formulation is a trial and error procedure. Moreover mixture property prediction
263 models for aqueous systems are difficult and models are highly nonlinear (Reid et al., 1987; Wu, 1987). Thus there is big incentive in developing a formulation tool t h a t can design aqueous blanket wash blends in the presence of nonlinear and (probably non-convex) models. In this study, we employ our interval arithmetic based global optimization package LIBRA for the systematic design of optimal water-based blanket wash systems. The paper is organized as follows. Section 11.2 describes the use of lithographic blanket washes. The mixture design model is developed in Section 11.2 followed by a description of interval analysis in Section 11.3. Interval analysis forms the basis for the solution algorithm in Section 11.4. A case study is presented in Section 11.5. Sections 11.6 and 11.7 present conclusions and bibliography. Finally, the appendix section (Section 11.8) gives details of the case study as well as the physical property models employed.
11.2 P R O B L E M D E F I N I T I O N 11.2.1 L i t h o g r a p h i c B l a n k e t W a s h e s Lithographic printing is the most common printing process and is based on the immiscibility of water in oil. Printing ink which is insoluble in water, comprises of resins and pigments suspended in a petroleum-based solvent. The ink is applied on a printing plate, which is pressed onto printing paper to impart the print image. When the plate is dipped in aqueous fountain solution, the ink and fountain solutions repel each other, and the ink is confined to the image area of both the plate and printed material. Roller blankets are used to carry the print paper. At the end of the print cycle, ink residues on the blanket have to be removed using a solvent-based blanket wash solution. Solvents are extensively used as a major component of ink in the printing industry. The function of a solvent in ink is to act as a vehicle for polymeric resins, pigments and dyes. The ink solvent also assists in wetting and dispersion of dyes and pigments. In letter press and offset lithographic printing processes, the ink is carried to the plate by means of a train of rubber rollers commonly called "blankets" as shown in Fig. 1 in Chapter 10. Thus a thin film of ink is distributed over a large surface area on the blankets. These ink solvents are volatile and evaporate to leave behind the pigments and resins on the blanket surface. Cleaning is required whenever the residue buildup affects the print quality and between print jobs. Paper fibers, ink residue, paper coating and dried ink, are types of material that must be removed from the rubber blankets. One of the most used solvents for lithographic printing is the '%lanket wash" which is specially formulated to clean ink and other residue from rubber blankets.
264 Manual cleaning operation (also termed "rag and bucket") involves wiping down the blanket cylinder with a cloth wipe dampened with blanket wash solution. The large volume of soiled rags from these operations are routinely sent to industrial launders who are then faced with the proper disposal of the waste water resulting from laundering the rags. In addition, the industrial launders are saddled with the inefficiencies of solvent use in the printing industry, since they also have to abide by rigid standards on wastewater pollution levels. Blanket wash solvents are mostly mixtures as opposed to single component solvents. As such, next to solvent performance, one of the most pressing concerns of the printing industry with regard to the environment is the volatile organic component (VOC) level of solvents. At present the VOC levels of solvents used in the printing industry are unusually high, well over 80% and far beyond the industry target of 30%. For example, a commonly used blanket wash, "VM&P naphtha" has a 100% VOC content (United States Environmental Protection Agency, 1997a). Another important issue is minimizing the effect of a solvent on the surface characteristics of the rubber blanket by inducing swelling. Swelling severely affects the print quality in lithographic processes. To enhance the cleaning operation, companies sometimes mix solvents from different vendors. However, as noted earlier, this trial and error approach is costly and may not necessarily yield the solvent mixture with the desired performance attributes. In addition, the solvent for a cleaning operation may not meet safety, health and environmental restrictions. 11.2.2 M i x t u r e D e s i g n P r o b l e m F o r m u l a t i o n The computer-aided mixture design is composed of three main steps: (i)
Selection of pure components from a database (for example, for designing a binary mixture from a set of 10 pure components can result in (~0/or 45 combinations),
(ii) (iii)
Determining the mixture composition that satisfies the property targets and Ranking the candidate mixtures by some criteria such as overall cost.
The first step is a combinatorial problem; the second step is a continuous problem, which could be nonconvex depending on the nature of the property prediction techniques employed. We propose to use a mixed-integer nonlinear problem formulation that is general enough to handle several types of mixture estimation techniques. Obviously if the number of combinations (binary, ternary,
265
etc.) for pure components is small, one can e n u m e r a t e t h e m all and solve a series of continuous nonlinear programs. In the proposed formulation, binary variables are used to denote the presence or absence of a pure-component solvent in the mixture, and a set of continuous variables are used to describe the mole fractions of the components in the mixture. Hence the formulation is mixed-integer in nature. First let us introduce the variables
Yi
(binary variable)=I 1_ if pure component i is present in the mixture [o otherwise
xi (continuous variable between 0 and 1) - mole fraction of pure component i in the mixture Other p a r a m e t e r s include" n /'/,max
P~j
n u m b e r of pure component solvents (basis set) m a x i m u m n u m b e r of pure component solvents in the blend property j of pure component i.
C o n s t r a i n t s are imposed for (a) limiting the n u m b e r of pure component solvents in the blend; (b) ensuring t h a t the mole fraction of an absent component is O; and (c) all the mole fractions add to 1.0. These constraints are by no m e a n s exhaustive, and several different ones can be added to achieve a specific solvent m i x t u r e design objective. An equivalent m a t h e m a t i c a l p r o g r a m m i n g model is as follows
Pmix "min f (x, y) x,y
subject to 9 pL <_P(x, y) <_pU ~_, Yi <-nmax i
Zi Xi ----1 O<xi <-Yi X=[Xl,X
2 ...... , X n ] T
Y=[Yl,Y2," .... ,ym]r x i ~ [0,1] real Yi E
{0,1} binary
(1)
266 pL and p u are lower and upper limits on a vector of target properties P. These properties may be nonlinear and nonconvex with respect to the search variables xi, yi.
The last constraint in the above formulation ensures that if component i is not present in the mixture (i.e. i=0), then the corresponding composition xi is also 0. This however, can lead to cases where the composition of one component is infinitesimally small. To avoid this we replace it by: yi .s < xi < yi (l-s), where s is a small number (e.g. 0.01).
11.3 D E S C R I P T I O N OF THE P R O P O S E D M E T H O D OF S O L U T I O N Since many property estimation techniques are generally nonconvex, we have developed an interval analysis based optimization strategy that can design (globally) optimal mixtures. Interval analysis has emerged as a reliable mathematical tool that can automatically generate lower and upper bounds for a function (Hansen, 1992). It has been used for solving ordinary differential equations, linear systems, and verifying chaos. Interval arithmetic, which is at the heart of interval analysis, was developed by Moore (Moore, 1966). In essence, interval analysis based optimization continually deletes portions of the search space with the goal of maintaining a final box of desired width that contains the global solution. A number of interval-based optimization procedures have been developed (e.g. Hansen, 1992; Moore et al., 1992; Ratscheck and Rokne, 1991; Vaidyanathan and El-Halwagi, 1994; Van Iwaarden, 1996). Most of these procedures are tailored for unconstrained optimization problems. In addition, these techniques can only handle continuous variables. In other words they do not handle discrete variables. Notwithstanding the attractive features of interval-based global optimization, they are in general computationally intensive. To address some of these issues, we have developed new acceleration strategies, and extended the capabilities of the algorithm to solve mixed integer problems. 11.3.1 B r i e f I n t r o d u c t i o n to I n t e r v a l A n a l y s i s An interval, Xi = [ai, bi], containing a real variable xi is characterized by two real scalars a, and bi such that ai ~_xi ~_ bi. An interval vector X=(XI, X2, ..... ,Xi,...X,O T represents a hyper rectangular region in an n-dimensional space R n and is referred to as a h y p e r b o x or simply a box. The width of a box denoted by w ( X ) may be defined as the largest width of the interval X. There are other definitions for the width. An interval extension F ( X ) of a continuous function f(x) is obtained by replacing all variables with their interval counterparts. The resulting function is called an
267 interval function and is given as F(X) = [FL, F U] where F L and F U are loose lower and upper bounds on f(x) over the box (domain) X (see Figure 2). In general, the smaller the width of X, the tighter the bounds are. The true range of a function f(x) over X is denoted by f(X) (or R(X)) such t h a t fL and fu correspond to the global m a x i m u m and m i n i m u m over X. Moore (Moore, 1979) proved t h a t lim F(X)= f(X)
w(X)~O
Considerable effort has been expended by interval analysts to produce systematic methods for representing an interval function t h a t gives the s h a r p e s t bounds on the range of a real function over an interval (see for example Ratschek and Rokne, 1984, Neumaier, 1990, and Rokne, 1986). It can be shown t h a t for monotonic functions F ( X ) - f(X). An i m p o r t a n t property of interval analysis t h a t makes it useful for global optimization is the inclusion of functions. Consider a real-valued function f(x). The interval function F(X) is said to be an inclusion isotone of f(x) if (xe Y0 implies •x) e F(X) ) and also if (Y c_ X) implies t h a t F(IO ~_ F(X). Operations such as addition, subtraction, multiplication and division have been developed t h a t are inclusion isotonic. For more details on the operations and interval analysis see (Hansen, 1992; Moore, 1966; Ratschek and Rokne, 1984).
............i............i............................................................... i~ ,~,,
f(x)
F(X)
f (x)
i R(X) |
!
'
Y=. . . . . . . . . . . . . .
I
X
x
Figure 2: Continuous function and its interval extension We s u m m a r i z e some of the notation above as x i - a real scalar n u m b e r (e.g. 5.0) x - a vector of real scalar n u m b e r s xi (e.g. (1.0, 1.4, 2.5) T)
268
Xi = an interval scalar number (e.g. [2.0, 2.2]) X = a vector of interval scalar numbers Xi (e.g. [(1.0, 1.2), (1.4, 1.45), (2.5, 2.6)] T) f(X) (or R(X)) = [fL(X), fv(x)] = the range of a function f(x) on X (this is often unknown) F(X) = [FL(X), Fu(x)] = the natural interval extension of a function f(x) on X., such that FL(X) ~_fL(X) _~[v(X) ~_Fu(x)]. 11.3.2 G l o b a l O p t i m i z a t i o n M e t h o d s B a s e d on I n t e r v a l - A n a l y s i s Almost all interval analysis based global optimization algorithms (Ichida and Fuji, 1979; Moore et al., 1992; Vaidyanathan and E1-Halwagi, 1994) employ a successive domain reduction approach by eliminating portions of search regions, which do not contain the global solution. Consider the continuous optimization model
(globally)min f (x) x
subject to : g(x)
(2)
h(x)=O x=[xl,x 2...... ,x,] rreal x~X 0 Almost all domain reduction algorithms invariably use the following tests to systematically remove portions of X 0 that cannot contain the global minimum.
i) Upper Bound Test If the objective value UPBD = f(x) corresponding to a point x in the feasible region (i.e. search space that satisfies all constraints) is known, then any sub-region of X 0 (namely X) satisfying FL(X) > UPBD does not contain the global solution and can be deleted from the search space.
ii) Infeasibility Test For a sub-region X, if GL(X) > 0 then X does not contain any feasible region and can be deleted from the search space (domain).
iii) Monotonicity Test For an unconstrained problem, at the optimal point the gradient f'(x)=O. Let F'(X) be the interval extension of f" (x). Now if
0.0~ F'(X) then the sub-region cannot contain an optimal point and only the edge of the subregion is retained and the rest deleted. Note that this test is not appropriate for
269 constrained systems; it can only be applied to a sub-region X for which Gu(X)<_ 0 and H(X) = [HL(X), Hu(x)] = [0, 0].
iv) Non-convexity Test This test is based on the principle that at the optimal point the curvature of the surface defined by the objective function is positive, in other words the Hessian (second partial derivative matrix) of f ( x ) i s positive semi-definite. Let H(X) be the interval Hessian of the function.
To apply this test one checks if /~(X)is
positive semi definite, typically by checking if Hi~ (X)<0 for all values of i=1,2,..n. If not, then the sub-region is deleted. Again this test is not appropriate for constrained systems; it can only be applied to a sub-region X for which Gu(x) ~_ 0 and H(X) = [HL(X), Hu(x)] = [0, 0].
v) Distrust Region Test Vaidyanathan and Halwagi (Vaidyanathan and E1-Halwagi, 1994) proposed this test for constrained problems. Here the idea is that once an infeasible point, x, is found, a box X is constructed around it such that GL(X)>O. Then the boxed region X is deleted from the search space. The authors did not directly deal with equality constraints.
11.3.3. Modifications Employed in this Study An interval-based global optimization algorithm can be constructed based on the above tests. However, in our experience it is computationally slow especially for problems with a large number of constraints. Additional domain reduction tests are proposed next.
Upper Bound via SQP Local Optimization For the initial search space defined by X ~ a good upper bound on the global solution, fUPBD, is found using the locally optimum Sequential (or Successive) Quadratic Programming (SQP) (Bazaraa and Shetty, 1979). SQP has proven to be a very powerful algorithm for gradient-based local optimization. It often requires fewer function evaluations than other competing gradient-based algorithms (Biegler et al., 1997). Subsequently at any iteration, k, the upper bound (UPBDD for a sub-region Xh is found via SQP. If this upper bound is lower than the overall upper bound UPBD, then UPBDk replaces UPBD. In our experience, in many cases SQP finds the global solution in the first few iterations and the remaining iterations are merely used to verify global optimality.
270
Local F e a s i b i l i t y Test Here the idea is to relax the optimization model and only consider the convex constraints and determine if this relaxed search space contains a feasible solution. This requires the prior specification of which constraints are convex and which are not. This is not always straightforward. However, linear equality and linear inequality constraints are simple convex constraints. Based on this reduced set of constraints the feasibility of a sub-region Xh is checked by solving the following feasibility problem min p p,x
g ...... (X)~l~l
(3)
h~.... (x) =0 x~X k
The above problem is solved via a local optimization algorithm (SQP). Note t h a t for this problem the local and global solutions are identical. If the problem is infeasible (i.e. # > 0) then Xh cannot have any feasible point and can be deleted from the search space.
E x t e n s i o n to MINLP p r o b l e m s Mixture design problems have relatively small dimension. For a design with a basis set of m pure components the interval dimension is 2m. As indicated earlier, current interval based global optimization algorithms can only solve continuous optimization problems. An extension of the algorithm is required to solve mixed integer nonlinear programs (MINLP) such as the mixture design problem discussed earlier. Let a hyperbox be represented by [x~,x/] V i = 1,2 ..... n . . This box can be partitioned into two sub-boxes at a branch point Xk*. The first box is represented as[x~,xi] V i ~ k and [~,x~] and the second box is represented as [xi,~] V i a k and[xk,xk]. We have employed a modified partitioning strategy for binary variables. If a variable k is a binary variable then the partitioning results in two points 0 and 1 on the k th dimension. For the above case a binary partition will result in two boxes, namely first box - [x/,xi] V i r k and [0,0]and second box = [xi,xi] V i ~ k and [1,1] (see Figure 3).
271
Figure 3: Branching Strategy on continuous and binary variables The branching strategy at each iteration plays an important role in the efficiency of this algorithm. Accuracy of bounds of functions and constraints using interval analysis also depends on the width of X. To address both of these issues, branching is performed along the dimension that corresponds to the maximum width. 11.4
S T E P - B Y - S T E P A L G O R I T H M F O R THE S O L U T I O N TECHNIQUE
Here the implementation of a global optimization algorithm (LIBRA) that utilizes the domain reduction strategy is presented. In our implementation we employ the interval arithmetic C++ class definitions, data structures and basic linear algebra operations from the public domain software VerGO ~ttp://wwwmath.cudenver.edu/-rvan/VerGO/VerGO.html) developed by Iwaarden (1996).
2'/2 In the local optimization solver (a C++ implementation of Biegler and Cuthrell's SQP algorithm (Biegler and Cuthrell, 1985)) the derivative and Hessian information for all functions and constraints are computed via the automatic derivative package ADOL-C (Griewank et al., 1996). For this, the function and constraints have to be supplied in a defined format. Automatic Derivative is also used for the interval functions and constraints. The algorithm is implemented in the following steps and also shown in Figure 4. STEP
STEP STEP
STEP STEP
STEP STEP
1: In this step the input data is prepared. The problem is specified including the dimension size, number of inequality and equality constraints and, variable indices corresponding to the binary variables (if MINLP). Bounds on the variable (in other words the original box entered as an interval vector X ~ are specified and define the original search space. Two lists are initialized, box_list (containing the search space), and good_list (containing the candidates for global solution). A tolerance (s) is specified that specifies the maximum width of the interval corresponding to the final solution box that is acceptable. Finally, the original region X ~ is inserted as the only box in the box_list. 2" Check if the original region X ~ is feasible. If not, the algorithm terminates with an infeasible solution. 3" Initialize Lower bound (LWBD) to -infinity and Upper Bound (UPBD) to the objective value corresponding to the feasible local solution from SQP. If a local solution is not found then set UPBD to +infinity. 4: Check if box_list is empty. If yes, go to STEP 8 else go to STEP 5. 5: Remove the box Xh from the top of the box_list. This box corresponds to the box with the maximum width. Find the upper bound UPBDk to this box via SQP such that UPBD -f(x*h). If UPBDh < UPBD, then UPBD UPBDk. If UPBDh > UPBD , delete Xk and go to STEP 4. Next check using monotonicity test (ignoring constraints), nonconvexity test (ignoring constraints) and local feasibility tests. If tests fail, delete Xk and go to STEP 4. 6: Check if max width of Xk is less than the specified tolerance. If yes, insert the box in the good_list and go to STEP 4. 7: Apply the back boxing technique (the process of identifying a box t h a t surrounds a given point such that the objective function is convex on t h a t box, Iwaarden, 1996) technique to box the maximum convex region around the upper bound solution point x*h. Insert the solution in the good_list, partition the remaining search space of Xk. If back boxing is not applicable, partition the box about its maximum width as described earlier. Insert the partitioned boxes to the box_list. Here the boxes are inserted in an ordered list with the box with the greatest max width on
273
upbd, lwbd, eps, X [
@
Initialize upbd, lwbd Insert X in the box list N
I Remove box from top of the list F" L. Upper Bound Test: ~ _ ~ F(X,3.1ow > ut)bd
Delete
[ Prone
good_list f
JInfeasibility Test: ~ G(Xk)>00~
~ ~ _ ~Y
Delete
H(Xk)
[
Y ~] Insert in good_list
I
Local Optimization in Xk sol n = x k
I Backboxabout Xk I Split Box to n-1 sub boxes update upbd ~
Insertin box_list t
Figure 4: Flowchart of the Global Optimization Algorithm
274 top. Note that back boxing is defined only for an unconstrained problem. For a constrained optimization problem we proceed in one of two ways (a) apply back boxing to the objective function plus constraints, which have been weighted with a penalty parameter, or (b) apply back boxing to the objective function and check if the constraints are satisfied in the resulting box. Go to STEP 4. S T E P 8: Rank order the boxes on the good list based on the objective function values. Delete boxes for which F L is greater than the UPBD. Terminate and r e t u r n the list of good_boxes containing the globally optimal solution or set of solutions. The attractive features of this algorithm are summarized as: 1. Analytical expressions for gradients and Hessian for objective functions and constraints are not required. They are not computed by finite differences, rather, their analytical form are automatically constructed via use of the a u t o m a t i c derivative package. 2. If a problem has more than one global solution, then the algorithm finds all the globally optimal solution points. 3. It can be used for design under uncertain parameters. This is a very unique feature in interval-based techniques. In effect, if a p a r a m e t e r is not exactly known, r a t h e r its nominal point and corresponding confidence interval or some error band is known, and then it can be directly used as an interval parameter. Since parameter correlations are not accounted for in the use of the hyperbox defined by the parameter intervals, parametric uncertainty addressed this way will result in rather conservative solutions. This algorithm has been tested for many benchmark NLPs and MINLP problems. We have shown that for MINLP problems the splitting strategy does not result in a total enumeration of all possible solutions.
11.5 A P P L I C A T I O N E X A M P L E Nearly all-conventional blanket washes contain VOCs (Adrian, 1991). In Connecticut alone about 13.5 tons/year of spent solvent is disposed of by the printing industry (Lomasney, 1994). Of this only 0.69 tons are aqueous solutions, 10.57 tons are non-halogenated solvents and 2.26 tons are halogenated solvents. Weltman discussed the benefits of replacing halogenated degreasers by aqueous substitutes (Weltman and Evanoff, 1991). The Printing Industry of America (PIA) and the USEPA started a major initiative in the early 1990's to search for alternative water-based blanket wash solvents (United States Environmental Protection Agency, 1995).
275 Fairly recently the Toxic Research Institute (Toxics Reduction Institute, 1997) developed a water-based solvent Printwise TM t h a t can be used as a blanket wash. The AG Environmental Products company recently commercialized a waterbased blanket wash product "Soygold | This solvent is a methyl ester of soybean oil and is completely miscible in water. All these water-based blanket washes have very low VOC, low environmental impact, less toxic and easy to recover and are amenable to safe disposal. The use of surfactants and dispersants give solvents better cleaning characteristics and have found increasing use in many blanket wash formulations (Design for the Environment Program, 1997). Many blanket wash formulations use alkyl benzene sulfonates (ABS) as surfactants. These are water-soluble surfactants and cannot be used in hydrocarbon-based blanket washes. Thus there is a strong incentive for designing water-based blanket wash solvent blends. This case study explores the systematic development of aqueous blends for use as blanket wash solvents. Details of the problem formulation and solution are given in the appendix (section 11.8).
11.6
CONCLUSIONS
Computer aided blend design is a highly complicated problem. The general u n d e r s t a n d i n g of variations in blend property with variations in composition has improved substantially over the last few decades and many companies have developed computer models to predict blend properties. These models are being used successfully by companies such as DuPont (Wu, 1987). However using such models directly for discovering optimal blends is non-trivial. Moreover identification of binary solvent blend from a set of single component solvents involves a combinatorial search of an optimal pair. To address this we have developed an optimization framework for mathematical representation of the solvent blend design problem. The mathematical framework is an MINLP problem. To solve such design problems we have also developed an interval based global optimization tool LIBRA. This framework and solution approach has been used to solve an industrially relevant problem of designing optimal blends for blanket wash applications in the printing industry taking into account solvent power, viscosity and surface tension. Seven binary mixture design problems were solved in the case study, for which we have been able to identify a globally optimal blend composition for the solvent mixture.
276 11.7
REFERENCES
1. Adrian, J. R., Managing Solvents and Wipers. Pollution Prevention Review, 419-425 (1991). 2. Barton, A. F. CRC Handbook of Solubility Parameters and Other Cohesion Parameters, CRC Press, Inc., Boca Raton, Florida (1985). 3. Bazaraa, M. S., and Shetty, C. M. Nonlinear Programming, Wiley, New York (1979). 4. Biegler, L. T., Grossman, I. E., and Westerberg, A. W. Systematic Methods of Chemical Process Design, Prentice Hall, New Jersey (1997). 5. Biegler, L. T. and Cuthrell, J. E., "Improved Infeasible Path Optimization for Sequential Modular Simulators - II: The Optimization Algorithm", Comp. & Chem. Eng., 9(3), (1985) 6. Center for Aerosol Technology, "Solvent Alternative Guide (SAGE) Version 1.0 Technical report," Research Triangle Institute, Research Triangle Park, N.C. (1993). 7. Design for the Environment Program, "Cleaner Technologies Substitute Assessment: Lithographic Blanket Washes." EPA 744-R-97-006, USEPA (1997). 8. Dunn, R. F., Dobson, A. C., and E1-Halwagi, M. M., Optimal Design of Environmentally Acceptable Solvent Blends for Coating. Advances in Environmental Research, 1(2) (1997). 9. Duvedi, A. P., "Mathematical Programming Based Approaches to the Design of Environmentally Safe Refrigerants," Masters Thesis, University of Connecticut (1995). 10.Duvedi, A. P., and Achenie, L. E. K., Designing Environmentally Safe Refrigerants Using Mathematical Programming. Chemical Engineering Science, 51, 3727-3739 (1996). l l.Duvedi, A. P., and Achenie, L. E. K., On the Design of Environmentally Benign Refrigerarnt Mixtures: a Mathematical Programming Approach. Computers and Chemical Engineering, 21(8), 915-923 (1997). 12.Gani, R., and Brignole, E. A., Molecular Design of Solvents for Liquid Extraction Based on UNIFAC. Fluid Phase Equilibria, 13, 331 (1983). 13. Griewank, A., Judes, D., and Utke, J., Algorithm 755: ADOL-C: A Pakcage for the Automatic Differentiation of Algorithms Written in C/C++. ACM Transactions on Mathematical Software, 22(2), 131-167 (1996). 14.Hansen, E. Global Optimization Using Interval Analysis, Marcel Dekker, Inc (1992). 15. Horvath, A. L. Molecular Design, Elsevier (1992). 16.Ichida, K., and Fujii, Y., An interval arithmetic method to global optimization. Computing, 23, 85 (1979). 17.Iwaarden, R. J. V., "An Improved Unconstrained Global Optimization Algorithm," University of Colorado at Denver, Denver (1996).
277 18.Joback, K. G., and Stephanopoulos, G. "Designing Molecules Possessing Desired Physical Property Values." Proceedings FOCAPD 1989, Snowmass Village, Colorado (1989). 19.Klein, J. A., Wu, D. T., and Gani, R., Computer Aided Mixture Design with Specified Property Constraints. Computers and Chemical Engineering, 16 Supplement ($229) (1992b). 20.Krishner, E. M., Environment, Health Concerns Force Shift In Use Of Organic Solvents. Chemical and Engineering News (June 20), 13-20 (1995). 21.Lamasney, R., P r i n t e r s . , Connecticut Technical Assistance Program (1994). 22.Macchietto, S., Odele, O., and Omatsone, O., Design of Optimal Solvents for Liquid-Liquid Extraction and Gas Absorption Processes. Transactions of the Institute of Chemical Engineers, 68, 429 (1990). 23.Moore, R., Hansen, E. R., and Lecrec, A., Rigorous Methods for Global Optimization. Recent Advances in Global Optimization, C. A. Floudas and M. Pardalos, eds., Princeton University Press (1992). 24.Moore, R. E. Interval Analysis, Prentice-Hall, Englewood Cliffs, New Jersey (1966). 25.Moore, R. E. "Methods and applications of interval analysis." SIAM, Philadelphia (1979). 26.Naser, S. F., and Fournier, R. L., A System for the Design of an Optimum Liquid-Liquid Extractant Molecule. Comput. Chem. Eng., 15(6), 397 (1991). 27.Naumaier, A. Interval Methods for Systems of Equations, Cambridge University Press, London (1990). 28. Odele, O., and Machietto, S., Computer Aided Molecular Design: A Novel Method for Optimal Solvent Selection. Fluid Phase Equilibria, 82, 47-54 (1993). 29.Pistikopoulos, E. N., and Stefanis, S. K., Optimal solvent design for environmental impact minimization. Computers and Chemical Engineering, 22(6), 717-733 (1998). 30.Pretel, E. J., Lopez, P. A., Bottini, S. B., and Brignole, E. A., ComputerAided Molecular Design of Solvents for Separation Processes. AIChE Journal, 40(8), 1349-1360 (1994). 31.Ratscheck, H., and Rokne, J., Interval Tools for Global Optimization. Computers Math. Applic., 21(6/7), 41-50 (1991). 32.Ratschek, H., and Rokne, J. Computer Methods for the Range of Functions, Halsted Press, New York (1984). 33.Reid, R. C., Prausnitz, J. M., and Poling, B. E. The Properties of Gases and Liquids, McGraw Hill, New York (1987). 34.Rokne, J. H., Low Complexity k-dimensional centered forms. Computing, 37(247-253) (1986). 35.Tamura, M., Kurata, M., and Odani, H., Bull. Chem. Soc. Japan, 28(83) (1955).
278 36. Toxics Reduction Institute, "Demonstration of Printwise TM: A "Near Zero" Lithographic Ink and Blanket Wash System." 39, University of Massachusetts, Lowel (1997). 37. United States Environmental Protection Agency., "Cleaner Technologies Substitutes Assessment: Lithographic Blanket Washes." EPA 744-R-97006, United States Environmental Protection Agency (1997b). 38. Vaidyanathan, R., and E1-Halwagi, M., Computer-Aided Design of High Performance Polymers. J. Elastom Plasti., 26(3), 277 (1994a). 39. Vaidyanathan, R., and E1-Halwagi, M., Global Optimization of Nonconvex Nonlinear Programs Via Interval Analysis. Computers and Chemical Engineering, 18(10), 889-897 (1994b). 40. Vaidyanathan, R., and E1-Halwagi, M., Computer-Aided Synthesis of Polymers and Blends with Target Properties. Ind. Eng. Chem. Res., 35(2), 627-634 (1996). 41. Vaidyanathan, R., Gowayed, Y., and E1-Halwagi, M., Computer-aided design of fiber reinforced polymer composite products. Computers and Chemical Engineering, 22(6), 801-808 (1998). 42. Van Iwaarden, R. J., "An Improved Unconstrained Global Optimization Algorithm," Ph.D., University of Colorado, Denver (1996). 43. Venkatasubramanium, V., Chan, K., and Carruthers, J. M., Evolutionary design of molecules with desired properties using the genetic algorithms. J. Chem. Inf. Comput. Sci, 35, 188 (1995). 44. Weltman, H. J., and Evanoff, S. P., "Replacement of Halogneated Solvent Degreasing with Regenerable Aqueous Cleaners," General Dynamics Corporation, Fort Worth, Texas (1991). 45. Wu, D. T., Modeling and simulation in the coating industry. Chemtech (January 1987). 46. Zhao, R., and Cabezas, H., Molecular Themodynamics in the Design of Substitute Solvents. Industrial and Engineering Chemistry Research, 37, 3268-3280 (1988).
11.8 APPENDIX: DETAILED SOLUTION OF CASE STUDY 11.8.1 Case Study Objective Even though many water-soluble solvents exist that can be used to make a blanket wash formulation, deciding on the composition of the final wash formulation is a trial and error procedure. Moreover mixture property prediction models for aqueous systems are difficult and models are highly nonlinear (Reid et al., 1987; Wu, 1987). Thus there is a big incentive in developing a formulation tool that can design aqueous blanket wash blends in the presence of nonlinear and (possibly non-convex) models. In this study, we employ our interval
279 arithmetic based global optimization package LIBRA for the systematic design of optimal water-based blanket wash systems. 11.8.2 B a s i s Set The EPA report on blanket wash risk assessment (Design for the Environment Program, 1997) lists 40 different formulations (or solvent blends) used as blanket washes by different printing facilities throughout the United States. However, due to propriety reasons their compositions are not reported. Out of these, 21 formulations contain petroleum distillates (hydrocarbons and/or aromatic hydrocarbons), which pose considerable environmental health and safety risks. Two common aromatic hydrocarbons used in blanket washes are 1-2-4 trimethyl benzene (C9H12) and isomers of xylene (CsH10). Trimethyl benzene has a flash point of 54.4~ and log Kow of 3.78. Isomers of xylene have flash point as low as 17oC and log Kow of 3.15. Thus both are flammable and have high bioaccumulation and toxicity and are shown in Table 1 below. Table 1: Two aromatic hydrocarbons used in many commercial blanket washes.
1-2-3 T r i m e t h y l B e n z e n e
~
Xylene
CH3
C~3~~CH3 CH3
o-xylene
O S H A PEL: 200 m g / m 3 Log Kow =3. 78 Log B C F = 2.53 Log Koc = 2. 86 Water Solubility (g/L) =0.02
m-xylene
O S H A PEL: Log Kow = 3.15 Log B C F = 2.16 Log Koc = -0. 69 Water Solubility 0.1
p-xylene
(g/L)
=
The pure component solvents employed in this case study are non-halogenated and non-aromatic water-soluble compounds. Also only those solvents, which have relatively small environmental and health impact are selected. These solvents are listed in Table 2 and the desired attributes for optimal blanket wash formulation are defined in Table 3. Sources: EPA Report on Cleaner Technologies Substitutes Assessment: Lithographic Blanket Washes (United States Environmental Protection Agency, 1997), and SOLV-DB, Solvent Database at: " h t t p : / / s o l v d b . n c m s . o r g / s o l v d b . h t m ' , National Center of Manufacturing Science.
280 These attributes target the solvent power, its flow characteristics, surface contacting and environmental impact. Note that by constraining both density and viscosity, we have constrained the kinematic viscosity (p/p) of the blends. The pure component properties of the b a s i s s e t are presented in Table 4. T a b l e 2: S o l v e n t s
u s e d i n t h e c a s e s t u d y to d e s i g n b l e n d s .
1. Methyl Ethyl Ketone (MEK)
OSHA PEL : Log Kow = -0. 64, Log B C F = -0. 7 Log Koc = 0.85 Water Solubility (mg/Kg) = oo
OSHA PEL: 200 mg/m 3 L o g Kow = 0.29, Log B C F = 0.0 L o g Koc = O. 72 Water Solubility (mg/gg) = 223 000
(x-Terpineol OSHA PEL: L o g Kow = 3.33 Log B C F = 2.30 L o g Koc - 1.76 Water Solubility (mg/Kg) = oo
I
N
0I
11 0
"7
\
r
/
Diethylene Glycol Monomethyl Ether (DGME)
Propylene Glycol (PG)
OSHA PEL : log Kow = -0.92, log B C F = 0.82 L o g Koc = 0.0 Water Solubility (mg/Kg) = oo
N-Methyl Pyrollidone (NMP)
T-Butyrolactone (GBL)
-
OSHA PEL: 1 O0 (mg/m 3) Log Kow = -1.18 Log B C F = -1.1 Log Koc =-1.3 Water Solubility (mg/Kg) = oo
Water
OSHA PEL: Log Kow = -0.11, Log B C F = 0.31 Log Koc = 1.32 Water Solubility (mg/Kg) = oo
Diethylene Glycol Monoethyl Ether (DGEE)
OSHA PEL: lO0(mg/m s) Log Kow = -1.18 Log B C F = -1.1 Log Koc =-1.3 Water Solubility (mg/Kg) = oo
281 Table 3: Desired attributes of an optimal blanket wash blend. S o l v e n t P o w e r : Based on Solubility Interaction Radius of Blend and Polymeric Resin. RiJ
The resin is a phenolic resin, Phenodur | 373 U, (Barton, 1985). Solubility parameters are: 5 D - 19.7, 5P=11.6, and 5i-1-14.6; and interaction radius R* - 12.7 (all in Mpal/2). ......p..e.n..s.i.tx:....(~....a.s...s.p.....e...c.i.fi..g......g...r...a.vi..t.y)...; .....Y.i...s....c..~.s...i...t..y....:....(..~....i.n.....c...e....n...ti..p....~...i..s.e.)...~
...................................................................................................................[0...9.....-.....1.4.!
...................................
......................................................................................................................................[0...8.....-.....1....4.!
...................................
S u r f a c e T e n s i o n ((~ in dyn/cm2): [ 3 0 . 0 - 45.0] (psat, in mmHg) V a p o r P r e s s u r e .............................................................................................................................................................................................. ;....................................................................................................[o.....t....0_2] ................................................ I n h a l a t i o n E x p o s u r e (IE in rag!day) ........ [0 to 2] P e r m i s s i b l e E x p o s u r e L i m i t (PEL in m g / m a) [0 to 100] Table 4: Pure component prot~erties of solvents in the basis set. Component Ix (~ ] psat ~D ] ~P 5H (cp) (dyn/cm 2) ](mmHg) (Mpal/2)] (Mpal/2) (Mpal/2) Methyl Ethyl 0.378 24.600 95.300 14.100 9.300 9.500 Ketone Butyrolactone 1.700 40.430 3.200 18.600 12.200 14.000 NMP 1.660 40.700 0.334 16.500 10.400 13.500 Propylene Glycol 19.000 36.510 0.200 11.800 13.300 25.000 DGME 3.480 28.190 0.180 16.200 9.200 14.300 DGEE 3.850 29.530 0.126 16.200 9.200 12.300 a-terpinol 36.500 31.600 0.490 13.900 7.900 10.200 Water 1.000 70.000 50.000 26.500 23.300 14.800
The mathematical formulation for the problem is
PBLEND: minimize R ~ : [ 4 ( 5 ~ - 5 ; ) 2 + ( t i p - 5 ; ) 2 + ( 5 , --5H)2] 1/2 subject to: 5D --" E O i S D i i 5p -- E ( ~ iSPi i
5 H = ~ , O iS Hi i ._
Xi5 i
(Solvent Power)
p (sg) 0.801 1.120 1.001 1.034 1.229 1.025 0.819 1.000
282
Z x~ • MW~
(Density Constraint) ~ L ~-~mix ~ U
ln(l~mixemix )
-- Xl
l n ( r l l e l ) + x 2 ln(rlveew )
Vc /3
E=
( T c M ) 1/2
i
j M(mix)
= Z xi M i i
+
)
(Viscosity Constraint)
Vc( ij) -:
0 .L <_O'mix <__0 "U
.1/4 mix
=l].lw(~wl/4
+V]oO..ol/4 Itto- l- ~tw
lOglo .
~
tog~0 (1-qJw)
.
XoVo
. +0.441 q ff~176 (xwVw +. xoVo) 1-q T q
(~ wVw 2/3
(Surface Tension Constraint) Z
1~ sat
xir
<_P sat - max
118.8*xi*pi <_PELI
(Vapor Pressure Constraint)
(OSHA Constraint on Permissible Exposure Limit)
(0. 48At)*Gi <_IE max (Inhalation Exposure Limit) xw > 0.3
(Water fraction more than 30%)
zV_,xi=1 (mole fraction constraint)
More details of these models have been described in the Appendix. An MINLP model was formulated and solved two ways. In Case 1, the model (PBLEND) was solved by fixing the binary variables resulting in an NLP model. Specifically each binary mixture was constructed by fixing the binary variables for water and one of the other pure component solvents to 1; the remaining binary variables
283 were set at 0. The surface tension model requires the calculation of the water fraction Tw, which becomes an additional search variable. In Case 2 (PBLEND_MINLP)the MINLP model was solved rigorously, i.e. without fixing the binary variables. This is a relatively more difficult problem with 17 variables and 8 binary variables. Also in this case we only considered 2component blends (i.e. binary blend, not to be confused with binary variable). Thus the solution approach not only picks the best combination of 2-component solvents (combinatorial problem) but also finds the optimal composition (continuous problem).
11.8.3 R e s u l t s and D i s c u s s i o n
Table 5: Computational results of blend design case study- Case 1. R ij xl xw Ww Iter. CPU
Component
(secs) MEK-Water
10.34
0.14
0.86
0.31
31
3.4
ButyrolactoneWater NMP-Water PropyleneGlycolWater DGME-Water
4.40
0.45
0.55
0.10
17
2.5
6.54 12.94
0.46 0.06
0.54 0.94
0.07 0.47
33 29
4.1 5.9
10.02
0.07
0.93
0.25
23
13.3
DGEE-Water
8.58
0.10
0.90
0.18
25
8.7
a-terpineol-Water
10.16
0.08
0.924
0.16
21
5.1
Any componentWater
4.40
0.45 (GBL)
0.55 (Water)
0.10
1013
242.0
The model PBLEND and PBLEND_MINLP w a s solved to obtain 7 different binary mixtures as shown Table 5. From Table 5, consider for example propylene glycol and water blends for which the globally minimal objective function, RiJ, is 12.94. Unfortunately no propylene glycol and water blend will fall within the interaction radius (12.7) of phenolic resin. Therefore, it is expected t h a t no propylene glycol and water blend will be effective in dissolving phenolic resin. Among all solutions, the lowest objective value is achieved by a ~,-butyrolactone and water blend with interaction radius of 4.4. The attributes of the solvent blends in Table 5 are tabulated in Tables 6a-6b.
284
Table 6a: Blend Properties. P
Component
psat
(sg)
m#
PEL
(mmHg) (mg/day) (mg/m3)
MEK-Water
0.910
13.628 116.825
15.348
Butyrolactone-Water
1.093
1.440
15.235
58.045
NMP-Water
1.001
0.154
1.828
54.339
PropyleneGlycol-Water
1.007
0.012
0.115
7.130
DGME-Water
1.067
0.013
0.125
8.917
DGEE-Water
1.011
0.013
0.195
12.173
a-terpineol-Water
0.922
0.037
0.563
8.237
Table 6b: Blend Properties. ~
Component
8.
am~x
(Mpa~/2) (gpa 1/2) (Mpain) (dyn/cm2)
~tm~
(cp)
MEK-Water
20.860
16.932 12.389 35.000
0.830
Butyrolactone-Water
20.359
14.672 14.178 42.862
1.363
NMP-Water
18.259
12.669 13.729 42.380
1.322
PropyleneGlycol-Water
23.457
21.230 16.911 38.350
1.218
DGME-Water
23.484
19.171 14.654 30.260
1.181
DGEE-Water
21.841
16.922 13.669 31.572
1.236
a-terpineol-Water
21.075
16.669 12.819 33.592
1.361
# Inhalation Exposure Water (with pure a =70) is the major component in all the blends shown in Tables 6a-6b. We note that the surface tension is highly nonlinear in that a small organic fraction in the aqueous blends results in a very large change in surface tension. For example, 6% of propylene glycol (with ~ =36.51) reduced the aqueous blend's surface tension from 70 to 38.35. This behavior is also true in practice as verified by many experimental results (Tamura et al., 1955). Case 2 (PBLEND_MINLP)is solved next. The solution found is identical to the second case (aqueous blend with 7-butyrolactone with mole fraction 0.450. However, for this problem the number of iterations (1013) is much higher. Consequently the CPU time is relatively large (242.3 seconds). Thus it appears
285 t h a t when the number of alternatives is small, the MINLP formulation is less efficient t h a n a complete enumeration.
11.8.4 M i x t u r e P r o p e r t y Models The mixture property models used in the case study are outlined here.
Solvent Power For a solvent mixture, each component of the solubility p a r a m e t e r can also be computed by 5 D ~-"Z O i S D i i (~p = Z ~ i ~ p i i i
where Oi is volume fraction of each component expressed as"
xi5 I~i
ZxiV i i
Phenolic resins are commonly used in printing inks. The dried ink (solute) is assumed to be phenolic resins, specifically " S u p e r B a k a c i t e | 1 0 0 1 , R e i c h h o l d ' " The solubility p a r a m e t e r of the resin are nonpolar (Sd) - 23.3, polar (Sp) - 6.6 and hydrogen bonding (Sh) - 8.3 MPa 1/2 (Barton, 1985). The radius of interaction (JR) = 19.8 MPa 1/2. Thus, solvents, which can effectively dissolve the ink residue, have the following solute-solvent interaction constraint: ijR
._. (4(i~)o
_23.3)2 ..}..(i~p_6.6)2
_}.(i~H _ 8 . 3 ) 2 )1/2 < 19.8
F l o w C h a r a c t e r i s t i c (viscosity) Most liquid mixture viscosity models assume that the pure component models are available. Reid et al. discuss two different models for mixture viscosity, namely (a) the Grunberg and Nissan model and (b) the method of Teja and Rice. While the first model works well for organic liquids, the Teja and Rice model is specially recommended for aqueous blends. The equation for binary mixture viscosity is:
286
ln(JTmixSmi x ) = X 1 ln(r/,e, ) +
x 2 ln(r/2e2)
Vc /3 E -
-
~
(TOM) '/2
Vc(min)
: Z Z xixj Vci] i j
M(mix) = E Xi Mi i
Vc(ij) =
where 77i is the pure component viscosity evaluated at T(Tci/Tcm), Vc and Tc are the critical volume and t e m p e r a t u r e and M is the molecular weight. Flow characteristic requirements are posed as ~L <~]r](product) <~7.]U
Surface Contacting Surface contacting determines how effective the solvent will be in wetting the blanket surface; thus it characterizes the solvent's cleaning ability. High surface tension also translates to m o r e energy utilization especially if the cleaning is performed via a wiping operation. Surface tension of aqueous solutions is more difficult to predict t h a n non-aqueous models because of the nonlinear dependence on mole fraction. Small concentration of organic material may significantly affect the surface-tension value. For many binary organic-aqueous mixtures, the method of Tamura, Kurata, and Odani (Tamura et al., 1955) is recommended (Reid et al., 1987): O.1/4
mix = [Vw0.w
1/4
3v l~ro0. o
1/4
where 0.mix = surface tension of mixture, dyn/cm
0.w = surface tension of pure water, dyn/cm 0.o = surface tension of pure organic compound, dyn/cm Vo = l- Vw
and ~tw is defined by the relation
287
l~176 ( 1 - ~ w )
I
= loglo (xwVw) (xwV w + xoVo)~-q + 0.441 q L xoVo r
(~~176 q
Here xw = bulk mole fraction of pure w a t e r Xo = bulk mole fraction of pure organic component Vw = molar volume of pure water, cm3/mol Vo = molar volume of pure organic component, cm3/mol q = constant to be read from a table, depends on the type and size of the organic component. For example, q=nc for fatty acids and alcohols, and (he - 1 ) for ketones. Here nc is the n u m b e r of carbon atoms in the molecule. Expected errors are reported to be less t h a n 10% for q less t h a n 5 and less t h a n 20% for q greater t h a n 5.
This Page Intentionally Left Blank
Computer AidedMolecularDesign: Theoryand Practice L.E.K. Achenie,R. Gani and V. Venkatasubramanian(Editors) 9 2003 Elsevier ScienceB.V. All fightsreserved.
289
C h a p t e r 12: R e f r i g e r a n t D e s i g n Case S t u d y A. Apostolakou & C. S. Adjiman
12.1 INTRODUCTION The CAMD problem formulation presented in Chapter 4 is applied to the problem of refrigerant design. In the early days of the development of refrigerants, chlorofluorocarbons emerged as the most likely candidates for effective refrigerants. However, following their widespread adoption, the impact of fully halogenated chlorofluorocarbons (CFCs) on stratospheric and tropospheric ozone, global warming and acid deposition has been detected. As a result of the Montreal Protocol, these compounds have been phased out, stimulating a search for alternatives. Thermodynamic and transport properties can be used to evaluate the performance of a refrigerant system. The most important thermodynamic properties are the boiling point, vapor pressure, liquid specific heat, and enthalpy of vaporization. Joback and Stephanopoulos (1989) were the first to report the computer-aided molecular design of replacement refrigerants, followed by Gani et al. (1991). Environmental constraints in the form of ozone depletion potential have since been explicitly considered by Duvedi and Achenie (1996) and Churi and Achenie (1996). Sahinidis and Tawarmalani (2000) have applied a global optimization algorithm to the problem. In this example the refrigerant design problem of finding alternatives to freon-12 (CC12Fz), first proposed by Joback and Stephanopoulos (1989), is considered. The objective is to find a refrigerant as good as or better than freon-12. This is defined as a refrigerant with a larger heat of vaporization and with a smaller liquid heat capacity than that of freon-12. The replacement refrigerant obtained should be analyzed with respect to environmental properties. One important consideration in material replacement is the desire to use existing equipment and processing technology. This requires that both the old and the substitute material should have similar transport and other properties such as heat capacity and vapor pressure (Sinha et al., 1999). An existing refrigeration process for which a replacement refrigerant is to be found is defined by the following temperatures (Joback and Stephanopoulos, 1989): 1. evaporating temperature Te - 272 K,
290 2. condensing t e m p e r a t u r e Tcd 316 K, 3. mean process t e m p e r a t u r e Tm - 294 K. -
-
S a t u r a t e d conditions are assumed so t h a t the evaporating t e m p e r a t u r e Te is equal to the saturation t e m p e r a t u r e (Ts). The relevant properties of freon-12 at the operating conditions are as follows" 1. enthalpy of vaporization at Te" H re
v, freon
= 18.4 kJ mo1-1 ,
= 27.1 cal mo1-1 K -1 . 2. liquid specific heat at mean t e m p e r a t u r e Tin, Cpl,freon Tm
12.2 P R O B L E M FORMULATION The formulation of the CAMD problem proceeds in three stages (Joback, 1987). First, property constraints and the objective function are formulated. Then, group contribution relations between molecular structure and the properties of interests are added to the problem. Finally, the first order groups which are allowed to appear in the candidate refrigerants are selected, using physical information on the problem. The notation of Chapter IV is used.
12.2.1 P r o p e r t y c o n s t r a i n t s and objective f u n c t i o n The main property constraints arise from the requirement to obtain refrigerants t h a t have a process performance at least as good as freon-12:
H re _>18.4 kJ mol "1
(1)
C T" < 27.1 (cal mo1-1 K -1) pl-
(2)
The enthalpy of vaporization should be high to limit the volumetric flow. A low liquid specific heat reduces the amount of refrigerant that flashes upon passage t h r o u g h the expansion valve. The heat capacity constraint is evaluated at the m e a n process temperature. The vapor pressure of the molecule at the operating t e m p e r a t u r e s is considered to be an important property for refrigerant design. The lowest pressure in the refrigeration cycle (psTe ) should be greater t h a n atmospheric. This reduces the system. Here, a m i n i m u m of system pressure increases the ratio of 10 is considered to be
possibility of air and moisture leaking into the 1.4 bar is imposed. On the other hand, a high size, weight, and cost of equipment. A pressure the maximum for a refrigeration cycle and as a
291 result, the highest pressure in the system (pTca) is limited to a maximum of 14 bar (Joback and Stephanopoulos, 1989): ps/; > 1.4 (bar)
(3)
p T~a <14 (bar)
(4)
Environmental, health and safety concerns are important issues for refrigerant design. The ozone depletion potential (ODP) and the tropospheric lifetime are related to the environmental impact of a compound. Non-flammability, good stability and low toxicity are also desirable. However, in this case study, only the most fundamental thermodynamic criteria (heat of vaporization, liquid specific heat and vapor pressure) are considered in the optimization formulation. The environmental characteristics as well as the stability of the molecules designed are considered after the optimization problem has generated some candidate molecules. Once the main property constraints have been identified, the objective function to be minimized can be chosen. While many types of performance objective can be proposed to define a good refrigerant, the function used in this case study emphasizes the need for a high heat of vaporation, H Te , and a low heat capacity,
CpT7(Churi and Achenie, 1996)" c~ F=
pt
H5
(5)
12.2.2 Structure-property relationships In order to relate the property constraints and objective function to molecular structure, the group contribution method proposed by Marrero and Gani (2001) is used to estimate the following four properties: 9 normal boiling point, Tb in K, 9 critical temperature, Tc in K, 9 critical pressure, Pc in bar, 9 standard enthalpy of vaporization at 298 K, H 298 in k J mol 1 The property estimation functions are used with only first order contributions are taken into account.
292
222.543
(6)
keG~
Tc exp 23i_239)k~G' nlkCkC (Pc - 5"9827) -0.5 - 0.108998 = Z Hv 98
-
11.733
=
Z
(7) nlk C ; c
nlk C~ ~
(8) (9)
keG~ where CX denotes the contribution of the first-order group of k to property X. The e n t h a l p y of vaporization at Te is estimated using the Watson relation and the estimates of critical temperature, and standard enthalpy of vaporization (Reid et al., 1987)
298[1 138 -~v
1_T298
(10)
where TrYx denotes the reduced t e m p e r a t u r e at temperature Tx. The liquid specific heat Cpl Tm is estimated using the group contribution method of Chueh and Swanson (1973). This method provides an estimate of the heat capacity at 293K. Since the mean process t e m p e r a t u r e is 294K, the heat capacity at 293K can reasonably be used. The set of groups in the Chueh and Swanson method is different from the set of first order groups listed in Table 1 of Chapter IV. Chueh and Swanson groups will be denoted by bold italics. In order to be able to describe the same set of molecules as with the Marrero and Gani groups, some of the Chueh and Swanson groups must be split, while others must be combined. For instance, the C H O H group of Chueh and Swanson is split into CH and OH in the Marrero and Gani set. Conversely, the CF group of Marrero and Gani does not exist in the Chueh and Swanson set, and can be obtained by combining C and F. Because of these differences between the two sets of groups, some corrections m u s t be introduced in the group contribution formula. Thus, if a CH group is bonded to an OH group, the contribution these two groups make to the heat capacity should be equal to the contribution of C H O H (18.2 cal mol 1 K -1) and not to t h a t of CH+OH (15.7 cal mol ~ K-~). Furthermore, the Chueh and Swanson approach requires the application of certain rules which modify the group contribution formula, such as a C1 contribution which depends on the n u m b e r of C1 atoms bonded to a single carbon. The heat capacity (in cal mo] 1 K "1) of any molecule built from the groups in Table 1, Chapter IV, is therefore given by
293 C7~ pl = Z nlk Cpl, k + Z C~ keG 1 o~ 0
(11)
where C~l,h is the contribution to the h e a t capacity for each group ke G1, 0 is the set of additional rules which m u s t be followed to calculate the h e a t capacity, and
O is the contribution from the application of the o th rule in set O. The relevant Cpl,o rules for this example are derived after the set of groups to be used in the design has been chosen. The vapor pressures at the evaporating t e m p e r a t u r e , pTe, and at the condensing t e m p e r a t u r e , p~cd, are e s t i m a t e d using the Pitzer expansion with the Ambrose and Walton coefficients (Poling et al., 2000)
ln[ P @ ) = fo(TrTx )+COfl(TrTx ) + (_o2f2 (TTx )
which
requires
the
calculation
of
(12)
reduced
temperatures
TrTx
at
both
t e m p e r a t u r e s , the ratio of boiling point to critical t e m p e r a t u r e 0, and the acentric factor value co, where
co=-lnPc +(5"97616(1-0)-1"29874(1-0)1"5 +0"60394(1-0)2"5 + 1"06841(1-0)5)/0 (13) (-5.03365(1 - 0) + 1.11505(1 - 0) 15 - 5.41217(1 - 0) 2.5 - 7.46628(1 - 0) 5) / 0
o = rb
(14)
7~ Z'x = l - TrTx
(15)
fo(T Tx ) = (-5.97616"r x + 1.29874Tx 1"5 - 0.60394'rx 2"5 - 1.0684 lZ'x5) / TTx
(16)
fl (TT~ ) = (-5.03365"rx + 1.11505Tx 1"5 - 5.41217Z'x2"5 - 7.46628Z'x5 ) / TTx
(17)
f2 (TTx ) = (-0.6477 I'Cx + 2.41539"t'x15 - 4.26979Tx 25 + 3.25259"Cx5 ) / TTx
(lS)
12.2.3 C h o i c e
of first order groups
The choice of first order groups is based on knowledge of the problem and availability of contribution p a r a m e t e r s for the group contribution methods employed. Thus, the choice of first order groups is partly based on the functional groups p r e s e n t in currently available refrigerants, typically C1 or F containing
294 groups. However, groups such as CC12, CHF and CHF2 lack contribution parameters for some of the relevant properties and are therefore not included. The set of seventeen first order groups shown in Table 1 is selected to design acyclic refrigerants. The contributions for these groups in the method of Marrero and Gani (2001) are shown in Table 2.
Table 1: Set of first order groups used for the refrigerant design problem (with group number). CH3 (1) COOH (31) CC13 (113) Br (128)
CH2 (2) C H 2 C 1(108) CH2F (114) C1 (130)
CH (3) CHC1 (109) CF (116)
C (4) CC1 (110) CF2 (118)
OH (29) C H C 1 2(111) CF3 (119)
Table 2: Parameters for the groups and properties used. CH3 CH2 CH C OH COOH CH2C1 CHC1 CC1 CHC12 CC13 CH2F CF CF2 CF3 Br C1
A
Vk,a
CH3 CH2 CH C OH COOH CH2C1 CHC1 CC1 CHC12 CC13 CH2F CF CF2 CF3 Br C1
1 2 3 4 1 1 1 2 3 1 1 1 3 2 1 1 1
C[b4
CTc4
cPc4k
C v4
0.8491 0.7141 0.2925 -0.0671 2.5670 5.1108 2.6364 2.0246 1.7049 3.3420 3.9093 1.5022 1.0084 0.5142 1.1916 2.4231 1.5147
1.7506 1.3327 0.5960 0.0306 5.2188 14.6038 6.2561 4.3756 3.7063 7.8956 8.8073 3.3179 2.1633 0.8543 1.7737 4.5036 4.0947
0.018615 0.013547 0.007259 0.001219 -0.005401 0.009885 0.021419 0.015640 0.009187 0.028236 0.036746 0.023315 -0.010120 0.018572 0.048565 -0.001460 0.007923
0.217 4.910 7.962 10.730 24.214 17.002 11.754 12.048 16.597 17.251 20.550 8.238 6.739 1.621 7.352 9.888 2.107 ~
Cpl'k5 8.80 7.26 5.00 1.76 10.7 19.1 0 0 0 0 0 11.26 5.76 9.76 13.76 9.0 0
The search space is restricted to single molecules constructed by this set of first order groups. Furthermore, the size of the molecule is limited to 5 first order groups (N/max=5). This limit is suitable for the refrigerant problem since molecules with a higher molecular weight do not have vapor pressures in the suitable range for refrigerants (Churi and Achenie, 1996).
4 From Table 6 of Marrero and Gani (2001). Ho contribution in kJ/mol. From Chueh and Swanson method (1973), adapted to this set of first order groups, in cal mol 1 K k No value for this parameter was provided in Marrero and Gani (2001). This value was regressed using a set of compounds containing the C1 group.
295 12.2.4 F o r b i d d e n b o n d a n d o t h e r s p e c i f i c c o n s t r a i n t s
The forbidden bonds for this set can be identified following the systematic strategy described in Section 4.3. All the first order groups belong to the set of "standard groups". Thus, rule 3 is the only rule which may be violated by some of the possible bonds. Group C1 should not be used to form CH2C1 (CH2 and C1), CHC1 (CH and C1), CC1 (C and C1), CHC12 (CHC1 and C1), and CC13 (CC1 and 2 C1 groups). The following constraints are therefore imposed
Ei,j y(i,CH2,a),(j,Cl,a) Ei,j Y(i,CH,a),(j,CI,a) Ei,j Y(i,C,a),(j,Cl,a)
-- O.
= O.
= O.
(19) (20) (21)
EY(i, CHCI,a),(j,C)I, O. i,j
(22)
E E (Y(i, CCI,a),(jl,CI,o) + Y(i,CCI,a),(j2,CI,a))<-1,Vie V. jl~V j2EV
(23)
=
Furthermore, it is desirable to prevent the formation of CFCs by the simultaneous presence of chlorinated groups and fluorinated groups. We define Gct={CH2C1, CHC1, CC1, CHC12, CC13, C1}, the set of chlorinated groups, and GF={CH2F, CF, CF2, CFa}, the set of fluorinated compounds. We define a binary variable (cl such that 1~, if there is at least one group in GCI in the compound otherwise
(Cl to , Then,
Ui,k <-~Cl, V i e V, Vk e Gcl. 5 <_
ke Gct i=1 gc! + ui, k < 1,Vk e GF,Vi e V.
(24)
(25) (26)
To calculate the heat capacity, Table 1 is used to identify the relevant groups in the Chueh and Swanson set. They are CHa, CHe, CH, C, COOH, CH2OH, CHOH, COH, OH, Cl, Br, F. The contributions C~l.i for every group ke G1 are
296 derived from the contributions of the Chueh and Swanson groups and are listed in Table 2. The following rules m u s t be applied: 1. For every C1 group linked to a given carbon group (CH3, CH2, CH, C, CH2=CH, COOH, CH2C1, CHC1, CC1, CHC12, CC13, CH2F, CF, CF2, CF3), add 8.6 cal mol-lK -1. 2. For every third or fourth C1 group linked to a single carbon group (CH, C, CHC1, CC1, CHC12, CC13, CF), a d d - 2 . 6 cal mol 1 K -1. 3. If a given CH2, CH2C1 or CHeF is bonded to at least one OH, a d d - 0 . 4 6 cal mol-1 K-1. 4. If a given CH, CHC1 or CHC12 is bonded to at least one OH, add 2.5 cal mol-1 K-1. 5. If a given C, CC1, CC13, CF, CF2 or CF3 group is bonded to at least one OH group, add 14.14 cal mol 1 K -1. Rules 1 to 2 are equivalent to the rules of the original method of Chueh and Swanson but apply to the Marrero and Gani groups. Rules 3 to 5 have been added to correct for the differences between the two sets of groups. The m a t h e m a t i c a l equivalent of these rules are given by the following constraints. R u l e 1 - F o r every Cl g r o u p l i n k e d to a g i v e n carbon g r o u p (CH3, CH2, CH, C, CH2=CH, C O O H , CH2C1, CHC1, CCI, CHCI2, CClz, CH2F, CF, CF2, CFs), a d d 8.6 cal m o l 1 K 1 .
We introduce a new variable pcl, i which denotes the n u m b e r of C1 groups linked to a carbon atom at vertex i, ie V ]'lCl, i = Z (Y(i, CH3,a),(j, Cl, a) + Yi, CH2,a),(j, Cl, a) + Yi, CH,a),(j, Cl, a) J § Cl,a) + Y(i, COOH,a),(j, Cl, a) + Y(i, CH2Cl, a),(j, Cl, a ) +Y(i, CHCI, a),(j, CI, a) + Y(i, CCl, a),(j, Cl, a) + Y(i, CHCl2,a),(j, Cl,a ) +Y(i, CCl3,a),(j, Cl, a ) + Y(i, Cg2F,a),(j,Cl,a ) + Y(i, CF,a),(j,Cl,a) +Y(i, CF2,a),(j,Cl, a ) + Y(i, CF3,a),(j, CI, a ) +ui, CH2Cl + ui, CHCl + ui, CCl + 2ui, CHCl2 + 3ui, CCl3
(27)
VieV.
where the Ui, k variables, which denote the existence of group k at vertex i, are used to count the C1 atoms which appear in carbon containing first order groups. The contribution from rule 1 is C pl,1 ~ = 8.6Zl.tCl,k k
(28)
297
Rule 2 - For every third or fourth Cl group linked to a single carbon group (CH, C, CHC1, CCI, CHCI2, CCI3, CF), a d d - 2 . 6 cal tool 1 K 1. We introduce two new b i n a r y variables to identify the presence of a t h i r d and fourth C1 a t o m linked to a given vertex i.
P3i =
10 ifPcl, i >--3, otherwise.
P4i =
10 if ].tCl, i = 4, otherwise.
The value of p3i and p4i is set t h r o u g h the following c o n s t r a i n t s
(29)
lzCl, i - 2.5 < 2.5P3 i < ~Cl, i, Vi ~ V. J.tCl,i - 3.5 < 3.5P4 i < PCl, i,Vi ~ V.
(30)
Then, the contribution for rule 2 is given by cO pl,2 = - 2 - 6 Z ( P 3 i +P4i)
(31)
i
Rule 3 - I f a given CH2, CH2Cl or CH2F group is bonded to at least one O H group, a d d - 0 . 4 6 cal mol I K -1. We introduce a the b i n a r y variable ~OH,i,k such t h a t
~og,i,k =
0' if there is an OH group linked to group k at vertex i , otherwise
for all ie V a n d for all ke {CH,CHC1,CH2,C,CC1,CF, CFz. Then, ~OH,i,k <-ZY(i,k,a),(j, OH,a), Vk ~ {CH, CHCI, CH2 , C, CCI, CF, CF2 }, Vi J
V.
(33)
The contribution due to rule 3 is t h e n
pl,3
-_046V/ ff[ o.,i, CH2 + 2j
(Y(i, CH2CI,a),(j,OH,a) + Y(i, CH2F,a),(j, OH,a) ))
(34)
298 Note t h a t since CH2C1 and CH2F can be bonded with at most one OH group, there is no need to define a ~OH,i,h variable for these groups. R u l e 4 - I f a g i v e n CH, C H C l or CHCle is b o n d e d to at least one OH, a d d 2.5 cal tool-1 K-1.
(35)
R u l e 5 - I f a g i v e n C, CCI, CC13, CF, CFe, CF3 is b o n d e d to at least one OH, a d d 14.14 cal m o l 1 K 1. C pl,5 ~ = 14.14Z (~OH, i, C + ~OH, i, CCI + ~OH,i, CF + ~OH,i, CF2
i
(36)
+ Z Y ( i , CCl3,a),(j, OH,a) + Y(i, CF3,a),(j,OH,a) ) ) J
12.2.5 S u m m a r y of f o r m u l a t i o n The formulation involves the following sets and indices G = {CH3, CH2, CH, C, OH, COOH, CH2C1, CHC1, CC1, CHC12, CC13, CH2F, CF, CF2, CF3, Br, C1} - Indices: k, k k e G Gcl = {CH2C1, CHC1, CC1, CHC12, CC13, C1} GF -{CH2F, CF, CF2, CF3} V = {1,2,3,4,5} - Indices: i,j,jl,j2 e V O = {1,2,3,4,5} - Indices: oe O The following variables are defined 9 ~cl e {0,1}: whether there is a C1 atom in the molecule 9 Ui,k e {0,1}, ke G, ie {1,...,5}: which group is found at which vertex 9 y(i,k,a),(j,kk,a) e {0,1}, k e G , k k e G , i~{1,...,5}, je{1,...,5}: vertex adjacency
matrix 9 ~OH,i,he{0,1}, ke{CH, CHC1, CH2, C, CC1, CF, CF2}, ie{1,...,5}: w h e t h e r an OH group is linked to (i,k) 9 p3i, p4ie{0,1}, ie {1,...,5}: whether 3 or 4 C1 atoms are bonded to the carbon at a given vertex 9 H T e "/4298 ' v ' cTm pl'Ps Te ,PsTcd ,Pc,Tre,rrcd, T298 ,Tc,Tb,tg,(gE R " properties of the molecule
299
"
C~ o e R, o e {1,..., 7}" heat capacity contributions
"
#Cl, i e R,i e {1,...,5}" number of C1 atoms linked to a given vertex
9 nlk e P, ke G: number of groups of a given type in the molecule
12.3
P R O B L E M SOLUTION
The GAMS interface to DICOPT (Discrete and Contituous OPTimizer) is used to solve the resulting mixed-integer nonlinear programming (MINLP) problem (GAMS). Table 3 shows the molecules obtained for the problem described above. The compounds along with their design property estimates are listed in decreasing order of optimality. Integer cuts were used to identify different candidates (Floudas, 1995). The other property estimates for the molecules obtained are listed in Table 4 and the corresponding experimental values are shown in Table 5. Table 6 shows that, on the basis of available data, the largest discrepancies occur in the prediction of saturated vapor pressure, which is systematically underestimated. This is a largely a result of inaccuracies in the critical properties and hence in the acentric factor (negative in two cases), and the reduced temperature values. This level of error is not representative of the overall accuracy of the group contribution methods used, but can be expected due to the small size of the molecules considered. This problem may be circumvented by relaxing the pressure constraints to account for uncertainty in the predictions as suggested by Duvedi and Achenie (1996) for example.
Table 3: Best candidates for the acyclic refrigerant problem, with their design property estimates. Molecule Objective 7~ cTm re 7ca function Hv pl Ps Ps value (kJ mo1-1) (cal mo1-1 K -1) (bar) (bar) CH2F-CH2F (R152) CH3Br (R40b 1) CH2F-CF3 (R134a) .
.
Molecule CH2F-CH2F CH3Br CH2F-CF3
.
.
.
.
.
0.748 0.760 0.821
30.1 23.5 30.5
22.5 17.8 25.0
2.4 1.4 5.1
Table 4: Property estimates for the candidate refrigerants. Tb (K) Tc (K) Pc (bar) H 298 (kJ mol 1) co 244.8 263.8 220.5
437.6 423.9 376.4
47.3 68.8 36.5
28.2 21.8 27.3
-0.073 0.280 -0.043
6.9 6.7 13.3
0 0.559 0.622 0.586
300
Table 5: Experimental values oft the candidate properties. All data from S M S W I N unless otherwise indicated, from Afeefy et al (2001), t from Langley (1995) Molecule Tb (K) Tc (K) Pc (bar) /4__v298(kJ mo1-1) co 0 CH2F-CH2F CH3Br CH2F-CF3
283.7 276.7 246.7
445.0 467.0 374.3
42.8 80.0 40.1
*** 23t 185
0.222 0.192 0.327
0.638 0.582 0.659
Table 6: Best candidates for the acyclic refrigerant problem, with design property estimates calculated from the experimentalvalues in Table using Eqs (i 0) and (12)-(18). Molecule HTe cT~ Te T~d CH2F-CH2F CH3Br CH2F-CF3
(kJ mol 1) *** 24 20
pl
Ps
(eal mo1-1 K 1) *** *** ***
(bar) 0.6 0.8 2.8
Ps
(bar) 2.9 3.8 10.8
The best molecule found, 1,2-difluoroethane, (R152), is known to be toxic. The second best molecule, methyl bromide, was officially listed as an ozone depleting substance in 1992. Finally, the third best molecule, 1,1,1,2-tetrafluoroethane (R134a) has been extensively used as a refrigerant replacement for freon 12, and it passes both ozone depletion and toxicity tests. It should however be noted that it is related to another environmental problem, namely global warming.
12.3
[1]
[2]
[3] [4]
[5] [6] [7] [s]
REFERENCES
H.Y. Afeefy, J.F. Liebman, and S.E. Stein, Neutral Thermochemical Data in NIST Chemistry WebBook, NIST Standard Reference Database Number 69, Eds. P.J. Linstrom and W.G. Mallard, National Institute of Standards and Technology, Gaithersburg MD, (http://webbook.nist.gov)(2001). C.F. Chueh and A.C. Swanson, Estimation of liquid heat capacity, Can. J. Chem. Eng., 51 (1973), 596. N. Churi and L.E.K. Achenie, Novel mathematical programming model for computer aided molecular design, Ind. Eng. Chem. Res., 35 (1996) 3788. M.A. Duran and I.E. Grossman, An Outer-Approximation algorithm for a class of mixed-integer nonlinear programs, Math. Prog., 36 (1986) 307. A. Duvedi and L. Achenie, Designing environmentally safe refrigerants using mathematical programming, Chem. Eng. Sci., 15 (1996) 3727. C.A. Floudas, Nonlinear and mixed-integer optimization: Fundamentals and applications, Oxford University Press, Oxford (1995). GAMS, Generalized Algebraic Modeling System, www.gams.com. R. Gani, B. Nielsen and A. Fredenslund, A group contribution approach to computer-aided molecular design, AIChE J., 37 (1991) 1318.
301 [9] [10]
[11] [12] [13] [14] [15] [16] [17]
K.G. Joback, Designing molecules possesing desired physical property values, Ph.D. thesis, MIT, Cambridge (1987). K.G. Joback and G. Stephanopoulos, Designing molecules possesing desired physical property values, Foundations of Computer Aided Process Design, (1989) 363. B.C. Langley, Fundamentals of refrigeration, Delmar Publishers, Albany, N.Y. (1995). J. Marrero and R. Gani, Group-contribution based estimation of pure component properties, Fluid Phase Eq., 183-184 (2001) 183. B.E. Poling, J.M. Prausnitz and J.P. O'Connell, The properties of gases and liquids, McGraw-Hill, New York, 5th edition (2000). R.C. Reid, J.M. Prausnitz and B.E Poling, The properties of gases and liquids, McGraw-Hill, New York, 4th edition (1987). N.V. Sahinidis and M. Tawarmalani, Applications of global optimization to process and molecular design, Comp. Chem. Eng., 24 (2000) 2157. M. Sinha, L.E.K Achenie and G.M. Ostrovsky, Environmentally benign solvent design by global optimization, Comp. Chem. Eng., 23 (1999) 1381. SMSWIN, Computer-Aided Process Engineering Center, CAPEC, Technical University of Denmark, http://www'capec'kt'dtu'dldmain/default'htm
This Page Intentionally Left Blank
Computer AidedMolecularDesign: Theoryand Practice L.EK. Achenie,R. Gani and V. Venkatasubramanian(Editors) 9 2003Elsevier ScienceB.V. All fightsreserved.
303
C h a p t e r 13: P o l y m e r D e s i g n Case S t u d y P. R. Patkar & V. Venkatasubramanian
12.4 INTRODUCTION A background of genetic algorithms was presented in Chapter 5 of the book and the adaptation of GAs for the problem of computer-aided molecular design was discussed. The framework for evolutionary molecular design using GAs proposed by Venkatasubramanian and co-workers [1] was presented in detail. The utility of the framework was illustrated by means of two example problems in polymer design. The case study problems demonstrated the success of the genetic design approach in locating optimal designs for the desired target constraints. The advantage of GAs in their ability to discover a diverse population of near-optimal designs was also highlighted. The sample problems discussed in Chapter 5 were relatively simple from the points of view of the enforced target constraints as well as combinatorial complexity. In the discussion that follows, a bigger problem of polymer design is presented, from work done by Venkatasubramanian et al. [2]. The objective of considering the bigger problem is two-fold: primarily, the investigation of the efficacy of the genetic design system for problems with much larger and more complex design spaces, and second, to describe the extension of the original GA framework by incorporating higher-level chemical knowledge to enable better handling of constraints such as chemical stability and molecular complexity. In the sections that follow, the large-scale polymer design problem is first introduced. The details of the GA implementation are omitted since they are almost the same as those discussed for the case studies in Chapter 5. Results for the standard as well as for the knowledge augmented genetic design framework are presented. Then some aspects concerning parametric sensitivity and robustness of GAs are discussed. Finally, conclusions are offered based on the results of the study.
13.1 POLYMER DESIGN CASE STUDY Chapter 5 presented two polymer design case studies from investigations by Venkatasubramanian et al. [1]. The genetic algorithm was required to design polymer repeat structure given certain macroscopic property values. The
304 property values of known polymers, namely: (i) Polyethylene terephthalate (PET), (ii) Polyvinylidene propylene copolymer (PVP), and (iii) Polycarbonate of bisphenol-A (PC) were used. The palette of groups for the search was relatively small and involved only four mainchain (>C<, -C6H4-, -C=OO-, -O-) and four sidechains groups (-H, -CH3, -F, -C1). To recapitulate the results, the genetic search discovered all the three target polymers in a fraction of the 200 total generations allowed for all design lengths (maximum repeat structure length, L=2-7 and L=2-10) and for all initial population conditions (random mainchain and sidechain groups, and -CH2- only). For instance, the average generation number for locating the design target first and the success rate (in parenthesis) for an initial population initiated with random mainchain and sidechain groups having lengths 2-7 were: (i) PET 11.3 generations (100%), (ii) PVP 28.2 generations (100%), and (iii) PC 41.0 generations (100%). The GA was also able to determine many high-fitness alternate structures.
II
l ~ i n c hair, Crm ups
O U >C< -S- -SO2- -0- -C-
0II
oII
-O-C-OO
0II
S i d e c hain G r o u p s
O
-O-CII
-H -CH~ -C2 Hs - nC~ H7 -iC~Hr
0II
-C-O-CO
X
-NH- -C-NH-
-@,~
-@
- ~ 4 Hs
-F
-C1
9 II
-OCH~ ~O~
-O-C-CH8
-OH
-Br
0 II
-O-C-OCHs
-CN
Fig. 1. Extended palette of base groups for the design case study
For the present case study, taken from Venkatasubramanian et al. [2], the design problem was made much larger and the search space more complex by increasing the base group choices to 17 mainchain and 15 sidechain groups. The extended palette of base groups is shown in Fig.1. In the smaller problem, when the base groups consisted of four mainchain and four sidechain groups, the total number of design candidates was about 1.4x105. Under the increased number of mainchain and sidechain groups, the search space was magnified to 1.1x1013 candidates considering design lengths of 2 to 7. Thus, the search space was about 100 million times larger than that in the earlier study. Also, the number of target polymers evaluated was increased from three in the previous study to nine as shown in Table1. The search space was further complicated by the increased number of nonlinear group interactions. For example, for polymer design target 4, the nonlinear van Krevelen group interactions required that every mainchain group, other than the -O- endgroup, and every sidechain group be in their proper
305
p o s i t i o n in o r d e r to give t h e o p t i m a l f i t n e s s of 1. T h a t is, t h e m a c r o s c o p i c p r o p e r t i e s d e p e n d e d n o t only on t h e g r o u p t y p e s b u t also on t h e i r exact ordering in t h e t a r g e t molecule.
Table 1. Target polymers a n d their properties Target Polymer
_g-c / H h, u
a, K.... (X 10, 4)
.................
9, ~,cm3
Tg, K
Cp, K, N/m J./kl~.K ,, (x 10 9) ....
1.34
350.8
2.96
1152.67
5.18
1.18
225.2
2.81
1377.82
2.51
1.21
420.8
2.90
1135.10
5.40
1.19
406.8
2.90
1073.96
5.39
1.28
472.0
2.89
995.95
5.31
1.25
421.1
2.90
1016.55
6.12
1.06
322.3
2.98
1455.90
3.85
1.27
322.1
2.81
1152.67
3.42
1.09
428.7
2.77
1163.10
4.12
H H
I I c-o-c-c4I.!1 J J /
/~/
u
N N--,n
TP1
_•1H
F HH
-
Li~ F H ~H3-ln
TP2 c.~ k
6
~/
~H3 ~
Jn
TP3
TP4
TP5
Jn
TP6 H H H H H
H
___~11 --C--C--C--C---C--N----lb--I I I I I -1 I I I I II H H H H O
/
.In
TP7 H H t,~
O
H H -in
TP8 r_...~ CH3
l TP9
r _ _ . ~ CH3
306 p = density, T g - glass transition temperature, a = thermal expansion coefficient, Cp - specific heat capacity, K - bulk modulus The number of property constraints was the same as before at five and included the following properties: density, glass transition temperature, thermal expansion coefficient, specific heat capacity and bulk modulus. Predicted values of these physical properties for a given molecular structure were calculated by the van Krevelen [3] group contribution methods. The second aspect of the case study involved the incorporation of higher-level chemical knowledge, which is discussed next.
13.1.1 Incorporation of high-level knowledge: Molecular Stability Higher-level chemical knowledge was incorporated to facilitate the search towards more chemically realistic and stable polymers. For example, it is commonly known that certain group combinations such as -O-O-O- and -OC=OC=O- lead to chemically unstable structures and are therefore undesirable in candidate solutions presented by the design system. In the absence of any inclusion of any such higher-level knowledge into the GA, such group combinations were often found in many high-fitness polymers in the smaller case study [1]. Another example of a practical constraint on a design system is environmental acceptability. Certain molecular groups or group combinations are known to be environmentally toxic or unacceptable. This is a common problem in the design of agrochemicals such as fertilizers and pesticides as well as refrigerants. Yet another important consideration would be the relative ease or difficulty involved in the synthesis or manufacture of the proposed design candidates. It is important to be able to incorporate all such constraints in the design process. In the current study, only stability and molecular complexity constraints were addressed. In the knowledge-augmented GA framework, chromosomes with unstable mainchain group combinations were assigned zero fitness. As a result of n a t u r a l selection, such solutions were automatically weeded out of the design process and thereby removed from any further consideration. The knowledge incorporated into the algorithm about the stability of nearest neighbor mainchain groups was drawn from Barton and Ollis [4].
13.1.2 Molecular Complexity Molecular complexity is encoded as a count of the total number of mainchain and sidechain groups and is given by the following equations [5, 6, 7]: F(x) = F(x)-13 x Sig x Complexity
(1)
307
2 Sig = (1 + exp[- 7{F - Fcrit }])
(2)
Complexity =
(3)
MC + SC MCma x
+ SCma x
where F is the fitness value, [3 is a penalty scaling factor, Sig is a sigmoidal fitness function, given by equation (2), that provides a fitness threshold, Fcrit, for the genetic algorithm to start penalizing complex designs, and ~ is a decay scaling parameter. The complexity measure, given by equation (3), ranges from 0 to 1 and is given by the ratio of the number of mainchain (MC) and sidechain (SC) units in the current design to the maximum allowable mainchain and sidechain units (32 in this case). Thus, the complexity of a polymer repeat structure is viewed in terms of its 'size' as given by the number of units in the repeat structure. The smaller the molecule, the lower is its complexity. In order to encourage the favoring of simple molecules over more complex ones of comparable fitness, a penalty was applied to the fitness. All molecules having fitness values greater than the threshold Fcrit w e r e penalized as given by equation (1) in direct proportion to their complexity.
13.2
GA B A S E D S E A R C H
The evolutionary search approach based on GAs has already been discussed in detail in chapter 5. The same framework was adopted for the larger polymer design problem. Slight modifications had to be made to handle the constraints arising out of molecular stability and complexity or maximum molecular length. These constraints were handled via suitable modification of the fitness function. A penalty was assigned to the overall fitness for design candidates that violated the defined constraints. The penalized fitness function used for this purpose can be expressed as [8]: P
F(x) = F(x) + e r / ~ q~i
(4)
i=l
where P is the total number of constraints, rl is a penalty coefficient, e is -1 for maximization and +1 for minimization problems, and (pi is a penalty related to the i th constraint. As mentioned before, the penalty was very severe for violation of stability constraints. Chromosomes infeasible with respect to stability were directly assigned zero fitness.
308 The parameter values used for the search are given in Table2. The design lengths varied from two base group units to a maximum of two units more than the polymer design target. The fitness function gain, a was equal to 0.001. The parameters for equations (1), (2) and (3) were as follows: Fcrit- 0.99, which resulted in applying the complexity measure only after near optimal solutions were attained, ~=100 which provided a gradual activation of the complexity measure as the fitness approached the critical value, and 13=0.10 so that a large penalty reduced the overall design fitness to a point where the genetic algorithm considered the design to be unworthy of further consideration. For statistical significance, results were compiled after 25 runs of 1000 generations each. The genetic design investigations carried out were subdivided into the following scenarios: (i) standard genetic design (ii) knowledge-augmented genetic design, which penalized unstable mainchain group combinations, and (iii) knowledge-augmented genetic design, which penalized unstable mainchain group combinations and molecular complexity.
......................................................Table 2: GAParameters ....................................................... Parameter Value 100 Steady state population 1000 Number of generations 0.001 Gaussian fitness decay rate (a) 0.1 Complexity sigmoid gain (13) Complexity penalty (~) 100 Maximum polymer length Target Length +2 Elitist retention with respect to population 10% size Genetic Operator Probabilities: Crossover Backbone mutation Sidechain mutation Hop Deletion Blending Insertion
13.3
0.2 0.2 0.2 0.2 0.1 0.1 0.0
R E S U L T S AND D I S C U S S I O N
The results for the different genetic design cases are presented in Table3. The results are arranged in the following manner. The rows labeled part (a) give the percent success rate (in bold text) in achieving the design objective and the
309 number of successful runs (in parenthesis) for each target. Part (b) presents the average generation when the target was first located (in normal text). The rows labeled part (c) show the average number (in italic text) of distinct high-fitness solutions found for each target. As was expected, the genetic design was not as successful as it was in the case of the smaller case study, when it located the target molecule in every run (i.e. a success rate of 100%). However, the most important observation here was t h a t the genetic design still succeeded in finding the target molecule for eight out of the nine target polymers, even though the search space had exploded by over a factor of 100 million. As seen from part (a) of the table, with the exception of target polymer 4, all target polymers were located at least once by one of the design scenarios (i.e., columns 3-7). From part (b) of Table 3, it is seen that some molecules took longer t h a n others to be discovered. For example, target polymer 7 was always found in less t h a n 100 generations. On the other hand, target polymer 6 was located with varying success (4%-68%) and took more than 400 generations for discovery. Typically, longer molecules t h a t required exact mainchain group ordering and sidechain positioning needed more generations to be discovered. This explained why target polymer 7, which was the only target molecule with no group ordering constraint was quickly located while target polymer 6, which required exact ordering, took much longer to discover. The exact ordering requirement and the long backbone structure were also the reasons why target polymer 4 was never discovered in any of the runs of 1000 generations each. Columns five to seven of Table 3 present results for the knowledge-augmented genetic search where higher-level chemical knowledge about the feasibility and stability of group combinations and molecular complexity were incorporated. One can observe several general trends from these results. It can be seen t h a t the success rates were higher, in general, with the knowledge-augmented genetic design in comparison with the standard genetic design (part (a) of column 3 vs. columns 5 and 7), when the initial population consisted of random mainchain and sidechain groups. Thus, the addition of higher-level chemical knowledge improved the design efficiency. For column 7, since the complexity measure was applied only after the fitness threshold was exceeded, more generations were required to achieve the target. This also attributed as to why the genetic design was unable to locate target polymers number 3, 4, and 9. In summary, it appears t h a t the incorporation higher-level chemical knowledge not only produced candidates t h a t were chemically feasible, stable, and less complex but also increased the efficiency of the search by eliminating spurious candidates in the genetic design.
310 .......................................
T a b l e 3 : Results for,,the genetic search
S t a n d a r d GA
=__=_..............
Pa rt
Target Polymer
random MC, SC
random MC, hydrog enSC
random MC, hydrog enSC
random MC, SC
60%
64%
28% (7)
(15)
(16)
random MC, SC
60% H
H
- - ~ - ~ ~~ ~k ~ /- o - 'It~ - ~ 't - -iq/- -
(a)
TP1
(b)
/ , I=
0
0
H
. F . ~.~J ~ TP2
-0-o--s
12% (3)
(15)
184
300
233
240
428
282
192
281
213
166
48%
40%
48%
48%
H-In
(c)
0
............
Feasible MC
,'~-x c"~/~x -1
~L_)2-1 --
TP3
(a)
36%(9)
(12)
(10)
(12)
(12)
Co) (c)
411
400
209
522
412
6
7
7
6
10
(a)
0% (0)
4% (1)
8% (2)
12% (3)
0% (0)
293
640
193
163
91
161
74
109
0% (0)
0% (0)
0% (0)
0% (0)
0% (0)
861
564
910
589
570
56%
48%
48%
92%
32% (8)
(14)
(12)
(12)
(23)
(c) s
~
~
~
i
~
~
(a) Co)
TP4
(c)
--~so~>-~o~
(a)
TP5
(b)
400
205
317
232
420
(c)
175
136
197
142
99
4% (1)
32% (8)
16% (4)
68% TP6
8% (2)
(17)
(b)
548
405
529
632
528
(c)
199
146
314
168
158
100%
100%
100%
100%
100%
(a)
(25)
(25)
(25)
(25)
(25)
(b)
61 217
61 188
58 214
64 198
85 163
H _El
H H H H H J i I I I " 1 C--C--C--C--C--C--N--{---I I I I I It / H H H H H 0 .,In
(a)
TP7
(c)
311 Table 3 (continued) H
H
H
H
~ - /- o - IIc ~I(c - )1o - IIc - c -II - -II ~/ I,.
0
~t,...~)
0
TP8
,______r
TP9
,__..__~ CH3
~
n
68%
68%
76%
88%
96%
(a)
(17)
(17)
(19)
(22)
(24)
(b) (c)
210 162
88 132
147 158
109 161
81 125
(a)
8% (2)
4% (1)
4% (1)
4% (1)
0% (0)
(b)
382
132
513
868
(c)
144
69
174
70
....
46 - - :
_.
.
.
.
.
.
.
.._=__.
(a) target polymer success rate "bold", times target found out of 25 GA Runs "(parentheses)"; (b) average generation number for locating target polymer "plain text"; (c) number of distinct polymers with fitness >_ 0.99 (0.985 for TP2) "italic text"; MC = mainchain, SC = sidechain.
The results also suggest that the initial polymer population complexity played a role in the success rate of the genetic design. For example, the standard genetic design, in general, gave better results when the initial population sidechains were seeded with hydrogen groups (column 3, part (a) vs. column 4, part (a)). Large improvements were seen for target polymer 1 (12% to 60%) and for target polymer 6 (8% to 68%). Similar results were obtained for the knowledgeaugmented genetic design that penalized unstable mainchain structures (column 5 part (a) vs. column 6, part (a)). The best improvements were those for target polymer 1 (28% to 60%) and for target polymer 6 (4% to 32%). Part (c) of Table 3 lists the number of near optimal or high-fitness solutions that were found for each target. This ability of the genetic design system to find many diverse alternative solutions with properties very close to the desired target properties, is one of the most appealing features of the system. The high-fitness threshold was 0.99 for all design targets except for polymer 2, in which case it was 0.985. The genetic design was unable to find alternate solutions with a fitness value greater than 0.99 for this polymer. It should be noted that while the genetic design did not find the exact target for polymer 4, it did locate more than 500 to 900 alternative near-optimal solutions. 13.4.1 N e a r - o p t i m a l s o l u t i o n s
Table 4 presents two of the numerous nearly optimal alternatives for target polymer 4 for each of the scenarios 1-3. As one can see, the alternative solutions
312 were very close to the target properties and had fitness values exceeding 0.99. The average absolute error ranged from 0.25% to slightly over 1.0% of the desired property values. The solutions varied according to the search type. For example, case 1 (basic genetic design) obtained two infeasible polymers. The first used a combination of-O- and >C=O groups instead of the single -O-C=O- group and the second contained a -O-O-O- group combination which was unstable. Using the correct -O-C=O- reduced the fitness to 0.976 and increased the average absolute error to 2.04%. Case 2 produced feasible mainchain structures but were generally more complex than those in case 3, which also considered molecular complexity. The number of near-optimal solutions was approximately the same for all genetic design types. Table 5 presents corresponding results for target polymer 3. For this target, as in the case of target polymer 4, all alternative solutions had very high fitness values. Furthermore, these alternative solutions were structurally fairly similar to the actual target. It can be easily appreciated that this ability of the genetic design system to deliver a number of nearly optimal solutions structurally similar to the target is of immense practical importance. In several cases, one of the near-optimal candidates could easily turn out to be an attractive and feasible option for further consideration.
13.5
PARAMETRIC SENSITIVITY AND ROBUSTNESS ANALYSES FOR GA'S
The performance of GA-based strategies is intimately tied to the different parameters employed in the algorithm. These parameters control the various aspects of the algorithm and hence directly govern the outcome of the search. The discovery of an optimal setting for the parameters or even the existence of one can be determined only by experimentation. The results of the GA design system on the case studies though encouraging, were widely varied in terms of success rate as well as the quality of the final solutions obtained. This indicated t h a t to obtain an improvement in performance, a detailed parametric sensitivity analysis needed to be performed. This would help to establish whether an optimal setting could be obtained, independent of the nature of the target structure or design problem. In their previous work, Sundaram and V e n k a t a s u b r a m a n i a n carried out such a parametric sensitivity study in an effort to systematically determine optimal parameter settings [9]. Their investigation also involved a characterization of the search space in order to identify strategies that would allow the GA to exploit the underlying structure of the space. The key results from their work are mentioned below.
313
............................................................T a . b
l_e_4"Near opt.i.mal...so!.u_tio.ns.for ta.._rget.po!ymer.____.._4 ........................................................... % error a
Polymer design
Fitness
Target Polymer: TP4 _ _ _ ~ ~
/'~'h
s_ o
~
{0; 0; 0; 0; 0}
CH3 r=--x
0%
1.0
Case 1" Standard GD b H
O
H
--HK ( ) k----f:'~"Y---( ( ) k--C--O--C---( I\ ~ - - ~ / I( ) l \ x"--"--' .j/ I ~ L'------~ ~ H
{-2.2;-0.5; 0.4; 0.4;-2} 0.74%
) )----C--F-
(
lm J
H
n
C2H5
0 II
Case 2" knowledge-augmented GD, stability
~
II
OH H
,---, 0 ,---, /:-~\ /f-'~\ A "1 C--C---( ( ) k---O--C--( ( J k--'-C:'x"r--O+-
I
l
~ClH3~
O
~
~
Jn
C2H5
~]-c--o--((
I )--c--s--((
) ~-((
I }--~r-~-~
(
) H-
Case 3: knowledge-augmented GD, stability & complexity H
O
~-o-~. @@ CH3
0
0
__[~~o_~.._k/~o_~__]_,
{1.6; 2.2;-0.8;-0.2; 0 . 9 } 1.18%
{0.04; 0.09;-0.4; 0.09; 0.7} 1.10% {0.4; 1.9; 0.85; 0.14;-2.2 1.10% {-0.1; 0.6; 0.1; 0.08; 0.04} 0.21%
{0.4;0.83%-1.0,0.02; 1.8;-0.9}
0.995
0.991
0.999
0.991
0.999
0.999
.....a% Error is f0r {p; Tgi :ai Cpi K} averageabsoluteerr0r %. b GD"= genetic design: The study clearly highlighted the absence of a single optimal setting for the parameters examined. In fact a parameter setting found to work very well for a particular target was found to be non-optimal for a different target. The results implied that an optimal tuning of parameters could be done only on a run-to-run basis. The target-specific nature of the optimal parameter settings exposed an important aspect of the algorithm: the nature of the search space critically influenced the mechanics of the GA. The search-space characterization study illustrated that the structure of the fitness landscape was drastically altered by the target property settings. While in some cases, the landscape was amenable to
314 search using convexity based algorithms, in other cases, it remained rather flat but reasonably correlated for small changes. The most important insight provided by the study was that the breadth as well as the depth of the sampling of chromosomes is crucial to performance of the GA. Stated differently, the diversity of chromosomes sampled during the search is important not only in terms of variety of the samples in terms of their distances in the search space but also in terms of the necessary number of samples at a given distance of separation. This becomes even more profound under non-binary genetic encoding.
Table 5: Near optima_! solutions for target polymer 3 Polymer design
% error a
Target Polymer: TP3 .-~ cH3 .---. _
c-o-((Q%c--,/~'~]-
{0; 0; 0; 0; 0} 0%
. . . . . . . . . . . . . . . . .
Fitness
1.0
Near-optimal solutions O
r II
~ /f-~\
~ /F~\
0
C2H 5
I
O
k
a
I
F
C3H~ ~
--t-c--o---(
{0.58; 0.22; 0.89;-1.3; 0.09} 0.62% {-0.95; 0.3; 0.68;-0.4; 1.5} 0.76% {-0.61; 0.56; 1.2;-0.09; 2.1} 0.92%
0
t-
"----'
--, n
C2H 5 {
~ ~
) Y---C--S---(
{
~
) )-----( {
) Y---Cf--x"v--K
"~--/'
UI~
{
) y-A-
~"~--/J n
{-1.9; 0.34;-0.5;-2;-0.5} 1.05%
0.997
0.996
0.993 0.992
% Error is for {p; Tg; (z; Cp; K} average absolute error %.
In addition to the issue of parametric sensitivity, another important concern relates to the robustness of the genetic search method, in fact any design system, to uncertainty in the forward prediction model, which is used for fitness evaluation. Every forward model has some level of error associated with it. Depending upon the type and complexity of the property or performance measure at hand, the predictions of a model may be as much as 10-15% off the true values. While such high degree of error may not be present in predictive models for simpler properties such as density, there would surely be some error. The presence of error may be viewed as uncertainty in the forward predictions. Then the practical utility of a design system would be related to its performance under
315 such uncertainty. In a recent work, P a t k a r and V e n k a t a s u b r a m a n i a n [10] studied the robustness of genetic algorithms to model uncertainty in molecular design. The study was carried out using the large polymer design case study. The results were highly encouraging and indicated an overall robust performance of the GA-based design system. For the target polymers considered, the system was able to enjoy success at errors even as high as 10% error in the forward model.
13.6
CONCLUSIONS
The performance of a GA-based approach for large-scale molecular design was investigated with the help of a large polymer design case study. The total number of solution candidates in the present problem was about 100 million times larger t h a n in the example discussed in chapter 5. It was found that, despite the tremendous increase in the search space size and the complex nonlinear group interactions, the genetic design was generally able to find the target molecules. Furthermore, it was also able to provide a diverse collection of design alternatives, which nearly satisfy the property constraints. However the algorithm enjoyed a much less success rate and was much slower in terms of convergence compared to the smaller problem. The versatility of the genetic search methodology was illustrated in terms of its easy extension to include higher-level chemical knowledge. The objective of incorporating such knowledge was to ensure that more realistic, stable, and less complex solutions were obtained from the search. The results indicated t h a t the inclusion of knowledge not only eliminated the creation of chemically infeasible structures as expected, but also improved the overall efficiency of the genetic design. In other words, not surprisingly, the search turned out to be more intelligent t h a n in the absence of additional knowledge. It was evident from the case studies that the genetic design system was extremely proficient at rapidly locating favorable regions in the design space. It was, however, less effective at performing very localized searches. This was seen in many design scenarios where the optimal design could be reached by three or four genetic operations but took the algorithm several hundred generations to realize the target. This strongly indicated that tuning the p a r a m e t e r s could significantly improve performance. However parametric sensitivity studies indicated the absence of a single optimal p a r a m e t e r setting. The best settings changed from one target to another and could be determined only by experimentation. The issue of the performance of GAs under forward model uncertainty was briefly addressed. Results from a recent study are encouraging and indicate significant robustness on the part of the genetic design system.
316 In conclusion, the problem independent, efficient nature of the versatile genetic approach and the ease with which chemical, biological, design or process knowledge and constraints can be incorporated make the genetic design framework very appealing for CAMD and worthy of further investigation for large-scale molecular design problems.
13.7
LIST OF SYMBOLS AND ABBREVIATIONS
F Fcrit (z
Y CAMD GA(s) PET PVP PC MC SC
13.8
fitness value fitness threshold decay rate for Gaussian fitness function penalty scaling factor for complexity complexity gain penalty coefficient for modified fitness function penalty related to the i th constraint Computer-Aided Molecular Design Genetic Algorithm(s) Polyethylene terephthalate Poly(vinylidene propylene) copolymer Polycarbonate of bisphenol-A mainchain sidechain
REFERENCES
1. V. Venkatasubramanian, K. Chan and J. M. Caruthers, Comput. Chem. Eng., 18 (1994) 833-844. 2. V. Venkatasubramanian, K. Chan and J. M. Caruthers, J. Chem. Info. Comput. Sci., 35 (1995) 188-195. 3. D. W. van Krevelen, Properties of Polymers; their Correlation with Chemical Structure; their Numerical Estimation and Prediction from Additive Group Contribution, 3rd Ed., Elsevier, Amsterdam, The Netherlands, 1990. 4. D. Barton and Ollis, W.D. (Eds.), Comprehensive Organic Chemistry: The Synthesis and Reaction of Organic Compounds, First Edition, Pergamon Press, New York, 1979. 5. E.A. Brignole, S. Bottlini, and R. Gani, Fluid Phase Equil. 29 (1986) 125132. 6. K. G. Joback and G. Stephanopoulos, FOCADP '89, Snowmass, CO, 1989. 7. S. Macchietto, O. Odele and O. Omatsone, Chem. Eng. Res. Des., 68, 5 (1990) 429-433. 8. R. Gani and E. A. Brignole, Fluid Phase Equil. 13 (1983) 331-340.
317 9. A. Sundaram and V. Venkatasubramanian, J. Chem. Inf. Comput. Sci., 38 (1998) 1177-1191. 1 0 . P . R . Patkar and V. Venkatasubramanian, AIChE J. (submitted for publication, 2002).
This Page Intentionally Left Blank
ComputerAided MolecularDesign: Theoryand Practice L.E.K. Achenie, R Gani and V. Venkatasubramanian(Editors) 9 2003 Elsevier ScienceB.V. All fightsreserved.
319
C h a p t e r 14: C a s e S t u d y in I d e n t i f i c a t i o n of M u l t i s t e p Reaction Stoichiometries A. Buxton, A. Hugo, A.G. Livingston & E.N. Pistikopoulos
14.1 INTRODUCTION In this chapter, the systematic procedure for the rapid identification of environmentally benign alternative multi-step stoichiometries, as described in Chapter 7, is applied to a case study- the production of acetic acid. Acetic acid is one of the most important aliphatic intermediate compounds with various of its esters being important for artificial silk manufacture and used as solvents for resins and paints. Its inorganic salts are used in the dye and clothing industries and in medicine. The scale of production of this molecule makes this an interesting example from the environmental point of view. The background and chemical routes for this example were adapted from Weissermel and Arpe, (1993).
14.2 PROBLEM FORMULATION The problem addressed here may be stated as follows:
Given a desired organic product
Identify a set of candidate multi-step organic reaction stoichiometries for the production of the desired product which are both economically and environmentally promising. This requires a three step procedure: (i) selection of co-material groups, (ii) determination of a set of candidate co-materials, and (iii) identification of a set of promising candidate multi-step stoichiometries. The use of such a structured, stepwise procedure reduces the multi-step stoichiometry identification problem to a manageable size. The key to the procedure is the introduction of co-material design (steps (i) and (ii)). With the product and stoichiometric co-materials known, the identification of feasible re-
320 action stoichiometries is no longer an open ended problem. The steps of the procedure are described in the following sections.
14.3 M E T H O D O L O G Y
As described in Chapter 7, the first step in the methodology is the application of a new group based co-material enumeration algorithm. By introducing material design principles, through structural and chemical feasibility constraints, a manageable set of raw materials and co-products can be generated. Next, stoichiometries are extracted from the co-material set using a two step optimisation procedure, including whole number stoichiometric coefficient constraints, carbon structure constraints and case specific constraints based on chemical knowledge. Thermodynamic, economic and environmental impact criteria are employed in the evaluation of feasible stoichiometries, with aspects of the Methodology for Environmental Impact Minimisation (MEIM) (Pistikopoulos et al., 1994) providing the framework for the environmental evaluation of alternatives. In terms of each of these steps, the particular specifications used in the case study follows. GROUP PRE-SELECTION There are five established routes to acetic acid, these are shown in Figure 1. As before, for simplicity group pre-selection was restricted to identifying the simplest set of UNIFAC groups necessary to represent the product and the comaterials involved in these stoichiometries. As a further simplification, the chemistry specific intermediates peracetic acid and 2-acetoxybutane were not considered as part of group pre-selection since it is unlikely that they would be produced and consumed in different stoichiometries which lead directly to the desired product. Accordingly, the following thirteen groups were selected: CH3-, -CH2-, -CHO,-CO2H, CH3COO-,-CH=CH-, CH3CO-, HCOO-, CH2=CH-, -OH, H20, CH~OH, HCOOH. The latter three groups are complete molecules selected from class zero in Constantinou et al. (1996), no category two groups are featured in this example. CO-MATERIAL DESIGN Since the established chemistries involve only unbranched acyclic molecules (disregarding 2-acetoxybutane), the co-material enumeration problem was solved for such molecules only, including the following additional structural restrictions based on the established co-materials; (i) an upper limit of four groups per molecule is imposed, and (ii) only one oxygen containing group is allowed per molecule, since more complex molecules than this are unlikely raw materials and the common industrial by-products are simpler than the product (mostly CO2 and H C Q H ) .
321
Oxidation of Acetaldehyde
ct3cno
+
--~ cn3co-o-on
02
Acetaldehyde
Peracetic Acid
CH3CO-O-OH + CH3CHO ---> 2 CH3CO2H Acetic Acid Operated by: UCC (USA), Daicel (Japan) and British Celanese (UK)
Oxidation of Alkanes (n-Butane)
CH3(CH2)2CH3 + 2.5 02 ---> 2 CH3CO2H + H20 n-Butane
Acetic Acid
Operated by: Hoechst Celanese, Hills and UCC(USA)
Oxidation of Alkenes (Butenes)
cn3cn2c~I-Cn2 + cn3co2n --) cn3cn2.cncI~ 3 /
CH3CH=CHCH3
O2CCH3
l-Butene or 2-Butene
2-Acetoxybutane
1
CH3CHTCHCH3
+ 2 02
----> 3 CH3CO2H
/
O2CCH3
Acetic Acid
Operated by: Bayer and Hills
Carbonylation of Methanol
CH3OH + CO ---> CH3CO2H Operated by: BASF and Monsanto
Formate CH3OCHO ---> CH3CO2H
Isomerisation of Methyl
Not Yet Commercialised
Figure 1: Acetic Acid Production Routes ROLE SPECIFICATION CONSTRAINTS
According to the industrial routes, stoichiometries of up to two steps in length were allowed, with a m a x i m u m of four species p e r m i t t e d in any step. Table 1 shows the knowledge based role specification constraints employed in the acetic acid e x a m p l e where, as before, R denotes r e a c t a n t only, P d e n o t e s t h e final product, C d e n o t e s p r o d u c t or co-product, N denotes t h e exclusion of a species from a s y s t e m a n d a b l a n k space denotes no restriction. T h e s e c o n s t r a i n t s w e r e a g a i n developed specifically for two step s t o i c h i o m e t r i e s according to t h e following arg u m e n t s , b a s e d on chemical k n o w l e d g e a n d t h e e x i s t i n g i n d u s t r i a l c h e m i s t r i e s .
322 Table 1: Role Specification C o n s t r a i n t s - Carbaryl Example Species 12 3 4 5 6 7 8 910111213141516171819202122232425262728 R CRRRC!CRPN R R N R N N R N R N N R R N R N N N 0 1A& 1B i C R R R C CC C R N R C N C R R C N C R N C C C C C System
9 Alcohols (species 1, 13 and 18) oxidise to aldehydes and then to carboxylic acids in two steps and so are included as reactants only in systems 1A and 1B, and excluded altogether from system zero (except methanol, species 1, which is allowed as a reactant in system zero for carbonylation directly to acetic acid, and is unrestricted in systems 1A and 1B). 9 Accordingly, aldehydes (species 8, 14 and 19) are included as products or co-products only in systems 1A and 1B and reactants only in system zero. 9 U n s a t u r a t e d molecules (species 11, 17 and 22) may be reactants only in all systems, their formation is not considered. 9 Alkanes (species 12 and 23) may be oxidised directly to acids, therefore they are included as raw materials only in system zero, and excluded from systems 1A and lB. 9 Higher carboxylic acids (species 15 and 20) are unlikely raw materials and undesirable co-products for a promising stoichiometry, they are therefore excluded altogether. 9 Formates (species 24, 25 and 26) and acetates (species 10, 16 and 21) are esters of formic and acetic acids respectively. They are therefore unlikely raw materials, and due to the conditions necessary for esterification (concentrated sulphuric acid) they are also unlikely co-products. They are therefore excluded from system zero (except methyl formate, species 10, for isomerisation) and included only as products or co-products in systems 1A and lB. 9 Formic acid (species 7) is included as a co-product in system zero, since it is a recognised industrial by-product, and is included as a reactant only in systems 1A and 1B to allow the generation of formates. 9 Ketones (species 29 and 30) are produced by oxidising secondary alcohols. No such alcohols are included here so that these species are excluded from system zero, and included only as products or co-products in systems 1A and lB.
323
9 H 2 0 a n d C 0 2 (species 2 and 6) are included as co-products only in all sys-
t e m s according to the i n d u s t r i a l chemistries. 9 C O , 0 2 a n d / / 2 are included as r e a c t a n t s only in all systems.
CHEMISTRY CONSTRAINTS Knowledge based c h e m i s t r y constraints were employed using the the b i n a r y product and r e a c t a n t r e a c t a n t flags, is and iis respectively, found in the whole n u m b e r stoichiometry constraints as defined in C h a p t e r 7. It is w o r t h recalling t h a t the b i n a r y variable iis takes the value zero if species s is a product a n d u n i t y if species s is a reactant, while zero or u n i t y gets assigned to is w h e n s is a r e a c t a n t or a product, respectively. 9 alcohols, alkenes, alkanes and aldehydes m a y not react w i t h each other iil + ii9 + ii13 + iils + iill + ii17 + ii22 + ii12 + ii23 + iis + ii14 + ii19 _~ 1
(1)
9 carbonylation (reaction with carbon monoxide) is restricted to alcohols and formates ii3 - (iil + ii13 + iils + ii24 + ii27 + ii2s + ii5) ~_ 0
(2)
9 formates m u s t either react with oxygen or carbon monoxide or undergo isomerisation ii24 + ii27 + ii2s -- ii3 -- ii4 ~_ 2 -- E
iis
(3)
8
9 formates m a y be produced only by esterification of formic acid w i t h the a p p r o p r i a t e alcohol 2i24 - ii7 - iils ~_ 0
(4)
2i27 - ii7 - iil ~ 0
(5)
2i2s - ii7 - ii13 ~_ 0
(6)
9 aldehydes m a y only be produced by oxidation of the a p p r o p r i a t e alcohols or oxidation or h y d r a t i o n of the a p p r o p r i a t e u n s a t u r a t e d compounds 2i8 - ii13 - iill - ii17 - ii22 - ii2 - ii4 ~ 0
(7)
2i14 - iils - i i l l - ii22 - ii2 - ii4 ~_ 0
(8)
2i19 - iil7 - ii4 <_ 0
(9)
9 m e t h a n o l m a y only be produced from synthesis gas 2il - ii5 - ii3 ~_ 0
(10)
324 In addition to these constraints, to prevent the formation of carbon-carbon bonds, the carbon structure constraints given in Chapter 7 were employed in a slightly modified form to allow the carbonylation of methanol. Finally, a production rate (crvp) lower bound of 2.5 kmol/hr and an allowable reactor temperature range of 300-800K were imposed.
14.4 R E S U L T S
CO-MATERIAL DESIGN Applying the co-material design procedure, twenty-one co-materials were constructed. Methanol, water and formic acid are included as additional molecules from class zero in Constantinou et al. (1996) according to the established routes. Carbon monoxide, oxygen, hydrogen and carbon dioxide are included as further additional molecules. All twenty-eight co-materials are shown in Table 2. MULTI-STEP STOICHIOMETRY IDENTIFICATION RESULTS The solutions of the stoichiometry identification program are again presented in the form of a table of stoichiometric coefficients in Table 3, where blank spaces indicate zero coefficients and the species are numbered as above. Table 2: Co-Material Design Results - Acetic Acid Example 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14)
CH3OH H20 CO 02 H2 CO2 HCO2H CH3CHO CH3CO2H CH3OOCCH3 CH3CH=CH2 CHnCH2CH3 CH3CH2OH CH3CH2CHO
Methanol Water Carbon Monoxide Oxygen Hydrogen Carbon Dioxide Formic Acid Acetaldehyde Acetic Acid Methyl Acetate Propene Propane Ethanol Propanal
15) 16) 17) 18) 19) 20) 21) 22) 23) 24) 25) 26) 27) 28)
CH3CH2CO2H CH3CH202CCH3 CH3CH=CHCH3 CH3(CH2)2OH CH3(CH2)2CHO CH3(CH2)2CO2H CH3(CH2)2OOCCH3 CH3CH2CH=CHCH3 CH3(CH2)2CH3 CH3(CH2)2OCHO CH3OCHO CH3CH2OCHO CH3OCCH3 CH3CH2COCH3
Propanoic Acid Ethyl Acetate 2-Butene Propanol Butanal Butanoic Acid Propyl Acetate 2-Pentene n-Butane Propyl Formate Methyl Formate Ethyl Formate Propanone Butanone
System zero produced nine candidate stoichiometries t h a t satisfy all constraints, in which materials 1, 3, 4, 8, 11, 14, 17, 19, 22, 23 and 27 appear as first generation precursor reactants. Due to the large number of role specification and chemistry constraints, the imposition of carbon structure constraints, and the simplicity of the species, systems 1A and 1B produced only seven further stoichiometries for the production of species 1, 8, 14 and 19. All stoichiometries
~
m
r12
~
0
~
t~
o
t~ 0
0
O
~".
m
O
O
o
~,,
i
[ ~
V ] 4 "
1
-1
[ 3 " ,, L 3 M 4 N 4 ~ 0 4 ,,
I 3
J
K
2 3 3 3 3 4 4 4 4
A B C D E F G H I
[
I-1 ]
-1 -1 -1 -3
-1
-2 -3 -1 -3 -5 -3 -2
] ] ]-1]
2
2
-1
-2
] ] {1{{
2 1 2 2 2
[
]
1 1 2 4 -2 2 1 2 4 2 2 1 1 -2 -1 -2 -1 System 1 - Producing Species 1 ] ] ] I ] I I ] System 1 - Producing Species 19 [ [ [ [ [-2 2[ ] System 1 - Producing Species 8 -1 1 -1 -2 -2 System 1 - Producing Species 14 I I I {1{ { I ] { I { {-1{
-1
System 0 - Producing Species 9
I
I
300 680 435 300
" 300
aoo
300 300 300 ]300 1300 ]300 1300 i!! 300 ',,3 0 0
{ ] { ] { "680 {
1
I
-1
-0.0125 0.0398 0.0359 0.0181 0.0083 0.0515 0.0440 0.0463 0.0787
0.0276 0.0484 -0.0025 0.0703
2.40 3163.81 377.02 20.69
3.53
a.96
1.68 0.91 13.59 7.17 6.27 21.69 16.43 22.47 31.88
2.50 ]0.0484] 3163.81
20.00 2.50 5.00 20.00
20.00 IO.O3571
9.s4 I-0.00241
10.00 10.00 20.00 40.00 20.00 20.00 40.00 20.00 10.00
Species Index]Nsp~ " 112 3 41 5 161718191101 11 I 12 I 13 I 14 I 15 I 16 I 17 118119 120121 122123 ]24125126127128 " Toi~e~ kmol/hr ervp [I Profit$/mol][ tnai~/mol CTWM
m
r~
0
C~
0
w,,~.
0~,,w.
C~
c~ ~
326 molecule is different. This eventuality was not accounted for in the stoichiometry identification formulation. Since the integer cuts are written to prevent any stoichiometry from occurring more than once, no stoichiometries producing species 14 were found and stoichiometry P was added manually after solving the problem. In principle, the integer cuts could be modified to avoid this problem by including information to identify the target molecule, so that only repeated stoichiometries used to produce a previously targeted molecule are excluded. All five established chemistries shown in Figure i were reproduced; stoichiometry A representing the isomerisation of methyl formate, stoichiometry B representing the carbonylation of methanol, stoichiometry G representing the oxidation of n-butane, and stoichiometries E and G representing the oxidation of acetaldehyde and butenes respectively, if in somewhat reduced form (i.e. without the inclusion of intermediates). Table 4 shows the total profits and impacts for the individual solutions combined in to multi-step stoichiometries. As before, the profits reflect only the values of the products minus the values of the reactants, assuming that stoichiometric co-products are sold at their market value. Once again, raw materials are assumed input waste free. As before, stoichiometries with poor conversion are not penalised. Table 4: Total Profits and Impacts Index A B
C D E E E E F G H I
J
K L M N O
P
Total Profit $/mol -0.0125 0.0373 0.0359 0.0359 0.0359 0.0568 0.0058 0.0787 0.0515 0.0440 0.0947 0.0787
Total CTAM
tnair/mol 1.68 4.88 13.59 8.94 8.67 3170.08 383.29 26.96 21.69 16.43 3186.28 31.88
14.5 CONCLUSIONS
In the example presented in here, there is a noticeable variation in both total profit and impact figures so that the most promising solutions can be more
327 easily identified. Clearly, stoichiometries involving steps M, N and P can justifiably be eliminated from further consideration on impact grounds, these steps being penalised in impact terms by high reactor temperature, and stoichiometry A can justifiably be eliminated on economic grounds. Of the remaining eight stoichiometries, the carbonylation of methanol (step B) with the addition of step J to produce methanol, exhibits the lowest impact of all. Although the profit of stoichiometry BJ is only half that of the highest, its impact is so much lower that it represents the best compromise solution. Once again, this clearly illustrates the advantages of considering multi-step production routes, since step J has a negative profit. Stoichiometries C, DK and EL all exhibit both higher impacts and lower profits than BJ, and while stoichiometries EO, F, G and I exhibit higher profits, their impacts rise in parallel with these profits. Thus, stoichiometry BJ is most worthy of further investigation, with stoichiometries EL, DK, C, G, F, EO and I representing progressively less promising alternatives. These results highlight how the technique can assist the identification of a small number of alternative stoichiometries which are promising both in terms of economics and environmental impact. Moreover, the application has been shown that developing multi-step stoichiometries directly can lead to the acceptance of alternatives which would be rejected as single step syntheses.
14.6 R E F E R E N C E S
Constantinou, L., K. Bagherpour, R. Gani, J.A. Klein and D.T. Wu. Computer Aided Product Design: Problem Formulations, Methodology and Applications. Computers chem. Engng 20(6), 685-703 (1996) Pistikopoulos, E.N., S.K. Stefanis and A.G. Livingston. A Methodology for Minimum Environmental Impact Analysis. AIChE Symposium Series, Volume on Pollution Prevention through Process and Product Modifications 90(303), 139150 (1994) Weissermel, K. and H.-J. Arpe. Industrial Organic Chemistry. Second, Revised and Extended Edition. VCH, Weinheim FRG (1993)
This Page Intentionally Left Blank
Computer AidedMolecularDesign: Theoryand Practice L.E.K. Achenie,R. Gani and V. Venkatasubramanian(Editors) 9 2003 Elsevier ScienceB.V. All fightsreserved.
329
C h a p t e r 15: M o l e c u l a r D e s i g n o f F u e l A d d i t i v e s A. Sundaram, V. Venkatasubramanianan & J. M. Caruthers
Computer-aided product design for performance usually involves evolving the best combination from existing prediction methods and search schemes to navigate the space of possible solutions. While the use of existing techniques may serve expediency well, it often comes at the cost of prediction inaccuracy and th einability of the proposed designs to meet actual performance requirements. In this paper we describe the development and implementation of an integrated approach that develops accurate prediction models and efficient design strategies using "design-relevant" functional descriptors and their associated structural building blocks. The forward (prediction) models are constructed to be an optimal trade-off between accuracy and robustness under a hybrid first principles and neural network framework. The phenomenological component is structured to mirror the product formulation process wherein smaller units (electrons, atoms, molecular fragments) are heirarachically modifed to obtain desired performance contributions in larger units (molecules and formulations) that contain them. The nonlinear and often uncertain influence of these contributions on the product performance is then built into the model by a correlative/neural-network approach. An evolutionary design/search strategy is used to reconstruct the molecular solutions models guided by performance objectives. The search strategy is customized to retain feasibility and navigate efficiently by using the designrelevant building blocks. The implementation of this CAMD approoach and its effectiveness in designing novel and synthetically feasible fuel-additives is demonstrated in this paper.
15.1 INTRODUCTION: PRODUCT DESIGN FOR PERFORMANCE Product design and development is an important and strategic activity in the chemical and pharmaceutical industry. It is also expensive and timeconsuming often costing millions of dollars and several years in development. The problem of product design involves the identification of formulations that match or closely approximate the desired performance characteristics of the product. This includes characterization of product performance, development of testing and measurement techniques, construction of design candidates and screening for closeness to desired performance levels. Depending upon the domain of application, screening the product candidates might involve use of methods ranging from predictive models to expensive field-testing. Hence,
330 computer-aided product design and more specifically CAMD has been an important area of engineering research over the past several years. The problem is comprised of two parts. The so-called forward problem, which is the computation or prediction of product performance measures from the product formulation or molecular structure, and, the inverse problem, which is the identification of the appropriate product formulation given desired property requirements [1]. The focus of the case study in this chapter is the fuel-additive, which falls under the category of an engineering material. The focus of the design is towards on field performance and not some inherent property of the material that exists totally independent of its interaction with its application environment. The design of engineering materials present the following questions: (i) What are the performance indicators? (ii) Can they be reliably determined with a fundamental approach for a given design? (iii) What is the testing necessary to obtain the performance measure and what is the quality of the experiments/data? (iv) What are the factors affecting synthesis of proposed designs ? The above questions contain in them the keys to constructing the strategy to solve the forward and inverse problems. In this respect, both the forward and the inverse problem have capture within them the functional aspects of how structure relates to performance. This involves the identification of sub-structural abstractions (building blocks) that translate function into structure through a mechanistic understanding of product performance. The above questions also reflect the uncertainties and unknowns specific to the design problem at hand, and the solution framework must be robust enough to handle them as well.
15.1.1. Design-Relevant Building Blocks This chapter deals with a case study in the computer aided molecular design of fuel-additives. Fuel-additives are, but one example, in the ever-growing class of engineering materials. The design problem involves some aspects well addressed by a traditional CAMD approach and some that are not. The idea behind the approach presented here is the identification and modeling of design-relevant building blocks. These structural (and functionla) elements tie together the forward and inverse problems. They reflect in a transparent fashion, the design process undertaken by an expert formulator in the area. The schematic in Figure 1 further explains this idea. The figure shows the forward and inverse approaches for molecular design along with the formulation and testing cycle for the example of a fuel-additive. It is clear from the figure that both the forward and the inverse approaches are bound to the hierarchy of formulation via key building blocks in the designing process. In some cases these building blocks are quite straightforward. For example, in a mixture design problem based on existing compounds, the building blocks are just the decision variables of mixing fractions. However, in a more involved problem such as that of additive design here, the building blocks can be anything from atoms to molecular fragments to synthesized formulations,
331 all of which influence the final performance. More importantly, these building blocks are not determined by either the forward or inverse strategies but by the formulation process and the physical phenomena at play in the interaction between the product and the intended environment of its use. The most accurate forward models may involve the consideration of electronic and atomic descriptors but the level of control accorded to the formulator (synthesizability constraints) may exist at the molecular level. On the other hand, using only molecular level descriptions might not capture the essentials of the physics that translates structure to property and further into performance. These sub-structures that behave as ideal performance consolidators for the formulation are the design-relevant building blocks of the CAMD problem. The contributions from these building blocks to eventual product performance through a phenomenological model or description are the
functional descriptors.
Figure 1: Forward and Inverse Problems in Additive Design: Parallels to Formulation and Testing Cycle The design-relevant building blocks and their corresponding functional descriptors are identified in Figure 1 for a general fuel-additive design problem. It is noted here that the functional descriptors are holistic and are usually obtained through a first-principles model. In this work, the approach was to mirror the design effort of the formulation chemist as closely as
332 possible and then incorporating the model of the underlying physical phenomena to guide that computer-aided process. This gives a dual advantage of building in implicit synthesizability constraints and developing accurate and, more importantly, sensitive decision variables for the inverse design process. Once these design-relevant blocks are ascertained, the approach in the forward problem is to de-couple the fundamental and correlative aspects of performance prediction using a hybrid model. The inverse problem is handled by an evolutionary design approach for reasons outlined later in the chapter. Primarily, the evolutionary strategy is flexible enough to completely disengage the inverse problem from the functional characteristics (linear, nonlinear etc) of the prediction algorithm. It allows the designer, the freedom to completely replace the forward model in the future with minimal implementation changes in the evolutionary scheme. When progress in experimentation affords prediction models richer in detail and complexity or when other performance metrics can be modeled, the design framework is essentially plug and play. The designer can then quickly discern how design decisions are altered by the newly modeled effects and different performance criteria.
15.2 P R O B L E M DEFINITION: D E S I G N OF F U E L A D D I T I V E S
Fuel-additives are a class of performance modifiers that are added to gasoline to enhance certain properties and/or to provide additional properties not present in the gasoline. Fuel additives are used as combustion modifiers, antioxidants, corrosion inhibitors and deposit control detergents. This effort is focused on the design of fuel-additives that control the deposit formation on the intake-valves of the automobile. Figure 2 shows the schematic of the position of the intake-valve and the interacting components in the automobile. It also shows a schematic of the intake-valve and manifold. The intake-valve forms the opening into the combustion chamber. The fuel-injection nozzles spray gasoline directly on the intake-valve. When the valve opens, it draws in a mixture of fuel and air into the combustion chamber where it is burned to supply power to the automobile. Over a period of time, deposits form on the surface of the intake-valve [2,3]. These are the intake-valve deposits (IVD). These deposits have been documented to affect driveability, cold-start efficiency, knock characteristics and emissions [4,5,6]. The US Environmental Protection Agency (EPA) [7] has adopted a standard test to determine the deposit forming tendency of fuel package (gasoline + additives) before approving commercialization of the package. The EPA adopted the ASTM Standard BMW-IVD test [8] formally, as the performance indicator for the fuel package containing the gasoline and the additive formulation. A 4-cylinder 1985 BMW vehicle is operated over the road for a total of 16,093 km. The daily test cycle consists of 10% city, 20% suburban and 70% highway mileage with an overall average speed of 45 mph. A fuel package must produce an average deposit of less than 100mg/valve for
333 certification [3]. The function of the class of fuel-additives we are interested in, is the prevention of deposition on the valve by aggregating the precursors of deposit formation produced when the fuel flows through the intake valve and holding them ins solution. The mechanism of intake-valve deposit formation is a complex one. Additive and fuel chemistry, operating conditions and flow properties of the additive, fuel and oil all play significant and interacting roles in determining the nature and amount of deposit. The test itself is quite expensive, costing about $8000-$10000 per run. In addition, the above mentioned controlling parameters are not measured in consistent accuracy if at all. Sometimes alternate tests that are simpler and less expensive are used in lieu of the regulatory benchmark to lower the expense of the design cycle. These factors make the available formulation vs. performance data sets, sparse, noisy and at best mildly inconsistent.
Figure 2: Intake-Valve and Combustion Chamber Manifold At the outset, we are given the chemical make-up including detailed structures of the fuel-additive molecules that comprise a given database of engine test results. The engine test results are from the BMW engine test runs or from equivalent engine tests. The engine test results are in the form of an intake-valve deposit measurement after the standard test run. The database also contains the fuels used in the engine and some fuel characteristics such as the boiling curve etc. Approximate values of the operating conditions such as temperature are also reported in these databases. These databases form the starting point of the additive design case study. The fuel-additives are actually packages consisting of more than one component/molecule. However, without loss of generality it can be assumed that the principle function of deposit prevention/removal is performed by core
334 group of molecules in the package and all of them have or similar functional features that contribute to their performance. The molecular problem then consists of two parts. i. Given the structure of the fuel-additives and their dosages in a formulation package, predict to the level of accuracy of the engine or fleet test result, the expected intake-valve deposit on the BMW or an equivalent test. ii. Given a set of operating conditions that include fuel characteristics and an intake-valve deposit cut-off or similar criteria, determine the molecular structures of the additives that will or at least expected to meet the criteria under those conditions.
15.3 CAMD P R O B L E M F O R M U L A T I O N
Given the expensive testing methods and the apparently intuitive nature of the formulation process, the design of fuel-additives stands to gain tremendously from a computer-aided design process. The aim of this CAMD process is to rationalize both the forward and the inverse approaches to the product design problem (steps (i) & (ii) above). The CAMD formulation in generic terms follows in a straightforward manner from the problem definition. This is depcited with problem specific detail in Figure 3. The different components of the CAMD formulation are as follows: 15.3.1 F o r w a r d P r o b l e m
.
.
Identify the dominant mechanism that determines the performance of the final product in this case the fuel-additive. Determine the structural components (these could be molecular fragments, atoms or even electronic level components) that are the key players in the performance determining mechanism. These are the design-relevant building blocks. Determine whether the performance indicators of interest could be directly estimated from a mathematical model of the mechanism at a small or reasonable computational expense. a) If such a description exists then this would serve as our forward model. b) If not, then an additional statistical model would be required t h a t relates the "outputs" of the mechanism to the performance indicators of interest. Additionally when several such "outputs" could be extracted from the fundamental model, the optimal set will be determined based on accurate correlation to eventual performance measures.
In cases where no mathematical description is feasible or inexpensive then a purely correlative model could be used to relate performance measures to the functionalities of the design-relevant building blocks. In this case study, an evolutionary search procedure is employed to locate additive candidates that are expected to satisfy pre-set performance criteria. The solution to the
335 inverse problem is one of combinatorial optimization and all techniques usually applied to the formulation and solution of this problem are applicable here as well. However, we anticipate the hybrid model with a nonlinear phenomenological component as well as a nonlinear statistical model (including neural networks). This makes for a non-linear combinatorial optimization problem involving the use of a black-box objective. Widely used techniques such as knowledge-based enumeration [9], graph theoretic approaches [10] and mathematical programming [11,12] have limited effectiveness in this situation. Stochastic methods such as random generation, simulated annealing and genetic algorithms (GAs) [13,14] are likely candidates because they function independently of the nature of the objective function. Previous work in computer-aided molecular design has also demonstrated GAs to be flexible in capturing the rich underlying chemistry [15,16]. Moreover, they are robust to non-linearities and hence powerful procedures for global search. Evolutionary algorithms are based on the Darwinian model of evolution. A random change followed by natural selection is the principle behind successive screening of populations of solution candidates [13]. At every stage new solutions are created from the current population using genetic operators. The genetic operators provide the moves in the structural space of possible designs. A mutation operator replaces one component of a solution with a randomly chosen but appropriate choice. A crossover operator is applied on two solutions at once, where a randomly chosen component of one solution is exchanged with an appropriate component on the other solution. Every solution in the population is evaluated using the forward model to determine its performance quality or fitness. The next generation is selected from the current set in a random but fitness proportionate manner. By maintaining a balance between fostering good building-blocks within the population and introducing random variations from without, the GA provides an environment for optimization to occur. The steps involved in the inverse design phase of the CAMD are as follows. 15.3.2 I n v e r s e P r o b l e m
1. Determine the criteria for the performance indicators for additive candidate selection. 2. Identify the choices available under each type of the design-relevant building block. a) Additionally determine if the above building blocks themselves could be put together from more fundamental structural units. b) Choose (usually) the smaller of the two sets. This will be the base group set for the evolutionary search algorithm. 3. Identify the rules that govern the construction of the final product from the design-relevant building blocks and if required the construction of the buidling blocks themselves from more basic units. These will determine the constraints to be imposed on the genetic operations.
336 4
Determine suitable genetic operators. These should include following types. a) A random candidate generator b) Mutation: A move that replaces a single component of additive with a suitable choice. c) Crossover: A move involving two additive candidates where components of one additive are exchanged for one or more to produce two offspring.
at least the
a candidate one or more of the other
The details of each step with regard to the specifics of the fuel-additive design problem (Figure 3) are taken up in the next section.
Figure 3: Overall CAMD formulation for additive design to minimize IVD
15.4 S O L U T I O N S T R A T E G Y 15.4.1 F o r w a r d P r o b l e m
As described in Section 15.2, the engine testing for intake-valve deposit determination is a long and expensive process. The lack of controlled experiments and a dearth of measured fundamental parameters make a purely first-principles forward model extremely difficult to determine and impossible to verify. On the other hand noise,consistency and sparsity problems in the data sets, make purely statistical or correlative approaches inaccurate and unreliable from a mechanistic viewpoint. However, formulation chemists in the industry often work with the same quality and amount of data and are able to produce marginal to significant improvements
337 in the formulations. They use a hybrid approach that combines the best possible phenomenological description of fuel-additive performance with the intangible but important ability to intuit performance from structure. The essence of this work and indeed of any CAMD for engineering materials is to rationalize this approach. We do this by hybridizing the fundamental and knowledge-based components using a first-principles+statistical/neuralnetwork regression approach. We bind this forward modeling technique to the process of formulation by the identification and use of design-relevant building blocks at the outset. Before any models can be determined, we need a mechanistic description of the chemistry of deposit removal. Figure 4 shows as a schematic, how a fuel-additive molecule works in a fuel+oil milieu. The fueladditive is sustained in solution by tail-like components in its structure, which we simply refer to as the tail. These polydisperse structures are strongly bound to a "deposit-attracting" core through a chemical block called the linker. The core of the molecule that performs the activity of scavenging the deposit forming precursors is referred to as the "head". In effect, we have three major "functional" components in the fuel-additive molecule t h a t serve as the design relevant building blocks. Any changes to the structure of the fuel-additive molecule, be it small substitutions or large structural changes affect the overall performance of the additive via their effect on the roles of the head, linker and tail. The head, linker and tail components act as performance consolidators of all the structure/chemistry of the additive molecule. As shown in Figure 4, the stability of the additive is the chief performance determining property of the additive in the fuel. As the additive breaks down in the fuel milieu, the stability of the additive degrades and so does its ability to sustain the deposit-scavenging function. The primary functional descriptor capturing the interacting roles of the head, linker and tail is the time varying stability of the additive molecule in the gasoline. The kinetics of additive degradation can be modeled as a series of differential equations on the "effective additive length". As mentioned previously, the tails are polydisperse and hence they have a distribution of lengths. The "effective additive length" is the total length of the additive molecule including the head, linker and the tail. With first-order degradation kinetics (a reasonable assumption), the concentration of effective additive lengths is given by equation (1). d~
N kH~XH+L~ ~=o
aft
N
dXH+L dt
=
--kHXH+L
--
-(k H
dXH+L+I
dt XH+L+j dt dXH+L+N dt
+ kLXH+L+1 + +
~j~tXH+L~ i--2 N i--2 j
--
_ --
-{k H
-{k H
+
(1)
+
kL
N
+ ~}XH+L+j + ~XH+L. . I < j < N - 1 i=l N
i=j+l
338
Xi is the concentration of additive of length 'i' units. The k's are the different r a t e constants for bond breakage. The variables in the above equation are explained in Figure 4. Using this model, the distribution of the structures of the additive can be determined as a function of time.
Figure 4: Function of additive molecules in the fuel The objective of the first-principles modeling is not so much to capture in exact q u a n t i t a t i v e detail the different m e c h a n i s m s involved, but to acknowledge the physics behind the relevant m e c h a n i s m s in order to get a r a n k - o r d e r i n g of the performance of different additives. Figure 5 shows some generic structures of additives. Additives m a y contain linkers and heads with more t h a n one site for tails and linkers respectively (Types II & I I I in Figure 5). Also more t h a n one tail might appear in succession along a single b r a n c h of a linker (Variant 2 in Figure 5). The concentration of "effective lengths" of the additive structures with more t h a n one tail in parallel is obtained as a sum from a joint distribution of several additive structures with single tail s t r a n d s of various lengths. Once this is done, the time-concentration behavior of the additives with different tail lengths is obtained as a solution to the above set of differential equations. The degradation of the additive is a s s u m e d to t a k e place between units primarily across the weakest links between them. This is similar to pyrolysis and the activation energies for this b r e a k a g e is d e t e r m i n e d in an analogous manner.
339
Figure 5: Generic topology and connectivity of additive molecules The stability or solubility of a solute (additive) in a solvent (fuel) is d e t e r m i n e d by the relative cohesive energy densities of the solute and the solvent. It is one of the most widely used indicators of solubility/miscibility and is characterized by the Hildebrand p a r a m e t e r [17]. This is a m e a s u r e of the i n t e r n a l - e n e r g y density and represents the a m o u n t of energy required to move two-molecules of a species to infinite separation in solution. The extent of solvation of a substance A in a solvent B depends on how close the cohesive energies of the two substances are. The condition for solubility (and hence stability) becomes
la~dd.,ve(t) - aF~e,I ~ ~'soluble
(2)
)Lsoluble is a pre-set solubility bound, which is usually less t h a n 5 [18]. ~Additive and 5Fuel are the Hildebrand p a r a m e t e r s of the additive and fuel respectively. As the additive degrades, its structure and therefore its Hildebrand p a r a m e t e r value changes over time on fuel. For a given value 5Fuel and Esoluble, the exact fraction of additives t h a t meet the solubility criteria can t h e n be d e t e r m i n e d by applying (2). Since the additive needs to r e m a i n solvated as long as possible in order to continue removing deposit pre-cursors, the time varying solubility m e a s u r e becomes a key indicator of its stability and hence its overall performance in the fuel. We define the a m o u n t of additive t h a t is solubilized in the fuel at any point in time as the amount of active additive in the fuel.
340 The Hildebrand p a r a m e t e r can be estimated by group contribution methods of which the modified Hansen's method [18,19] is the most suitable for this case. H a n s e n ' s group contribution method determines three separate contributions to the Hildebrand p a r a m e t e r or the cohesive energy of a molecule. These are due to dispersion, polarity and hydrogen bonding. A molecule is first split up into a group of functional groups t h a t have fixed contributions to each of the above terms. The molar volume of each functional group is also e s t i m a t e d from group contribution (of H a n s e n [18]). The Hildebrand p a r a m e t e r for a molecule containing Nf functional groups is estimated by equation (3) [18,19]. 5d, 5p and 5h are the dispersion, polar and hydrogen bonding contributions for the entire molecule, and Flzd, F~zp and Ulzh are the functional group contributions to each of those three terms respectively. V i is the groupcontribution to the molar-volume of the molecule, from functional-group 'i'. 5i are the Hildebrand p a r a m e t e r s estimated for each species and 5T is the total H i l d e b r a n d p a r a m e t e r value of all the additive in the fuel and Xi is the molefraction of an additive molecule with an effective length of 'i' units. The molefractions are also distributions (varying with effective length) changing over time due to the degradation reactions. At a given i n s t a n t of time, one can t h e n d e t e r m i n e the additive length distribution curve. Each point on this curve refers to an "effective" additive structure whose components in t e r m s of head, linker and n u m b e r of tail units can be determined. From the above group contributions the Hildebrand p a r a m e t e r (cohesive energy density) of each additive molecule in the distribution at a given i n s t a n t in time can be calculated. Using the criteria of equation (2) then, for a given )~oluble the a m o u n t of active additive in the fuel can be determined as a function of time. This is the first-principles component of the forward model.
i=1
i=N aT -- Z Xi (t)ai i=l
V
(3)
(4)
The first-principles model predicts the fuel activity as a function of time. W h a t r e m a i n s , is to correlate this activity vs. time curve to the eventual performance indicator, which in this case is the a m o u n t of intake-valve deposit. The forward branch of Figure 1 illustrates how one might accomplish this. The functional descriptors obtained from the first-principles model will
341 have to correlated against the the IVD data from the databases. Both linear and non-linear regression models should be explored. A point to note is t h a t the ~,soluble parameter can be varied within its normal bounds to obtain different curves (for every datapoint in the database) and the data picked from the corresponding curves for all the datapoints can then be regressed against the intake-valve measurments. The ~soluble value corresponding to the best regression model is then chosen as the optimal bound. For this case study, neural networks [22] are the nonlinear method of choice. They are relatively function free and easy to implement. Moreover, we do not have large fundamental constraints on the regression models and this makes the situation further attractive for the use of neural networks. Linear models as well as different architectures of neural networks are implemented and the optimal one determined based on accuracy in prediction The results are discussed in the Section 15.5.1. 15.4.2 I n v e r s e P r o b l e m
The design problem involves the construction of optimal fuel-additive molecules given desired IVD requirements. For reasons outlined earlier, an evolutionary search is employed to achieve this. Unlike deterministic approaches like mathematical programming, for instance, that contain a formulation phase and a solution phase, evolutionary approaches usually contain only a solution phase. While the details specific to the problem are explicitly modeled in the formulation phase of a math programming approach, these have to be dealt with in the solution phase for an evolutionary method. This implies that each evolutionary algorithm is unlike another one applied for a different domain in most aspects. However, major components of evolutionary search procedures have some common spirit even across different application domains. These aspects were outlined previously. For the fueladditive design problem the components of an evolutionary design procedure are customized as follows.
Representation: With the identification of the design-relevant building blocks this step is straightforward. We choose to represent an additive molecule to contain a head, linkers and tails. But there are constraints on how these components are put together based on their chemical make up as well as rules of synthesizability. To accommodate these rules, the head, linker and tail representation is recast into an object-oriented representation as shown in Figure 6. Under each object category information about generic object properties such as compatibility lists, connectivity and group contributions are retained. In addition, when these objects are connected to form additive molecules, specific adjacency information is also retained in the object structure.
Feasibility Rules: Chemical and synthesizability rules for the design problem basically fall into two categories: - (i) Disallowed combinations of head-linker
342 and/or linker-tail pairs. (ii) Feasible construction based on existing connectivity. The second category is transparently imposed via the objectoriented structure. Since we keep track of the actual connectivity (valence) of the head, linker and tail as well as the connectedness information in the molecule, feasibility can be enforced during generation, mutation and recombination (crossover). The first rule is also complicated by the fact t h a t a head may contain more than one type of site, not all of which is compatible with a given linker. Imposing the first rule during generation is straightforward. However the genetic operators are modified from their generic counterparts to seek feasibility as well as recover feasibility after operation.
Figure 6: Object-oriented representation for head, linker and tail in evolutionary search Genetic Operators: Mutation and Crossover are the genetic operators of choice for this problem. Unlike the operators widely used for evolutionary methods in literature, we can customize them to better reflect the chemistry of the additive design problem. The first step is to bind these operators to the component being operated upon. For each of the four components of head, linker, tail and branch (a linker+tail path when several such structures are connected in parallel to the head), one mutation and crossover operator is created. Within each of these operators, we also need to reflect the feasibility rules to ensure legitimate product candidates. The outline of feasibility
343
enforcement within the operators is similar across operators and only two will be discussed here.
Branch Crossover: The crossover operation involving one branch of a fueladditive molecule with a branch from another molecule is shown in Figure 7. The constraint to be considered is one of head-linker compatibility. The linker attached to one branch should be compatible with the linker site on the head component of the other molecule. This is enforced as shown in Figure 7. During the pairing phase, after one parent has been chosen, the second p a r e n t is chosen so t h a t it contains at least one branch t h a t is compatible with a site on the head of the first parent. As shown in the example in Figure 7, Branch-1 is compatible with both sites on the head of its p a r e n t (Parent-I), as well as the single site on Parent-2. On the other hand, the single branch of Parent-2 can go only on Site-1 of Parent-1. A simple crossover of Branch-1 of Parent-1 with the single branch of Parent-2 results in infeasibility. This is averted by switching the Branch-1 of Parent-1 onto Site-2 and moving the crossed over b r a n c h from Parent-2 onto t h a t vacant site. Similar operations can be (and are) defined for the other crossover operators.
Figure 7: Branch crossover operator Linker Mutation: This operation replaces a linker on a chosen p a r e n t with a compatible linker from the base set. This is done in two steps. First, a linker is chosen at r a n d o m from a set of linkers compatible with the head-linker a t t a c h m e n t site. This linker is used as the replacement if a sufficient n u m b e r (as m a n y as the branching of the new linker) of tails on the removed linker are compatible with the sites on the new one. If not, different r e a r r a n g e m e n t s
344 of the tails are explored to identify a compatible configuration. If no such a r r a n g e m e n t can be found, this linker is dropped from the list of compatibles (only for this instance of the selection procedure) and a different linker component is chosen. If the chosen linker has more branch points t h a n the linker t h a t it is replacing, the additional branches/tails are chosen at random, but compatible with the vacant attachment sites on the linker. If the chosen linker has less branch points, the surplus tails after feasible assignment are discarded. The crossover rate is set at 0.60 and the mutation rate at 0.40. The rate for each individual type of crossover and mutation is the same within the class. The large mutation rate is warranted by the non-binary representation. Fitness Function: The hybrid forward model can be de-coupled to design either for maximum additive stability (output of first-principles model) or for low intake-valve deposit performance (final output of the hybrid model). The fitness function returns a value characterizing the quality of the solution which is usually a number between zero and one, the larger being more desirable. For the solubility objective, the fitness function is directly defined as the fraction of the initial amount of additive that is active (at a prespecified time t) based on the definition of activity given earlier. For the IVD based objective, the fitness function is defined as follows F = e-3('v~ ,~s-IvO,.~,) ;IVD,~e > IVl~i~,
(5)
F=I;IVDre.f
An a value of 0.002 is used for this case study. The variuos steps involved in the evolutionary design algorithm are shown in Figure 8. The initialization for the algorithm is a single randomly generated (but feasible) lead structure. Copying the single lead structure as many times as the population size gives the very first generation. The fitness of each function in the population is evaluated and the population is sorted accordingly. The top few solutions in each generation are retained into the next generation. A fitness proportionate selection is employed to select parents who then undergo mutation and crossover to form the rest of the population. The cycle is continued until some pre-specified termination criteria are met.
15.5 R E S U L T S A N D D I S C U S S I O N The database for the case study consists of engine test results (provided by the Lubrizol Corporation) based on three different engines, which are referred to as the BMW, Honda and Ford databases. The intake-valve deposit m e a s u r e m e n t is the performance indicator of interest here. For each test, the databases also provide the structure of the additive package used and some characteristics of the gasoline. The solubility distribution of additives of
345 different effective length as a function of time is determined from the firstprinciples model. Using the fuel character, a reasonable fuel cohesive energy density is determined and the active additive concentration profile is calculated. Both linear and neural-network models are examined. As mentioned earlier the stability bound (~,soluble) is adjusted to get the best correlative curve. The amount of active additive remaining in the fuel at different times is used as the input to the regression model.
Figure 8: Evolutionary Design of Fuel Additives 15.5.1 F o r w a r d M o d e l
Figure 9 (reproduced from Sundaram et. al. [20]) shows the cohesive energy density profile (characterized by the Hildebrand parameter) of different additives from the BMW database. It is clear that the fundamental model is quite successful in capturing significant stability differences between the additive packages. Fraction 2 is a Type III structure and Fraction 1 is a Type II structure. The larger non-polar (more soluble) contribution from the additional tails lead to a smaller initial value of the Hildebrand parameter for Fraction 2. But as the tails start degrading over time, the presence of a highly polar additional linker in Fraction 2 has a significant contribution to the final Hildebrand value making it less soluble. Two additives t h a t perform similarly in a given fuel over a particular time scale may indeed have different stability,
346 if either the fuel or the time scale of interest is different. The differences between Package 2 & Package 3 are primarily due to the different polar contributions of their core fragments. The first-principles model as demonstrated by the above results is able to distinguish between the chemical n a t u r e of the sub-structures in different additives and pick out the stability contributions from the topology or connectivity of the additive molecules.
Figure 9: Solubility profiles for different packages in the BMW database (Reproduced from [20]) As a second level of validation of the model, we try to ascertain the importance of the stability argument in the eventual intake-valve deposit performance of the additive. This is an under pinning assumption for the regression model. To this end, the additive packages from the FORD database (refers to the type of engine test used) are assigned relative quality measures. An expert based on their previous experience with these additive packages assigned these relative quality indicators. The solubility predictor was compared against the expert assigned measure. The resulting correlation had an R 2 of 0.965. This demonstrates quite clearly that stability is a significant discriminator between additives.
347
Table 1: Comparison of performance of different regression models for IVD prediction from stability descriptors. Database
Model
BMW QS~ QSAR (PEA) BMW BMW QSAR (PLS) BMW Solubility(NN: 4 Radial Basis) Honda Solubility (Linear) Honda Solubility (NN: 2 Tan-Sigmoid) Honda Solubility (PLS-NN: 2 Tan-Sigmoid) Honda Solubility (PLS-NN: 3 Tan-Sigmoid) Honda Solubility (PLS-NN: 4 Tan-Sigmoid) S u m m a r i z e d from S u n d a r a m et. al. [20]
Variables Projections
36 None None 6 None 1 None 30 PLS (3) 30 None 30 PLS (4) 30 PLS (3) 30 PLS (3)
RMSE (mg) in Cross, Validation 214 172 142 124 33 35 30.7 31.2 31.4
The regression models are developed to correlate the activity profile of the additive to the intake-valve deposit measurement. Both linear and neural network models can be constructed for this purpose. The models are developed i.e. their p a r a m e t e r s are determined during the training phase, where only a part of the data set is used for this determination. In the testing phase, the model is presented with rest of the data t h a t was not used in training. The predictions in the testing phase are compared against the actual m e a s u r e d values and the error is taken to be representative of the quality of the regression model. In bootstrapped cross-validation mode, several different partitions of the data are made (into training and testing subsets) and the models trained and tested on each partition. The overall average error during testing across all partitions is then reported as the quality of the model and the best of several competing models is chosen [21]. Neural networks are function free nonlinear models. But their architectures in terms of the n u m b e r of neurons, layers and the transfer functions can be varied. By varying these parameters, different architectures of the s t a n d a r d feed-forward network [22] can be examined. In addition the so-called PLS-NN architectures are also examined. Briefly, these are models where the neural networks are fit to the residuals of successive linear models extracted by applying the partial least squares (PLS) technique [23]. The PLS approach involves successive projections of the input variables into linear combinations based on m a x i m u m correlation with the output. For further details the reader is referred to the different papers in this area [23,24,25]. The results for the forward modeling effort are summarized in Table 1. The first column refers to the database. The second column indicates the model type. The QSAR models refer to quantitative structure activity relationships and the ones mentioned in the second column refer to those in use at Lubrizol. The solubility models are the ones based on the functional stability
348 description of the molecules. The second column also details the model type (such as linear, NN, PCA, PLS etc). The third column shows the number of variables in the model (before projections). The fourth column indicates the projection type and the number of projections eventually used in the modeling. The last column is the RMSE (in mg) based on cross-validation over 10 different partitions of the data sets. For the BMW database, the quality of the data is not very good. The data are engine test results over ten years and the experimental errors were quite large [26]. Adding to this is the small size of the dataset (92 points). Even with these limitations the hybrid model clearly outperforms the best QSAR models and the error reduction is quite significant. For the HONDA database the models perform up to the quality of the data. Linear models from the HONDA database perform quite closely to the neural-networks. However, they were more sensitive to data partitioning and so the neural netowork models are favorable in this regard. The PLS-NN models perform a little better than the standard feed-forward architectures and are the model of choice for the database. The hybrid first-principles + regression approach to forward modeling is quite accurate given the sparseness of the data and large experimental errors. Additionally, this approach provides an intermediate indicator of stability (the amount of active additive) which by itself can be used as a relatively easiy to measure performance and modeling standard.
15.5.2 Evolutionary Design of Fuel-Additives D e s i g n for m a x i m u m solubility Table 2 shows the results of the evolutionary search with the solubility as the objective. The objective is to find the additive molecule(s) that are most soluble in a give solvent (characterized by a solvent Hildebrand p a r a m e t e r shown in column one) at a given instant of time. The design procedure was allowed to run for 25 generations with a population of 25 molecules. The base set of design-relevant groups consisted of 25 heads, 9 linkers and 3 tails. The time instant of interest was then varied, keeping the fuel Hildebrand p a r a m e t e r fixed to determine a different set of designs for maximum solubility. Table 2 shows the solubilized fractions of the additive for four different solvents (fuels) and 3 different time instants (z =1,5 and 10). The solubility Vs time curve for the additive molecules identified to be highly soluble at x =1 are then used to determine their solubility at x = 5 and 10. This is reported in columns 3(A) and 4(A) in Table 2. The solubility values are then compared with those of the additives designed for maximum solubility at times x = 5 and 10. These values are reported in columns 3(B) and 4(B) of Table 2. It is clear from the table that the additive design is sensitive to the nature of the fuel and the time on stream. Additives that perform well at short times of contact need not do so for longer times and the difference is larger as the fuel becomes
349 increasingly polar (larger ~). The results d e m o n s t r a t e t h a t the evolutionary algorithm is successfully exploiting the differences in short and long t e r m behavior between different additives through the use of the design-relevant building blocks.
Table 2. Evolutionary design of additives for stability: Results
5 (MPa in) [Fuel]
x=l x=5 x=10 4(B) 4(A) 3(A) Best 3(B) 2(A) Best Design Solubility (at ~=5 Design at x=5 Solubility (at "c-10 Best Design at "~=10 of Best Design at of Best Design at at I;= 1 t=l) ~=1)
19 21 23 25
Design
0.96 0.96 0.92 0.42
for minimum
0.77 0.82 0.67 0.31
0.79 0.85 0.72 0.41
0.55 0.7 0.45 0.08
0.64] 0.73 0.61 0.5
IVD
The forward model for this r u n consisted of the complete hybrid model t h a t predicts the expected intake-valve deposit given the s t r u c t u r e of the additive, the fuel structure or character and the operating conditions. The regression model used is the PLS-NN model trained on the HONDA database. The objective now is to find designs t h a t are predicted to produce an IVD close to or less t h a n 10 mg. The fitness function of Equation (5) is used here. Again the evolutionary algorithm is allowed to run for 25 generations with a population size of 25. The rank, fitness, IVD and structural details of some of the top ten additive molecules are reported in Table 3, for three different runs. For p r o p r i e t a r y reasons, the structures are not revealed. However the s t r u c t u r e is described in t e r m s of the generic structural/connectivity definitions used in Figure 5. Some structures contain commonly used components based on the additive molecules in the databases. However the combinations (of the head, linkers and tails) are novel, leading to different predicted performance. The best design giving a predicted IVD of 8.9mg (structure 3-A) contains commonly used components and in fact is identified to possess good synthesis potential.
Table 3. Optimal solutions from the evolutionary design of fuel-additives for intake-valve deposit performance Run
Identifier
Fitness
Rank
IVDPredicted (mg)
Type
Comments
1-A
0.997
1
11.4
Type III
1-B
0.996
2
11.5
Type III
1-C
0.993
6
12
Type II
2-A
0.999
1
10.1
Type II
2-B
0.989
2
12.6
Type II
Novel Structure; Components not usually used in database Novel Structure Variant of structure in BMW database Novel Structure; Different from lA; Contains a rarely used linker type Slight variation of a commonly found structure in the databases
2-C
0.983
4
13.2
Type II; Variant 2
B
3-A
1
1
8.9
Type III
Novel Structure; Distinct from both 1-A & 2-A; High synthesis potential
3-B
0.994
2
11.9
Type III; Variant 2
3-C
0.993
3
12.1
Type III
A two tails in series variation of 2-
Variation of 3-A Different core compared to 2-B; An additional branch
351 15.6 C O N C L U S I O N S Design of engineering materials involves design for performance instead of inherent properties. A CAMD procedure for engineering materials should capture the phenomenological underpinnings of how molecular structure interacts with the environment leading to performance. The linchpin in this approach is the identification of structural aggregates t h a t consolidate the influential effects (on performance) of smaller units. They act as sensitive design-relevant building blocks and intimately tie the forward and inverse problems together. We have demonstrated through this case study, how design-relevant building blocks can be identified for a real industrial problem in fuel-additive design. A hybrid model based on functional descriptors derived from these building blocks was implemented. The model captures the chemistry and physics of how additives sustain their deposit removal ability while maintaining robustness to noise and sparseness in the data. An evolutionary search procedure was implemented to determine optimal additive molecules that meet desired performance criteria. The design algorithm was customized to handle inherent constraints and hence avoids some of the feasibility pitfalls of stochastic algorithms. The inverse design algorithm was shown to locate optimal solutions that also possessed a high potential for synthesis. In this case study, we concentrated on the stability of the additive as the function influencing its deposit removal capability. While this is the predominant effect, other factors such as the susceptibility of the core to nucleophilic attack, the tendency of the additive to degrade rapidly in the combustion chamber (just enough stability), for instance, are also important to lesser extents. These effects could be modeled separately through relevant mechanistic descriptions. The point to note here is that the design-relevant building blocks are the same for these models also. But the contributing subunits of the building blocks and their functional interactions will be different. The building blocks allow for different functional descriptions while retaining the same design-level structural abstraction. The hybrid model for IVD prediction was a black-box model due to the use of a neural network. Even the phenomenological component of the hybrid model t u r n e d out to be nonlinear in this case. However in other design domains where the forward models are linear or transformably so, suitable deterministic techniques should be used. Even in this case study other stochastic methods such as simulated annealing [27] t h a t have been successfully applied to other CAMD problems [28] could be used to tackle the inverse problem. The essential idea and indeed the power behind this CAMD approach is the identification of the most sensitive set of design/decision variables based on a first-principles understanding of how structure relates to performance. Then the best search procedure can be customized to navigate the designs cast in that decision space.
352 Acknowledgements We would like to thank the Lubrizol Corporation, Wickliffe, OH for their support of this work as well as the data. Thanks are also due to Dr. Dan Daly at Lubrizol for numerous discussions and help in understanding fueladditives.
15.7 REFERENCES [1] Chan, K., Computer-Aided Molecular Design Using Genetic Algorithms. , Ph.D. Thesis, Purdue University, 1994. [2] Kalghatgi, G.T., Deposits in Gasoline Engines- A Literature Review. SAE Technical Paper Series: Lubricants and Fuels, 1990. 99:4 (902015): p. 639667. [3] Lacey, P.I., Kohl. K.H., Stavinoha, L.L., and Estefan, R.M., A LaboratoryScale Test to Predict Intake Valve Deposits. SAE Technical Paper Series: Lubricants and Fuels, 1997. 106:4(972833): p. 880-891. [4] Grant, L.J. and R.L. Mason, SwRI-BMW N.A. Intake Valve Deposit Test- A Statistical Review. SAE Technical Paper Series: Lubricants and Fuels, 1992. 101:4(922215): p. 1221-1230. [5] Graham, J.P. and B. Evans, Effect of Intake Valve Deposits on Driveability. SAE Technical Paper Series: Lubricants and Fuels, 1992. 101:4(922259): p. 1231-1245. [6] Houser, K.R. and T.A. Crosby, The impact of Intake Valve Deposits on Exhaust Emissions. SAE Technical Paper Series: Lubricants and Fuels, 1995. 103:4(922259): p. 1432-1451. [7] Office of Mobile Resources, Final Rule: Certification Standards for Deposit Control Gasoline Additives, 1996, US Environmental Protection Agency. [8] ASTM D5500-94. Standard test method for vehicles evaluation of unleaded automotive spark-ignition engine fuel for intake valve deposit formation, Section 5, Vol 5.03, American Society for Testing and Materials, Jan 1995. [9] Gani, R. and Fredenslund, Aa., Computer Aided Molecular and Mixture Design With Specified Property Constraints, AIChE J., 1991, 37(9):p. 1318 [10] Skvortsova, M.I., et. al., Inverse Problem in QSAR/QSPR studies for the case of topological indices characterizing molecular shape (Kier Indices), J. Chem. Inf. Comput. Sci., 1993, 33:p. 630-634. [11] Churi, N. and Achenie, L.E.K., Novel Mathematical Programming Model for Computer Aided Molecular Design, Ind. Eng. Chem. Research, 1996, 35:p. 3788-3794. [12] Maranas, C.D., Optimal Computer-Aided Molecular Design: A Polymer Design Case Study, Ind. Eng. Chem. Res., 1996, 35:p.3403-3414. [13] Holland, J.H., Adaptation in Natural and Artificial Systems. 1975, Ann Arbor: University of Michigan Press. [14] Goldberg, D.E., Genetic Algorithms in Search, Optimization and Machine Learning. 1989, Reading, Mass.: Addison-Wesley.
353 [15] Venkatasubramanian, V., Chan, K. and Caruthers, J.M., Genetic Algorithmic Approach for Computer-Aided Molecular Design, in ComputerAided Molecular Design. 1995. p. 396-414. [16] Venkatasubramanian, V., et al., Computer-aided Molecular Design using Neural-Networks and Genetic Algorithms, in Genetic Algorithms in Molecular Modeling, J. Devillers, Editor. 1996: London. p. 271-302. [17] Hildebrand, J.H. and R.L. Scott, Regular Solutions. 1962. [18] Barton, A.M., CRC Handbook of solubility parameters and other cohesion parameters. 1991: CRC Press. [19] Meusberger, K.E., Pesticide Formulations. Am. Chem. Soc. Symp. Ser., 1988. 371: p. 151. [20] Sundaram, A., et al., Design of Fuel Additives Using Neural Networks and Evolutionary Algorithms, AIChE Journal., 2001, 47(6): p. 1387-1405. [21] Schenker, B. and Agarwal, M. Cross-Validated Structure Selection for Neural-Networks. Computers. Chem. Eng., 1996.20(2): p. 175-186. [22] Haykin, S., Neural Networks: A Comprehensive Foundation. 1999. [23] Andersson, G., Kaufmann, P. and Renberg, L. Nonlinear Modeling with a Coupled Neural Network - PLS Regression System. Journal of Chemometrics, 1996. 10: p. 605-614. [24] Geladi, P and Kowalski, B.R. Partial Least-Squares Regression: A Tutorial. Analytica Chimica Acta, 1986. 185: p. 1-17. [25] Qin, S.J. and McAvoy, T.J. Nonlinear PLS Modeling Using Neural Networks. Computers Chem. Eng., 1992.16: p. 379-391. [26] Arters, D.C., E.A. Schiferl, and D.T. Daly, Variability of Intake Valve Deposit Measurements in the BMW Vehicle Intake Valve Deposit Test. SAE Technical Paper Series, 1997. SP-1277(971723): p. 67-80. [27] Kirkpatrick, S., et. al., Optimization by Simulated Annealing, Science, 1983, 220:p. 671-680. [28] Marcoulaki, E.C. and Kokossis, A.C., Molecular Design Synthesis using Stochastic Optimisation as a Tool for Scoping and Screening, Computers Chem. Engng., 1998, 22(Supplement): p. $11-$1.
This Page Intentionally Left Blank
PART III:
Computer Aided Product D e s i g n
The first two parts of the book focussed on the problem of computer aided molecular design (CAMD). The broader problem of design of new materials or products as against molecules is highlighted in this final part of the book. Some of the new frontiers for the computer aided product design (CAPD) problem are presented. Finally, the outstanding issues and challenges are discussed.
Computer Aided MolecularDesign: Theoryand Practice L.E.K. Achenie, R. Gani and V. Venkatasubramanian(Editors) 9 2003 Elsevier ScienceB.V. All fights reserved.
357
Chapter 16: Challenges and Opportunities for CAMD
R. Gani, L. E. K. Achenie & V. V e n k a t a s u b r a m a n i a n
The problem of molecular design is only a special case of the much broader problem involving design of new materials, formulations or structured products as against simply new molecules (or mixtures of molecules). The expanded problem of computer-aided product design (CAPD) can thus be stated as the determination of the optimum material, structured product and/or formulation to meet a given set of design objectives. The chapters in Parts I and II of this book have primarily been concerned with molecular (and mixture) design of small molecules. Chapters 5 and 13 have discussed the design of polymers, but they have concentrated on bulk properties and not on polymers properties dependent mainly on differences in the polymer structure at the mesoscopic and/or microscopic level. The methods and tools described in this book can, however, serve as the basis for solving problems not covered in detail in the earlier chapters. For example, using higher-order groups, topological indices and/or higher level molecular structural representations, the methods described in chapters 2-7 can easily be adapted to design large, complex molecules t h a t are usually isomers or multiple conformations of a specific molecular type. Use of higher-level molecular representations will also require the use of property estimation techniques that employs such molecular structural information. The objective of this chapter is to highlight some of the challenges and opportunities related to problems not covered by the earlier chapters.
16.1 C H A L L E N G E S Klientjens (1999) provided a useful list of challenges in terms of structured material products that adapt their properties to suit their environment or t h a t remember their previous shape. As examples, Klientjens lists some target functions (needs) for these structured m a t e r i a l s - materials that contract like a muscle, materials t h a t change in color upon a change in thermodynamic conditions, materials whose viscosity changes when introduced into an electromagnetic field and many more. Realization of these and other challenging products could be
358 achieved by addressing (finding answers) to the following questions and other related questions: 9 Can we manipulate the structures of our products at the micro-, meso- and/or macroscopic levels in order to give the product a desired functionality? 9 Can we produce a desired chemical/biochemical/agrochemical product by finding the optimal reaction and processing path? 9 Can we a priori identify the products for which an appropriate processing route is achievable? 9 How can we validate and test the desired functional properties (such as controlled release) of the product? 9 How can we enhance the functional properties of products? 9 How can the optimal interactions between product and process design be explored? 9 How can we identify the ingredients (additives) that when added to a product (such as flavors, paints, pesticides, etc.) enhances the functional properties of the product & formulation? 16.1.1 F r o m M o l e c u l e s to M a t e r i a l s
Material design differs significantly from the more traditional design problem. In traditional design, the component behavior is often well known or can be described by relatively simpler models. On the other hand in the realm of material design, the primary challenge lies in the determination of the model of the material behavior. Furthermore, the multitude of possible chemical structures or formulations results in a combinatorially complex design space. Notwithstanding these challenges, present day material design enjoys the advantage of availability of rich pools of material data due to the advent of high-throughput experimentation (HTE). Not only do there exist large collections of data but also new data is being generated now at incredible rates. Consequently, another key issue in material design becomes the suitable extraction of knowledge from such vast reservoirs of data and its appropriate exploitation in the overall design exercise. At the same time, despite the high-throughput screening tools, the sheer complexity of the design space necessitates experimental design so as to intelligently focus the data collection process towards the promising regions of the design space. The need of the hour for computer aided product design is then a rational framework that can address all the issues mentioned above and seamlessly integrate the processes of forward model development via knowledge extraction, solution of the inverse problem and design of experiments. Caruthers et al. (2002) discuss such a framework in the domain of catalyst design. The ideas presented by Caruthers and coworkers are applicable to the generic material design context and we briefly relate the same here.
359
Hybrid Modeling Approach The framework integrates the computer-aided knowledge extraction process with HTE and expert knowledge so as to fully exploit HTE. It is important to note that the vast reservoirs of available data simply offer information and do not necessarily present it directly in the form of corresponding knowledge. In order to extract such knowledge from HTE data, the framework utilizes advanced models and novel software architectures that strive to approximate the thought process of the h u m a n expert. The overall material design problem is again viewed to consist of two components, analogous to that in the case of molecular design as was mentioned in the very first chapter of the book. This is illustrated in Fig. 1.
Forward Problem "~
I
li
Predictive Model I'
"
Material Composition Operating Conditions
Material Performance
]
Design
Inverse Problem
Figure 1: Components of the material design problem The forward model is used to connect the material composition and/or high-level descriptors of the composition to the performance of the material in the application of interest. An inverse model relates the performance to the desired composition or formulation. By definition, design is the solution of the inverse model. Although the inverse solution is often the primary technological objective, a rational design process requires good, robust, forward models. In turn, in order to develop good forward models it is imperative to possess in-depth knowledge about the system of interest. However the development of the model presents some unique challenges. Often, it is intractable to develop first principles models alone that connect the material composition all the way to the material performance. At the same time, while a large and diverse data set is essential, purely data-driven models are also usually inadequate. These difficulties necessitate the use of advanced hybrid modeling techniques where first principle models are used in concert with data driven models, like neural networks, and expert knowledge. Fig. 2, taken
360 from Caruthers et al. (2002), schematically illustrates the concept of such a hybrid, integrated modeling framework.
Figure 2: Schematic of modeling architectures. (a) Traditional approach where models do not interact and (b) new hybrid approach where models work in concert. The most complex material design problems are often those where the underlying systems either involve reactions, or have time-evolving performance properties of interest, or both. For instance, biological systems form an especially important class of such complex systems involving reactions where metabolic pathway modeling becomes important. To model these complex reactive systems, more sophisticated knowledge architectures are required. In general, for a chemical or biological system, a kinetic model is required to model the reactions. Then
361 the p a r a m e t e r s of the kinetic model need to be determined from experimental data assuming a particular reaction mechanism or metabolic pathway. To facilitate the process, the overall modeling effort is structured in two parts. First a fundamental model of the system based on physics, chemistry and/or biological knowledge about the system is developed t h a t would provide suitable descriptors of the system. These descriptors would then be used to determine the parameters of the kinetic model, which is the second part of the overall forward model. This two-part forward model approach is depicted in Fig. 3, which is a generic version of a figure from Caruthers et al. (2002). One may be tempted to eliminate the fundamental (chemistry, physics or biology) models as well as the kinetic models, and a t t e m p t to directly correlate descriptors of the system with performance. However, to develop a forward model reliable enough t h a t it may be extrapolated to new regions of composition space for the overall material problem, it is often essential to utilize all available knowledge about the system at hand. Material Library
HTE data r
Performance" Curves
]
Chemistry, Physics or Biology Models
Kinetic Model
t
t
Rules
Rules
Material Structure to Model Parameters
Model Parameters Performance
J
to
Figure 3: Schematic of the overall forward model for systems involving reactions As shown in Fig. 3, the knowledge about the system captured by the fundamental model may be in terms of chemistry, physics or biology rules. These rules may themselves arise out of both first principle as well as expert knowledge. The development of the fundamental model as well as the kinetic model will be entirely determined by the selection of the rules expected to govern the underlying chemical, physical or biological system
362 at hand. In other words, the process of model development is actually the process of selecting the appropriate set of governing rules.
K n o w l e d g e Extraction: F r o m Rules to F e a t u r e s The overall forward model may now be defined as the clear and precise representation of all forms of knowledge about the system including first principle, data-driven and expert knowledge. If one wants the full benefits of HTE and the ability to do design, there is no alternative to model development. The reasons are very clear. First, for most complex design problems, the composition space can be so large that even HTE cannot fully explore it. Second, in order to obtain more than just correlations using HTE, knowledge must be extracted, and the extraction of knowledge must be automated so as to keep up with the rate at which data is now becoming available. Finally and probably, most importantly, the systems being modeled are usually far too complex so that the number of ideas that must be addressed simultaneously often exceeds the capacity of h u m a n experts. Consequently, a computer-aided Knowledge Extraction (KE) engine with capabilities for both model refinement and formulation of new, critical high-throughput experiments is a necessary component for effective design. Fig. 4, again a generic version of a figure from Caruthers et al. (2002) shows the idea of knowledge extraction and the resulting flow of knowledge. Knowledge extraction is not a model, but rather a process as explained below. To start with, a set of rules is postulated (possibly by a h u m a n expert) to best describe the system at hand. The rules lead to a fundamental model with chemistry, physics or biology knowledge embedded in it. The rules also lead to a kinetic model. Using HTE data, the model parameters are determined and the system performance predictions are obtained. Rather than the quantitative predictions, the qualitative features of the performance prediction vis-a-vis those in the HTE data are often more critical to evaluate the extent of the inadequacy of the postulated model. To handle the shortcomings of the model, the model refinement process is invoked which reselects a set of rules and the process repeats. At the same time, in order to better discriminate between different models or obtain data on different features about the system, new experiments are formulated. Thus, each iteration of the KE process leads to a potentially improved model as well as more discriminating HTE data. The continued interplay between theory and experiment via both a computer-based system and h u m a n experts ultimately results in generation of new knowledge. If the KE engine, HTE data and the h u m a n expert are working in concert, the process should ultimately begin to converge with each cycle of the process. At the end of the convergence process, the forward model would be essentially complete for the class of systems studied, so that it may then be used in conjunction with a suitable inverse solution method to solve the material design problem.
363
II
Formulation of Experiments
High Throughput Experiments Performance ~ Curves / Feature Extraction
Model
Rules r~
II
Model Refinement
Figure 4: The Knowledge Extraction process 16.1.2 S e l e c t i o n o f R e a c t i o n S c h e m e s
The selection of the best reaction scheme (item 2 of the list of challenges) for the industrial production of chemicals is probably the most challenging unsolved problem. The development of processes requires the consideration of wide variety of reactions, which are involved to transform raw materials into desired products. The identification of the most efficient reaction routes that connect the raw materials and the desired products is known as reaction path synthesis. The necessity of discovering and developing alternative / new reaction paths to obtain existing / new chemicals attracts more and more attention due to the changing conditions in terms of availability of raw materials and energy, the constraints posed by ecological and health considerations and the shifting requirements of the market. The synthesis of new chemical reaction paths is an attractive objective within the general field of process synthesis. Reaction path synthesis, the generation of a network of alternative routes for the manufacture of a desired product and the selection of an optimal route, represents a key step in the process design. Significant advances have been made by researchers for the synthesis, analysis and evaluation of alternative reaction routes (Govind & Powers, 1981; Rotstein, Resasco & Stephanopuolos, 1982; Crabtree & E1-Halwagi, 1994; Fornari & Stephanopoulos, 1994; Buxton, Livingstone & Pistikopoulos, 1997; Li, Hu, Li & Shen, 2000).
364 Usually, the synthesis problem falls into one of the following classes: (a) Given a desired product, identify the feasible sets of raw materials, as well as the pathways that correspond to feasible mechanisms - retrosynthesis; (b) Identify potential products starting from available raw materials using feasible mechanisms; (c) Bridge the gap between the given raw materials and products using feasible mechanisms; (d) Identify chemicals and feasible intermediate reaction steps to bridge a given situation with a desired situation, and the reverse - Solvay-type clusters of reactions. Any computer-aided synthesis procedure requires the solution of the following problems: (i) representation of chemical compounds and reactions; (ii) selection of the appropriate synthesis strategy; and (iii) criteria for evaluation. The existing approaches to the identification of reaction paths can be divided in two categories: information-based systems with their roots in chemistry and logic-based systems with their roots in mathematics. The first approach is based on the chemist's point of view and uses vast amounts of data, which encode available knowledge on chemistry, generating only the rational alternatives. The second approach uses logical constructs of chemistry and mathematical representations of molecules and reactions, synthesizing completely novel reaction pathways, but also large numbers of unattractive solutions. In each approach there is a trade-off between the generality of the methods (the ability to represent many alternatives) and their predictive power (the ability to represent specific reactions in detail), according to the representation system employed. The future of computer-aided reaction path synthesis systems depends on their ability to provide quantitative evaluation of the generated alternatives. Thus, the priority is to estimate the reactivity, equilibrium conversion, by-product formation etc.; for large-scale commodity products, the reaction path synthesis must be coupled with processing requirements. The key for any quantitative evaluation of the generated alternatives is the prediction of the amount of desired product that can be obtained from each reaction route. Therefore, for each feasible reaction (stoichiometric and thermodynamic), the equilibrium and kinetic conditions must be evaluated. For a limited number of reactions, it is possible to find equilibrium and kinetic data in the literature or to conduct experiments in order to evaluate them. Generally reaction kinetic data (rate constants) are determined experimentally and only knowledge-based systems include such data. However, the reaction path synthesis problem is likely to include hundreds or thousands of both known and new reactions, for which a rapid equilibrium and kinetic evaluation must be implemented. For small molecules and very simple (unimolecular) reactions, computational chemistry may help evaluate the kinetics. For the generation of alternative reaction paths for the production of a given chemical product, a three-step methodology can be considered as
365 follows: (1) Co-material design step; (2) Reaction path synthesis step; (3) Post-synthesis step. In the first step, the chemical species included in the reaction network are generated using a group-based CAMD procedure (Gani et al. (1991); Constantinou et al. (1995)). Careful pre-selection of these compounds provides an early opportunity to limit the size of the problem. Thus, starting from the desired product and based on its known chemistry, the appropriate set of groups is selected. The CAMD procedure systematically generates the stoichiometric co-material candidates, considering the specified type of compounds and property constraints. Also, to complete any stoichiometry, it may be necessary to include some simple additional molecules, which cannot be systematically designed using the CAMD procedure. In order to develop the reaction path network using the computer-aided reaction synthesis algorithm, it is necessary to create a specific structure and reaction representation. In the second step (reaction path synthesis), these representations are constructed and subsequently the reaction tree is generated, considering role specifications, chemistry constraints and stoichiometric coefficient constraints. The identification of feasible stoichiometries is performed using a multi-level "Generate and Test" procedure (see Chapters 7 and 14), including the atom balance and thermodynamic and equilibrium conversion constraints, in order to reduce the problem dimensions and to screen out the infeasible stoichiometries. A number of thermodynamic properties (Gibbs free energy of reaction) serve as basis for a selection strategy. The resulting reaction path network includes only the stoichiometric and thermodynamic feasible reactions. In the third step (post-synthesis), the results from the reaction path synthesis step are analyzed in order to identify the most promising reaction routes in terms of economics, operability, reliability, environmental impact, etc. The analysis involves large amounts of information, such as reaction conditions and kinetics needed to perform process design, simulation and optimization. Various simplifications have been suggested, the most important of which is the hierarchical evaluation of reaction schemes, which progresses through different levels of required detail, e.g. evaluation of alternative reaction schemes at equilibrium conditions, kinetically controlled conditions, considering overall gross added-value economics, toxic raw materials or by-products, etc.
16.1.3 Drug Design In a recent article, Garg and Achenie (2001) discussed the use of mathematical programming for designing drugs with desired properties. The mathematical programming formulation is solved to obtain optimal descriptor values, which are then employed in the Cerius 2 modeling environment to infer the optimal lead candidates, in the sense that they exhibit both high selectivity and activity while ensuring low toxicity. Both
366 linear and non-linear quantitative structure activity relationships (QSAR's) were developed for use in the approach. The modeling approach was demonstrated for a class of non-classical antifolates for pneumocistis carinii and toxoplasma gondii dihydofolate reductase. Some of the potential leads found in this study have biological properties similar to those in the open literature.
Background Pneumonia and toxoplasmosis are the major causes of morbidity and mortality in AIDS patients (Vita et. al., 1987). Opportunistic pathogens, Pneumocystis carinii (pc) and Toxoplasma gondii (tg), respectively, cause these diseases via the dihydrofolate reductase (DHFR) enzyme in AIDS and other immunocompromised patients. Existing therapies based on present drugs are either too toxic or not very selective between human DHFR and pcDHFR (or tgDHFR) (Walzer et. al., 1988 and Kovacs et. al., 1988). Currently available antifolate therapies (namely, trimetoprim and pyrimethamine) for pc and tg infections, are weak inhibitors of DHFR. On the other hand, trimetrexate and piritrexim, although 100-10,000 times more potent than trimetoprim and pyrimethamine, are unfortunately strong inhibitors of DHFR from mammalian sources (Gangjee et. al., 1993, 1995, 1996a, and Rosowsky et. al., 1994,1995). There has been a flurry of research activities reported in the open literature (Chio et. al., 1991, Piper et. al., 1996, Gangjee et. al., 1996b, 1997, 1998) focused on the design of drugs that are selective, i.e. simultaneously active against pcDHFR and tgDHFR and relatively inactive against human DHFR. In these studies, typically the researcher takes an antifolate backbone, changes some of the functional groups, synthesizes the molecule, and performs bioassays to determine if it is a potential lead candidate. This process is naturally very time consuming and expensive. Compounding this problem is the fact that there are several hundred, even millions of possible molecules that can be screened through this approach. This is a Herculean task for any research group to accomplish in a reasonable amount of time. Garg and Achenie propose a computer aided molecular design approach.
Problem Formulation A CAMD approach is formulated to identify potential leads that are selective, i.e. simultaneously active against pcDHFR and tgDHFR and relatively inactive against human DHFR. In the suggested approach, two sub-problems are considered, namely, a forward problem and an inverse problem. In the forward problem, models are developed to predict the selectivity and activity from molecular descriptors. In the inverse problem, the optimal values (based on selectivity and activity) of the molecular descriptors are determined and an appropriate molecular structure is inferred.
367 For the forward problem, several antifolates, each with a different inhibitory activity characteristics, and with a general structure as:
NH2 R1 R21 R2'/~ H2N
R3'
R 5'
where: W = [N, CH], X = [N, CH], Y = [N, CH2], Z = [N, CH2], R1 = [no substituent, H, C1, CH3], R2 = [H, CH3] and R3 = [H, CH3, CHO, CH2CCH, CH(CH3)2, CH2CH3], are fed into the MSI Cerius 2 modeling environment. The latter then gives a unique set of descriptor values corresponding to each molecule. Next a quantitative structure activity relationship (QSAR) between the activities of the antifolate molecules and the descriptor values are developed. QSAR models are also developed for selectivities of the antifolate molecules for pcDHFR (and tgDHFR) versus r/DHFR. The inverse problem is solved using one of the QSAR models. This results in a set of optimal descriptor values for potential lead candidates with both high selectivity and activity values. The inverse problem is given by
Maximize
Selectivity (d)
Subject to
Activity (d) >_Activity_low
where, d is the vector of descriptors. The Selectivity and Activity are models generated from the forward problem. Activity_low is the activity above which the drug has a significant biological effect. The suggested formulation is done bearing in mind the fact that the selectivity of a drug is more critical than its activity since the dosage and/or its form can control the activity of any drug. Note that the drug can be given at a given level of potency or frequency. The dosage form can be intravenous or oral. The mathematical programming model can handle objectives and constraints different from the suggested one above. The optimal descriptor values from the model are then used to identify the appropriate substituents on the antifolate backbone. In other words, these descriptor values are used to infer the important structural features necessary to attain the desired properties.
368 16.1.4 F o r m u l a t i o n s
The mixing of materials to achieve a new or improved product is practiced in many industries, including paints and dyes, foods, personal care, detergents, plastics and pharmaceutical development. Formulated intelligent industrial products use specifically chosen mechanisms to serve the customer by accurately exerting their desired features. According to Kind (2002), the desired features consist of P e r f o r m a n c e - nutritional value, health care, disease prevention, body care, surface protection, crop protection, chemical activity, etc. Convenience- controlled release of active substance at the location and instance of maximum effect and minimal environmental impact, ease of handling, ease of application, absence of unwanted side effects, etc. Design of formulated products requires knowledge of the interaction between the microscopic structure and the process at the molecular and nano-scale on the one hand and the macroscopic consumer oriented properties of the product on the other hand. Formulation design problems such as mixing & blending of oils or solvents can be tackled by currently available CAMD techniques. For design of polymer formulations (or blends), ingredients for food, drugs, pesticide, etc. for desired delivery/penetration, however, integration of models of different scales of size, time and complexity with CAMD techniques is needed. This is a very rapidly growing research area and development of integrated computer aided methods & tools for formulation design is certainly feasible. Much work, however, is necessary to measure and collect the necessary data for the development of new property models.
16.2 C U R R E N T T R E N D S T O W A R D S P R O B L E M S O L U T I O N S 16.2.1 F l e x i b l e s o l u t i o n s t r a t e g i e s
The step in the product design procedure dealing with the manufacture of the product is in most cases applied in a sequential manner, after the completion of the first three steps (identify needs, design products and test products). For simple solvent design problems, simultaneous solution approaches for process-product design has recently been reported by Hostrup et al. (1999) and Linke and Kokossis (2001). Mathematical programming approaches incorporating product and process design, while attractive, however need first to overcome the problem of property models. As pointed out by Gani (2001), the property models for product design may not be suitable for process design and vice versa. In addition, once a property model is selected for inclusion into the process model, the application range in terms of additional new mixtures (generated by the product design steps) is restricted since for the generated molecules, either
369 the model parameters may be unavailable or the property model may not be suitable. Since in mathematical programming techniques, changing of model equations (included as equality constraints) will cause discontinuities in the solution trajectory, it may become extremely difficult to achieve convergence if multiple versions of models for the same properties were to be used. This problem may however be overcome using mixed integer mathematical programming formulations t h a t use logic or binary variables to represent different models. Recently, Gani and Pistikopoulos (2002) and Eden et al. (2002) proposed the solution of process as well as product design problems as a series of reverse problems. J u s t as molecular design problems may be formulated as reverse property estimation problems, process design problems may also be formulated as reverse simulation problems. That is, determine the design targets from the process models, given the known process input information and the desired process output information. Eden et al. (2002) have shown that solution of this reverse simulation problem does not require the use of property models in the process model equations since the unknown design targets are functions of the target properties. This means t h a t the target properties can be determined from the solution of the reverse simulation problem by solving a set of linear equations (in most cases) and from these, the design targets are calculated. The advantage of this procedure is that solution complexity has been reduced without sacrificing solution accuracy. Also, note that the dependence on property models for performing mass- and energy-balance calculations has been eliminated (from this step). In order to complete the process design, conditions of temperatures, pressures and/or compositions are determined in the next step, where a reverse property estimation problem is solved. Here, the property values (calculated from the first step) and mixture compounds are known but the variables defining the condition of operation (temperature, pressure and/or composition) are unknown. As long as the target properties (from the reverse simulation step) are matched to some degree of tolerance, any number of property models may be used for this reverse property estimation step. The hybrid CAMD methods are designed to handle multiple property models (see figures 18-19 in chapter 6) and has the flexibility to design the condition of operation if the compounds are known and vice versa. Note that the reverse simulation problems may also be solved in order to "define needs" for the CAMD problem. Integrated process-product design problems may also be tackled by decomposing the problem into two reverse problems and iterating on the connection between the two reverse problems until the optimal solution has been achieved (see figure 6). That is, the reverse simulation problem determines the target design values for a specified set of product qualities (chemical identities are not necessary since property models are not used in the reverse simulation problem). Matching the design targets and
370 generating new input/output stream parameters and determining the chemical identities solve the reverse property estimation problem. This information is fed back to the reverse simulation problem and a new iteration loop is started. Eden et al. (2002) provide an illustrative example on how the reverse problem formulation works for integrated processproduct design.
Figure 6: Reverse problem formulations for integrated process-product design (Eden et al. 2002) Another opportunity in developing flexible solution strategies is in the area of formulated products. Here, the design problem is to find a formulation that when added to another product, enhances its function. Thus, the design of the formulation (commonly known as active ingredient) and the testing of the final product need to be performed simultaneously. Take for example, the case of drug delivery, pesticide uptake, polymer blends for specific applications, inhibitors for drugs, elastomers in chemical products and many more. In all these product design problems, models of the process (phenomena) need to be combined with the search of property-based formulations. In many cases, one starts with modeling of the diffusion process (for example, in drug delivery) and then relates the sensitive parameters for this process to the target properties of the desired active ingredients. The process model in this case is represented by a system of partial differential-algebraic equations where a number of the terms are represented by properties such as diffusion, thermal conductivity, viscosity, etc., for which property models (or constitutive models) need to be introduced. The difficulty in solving these problems in a general manner is that the property model may be
371 valid only for a certain range of conditions and/or mixtures. The other difficulty is that the models may be complex, requiring higher levels of molecular structural information. Therefore, multi-level and multi-scale modeling approaches need to be considered together with an optimization strategy to identify the best formulation. Again, a decomposition of the problem into reverse problems may be a more pragmatic and flexible way to solve these product design problems.
16.3 E N A B L I N G T E C H N O L O G I E S
The Journal of Computer-Aided Molecular Design started in 2001 had an initial emphasis on drug discovery and design. To get a sense of current trends in CAMD (at least from a drug discovery point of view), one needs to look at the submission areas that the journal actively solicits manuscripts" theoretical chemistry; computational chemistry; computer and molecular graphics; molecular modeling; protein engineering; drug design; expert systems; general structure-property relationships; molecular dynamics; and chemical database development and usage. Researchers in CAMD (as defined in this book) are making contributions in all of the above areas except theoretical chemistry and computer and molecular graphics. Some of the above areas are further discussed below. In his lecture, Tomasi (1999) quoted Dirac (1929) as follows:
The underlying physical laws necessary for the mathematical theory of a large part of physics and the whole of chemistry are thus completely known, and the difficulty is only that the exact application of these laws leads to equations too much complicated to be soluble. It therefore becomes desirable that approximate practical methods of applying quantum mechanics should be developed which can lead to an explanation of the main features of complex atomic systems without too much computation. Developers of computational chemistry software (such as Gaussian, Jaguar, GAMESS) have largely headed the advice given in the second sentence, although the amount of computations is still too large. In computational chemistry, sophisticated algorithms from numerical mathematics are employed to elucidate molecular properties. In addition, fast computers (for example parallel/distributed computers, networked computers, and agent-based computing) are employed. If the view expressed in the opening sentence were correct (most computational chemists agree with this view), then expected future advances in both hardware (faster and cheaper computers) and computational chemistry software would revolutionize property modeling at various scales. Since the availability, accuracy and speed of computation of property models are the "Achilles' heel" of CAMD, the effect on CAMD would be very profound indeed.
372 Current computational chemistry algorithms have had reasonable successes. For example, Sandia National Labs (Albuquerque, NM, http://www.bmpcoe.org/bestpractices/internal/sandi/sandi 57.html) have reported the use of quantum chemistry modeling to determine the structure and energetics of a newly discovered fullerene (an allotrope of carbon). Nyden and Brown (1993) report using molecular dynamic simulations and cone calorimeter measurements to gauge the effects of electron beam irradiation and heat treatments on the flammability of the honeycomb composites used in certain parts of commercial aircraft. The interactions between a protein and substrates dictate essentially all functions of an organism (http://www.ram.org/research/pfp.html). Proteinsubstrate and protein-protein interactions depend critically on the 3D structure of the protein. The formation of this 3D structure results from folding of the protein, a problem for which scientists have spent well over 30 years trying to understand. Understanding and modeling protein folding (i.e. activity) has enormous implications in drug discovery since in principle the 3D structure of a protein may then be controlled (i.e. designed) in order to dock properly to a given substrate or protein. Modeling and simulating protein folding is rather complex and computer intensive. The protein-folding problem can benefit from recent advances in computing and algorithms including global optimization (see, for example, Floudas et al. (1999)). Likewise property modeling and prediction in CAMD can benefit greatly from global optimization algorithms under certain conditions: the algorithms should be easy to use, intelligent enough to quickly recognize whether or not a solution exists, robust enough to easily converge to a solution if one exists, and most importantly memory usage and computations should scale almost linearly with the problem size. In the chemical database area there is a concentrated effort to archive experimental data and property prediction methods in an intelligent database that allows easy retrieval of pertinent information. The CAPEC database (http://www.capec.kt.dtu.dk/main/software /database) at the Technical University of Denmark has a large collection of experimental data, which can be used for developing new models or verifying the accuracy of existing predictive models with respect to application in CAMD. The database from Cranium (Molecular Knowledge Systems, Inc., http://www.molknow.com/) provides physical property estimation. In addition, the database available in SMSWIN (see chapter 9) has a large collection of compound related data. The PARIS-II project (Cabezas, 2000) offers solvent selection tools based on database search and on-line calculation of properties. The database area would benefit greatly from advances in data mining, knowledge extraction (see section 16.1.1) and protocols such XML. As computational chemistry becomes more accurate for both small and large molecules, gaps (absence of experimental data) in
373 property databases can be partially offset by computational chemistry data.
16.4 C ONC L USI O N S We conclude the book with a note on the key improvements required in the product design (inverse) problem strategies to meet the product needs of the future. Before that, we will briefly summarize some of the aspects of the CAMD problem that were addressed in the book. The book primarily focused on the computer aided molecular design problem and highlighted its key issues. A background was provided of the required forward modeling effort ranging right from linear group-contribution models to hybrid approaches based on complex, knowledge-extraction architectures that strive to integrate first-principles, expert and data-driven knowledge. In terms of the inverse or reverse problem, a variety of methods were discussed in detail including generate-and-test methods, mathematical programming, evolutionary algorithms and hybrid models. The case studies presented in the book highlighted the application and practice of some of these methods.
16.4.1 A d v a n c e d Product Design Strategies Much of the current work in product design is carried out through empirical, trial and error approaches involving time-consuming experiments. It is important to capture the knowledge gained from past experiments and apply them in a systematic manner so that future efforts will need fewer trials and therefore fewer experiments. In this context, a major effort is needed to understand the molecular structure-property relationships, collect experimental data, develop mathematical models, and apply the solution techniques to identify/design new products and processing routes. Recent successful applications of CAMD to the development of new agrochemicals, materials, and pharmaceuticals can be found in the book edited by Reynolds et al. (1995). Most of these successes employ techniques such as CoMFA, molecular dynamics, de novo ligand design, QSAR, molecular orbital methods, and genetic algorithms. In these applications important properties include interfacial phenomena and pharmacokinetic properties such as transport and metabolism. This chapter introduced the broader problem of material or product design. It provided a flavor for the kind of modeling effort that will be required when the system at hand or its performance measure is too complex to be modeled by simple property prediction methods. In such cases, the forward model itself will be a complicated and computationally intensive process. With the forward model presenting the computational bottleneck, the inverse problem solution strategy will itself also need to be
374 modified. Typical inverse methods discussed in this book such as conventional mathematical programming or simple genetic algorithms will no longer be feasible to solve the inverse problem in reasonable amounts of time under a computationally intensive forward model. These search algorithms will have to be redesigned such that all knowledge (fundamental or expert) about the underlying system would be suitably exploited to obtain a guided search procedure capable of exploring the search space rapidly despite the computation limitations posed by the forward model. As of now, only some highly preliminary efforts exist towards this end and the problem is far from solved. However, this is the key challenge that will ultimately have to be dealt with to produce an intelligent and efficient, knowledge-driven design system capable of handling the complex material design problems in the years to come.
16.3.2 Multidisciplinary Approach Since chemical product design problems are multidisciplinary in nature, development of a systematic framework based on identified workflow and data-flow for the various inter-related activities would make a significant contribution. The framework needs to consider the human-computer interactions and allow the human to control the workflow while the computer performs tasks that are calculation intensive in the workflow and most of the tasks in the data-flow. In this way, the human concentrates on the tasks he/she can efficiently solve while the computer concentrates on the tasks it can perform very efficiently. The systematic framework could serve as the basis for state-of-the-art computer-aided tools utilizing existing databases, mathematical models and efficient solution techniques. Note that while the computer-aided tools will depend on the availability of appropriate models, the systematic framework can be used even if the models are not available.
16.4 REFERENCES
1.
2.
3.
4.
A. Buxton, A.G. Livingston and E.N. Pistikopoulos. Reaction Path Synthesis for Environmental Impact Minimization. Computers Chem. Engng. 21, $959-$964 (1997) J . M . Caruthers, J. A. Lauterbach, K. T. Thomson, V. Venkatasubramanian, C. M. Snively, A. Bhan, S. Katare and G. Orkarsdottir, J. Catalysis, (2002), submitted for publication. L . - C . Chio, and S. F. Queener, S.F. "Identification of highly potent and selective inhibitors of Toxoplasma gondii dihydrofolate reductase." Antimicrob. Agents Chemother. 37 (1991) 1914-1923. E.W.Crabtree, and M.M. E1-Halwagi. Synthesis of Environmentally Acceptable Reactions. AIChE Symposium Series, Volume on Pollution Prevention via Process and Product Modifications 90, 117127 (1994)
375 5. .
7.
8.
.
10.
11.
12.
13.
14.
15.
16.
17.
V.T. De Vita Jr., S. Broder, A. S. Fauci, J. A. Kovacs, B. A. Chabner, Ann. Intern. Med. 106 (1987) 568- 581. P. A. M. Dirac, 'Quantum mechanics of many-electron systems', Proceedings of the Royal Society (London), A 123 (1929) 714-733. M.R. Eden, S. B. Jorgensen, R. Gani, M. E1-Halwagi, "Property integration - A new approach for simultaneous solution of process and molecular design problems", Computer Aided Chemical Engineering, J. Grievink and J. van Schijndel (Editors), Vol. 10 (2002) 79-84. C.A. Floudas, J.L. Klepeis and P.M. Pardalos, "Global Optimization Approaches in Protein Folding and Peptide Docking", DIMACS Series in Discrete Mathematics and Theoretical Computer Science, (Ed. F. Roberts), 47 (1999) 141-171. T. Fornari, and G. Stephanopoulos. Synthesis of Chemical Reaction Paths: The Scope of Group Contribution Methods. Chemical Engineering Communications 129, 135-157 (1994) A. Gangjee, A. P. Vidwans, A. Vasudevan, S. F. Queener, R. L. Kisliuk, V. Cody, R. Li, N. Galitsky, J. R. Luft, and W. Pangborn, "Structure-based design and synthesis of lipophilic 2,4-diamino-6substituted quinazolines and their evaluation as inhibitors of dihydrofolate reductases and potential antitumor agents" J. Med. Chem. 41 (1998) 3426-3434. A. Gangjee, A. Vasudevan, S. F. Queener, and R. L. Kisliuk "2,4Diamino-5-deaza-6-substituted pyrido[2,3-d]pyrimidine antifolates as potent and selective nonclassical inhibitors of dihydrofolate reductases" J. Med. Chem. 39 (1996a) 1438-1446. A. Gangjee, A. Vasudevan, S. F. Queener, and R. L. Kisliuk, "6Substituted 2,4-diamino-5-methylpyrido[2,3-d]pyrimidines as inhibitors of dihydrofolate reductases from Pneumocystis carinii and Toxoplasma gondii and as antitumor agents" J. Med. Chem. 38 (1995) 1778-1785. A. Gangjee, J. Shi, S. F. Queener, L. R. Barrows, and R. L. Kisliuk, "Synthesis of 5-Methyl-5-deazononclassical antifolates as inhibitors of dihydrofolate reductases and as potential antineumocystis, antitoxoplasma, and antitumor agents" J. Med. Chem. 36 (1993) 3437-3443. A. Gangjee, R. Devraj, and S. F. Queener, "Synthesis and dihydrofolate reductase inhibitory activities of 2,4-diamino-5-deaza and 2,4-diamino-5,10-dideaza lipophilic antifolates" J. Med. Chem. 40 (1997) 470-478. A. Gangjee, Y. Zhu, S. F. Queener, P. Francom, A. D. Broom, A.D. "Nonclassical 2,4-Diamino-8-deazafolate analogues as inhibitors of dihydrofolate reductases from rat liver, Pneumocystis carinii, and Toxoplasma gondii." J. Med. Chem. 39 (1996b) 1836-1845. R. Gani, "Computer aided process/product synthesis and design: Issues, needs and solution approaches", paper 264a, AIChE Annual Meeting, Reno, USA, Nov. 4-9, 2001. R. Gani, E. N. Pistikopoulos, "Property modelling and simulation for
376
18.
19. 20.
21. 22.
23.
24.
25.
26.
27.
28.
29.
30.
product and process design", Fluid Phase Equilibria, 194-197 (2002) 43-59. S. Garg, and L.E.K. Achenie, "Mathematical Programming Assisted Drug Design for Non-classical Antifolates," Biotechnology Progress, 17 (2001) 412-418. R. Govind, and G.J. Powers. Studies in Reaction Path Synthesis. AIChE J. 27(3), 429-442 (1981) M. Hostrup, P. M. Harper, R. Gani, 'Design of Environmentally Benign Processes: Integration of Solvent Design and Process Synthesis', Computers and Chemical Engineering, 23 (1999) 13941405. M. Kind, personal Communications, University of Stuttgart, Germany (2002). L. Klientjens, Thermodynamics of organic materials. A challenge for the coming decades", Fluid Phase Equilibria, 158-160 (1999) 113121. J.A. Kovacs, C. A. Allegra, J. C. Swan, J.C. Drake, J.E. Parrillo, B. A. Chabner, and H. Masur, "Potent antipneumocystis and antitoxoplasma activities of piritrexim, a lipid-soluble antifolate" Antimicrob. Agents Chemother. 32 (1998) 430-433. M. Li, S. Hu, Y. Li and J. Shen. Reaction Path Synthesis for a Mass Closed-Cycle System. Computers Chem. Engng. 24, 1215-1221 (20O0) P. Linke, A. Kokossis, "Simultaneous synthesis and design of novel chemicals and chemical process flowsheets", Computer Aided Chemical Engineering, J. Grievink and J. van Schijndel (Editors), Vol. 10 (2002) 115-120. J. E. Nyden, M. R.; Brown, J. E., "Computer-Aided Molecular Design of Fire Resistant Aircraft Materials" Federal Aviation Administration (FAA). International Conference for the Promotion of Advanced Fire Resistant Aircraft Interior Materials. February 911, 1993, Atlantic City, NJ, 147-158 pp, 1993. J.R. Piper, C.A. Johnson, C.A. Krauth, R. L. Carter, C. A. Hosmer, S. F. Queener, S.E. Borotz, and E. R. Pfefferkorn "Lipophilic antifolates as agents against opportunistic infections. 1. Agents superior to Trimetrexate and Piritrexim against Toxoplasma gondii and Pneumocystis carinii in in vitro evaluations" J. Med. Chem. 39 (1996) 1271-1280. C.H. Reynolds, "Computer-Aided Molecular Design Applications in Agrochemicals, Materials, and Pharmaceuticals", Edited by C. H. Reynolds, M. K. Holloway, and H. K. Cox] ACS Symposium Series 589, ACS, Washington DC, (1995) 396-414. A. Rosowsky, C.E Mota, J.E. Wright, and S.F. Queener, "2,4Diamino-5-chloroquinozoline analogues of trimetrexate and piritrexim: Synthesis and antifolate activity" J. Med. Chem. 37 (1994) 4522-4528. A. Rosowsky, R. A. Forsch, and S.F. Queener, "2,4Diaminopyrido[3,2-d]pyrimidine inhibitors of dihydrofolate
377 reductase from Pneumocystis carinii and Toxoplasma gondii" J. Med. Chem. 38 (1995) 2615-2620. 31. E. Rotstein, D. Resasco and G. Stephanopuolos. Studies on the Synthesis of Chemical Reaction Paths - I. Chemical Engineering Science 37(9), 1337-1352 (1982) 32. J. Tomasi, "Towards 'chemical congruence' of the models in theoretical chemistry", H Y L E - An International Journal for the Philosophy of Chemistry, 5 (1999) 79-115. 33. P.D. Walzer, C.K. Kim, J.M. Foy, M.T. Cushion, "Inhibitors of folic acid synthesis in the treatment of experimental Pneumocystis carinii pneumonia" Antimicrob. Agents. Chemther. 32 (1988) 96.
This Page Intentionally Left Blank
G l o s s a r y of T e r m s
ABS API Aprotic Solvent
Basis set BB
Binary variable Building block CAMbD CAMD
CAMD design algorithm CAMD framework
CAMD problem CAMD solution CAMD solution step
CAMS
Candidate selection CAPD Cardinality CFC Chem3D Chemometrics
Alkyl Benzene Sulfonates Active Pharmaceutical Ingredient A term used to describe solvents of both high and low polarity which do not readily give a proton to a base The set of groups from which a molecule may be assembled. Branch and bound method. It is a strategy for obtaining the global minimum (or maximum) of a mathematical program that has discrete variables (sometimes in addition to continuous variables). Has a value of either 0 or 1 The pieces the molecular models are assembled from in CAMD - a group or fragment Computer aided mixture/blend design Computer Aided Molecular Design - the generation of molecules from fragments using a computerised technique The set of sub-algorithms used to solve a CAMD problem The overall collection of algorithms for formulating, solving & analysing the solution results of CAMD problems and the sequence they must be applied in. The task of generating compounds matching a set of properties A solution to a CAMD problem The general procedure of solving CAMD problems - in the developed framework this performed in the design phase Computer Aided Molecular Searchidentification of compounds having specific properties by systematic searching in databases Finding the most promising candidates among the solutions from the design phase Computer aided product design (CAMD + CAMbD) Number of members Chlorofluorocarbon Commercial molecular modeling software The simulation of reaction systems with kinetic models and principal factor analysis to identify
380
CHRIS Comaterial Computational load Connectivity CPLEX CSTR CTAM Database lookup Descriptors (structural)
Design considerations
Design constraints Design phase
Design specifications Desirable property
Desirable qualities DIC DICOPT Dimensionality
DIU DMAC DMF EH&S
the major pathways Database on the internet (see chapter 1) Stoichiometric by-product The amount of calculations required to solve a CAMD problem How the atoms of a combined are interconnected MILP solver (www.cplex.com) Constant Stirred Tank Reactor Critical Air Mass Performing searches in a database for records with specific data Numbers or other information describing something about the structure of a molecule - a group vector is a set of structural descriptors The aspects taken into account when formulating a CAMD problem - explicit considerations must be reformulated as constraints in order to apply the CAMD algorithm while implicit considerations are treated by restricting the types of molecules generated or by analysing the results in the post-design phase constraint specification expressing the problem formulation as constraints The requirements a compound should fulfill in terms of properties The part of the CAMD framework responsible for solving a CAMD problem by using the CAMD design algorithm The same as constraints A constraint on a property controlling the suitability of a compound. Typically a relative specification such as "as high as possible", "as low as possible" or "as close to a goal value as possible" Qualities that would enhance the suitability of a compound Diisopropyl carbodiimide Discrete and Continuous Optimizer (this solver is also available in GAMS) The dimensionality of a molecular model is a measurement for the level of detail contained in it Diisopropyl Urea Dimethyl Acetamide Dimethyl Formamide Environment, Health and Safety
381 EH&S properties Environmental impact Essential property
Essential qualities Estimated properties External substance Feasibility Feasibility requirements
FMS Formulation
Forward problem Fragment
Free-attachment GA GAMS GCA
GC-EOS Generate and Test Generation level Global optimization
Constraints relating to the Environment, Health & Safety of an operation The consequences of discharging a compound into the environment A constraint indicating a property value or interval required in order for a compound to be used for a particular application Qualities that a compound must posses in order to be usable Properties estimated using a property prediction method Compound not participating as a reactant or product in a process If a compound can be expected to exist in nature or be synthesized successfully The requirements a molecular model must fulfill in order to be regarded as being a feasible compound Final molecular structures CAPD problems that refer to mixture/blend design and or addition of additives to a product in order to enhance the product performance or quality (see chapters 11 and 16) Property prediction A fragment or sub-part of a molecule, typically the same as a group - in CAMD fragments are used as building blocks functional group a group defining the family of the compound it appears in (e.g. OH defines a compound as being an alcohol). A group has a free attachment if it is available for bonding to another group Genetic Algorithm General Algebraic Modeling System (commercial software) Group Contribution Approach- property prediction based on the assumption t h a t a fragment has the same contribution to a property regardless of the compound it is found in Group contribution equation of state A combinatorial approach The level of detail in the generated molecule models Identification of the absolute minimum (or maximum) point within the range of allowed values of the design variables and the region defined by the constraints.
382 Group
Group classification Group vector
Group-set
Hetero atom Hildebrand (solubility) Parameter HMPA HSDB Hybridization ICAS IMS IVD LCA LIBRA Log P Mathematical programming
MC MEIM Metha groups MILP
MINLP
MM2 MOLDES Molecular detail
A clearly defined substructure of a molecule, is part of a group-set and forms the basis for the GCA prediction methods The subdivision of the groups from a group-set into classes and categories A collection of groups (taken from a group-set) that defines a compound. Each group appears in the vector the number of times it can be found in the compound A set of molecular fragments used to describe compounds. An example is the groups defined in the UNIFAC method Non-carbon atom in (aromatic) ring. An indicator for solubility/miscibility Hexamethyl Phosporamide Database on the internet (see chapter 1) The bond configuration for an atom level the generation steps taken in the design phase Integrated Computer Aided System that contains ProCAMD as one of the tools Intermediate molecular structures Intake Valve Deposits Life Cycle Assessment Interval arithmetic based global optimization package Octanol-water partition coefficient A mathematical model consisting of a performance objective, constraints (including material balances and other process constraints) and design variables that can be manipulated to optimize the performance objective. Main-chain (see chapters 5 & 13) Method for Environmental Impact Minimization Groups with the same combination properties (see Chapter 2) Mixed Integer Linear Programming- solving nonlinear optimisation problems where some variables must have an integer value Mixed integer nonlinear program. This is a mathematical program involving both discrete and continuous design variables. A molecular mechanics method for doing calculations on 3D molecule models CAMD software (see chapter 2) The amount of structural detail embedded in a
383
Molecular model Molecular modeling Molecule representation MOPAC
MSA MTBE MW NFA NN Nonconvex equation Nonlinear equation ODP Optimal solution
OSL PARIS-II PEL Performance criteria PET PFR PLS Post-design phase
Pre-design phase Primary properties Problem (CAMD) formulation
ProCAMD Property intervals Property level Property prediction for
molecular model An electronic representation]model of a molecular structure Calculations of compound structures and properties using 3D molecular models The structure of a molecular model A computer program for doing ab initio calculations on molecules. Available in many versions but commercially sold by Fujitsu Inc. Mass separating agent Methyl Tertiary Butyl Ether Molecular weight Number of free attachments Neural Network Does not have a unique minimum point Variables appear with indices other than 1. Ozone Depletion Potential In MINLP based CAMD solution algorithms the compound having the optimum value of the objective function. In the developed framework the compound most suited for the intended use MILP solver (www.research.ibm.com/osl) A software developed by US-EPA for solvent selection Permissible Exposure Limit A measurement of how well a compound performs a given task Polyethylene Terephthalate Plug Flow Reactor Partial Least Squares The part of the CAMD framework responsible for analysing the results obtained from the design phase The part of the CAMD framework responsible for formulating a CAMD problem Properties predicted purely on the basis of the molecular structure Identifying the goals of the design process properties The physical and chemical properties of a compound (e.g. boiling point, melting point etc.) Software developed at CAPEC based on the hybrid CAMD method The interval a property value must lie in order to be suitable for given application How complex the calculation of a given property with a given method is The prediction of properties based on molecular
384 C A M D - level 1 Property range
structural information The total set of properties that has to be calculated in order to evaluate all the design constraints Property trust A qualitative measure for the quality of an estimation ProPred Pure component property estimation package PVP Ploy(vinylidene propylene) copolymer QSAR Quantitative Structure Activity Relationships Property prediction techniques relating the an activity (toxicity, biodegradability, bioaccumulation) to the molecular structure QSPR Quantitative Structure Property Relationships Property prediction based on the assumption that a property is related to the molecular structure of the compound. Related to GCA methods Qualities A qualitatively defined behavior or capabilitylike "liquid at ambient temperature" or "good solvent for phenol" Reverse problem CAMD could be regarded as the reverse of property prediction Database on the internet (see chapter 1) RTECS Side-chain (see chapters 5 & 13) SC Secondary properties Properties predicted using primary properties and/or temperature and pressure SEVIN The trade name for l-naphthalenyl methyl carbamate Solvent Molecular Structure SMS SMSWIN Software developed at Syngenta, which is useful for solvent selection (see chapter 9) SOLV-DB Solvent Database (see chapter 11) Solvent A solvent is that constituent of a solution that is liquid in the pure state, is usually present in the larger amount and has dissolved the other constituent (a solute) of the solution. The solute may be a solid, a liquid or a gas. The solvent may be a single compound or a mixture of compounds Spanning tree A tree in a graph including all vertices SQP Sequential (or successive) Quadratic Programming Steric information Information regarding the (relative) spatial positions of atoms (to each other) Structural feasibility An implementation of the octet rule constraint Structure (molecular) The internal organization of atoms and bonds that form an atom - represented in calculations
385
Substructure searches Target property in CAMD Uncertainty (properties) Undesirable candidates Undesirable qualities UNIFAC
UNIQUAC
UPBD US-EPA VOC WAR
by a molecular model The identification of fragments in a molecule model by use of an algorithm A physical property whose value needs to be within a given range for the candidate molecule (product) The inverse of Property Trust Candidates not fulfilling the requirements regarding feasibility or properties Qualities not desired in a candidate compound Group contribution based model for predicting the liquid phase activity coefficients of compounds present in a mixture Model for estimation of liquid phase activity coefficients (requires information on moleculemolecule interaction as opposed to group-group interaction) Upper bound of objective function United S t a t e s - Environmental Protection Agency Volatile Organic Matter Waste Reduction algorithm
This Page Intentionally Left Blank
Subject
Index
Subject / Topic Acetic Acid Production R o u t e s - Carbaryl Example Adaptation of Genetic O p e r a t o r s - Polymer Design Additional Structural Restrictions ADOL-C Advanced Product Design Strategy Analysis of Design Solutions Application E x a m p l e - Optimization in CAMD Application E x a m p l e - Problem Description Atom Balance Basic Set - Design of Aqueous Blanket Wash Blends Branch-and-Bound Algorithm Preliminaries Calculation of Properties in Level 2 CAMD Algorithm CAMD Framework CAMD Phase - Extraction Solvent Replacement CAMD P h a s e - Mass Separating Agent CAMD Problem F o r m u l a t i o n - Fuel Additives CAMD Problem Specification CAPD Carbaryl Production Routes Carbon Structure Constraints Case Studies - GA based CAMD Case Study CAMD_I - Optimal Solvent Design Case Study CAMD_2 - Optimal Solvent Design Case Study CAMD_3- Optimal Solvent Design Case Study in Identification of Multistep Reaction Stoichiometries Case Study in Optimal Solvent Design Case Study Objective - Design of Aqueous Blanket Wash Blends Case Study: Production of 1-Naphthalenyl Methyl Carbamate Challenge P r o b l e m - CAMD Industrial Example Challenges Challenges and Opportunities for CAMD Challenges for the Early Evaluation Tools Chemical Feasibility Rules Chemistry Constraints Chemistry Constraints - Carbaryl Example Choice of First order Groups - Refrigerant Design
Page 321 114 177 272 373 156 55 84 180 279 46 149 14 19 219 225 334 130 262 199 184 117 25O 253 254 319 247 278 198 226 357 357 230 174 182 323 293
388 Churi-Achenie Octet Rule Model Classification of Groups Co-Material Design Co-Material Design (Results) - Carbaryl Example Co-Material Design Procedure Combination & Feasibility Rules Construction of Estimators - Optimal Solvent Design Creation of Atomic Based Adjacency Description Current Trends Towards Problem Solutions Decision Tree Property Model Selection Definition of Structural Variables Description of Group Contribution Method Design for Maximum Solubility- Design of Fuel Additives Design for Minimum IVD Design of Fuel Additives Design of an Aromatic Compound Design Phase Design-Relevant Building Blocks Desirable Properties DICOPT (Discrete and Contituous OPTimizer) Drug Design EH&S and Special Properties Enabling Technologies Essential Properties Evolutionary Design of Fuel Additives Extension of Hybrid CAMD Method to Complex Molecules Extraction Solvent Replacement- CAMD Industrial Example Extractive Distillation Facets of Solvent-Based Processing Routes that Need Consideration Feasibility Criteria for the Synthesis of Linear Branched Structures Final candidate Selection First Order Groups and their Bonds Fitness F u n c t i o n - Polymer Design Flexible Solution Strategies Flowchart of the Global Optimization Algorithm Forbidden Bond and Other Specific Constraints Refrigerant Design Formulations Forward Problem (results) - Design of Fuel Additives Forward Problem (solution strategy) - Design of Fuel Additives From Molecule to Materials GA- Background GA- Building Block Hypothesis
251 25 172 324 174 27 257 150 368 232 74 66 348 349 84 139 330 133 299 365 134 371 133 348 168 215 37 232 30 157 69 113 368 273 295 368 345 336 358 97 110
389 GA- Fitness Function GA- Forma Theory GA- Genetic Encoding GA- Implementation GA- Replacement Policy GA- Schema Theory GA- Selection of Parents GA- The Polymer Design Problem GA Based S e a r c h - Polymer Design GA P a r a m e t e r s - Polymer Design GAMS Interface General Problem F o r m u l a t i o n - Optimization Methods in CAMD Generalized CAMD Framework Generation Algorithm for Level 1 - Hybrid CAMD Method Generation Algorithm for Level 2 - Hybrid CAMD Method Generation Algorithm for Level 3 - Hybrid CAMD Method Generation Algorithm for Level 4 - Hybrid CAMD Method Generation Level Generation of 3D Structures Generation of Feasible Molecular Structures Generation of Group Vectors from 1st-Order Groups Generation of Structural Isomers from Group Vectors Genetic Algorithms & Genetic Programming Genetic Algorithms Based CAMD Genetic Search R e s u l t s - Polymer Design Global Optimization Methods Based on Interval-Analysis Hybrid CAMD Method Hybrid Generate & Test CAMD Algorithm Hybrid Modelling Approach Identification of Environmentally Benign Stoichiometries Identification of Forbidden Bonds Between Groups Identification of Multistep Reaction Stoichiometries Incorporation of High-Level Knowledge: Molecular Stability Insertion and D e l e t i o n - Polymer Design Integration of Process-Product Design Interval Analysis - Brief Introduction Inverse Problem (solution strategy) - Design of Fuel Additives Knowledge B a s e - Hybrid CAMD Method Knowledge E x t r a c t i o n - From Rules to Features LIBRA
100 109 98 98 103 104 100 110 307 308 299 65 214 141 147 150 154 18 153 27 141 145 97 95 310 268 129 140 359 171 82 167 306 116 157 266 341 132 362 271
390 Linear Estimators and Branching Functions- B&B Method Liquid Extraction Lithographic Blanket Washes Lower Bound Algorithm-B&B Method Main-chain Mutation and Side-chain M u t a t i o n Polymer Design Mass Separating A g e n t - CAMD Industrial Example Method & Constraint Selection- Hybrid CAMD Method Methods & Tools- Optimization in CAMD Mixture Design Problem Formulation Mixture Properties Mixture Property Models - Design of Aqueous Blanket Wash Blends Molecular Complexity- Polymer Design Molecular D e s i g n - Generation & Test Methods Molecular Design of Fuel Additives Molecular Encoding Technique Molecular Representation Molecular Structure Representation Molecular Synthesis Molecule Representation- Polymer Design Multidisciplinary Approach Multi-Step Stochiometry Identification Results Multi-Step Stochiometry Identification Results Carbaryl Example Multistep Stoichiometry Identification Algorithm Near-optimal Solutions- Polymer Design New Group Combination Property Characterization Nitric Acid Oxidation of Anthracene to AnthraquinoneCAMD Industrial Example Octet Rule Odele-Machietto Octet Rule Model Optimization Methods in CAMD - I Optimization Methods in CAMD - II Parametric sensitivity and Robustness Analyses for GA'S Polymer Design Case Study Post-Design Phase Post-Design Phase - Extraction Solvent Replacement Post-Design P h a s e - Mass Separating Agent Pre-Design Phase Pre-Design P h a s e - Extraction Solvent Replacement Pre-Design Phase - Mass Separating Agent Prediction of Properties Primary Pure Component Properties Problem Definition- Design of Fuel Additives Problem Definition- Design of Optimal Solvent (Case Study)
50 35 263 48 115 222 132 55 264 12 285 306 23 329 164 67 15 24 112 374 201 324 194 311 29 236 177 251 43 63 312 303 156 221 225 130 217 224 9 10 332 248
391 Problem Defintion - Optimization Methods in CAMD Problem Formulation Algorithm- Hybrid CAMD Method Problem Fromulation- Reaction Stoichiometries Case Study Problem Solution- Refrigerant Design Problem Type and Solution- Optimization Methods in CAMD ProCAMD & Chem3D Product Design for Performance Properties Handled in Level-1 Property Constraints Property Constraints and Objective FunctionRefrigerant Design Property Level Property Prediction in Level 3 Property Range Property Trust Proposed GA F r a m e w o r k - Polymer Design Reactor Process Model Equations Reducing the Combinatorial Size of the Problem Refrigerant Design Case Study Results & Discussion- Design of Aqueous Blanket Wash Blends Results and Discussion - Design of Fuel Additives Reverse Problem Formulations for Integrated ProcessProduct Design Role Specification Constraints Role Specification Constraints- Carbaryl Example Role Specification Constraints - Reaction Stoichiometries Case Study Schematic of Lithographic Printing Secondary Pure Component Properties Selection of Branching Function - B&B Method Selection of Reaction Schemes Single Step Stoichiometry Enumeration Single-point Crossover- Polymer Design SMSWIN Solvent for Dehydration- CAMD Industrial Example Solvent for Ethanol Recovery Solvent for Separation of n-Propyl Acetate from n-Propyl Alcohol Solvent Mixture Design Solvent Selection C r i t e r i a - Nitric Acid Oxidation Example Solvent Selection in Industry- I Solvent Selection in Industry- II Solvent Selection Methodology in ICAS
43 137 319 299 81
168 329 144 6 290 18 151 18 19 111 188 32 289 283 344 370 182 322 320 247 12 54 363 193 114 231 242 38 39 261 236 213 229 235
392 Solvent Selection Methodology in SMSWIN Solvent Selection using ProCAMD Solvent Selection using SMSWIN - Nitric Acid Oxidation Example Solving the Multistep Stoichiometry Identification Problem Special Features for Complex Solutes- CAMD Framework SQP Step by Step Algorithm for Solution Technique- B&B Method Step by Step Algorithm for the Solution Technique Stoichiometry Identification Formulation Structural Feasibility Constraints Structure-Property Relationships - Refrigerant Design Summary of Problem Formulation- Refrigerant Design System Gibbs Free Energy Target Molecule Identification Target Polymers and their Properties- Polymer Design Test or Molecule Evaluation Stage The Algebra of Genetic Algorithms The Blending Operator- Polymer Design The Evolution of CAMD The Hop-mutation Operator- Polymer Design Thermodynamic and Environmental Property Equations Upper Bound Algorithm-B&B Method VerGO What is CAMD? Whole Number Stoichiometries Constraints
234 238 237 192 214 272 54 271 179 76 291 298 189 196 305 34 104 117 24 117 187 49 271 3 180
Author Index Author
Page
L. E. K. Achenie
3,43,247,261, 357 63,289 63,289 23 167,319 329 23 229 3, 129, 357 129 129 167,319 167,319 43,247 95,303 167,319 43,247,261 329 3,95,303,329, 357 211
C. S. Adjiman A. Apostolakou E. A. Brignole A. Buxton J. M. Caruthers M. Cismondi J. L. Cordiner R. Gani P. M. Harper M. Hostrup A. Hugo A. G. Livingston G. M. Ostrovski P. Patkar E. N. Pistikopoulos M Sinha A. Sundaram V. Venkatasubramanian J. M. Vinson
This Page Intentionally Left Blank