134
Structure and Bonding Series Editor: D. M. P. Mingos
Editorial Board: P. Day · X. Duan · L. H. Gade · K. R. Poeppelmeier G. Parkin · J.-P. Sauvage
Structure and Bonding Series Editor: D. M. P. Mingos Recently Published and Forthcoming Volumes
Data Mining in Crystallography Volume Editors: Hofmann, D. W. M., Kuleshova, L. N. Vol. 134, 2010 Controlled Assembly and Modification of Inorganic Systems Volume Editor: Wu, X.- T. Vol. 133, 2009 Molecular Networks Volume Editor: Hosseini, M. W. Vol. 132, 2009 Molecular Thermodynamics of Complex Systems Volume Editors: Lu, X., Hu, Y. Vol. 131, 2009 Contemporary Metal Boron Chemistry I Volume Editors: Marder, T. B., Lin, Z. Vol. 130, 2008 Recognition of Anions Volume Editor: Vilar, R. Vol. 129, 2008 Liquid Crystalline Functional Assemblies and Their Supramolecular Structures Volume Editor: Kato, T. Vol. 128, 2008 Organometallic and Coordination Chemistry of the Actinides Volume Editor: Albrecht-Schmitt, T. E. Vol. 127, 2008
Ferro- and Antiferroelectricity Volume Editors: Dalal, N. S., Bussmann-Holder, A. Vol. 124, 2007 Photofunctional Transition Metal Complexes Volume Editor: V. W. W. Yam Vol. 123, 2007 Single-Molecule Magnets and Related Phenomena Volume Editor: Winpenny, R. Vol. 122, 2006 Non-Covalent Multi-Porphyrin Assemblies Synthesis and Properties Volume Editor: Alessio, E. Vol. 121, 2006 Recent Developments in Mercury Science Volume Editor: Atwood, David A. Vol. 120, 2006 Layered Double Hydroxides Volume Editors: Duan, X., Evans, D. G. Vol. 119, 2005 Semiconductor Nanocrystals and Silicate Nanoparticles Volume Editors: Peng, X., Mingos, D. M. P. Vol. 118, 2005
Halogen Bonding Fundamentals and Applications Volume Editors: Metrangolo, P., Resnati, G. Vol. 126, 2008
Magnetic Functions Beyond the Spin-Hamiltonian Volume Editor: Mingos, D. M. P. Vol. 117, 2005
High Energy Density Materials Volume Editor: Klapötke, T. H. Vol. 125, 2007
Intermolecular Forces and Clusters II Volume Editor: Wales, D. J. Vol. 116, 2005
Data Mining in Crystallography Volume Editors: Detlef W.M. Hofmann Liudmila N. Kuleshova
With contributions by J. Apostolakis · C. Buchsbaum · H. Cheng · S. Höhler-Schlimm · D.W.M. Hofmann · R.L. Jernigan · A. Kloczkowski · K. Rajan · S. Rehme · T.Z. Sen
123
Dr. Detlef W.M. Hofmann Center for Advanced Studies Research & Development in Sardinia (CRS4) Località Piscina Manna, Ed. 1 09010 Pula CA Italy
[email protected]
Liudmila N. Kuleshova Center for Advanced Studies Research & Development in Sardinia (CRS4) Località Piscina Manna 09010 Pula CA Italy
[email protected]
ISSN 0081-5993 e-ISSN 1616-8550 ISBN 978-3-642-04758-9 e-ISBN 978-3-642-04759-6 DOI 10.1007/978-3-642-04759-6 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2009936786 c Springer-Verlag Berlin Heidelberg 2010 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Cover design: KünkelLopka GmbH, Heidelberg, Germany Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Series Editor Prof. D. Michael P. Mingos Principal St. Edmund Hall Oxford OX1 4AR, UK
[email protected]
Volume Editors Dr. D.W.M. Hofmann
Liudmila N. Kuleshova
Center for Advanced Studies Research & Development in Sardinia (CRS4) Località Piscina Manna, Ed. 1 09010 Pula CA Italy
[email protected]
Center for Advanced Studies Research & Development in Sardinia (CRS4) Località Piscina Manna 09010 Pula CA Italy
[email protected]
Editorial Board Prof. Peter Day
Prof. Dr. Kenneth R. Poeppelmeier
Director and Fullerian Professor of Chemistry The Royal Institution of Great Britain 21 Albermarle Street London W1X 4BS, UK
[email protected]
Department of Chemistry Northwestern University 2145 Sheridan Road Evanston, IL 60208-3133 USA
[email protected]
Prof. Xue Duan
Prof. Gerard Parkin
Director State Key Laboratory of Chemical Resource Engineering Beijing University of Chemical Technology 15 Bei San Huan Dong Lu Beijing 100029, P.R. China
[email protected]
Department of Chemistry (Box 3115) Columbia University 3000 Broadway New York, New York 10027, USA
[email protected]
Prof. Lutz H. Gade Anorganisch-Chemisches Institut Universität Heidelberg Im Neuenheimer Feld 270 69120 Heidelberg, Germany
[email protected]
Prof. Jean-Pierre Sauvage Faculté de Chimie Laboratoires de Chimie Organo-Minérale Université Louis Pasteur 4, rue Blaise Pascal 67070 Strasbourg Cedex, France
[email protected]
Structure and Bonding Also Available Electronically
Structure and Bonding is included in Springer’s eBook package Chemistry and Materials Science. If a library does not opt for the whole package the book series may be bought on a subscription basis. Also, all back volumes are available electronically. For all customers who have a standing order to the print version of Structure and Bonding, we offer the electronic version via SpringerLink free of charge. If you do not have access, you can still view the table of contents of each volume and the abstract of each article by going to the SpringerLink homepage, clicking on “Chemistry and Materials Science,” under Subject Collection, then “Book Series,” under Content Type and finally by selecting Structure and Bonding. You will find information about the – – – –
Editorial Board Aims and Scope Instructions for Authors Sample Contribution
at springer.com using the search function by typing in Structure and Bonding. Color figures are published in full color in the electronic version on SpringerLink.
Aims and Scope The series Structure and Bonding publishes critical reviews on topics of research concerned with chemical structure and bonding. The scope of the series spans the entire Periodic Table and addresses structure and bonding issues associated with all of the elements. It also focuses attention on new and developing areas of modern structural and theoretical chemistry such as nanostructures, molecular electronics, designed molecular solids, surfaces, metal clusters and supramolecular structures. Physical and spectroscopic techniques used to determine, examine and model structures fall within the purview of Structure and Bonding to the extent that the focus
is on the scientific results obtained and not on specialist information concerning the techniques themselves. Issues associated with the development of bonding models and generalizations that illuminate the reactivity pathways and rates of chemical processes are also relevant. The individual volumes in the series are thematic. The goal of each volume is to give the reader, whether at a university or in industry, a comprehensive overview of an area where new insights are emerging that are of interest to a larger scientific audience. Thus each review within the volume critically surveys one aspect of that topic and places it within the context of the volume as a whole. The most significant developments of the last 5 to 10 years should be presented using selected examples to illustrate the principles discussed. A description of the physical basis of the experimental techniques that have been used to provide the primary data may also be appropriate, if it has not been covered in detail elsewhere. The coverage need not be exhaustive in data, but should rather be conceptual, concentrating on the new principles being developed that will allow the reader, who is not a specialist in the area covered, to understand the data presented. Discussion of possible future research directions in the area is welcomed. Review articles for the individual volumes are invited by the volume editors. In references Structure and Bonding is abbreviated Struct Bond and is cited as a journal. Impact Factor in 2008: 6.511; Section “Chemistry, Inorganic & Nuclear”: Rank 2 of 40; Section “Chemistry, Physical”: Rank 7 of 113
Preface
Humans have been “manually” extracting patterns from data for centuries, but the increasing volume of data in modern times has called for more automatic approaches. Early methods of identifying patterns in data include Bayes’ theorem (1700s) and Regression analysis (1800s). The proliferation, ubiquity and increasing power of computer technology has increased data collection and storage. As data sets have grown in size and complexity, direct hands-on data analysis has increasingly been augmented with indirect, automatic data processing. Data mining has been developed as the tool for extracting hidden patterns from data, by using computing power and applying new techniques and methodologies for knowledge discovery. This has been aided by other discoveries in computer science, such as Neural networks, Clustering, Genetic algorithms (1950s), Decision trees (1960s) and Support vector machines (1980s). Data mining commonly involves four classes of tasks: • Classification: Arranges the data into predefined groups. For example, an e-mail program might attempt to classify an e-mail as legitimate or spam. Common algorithms include Nearest neighbor, Naive Bayes classifier and Neural network. • Clustering: Is like classification but the groups are not predefined, so the algorithm will try to group similar items together. • Regression: Attempts to find a function which models the data with the least error. A common method is to use Genetic Programming. • Association rule learning: Searches for relationships between variables. For example, a supermarket might gather data of what each customer buys. Using association rule learning, the supermarket can work out what products are frequently bought together, which is useful for marketing purposes. This is sometimes referred to as “market basket analysis.” The final step of knowledge discovery from data is to verify that the patterns produced by the data mining algorithms occur in the wider data set. Not all patterns found by the data mining algorithms are necessarily valid. It is common for the data mining algorithms to find patterns in the training set which are not present in the general data set; this is called overfitting. To overcome this, the evaluation
x
Preface
uses a test set of data which the data mining algorithm was not trained on. The learned patterns are applied to this test set and the resulting output is compared to the desired output. For example, a data mining algorithm trying to distinguish spam from legitimate e-mails would be trained on a training set of sample e-mails. Once trained, the learned patterns would be applied to the test set of e-mails on which it had not been trained, and the accuracy of these patterns can then be measured from the number of e-mails they classify correctly. A number of statistical methods such as ROC (receiver operating characteristic) curves may be used to evaluate the algorithm. If the learned patterns do not meet the desired standards, then it is necessary to reevaluate and change the preprocessing and data mining. If the learned patterns do meet the desired standards, then the final step is to interpret the learned patterns and turn them into knowledge. In recent years, data mining has been widely used in areas of science and engineering, such as bioinformatics, genetics, medicine, education and electrical power engineering, as well as in businesses and governments, to sift through volumes of data such as airline passenger trip records, census data and supermarket scanner data to produce market research reports. The present volume, dedicated to the application of Data Mining in Crystallography, is organized as following. The Chapter “An Introduction to Data Mining” written by J. Apostolakis provides an introduction and a short overview of the mathematical concepts and ideas behind the most relevant methods of data mining approach in crystallography. Some applications are described in the following chapters (and we hope these will make the mathematical concepts of introduction more accessible for a wide range of readers); other methods still do not find a large application in crystallography but we hope the chapter will open up possibilities for future discoveries in this field. Crystallographers were among the first scientists to recognize the importance of collecting data on crystal structures. As a result nowadays a large amount of crystal structural data is collected in several big crystallographic data bases. The comprehensiveness of the data collection, the structure and quality of the data and the selection of relevant data sets are extremely important to get reasonable results from the fully automatic procedure of data mining. As one example, the Inorganic Crystal Structure Database (ICSD), a source of information for crystallographers, mineralogists, physicist and chemists, is presented in the Chapter “Data bases, the base for Data Mining” by Ch. Buchsbaum, Sabine H¨ohler-Schlimm, and S. Rehme, along with a short overview of all existing crystal structural data bases. In the Chapter “Data Mining and Inorganic Crystallography” by K. Rajan, an overview of the types of information that can be gleaned by applying data mining and statistical learning techniques to inorganic crystallography is provided. The focus is on two broad areas, classification and prediction of data, the two primary roles of data mining as a field. A fundamental issue in inorganic crystallography is to understand the relationship between chemical stoichiometry and crystal structure. The relationship between specific compounds and specific crystal structures is usually developed heuristically by surveying the crystallographic data of known compounds. This process of structure–chemistry association has laid the historical foundations of identifying crystal structure prototypes and structural classifications.
Preface
xi
This demonstrates how informatics can quantitatively accelerate the discovery of structure–chemistry relationships and also be used as the foundation for developing structure–chemistry–property relationships. In the Chapter “Data Mining in Organic Crystallography” by D.W.M. Hofmann, two actual applications of Data Mining have been highlighted: the cluster analysis and the support vector machines (SVM). The SVMs are used to find errors in the Cambridge Structural Database of small molecule crystal structures and to derive force fields without any hypotheses on the functional form. Since the accuracy of the force fields derived by data mining depends on the number of known crystal structures, this approach should be favored in the long-term. The second method, clustering, has been introduced in this field only very recently. An obvious application is the screening of data bases to remove undesired repetitions of crystal structures. This is important for all, virtual as well as for experimental, data bases. Its application is interesting in crystal structure determination, where it can be used to find isostructural crystal structures. With this simple application, the knowledge about regularities between isostructural crystal structures gives very valuable information for crystal engineering. A third method, principal component analysis, might become more important in the future, as it is already in use nowadays in inorganic crystallography. The last chapter of the volume, “Data mining in Protein Secondary Structure Prediction” written by A. Kloczkowski et al. is dedicated to the application of data mining techniques to extract the predictive information from the protein, DNA and RNA Data Bases. Data Mining in biology is a rapidly growing field of science, combining molecular biology, computational methods and statistics for analyzing and processing biological data. This has led to the development of a new field of science: bioinformatics. Mining for information in biological databases involves various forms of data analysis such as clustering, sequence homology searches, structure homology searches, examination of statistical significance, etc. Particularly, the data mining of structural fragments of proteins from known structures in the Protein Data Bank significantly improves the accuracy of secondary structure prediction. Since the crystallization of these objects is the most serious bottleneck in high-throughput protein-structure determination by diffraction methods, it is to be noted that the data mining approach is also used to characterize the biophysical properties and conditions that control and improve a protein crystallization. The tendency to increase the accuracy of the crystal structure data promises a better quality of patterns obtained by Data Mining for the future because the quality of the result depends strongly on the amount, the quantity and reliability of the data used. The aim of the volume is to show the possibilities of the method used in knowledge discovery in crystallography. We hope that it will make Data Mining more accessible in crystallography and allow new applications in the field and the discovery of non trivial and scientifically relevant knowledge. Pula, September 2009
Detlef W.M. Hofmann Liudmila N. Kuleshova
Contents
An Introduction to Data Mining Joannis Apostolakis
1
Data Bases, the Base for Data Mining Christian Buchsbaum, Sabine H¨ohler-Schlimm, and Silke Rehme
37
Data Mining and Inorganic Crystallography Krishna Rajan
59
Data Mining in Organic Crystallography Detlef W.M. Hofmann
89
Data Mining for Protein Secondary Structure Prediction Haitao Cheng, Taner Z. Sen, Robert L. Jernigan, and Andrzej Kloczkowski
135
Index
169
Struct Bond (2010) 134:1–35 DOI:10.1007/430 2009 1 c Springer-Verlag Berlin Heidelberg 2009 Published online: 1 September 2009
An Introduction to Data Mining Joannis Apostolakis
Abstract Data mining aims at the automated discovery of knowledge from typically large repositories of data. In science this knowledge is most often integrated into a model describing a particular process or natural phenomenon. Requirements with respect to the predictivity and the generality of the resulting models are usually significantly higher than in other application domains. Therefore, in the use of data mining in the sciences, and crystallography in particular, methods from machine learning and statistics play a significantly higher role than in other application areas. In the context of Crystallography, data collection, cleaning, and warehousing are aspects from standard data mining that play an important role, whereas for the analysis of the data techniques from machine learning and statistical analysis are mostly used. The purpose of this chapter is to introduce the reader to the concepts from that latter part of the knowledge discovery process and to provide a general intuition for the methods and possibilities of the different tools for learning from databases.
Contents 1 2
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Unsupervised Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Supervised Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Bias vs. Variance (Fit vs. Generalizability) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 4 4 15 19 22 30 32 33 34
J. Apostolakis () Ludwig-Maximilians-Universit¨at M¨unchen, Institut f¨ur Informatik, Oettingenstraße 67, 80538 M¨unchen, Germany e-mail:
[email protected]
2
J. Apostolakis
Abbreviations DBMS ICA KDD LDA PCA PLS RMSD SVD SVM
Database management system Independent component analysis Knowledge discovery in databases Linear discriminant analysis Principal component analysis Partial least squares Root Mean Square deviation Singular value decomposition Support vector machine
1 Introduction In the last century the scientific and technological basis for high performance computing and electronic data management have been set. Assuming that no major breakthroughs in the principles of storing and handling information electronically will take place for the next 50 years or so, the continuous optimization of processes, hardware, algorithms, software, and standardization in the storage of data together with the ever increasing scope of application of computer systems for storing all different sorts of data are expected to continue to change our professional and private lives at a fast pace. At the same time, the ever widening use of automated processes for measuring data both as part of everyday transactions as well as part of scientific experimentation has lead and is continuing to lead to an explosion in the amount of generated data in a number of aspects of every day life and in the different scientific disciplines. The identification of relevant information in the data and the generation of new knowledge from the existing data is an important process for decision support on the basis of the data, which is not manageable on a manual basis simply due to the sheer size of the existing data sets and databases. Data mining or knowledge discovery in databases describes the identification of structure or patterns or particular relationships in the data to allow predictions for the future and to support decisions on questions related to the data. According to the definition by Fayyad et al. [9] data mining is a specific part of knowledge discovery in databases (KDD) which entails the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. This description indicates a certain vagueness in the type of information that is expected to be found in the data, and indeed it is often the case that data mining is highly explorative in nature, and we are simply trying to find patterns in the data which warrant further investigation. On the other hand, in many cases, especially in scientific applications the search for patterns has a given focus. For example, we may be looking for patterns that are found in a particular group of data and not existing in a second group or patterns that fit to or could be explained by a particular theoretical model.
An Introduction to Data Mining
3
An alternative approach to giving a definition to data mining is to investigate its uses and the motivation behind its application. While database management systems (DBMS) have been highly successful in allowing highly efficient retrieval of specific data from very large collections of data, it is clear that in most cases retrieval itself is not sufficient for decision support. We illustrate this point with a simple example from banking. A given person with a certain income, a given account balance, and a certain history with the bank asks for credit. In order for the bank to decide on the risk of the particular credit it could retrieve all identical cases in its history and assess historically the probability of the credit being defaulted. However, the question about what constitutes an identical case arises. Is the account balance enough, or does the current income also play a role? What other type of information can be relevant for the decision? It becomes readily clear that if a number of such criteria is used for the definition of an identical request, the answer to the database query will in most cases be an empty set, or one that simply contains the current request. Data mining can be used to identify both the relevant criteria and their weight for deciding this type of query. In science, in particular, the retrieval based approach corresponds to a case by case type of study, or the search for a single counterexample that may disprove a certain hypothesis. However, as ever more complex systems fall into the scope of quantitative science it is also clear that our description of such complex systems will often be accurate only in an average sense. We are therefore interested in general patterns that are often true for different data, even if there are single exceptions. The exceptions themselves can be of particular interest for further development of a particular theory and should be identified in a second step, for example as outliers, i.e., those data that do not fit into the general pattern. At the same time the identification of patterns within a set of data allows us to compress its description. This is important in the interaction between machines and humans. Compared to computers, humans are better at understanding a general description of a data set, while the former are obviously better at memorizing its single elements. The identification of patterns in a set of data allows us humans to get a better impression of the information in that set and condense it into knowledge. Furthermore, data compression allows us to identify redundancies in the data. Classical data mining aims at the discovery of knowledge as this can be understood in a simple way by a human. This is illustrated by the famous example of the identification of the empirical rule that people who buy diapers are also more likely to buy beer. In crystallography, however, the human user is a scientist for whom a vector of weights can be more intuitive than an association of the type mentioned. This is not to say that scientists are weird in any particular way, it is simply often the case in the scientific domain to use data and computational methods to parametrize a given physical model. Scientists tend to interpret the world in terms of such models and a reasonable parametrization of a simple model will in most cases be significantly more intuitive than some newly discovered rule that is derived from a particular database and may or may not have a more general validity. Therefore, in the use of data mining in the sciences, and crystallography in particular, methods from machine learning and statistics play a significantly higher role than in other application
4
J. Apostolakis
areas. In the context of Crystallography, data collection, cleaning, and warehousing are aspects from standard data mining that play an important role, whereas for the analysis of the data more general techniques from machine learning and statistical analysis continue to play an important role. The purpose of this chapter is to introduce the reader to the concepts from that latter part of the knowledge discovery process to provide a general intuition for the methods and possibilities of the different tools for learning from databases. The description will focus on the two classical aspects of pattern recognition, unsupervised and supervised methods.
2 Unsupervised Methods Unsupervised methods aim at the identification of structure or patterns in the data, which were not previously known. Structure in the data is often found as the general relations between the data descriptors or at the basic level of similarity between pairs of data points. The former is the basis for general density estimation approaches such as principal components analysis or naive Bayes approaches, while the latter is the basis for clustering and the identification of discrete structures in the data. The identification of similarities and differences in an efficient manner is therefore an important task in data mining which often forms precedes further analysis. The comparison between descriptors is typically measured by simple statistics such as correlation or mutual entropy. Objects on the other hand, generally require the definition of corresponding similarity or distance functions that take as arguments the two objects that need to be compared. The reason for this is simply that in the latter case the objects need to be compared with respect to a number of different types of properties, and the combined similarity needs to be determined weighting partial similarities of different types. On the other hand, the comparison of descriptors is in general a more straightforward question as we simply compare the values of the same type of property over different objects. In the following sections, some of the typical correlation and similarity functions will be described.
2.1 Density Estimation 2.1.1 Correlations and Dependencies in the Descriptors Linear Correlation The simplest approach for identifying correlation between two variables is to assume that they are linearly related. For two variables x and y for which we have N observations, Pearson’s correlation coefficient r is defined as
An Introduction to Data Mining
5
∑ (xi − x) (yi − y) rxy = i . 2 2 ∑i (xi − x) ∑i (yi − y)
(1)
The observant reader will notice that the correlation coefficient corresponds to the scalar product of the normalized and centralized data vectors x : (xi − x) . xi = 2 ∑i (xi − x)
(2)
The representation of a correlation coefficient as a scalar product provides significant insight into its properties, as it allows a direct geometric representation. In fact one of the main advantages of linear methods is the fact that can generally be visualized in a manner that provides some intuition of the results and the possible problems. Linear correlation takes values from −1 for perfect anti-correlation, through 0 for no linear correlation to 1 for perfect correlation. It can be calculated on the basis of relatively few data points and is very useful for screening for dependencies among thousands of variables. It does however have a number of disadvantages. It is quite sensitive to outliers. A few points or a single point lying outside the bulk of the rest of the points can lead to either a very strong correlation or destroy an existing correlation (see Fig. 1). In both cases, the overall correlation is dominated by a few or only one point, which may after all be a true measurement error. Rank correlation (the correlation of ranks, as opposed to that of the properties themselves) can be used to ameliorate some of the problems connected to the sensitivity to outliers, but it does not solve problems connected to nonlinearity. As the name suggests linear correlation does not necessarily detect nonlinear relations. The most typical example of that is the quadratic relation shown in Fig. 2. The linear correlation coefficient between X and Y for that data is zero even though clearly Y depends quadratically on X. For more general and especially unknown relations the correlation between nonlinear transformations of the original variables and particularly mutual entropy and information gain are used.
Fig. 1 Examples where single outliers lead to overestimation (left plot) or underestimation (right plot) of the over all correlation
6
J. Apostolakis
Fig. 2 A simple quadratic dependence between two variables shows no linear correlation
Statistical Dependence The most general form of correlation between any two properties is that of statistical dependence. Unlike linear correlation or nonlinear correlation, the question of statistical dependence does not depend on any assumptions about the type of the correlates or the functional form of the dependence. Statistical dependence is directly related to the question of whether knowledge of one fact, for example the value of a given variable, in any way influences our knowledge of another variable. To understand statistical dependence it is, however, necessary to introduce some basic concepts from statistics. This is worth our time at this stage as they also form the basis for related methods, such as density estimation, naive Bayes methods, and independent component analysis. Basic Concepts Possibly the best way to understand statistical dependence is to study a simple example of dice throwing. Assume we are given two fair dice and these are thrown one after the other. We will call the value of the two dice x and y respectively. As we are dealing with fair dice we have p(x = i) = p(y = i) = 1/6 for i ∈ 1..6. Now imagine we are taking statistics over different properties of the result. For example, the probability that the value of the die is an even number (p(even)), whether it is dividable by three (p(3, 6)) and whether it is a number smaller or equal to 3 (p(x ≤ 3)). The probability of each of these more complex results can be obtained as the sum of the basic cases (p(x = i)) consistent with the result. For example p(even) = p(2) + p(4) + p(6) = 0.5. Since any number can be either even or uneven it is clear that p(uneven) = 1 − p(even) = p(1) + p(3) + p(5) = 0.5. By simple counting of the possible results consistent with the different events we see that p(even) = p(x ≤ 3) = 1/2 and p(3, 6) = 1/3. For most people it is quite clear
An Introduction to Data Mining
7
that the probability of a combined result in two sequential throws of the die, for example the probability that in the first throw we obtain an even number, while in the second throw we obtain a small number is equal to the product of the single probabilities. p(even, x ≤ 3)seq = p(even) ∗ p(x ≤ 3) = 0.5 ∗ 0.5 = 0.25.
(3)
This works for any other combination of events, and we say that the result of consecutive throws are (statistically) independent. The situation changes if we study those events for a single throw of the die. The question is what is the probability that the result is both even and a small number? Clearly this event is only fulfilled by the value x = 2, and we know that this probability is equal to 1/6, which is not equal to the result obtained under an independence assumption (3). By taking both observations from a single die we have connected their outcome in a way that at first may not be completely clear. To understand this result it is necessary to see how the two events are related. The fact that the value of the die is an even number influences the fact that it is at the same time a small number. There is only one small number among the even numbers. So in the subset of even numbers in our set the probability of picking an even number has changed from 1/2 to 1/3. This probability is written p(x ≤ 3|even) = 1/3, and read as the probability of a small number given the fact that it is an even number. Now we can write the overall probability for p(even, x ≤ 3)single = p(x ≤ 3|even) ∗ p(even) = 1/3 ∗ 1/2 = 1/6.
(4)
And this is now the general equation for obtaining the probability of combined events X and Y . p(X,Y ) = p(X|Y ) ∗ p(Y ) = p(Y |X) ∗ p(X),
(5)
where we have now also shown the symmetry of the operation. The sequence in which we regard the two different events is not relevant, it works both ways. This equation is also correct for the independent case (e.g., the case of the sequential die throws). In that case we simply have p(x ≤ 3|even) = p(x ≤ 3)
(6)
because of the probability of throwing a small number with the second die, when we know that the first die has not changed. This is in fact the condition for statistical independence of two quantities. Two properties x and y are statistically independent when p(x|y) = p(x). In other words the knowledge of x does not change in any way our knowledge of y. The distribution of x is independent of the value of y. It is easy to show that from p(x|y) = p(x) follows p(y|x) = p(y): p(x, y) = p(x|y)p(y) = p(x)p(y) = p(y|x)p(x) from which, for the parts of the space for which p(x) = 0 we derive the identity.
(7)
8
J. Apostolakis
In the next example, we show the power of statistical dependence and why it is so relevant for machine learning and statistics. Returning to our previous example, we note that the two properties appear to be of similar type. Both are properties of the values obtained from the die. However, we could design a similar experiment that leads to the same distribution of values and value pairs, which has a significant difference from the current example: We now have two different types of dice. Both have only three values, one only the even values, one only the odd numbers. We put both dice in a bag and choose one without looking, and then throw it. We then report the result only with respect to whether the obtained number is high or low. The question is, can we deduce from that information something about the identity of the thrown dice (i.e., whether it is the even or odd die)? For example we report that the value is high. What is the probability that it was thrown with the even die? In other words, we are looking for p(even|high) (read: probability of having used the even die given the fact that we got a high value) p(even, high) = p(high|even) ∗ p(even) = p(even|high) ∗ p(high)
(8)
from which we get by a simple rearrangement p(even|high) =
p(high|even)p(even) 2/3 ∗ 1/2 2 = = , p(high) 1/2 3
(9)
correspondingly for the probability that it was thrown with an odd die we have p(odd|high) =
p(high|odd)p(odd) 1/3 ∗ 1/2 1 = = , p(high) 1/2 3
(10)
and fittingly the probabilities add up to 1 as one of the two possibilities must be true. As the denominator is identical for both terms it is often neglected and we instead note the proportionality: p(odd|high) ∼ p(high|odd)p(odd).
(11)
Now the interesting thing is that we have a higher probability for the even die being behind the result, thus we can make a prediction on an unknown variable of the system. This prediction is only possible because the unknown variable and the measured variable are statistically dependent. The fly in the ointment is that given the current status there is still a significant probability of one third that the result was obtained by an odd die. How can we increase our knowledge of the hidden variable, i.e., make a more confident prediction of the identity of the thrown die? There is a very simple way to do this. We need to increase the sampling. To this end we have to change the rules of the game. The die is still selected from the bag, but then it is thrown N times instead of only once, and the result is reported as the sequence S of high and low values that are thrown. In that sequence M high values and N − M low values are obtained. Disregarding the exact sequence and calculating only the probability of it containing M high values
An Introduction to Data Mining
9
in N throws, we have that p(even|S) =
p(S|even)p(even) p(S)
(2/3)M ∗ (1/3)(N−M) ∗ (1/2) p(S) 1 = 2M−1 ( )N ∗ p(S)−1, 3
=
correspondingly for the probability that it was thrown with an odd die we have p(odd|S) =
p(S|odd)p(odd) p(S)
(1/3)M ∗ (2/3)(N−M) ∗ (1/2) p(S) N 1 ∗ p(S)−1. = 2N−M−1 3
=
For comparison, often the ratio of the two probabilities is used, as a number of factors fall off to simplify the overall expression p(even|S) = 22M−N . p(odd|S)
(12)
Now assuming we have thrown the die 20 times achieving a high value in 12 cases the ratio is equal to 24 = 16, meaning there is (24 + 1)−1 < 6% probability that this result was obtained with the odd die, while most probably it was obtained with the even die. By further increasing the sampling (number of throws), we can take this result to practical certainty. The interesting thing about this whole example is that it demonstrates how we can use modeling and statistics to derive information about unknown properties of a given system (the identity of the die). The statistics part of this last example is obvious, however, where does the modeling come into play? In this case the model is given a priori by the rules of the game (drawing a die from two possible types and then throwing it N times). In a more typical scenario, we would be given sets of results of this game, say 10 different sequences of high/low values of length N and we would be asked to propose a model that explains and possibly even predicts to a certain extent certain properties of these results. Clearly in most cases finding a model will be more difficult than actually parameterizing it. However, one of the important quality criteria of a model is the question, how well and how easily it can be parameterized to fit the data of the process it is to describe? Furthermore, machine learning provides a number of standard models that appear to describe many processes reasonably well, so that the choice often becomes one of matching the particular data to a given model.
10
J. Apostolakis
One basic question that often comes up is why statistical models should be a good choice for modeling a given system. There are two main reasons for this. First, statistical noise is often contained in scientific data and is by definition treated best as a random variable drawn from a given error distribution. Second, variables which per se are unknown and not controlled in the experiment are often best treated as additional randomly fluctuating terms. On the other hand, there are also certain dangers in using statistical dependence to derive relationships between properties. The first is the fact that any relationships that need to be determined require the assessment of the corresponding multidimensional distributions p(y, x). Their determination requires sufficient statistics over the relevant range of values, a requirement that is often overseen. Poor statistics can indicate dependencies where none exist. This problem is often referred to as spurious correlations. The second is that any relationships that are derived are strictly valid only for the data set they are derived from. Therefore, even if we say in the above example that the properties of evenness and smallness appear statistically dependent, this is only true for the value range of the die. This dependence certainly has no general validity, we cannot claim that even numbers are in general larger than odd numbers. This fact is often overseen and the generalization of dependencies derived from a given data set to general rules is a risky business. Information Gain The earlier description explains the concept of statistical dependence. To use it for identifying relevant and significant dependencies in the data it is necessary to quantify the magnitude of the dependence. This can be achieved on the basis of the idea of information gain. Information gain describes to which extent knowledge of one property’s value influences our knowledge of a second property, just as knowledge of the results from the dice experiment increased our knowledge on the type of dice that was being used. To identify the information gain it is necessary to identify a well-defined measure of the amount of information we have about a particular property. Clearly, as we are dealing with probability distributions, the measure should be able to cope with them. In a landmark paper, that practically initiated the field of information theory, Claude Shanon showed [31] that any measure of information that is consistent with three basic axioms (additivity, continuity, and monotonicity) necessarily has the functional form of an entropy, i.e., the measure of disorder in a thermodynamic system as derived by Ludwig Boltzman. Without going into details here we will simply mention that the amount of information required to describe the state of a given system (and possibly its time evolution) is given by the number of bits required for defining it. The number of bits is minimized by an encoding that uses few bits to describe probable states, and more for less probable states. Shannon could show that for an optimal encoding the number of bits required for describing a long sequence of states is proportional to the entropy H of the system: H(x) = − ∑ p(x) lg p(x). x
(13)
An Introduction to Data Mining
11
As a result of Shannon’s insight, entropy and information are often used interchangeably when they apply to questions of information theory. Due to the confusion arising from the use of information in a colloquial sense, we will generally use the term entropy here. A few interesting properties of the entropy need to be mentioned here: • States with zero probability have no contribution to the entropy. • The entropy is always nonnegative. • The entropy is minimal (zero) for a δ -distribution, i.e., one that is equal to 1 for a particular state and 0 for all the others. In that case we have certainty about the state of the system and we need no bits at all to communicate the current state. • The entropy is maximal for a uniform distribution. The interesting property of H is that it describes in a well-defined way the “broadness” of a particular distribution even though it is not affected by its topology. The more bits we need on average to further define the system, the higher its entropy, the less we know about its current state at any given time. To understand the term of information gain we simply need to introduce the conditional entropy: H(x|y) = − ∑ p(y) ∑ p(x|y) lg p(x|y). y
(14)
x
The information gain I is defined as I(x; y) = H(x) − H(x|y)
(15)
The information gain describes how much we learn about the possible value of x if we know the value of y. Clearly, when x and y are statistically independent p(x) = p(x|y), H(x) = H(x|y), and the information gain is zero. The conditional entropy and the information gain are practically the most general measures of dependence between variables, because they do not make any assumptions on the functional form of the dependence. They form the basis of a number of important techniques in pattern recognition and analysis, such as independent component analysis[28] and decision trees [29]. The price for their generality is that they depend on the determination of the distributions of the variables, which generally requires significantly larger data sets than are required for linear correlation. Association Rule Mining Given large databases of experimental data it is natural to search for associations or rules that relate certain features of the data. Examples of this type of correlations and dependencies have been manually identified in crystallography, [5] through manual interrogation of the CSD database [3]. The authors have found diverse relations between properties of the compounds, e.g., the presence of chiral centers and crystal packing features, such as space group preferences. These types of relations can be used to check the validity of new structures especially when these are predicted.
12
J. Apostolakis
Association rule mining is an empirical, yet automated approach for identifying this type of statistical dependencies between data features in a given data set. In its simplest form it can be seen as looking for significant cooccurrences of certain features in a set of item sets. The item sets in our terminology correspond to the single data points or objects. It is an empirical search for the dependency of the type “from A follows B.” Given a particular data set of objects that can have features A and B we want to know how high the probability is that when A is given B will also be true for the particular object. The first measure of the association between A and B is support in the data s(A, B). This is given by the ratio of the number NAB of objects for which both A and B are true, over the overall number N of objects (data points) in the set. The support is important to limit the search to patterns that are statistically significant. The second relevant measure is the confidence c(A, B) which again is given by the ration of NAB over the number NA of objects for which A is true. The confidence is practically the empirical measure of the conditional probability p(B|A). As such it is directly connected to the statistical dependence between A and B and to the information between A and B. In association rule mining we are looking for all associations between two features Ai and B j with s(Ai , B j ) ≥ sthresh , and c(Ai , B j ) ≥ cthresh , with sthresh and cthresh , user defined parameters of the search. The advantage of association rule mining is that highly efficient algorithms exist for it (e.g., [2]), and that it is therefore the method of choice when large numbers of both features and data are to be mined. It is particularly useful for deriving empirical rules in cases where little previous knowledge exists. In a scientific setting this corresponds to identifying phenomenological associations which can then be further substantiated by theory. While association mining is one of the corner stones of data mining, it may be of different relevance for different scientific disciplines. Up to now it has found relatively little application in crystallography, while its relevance for other fields of science as for example association studies in genetics is obvious [25]. Text mining can be seen in some of its variants as a particular type of association rule mining. The simple co-occurrence of certain terms in sentences or scientific abstracts may be taken to indicate a connection between the terms and this is indeed often used to extract more complex dependencies from unstructured scientific texts (e.g., [12, 13]). In Crystallography, text mining has been used to support the automatic creation of CrystalEye a database of structures, from scientific publications [26].
2.1.2 Density Estimation The representation of a large data set by an approximate probability distribution allows the condensation of the information in the database and the identification of outliers as well as the identification of the relations between variables. In general, the different methods that exist in this direction rely on performing some decomposition of the overall probability density as a product of independent probabilities. The most
An Introduction to Data Mining
13
straightforward approach there is naive bayes, however, methods like principal components analysis and independent component analysis have the same aim. They first look for linear combinations of the original variables that appear truly independent among each other and therefore constitute a better basis for the decomposition.
Naive Bayes Similar to the calculation of the probability of a model (11) given one dimensional data, for a general high dimensional probability distribution we have: p(C|X1 , X2 , . . . , XN ) ∼ p(X1 , . . . , XN |C) · p(C)
(16)
The first term on the right side needs to be either determined explicitly which requires a significant amount of data as an N-dimensional space needs to be sampled or needs to be approximated. The simplest approximation is to assume the independence of the conditional probabilities: p(X1 , X2 , . . . , XN |C) = p(X1 |C) · p(X2 |C) · ... · P(XN |C)
(17)
This approximation plugged into (16) is called the naive bayes approach. This type of idea finds its application in crystallography as it underlies the work of potentials of mean force as those suggested by Sippl [14, 32], for the prediction of crystal structures of proteins. These methods, have been derived on the basis of a quasi-physical model but are in essence naive bayes approaches of the type described here. In a potential of mean force the statistics from a database is used to obtain probabilities, from which again an energy (potential) is derived. The potential is proportional to the logarithm of the derived probabilities. The additivity of the single terms in the potential again means that the overall probability for a given structure is decomposed as a product of the type shown in (17). PCA, ICA, Nonlinear PCA For reasons that practically go back to the central limit theorem, data will often look like an extended cloud as in Fig. 3. In this case, the first reduction we can perform is to identify the center of the distribution. In the next step, we generally would like to obtain an idea of the width and the “direction” of the distribution. One approach would be to identify the two points lying furthest apart from each other and report this as the maximal width. However, this measure would depend crucially on only two data points, while revealing relatively little on the rest of the distribution. A significantly more robust feature of the distribution is the main (or principal) direction of the data “cloud” (see Fig. 3). This can be found as the direction of → − maximal variance in the data. To understand what that means consider the vector d → − in the data. If we project each data point on d we obtain the component of each data → − point along that direction. The principal component is the vector d which shows
14
J. Apostolakis
Fig. 3 Principal components of a two dimensional distribution. The solid arrow indicates the direction of the first component, the broken arrow the direction of the second component
the largest variance of the corresponding components in the data. It is the vector that best captures the breadth of the data distribution. The way to find it is relatively → − simple. Assuming d is normalized, the product c = X ∗ d between the data matrix X and the matrix vector d gives a matrix vector c with the components of each data point along that direction. The variance of the components is given as
∑ c2i − ∑ ci
2
.
(18)
The second sum is the square of the average of the components. Assuming that → − the data are centered, this term is zero independent of the direction of d and to maximize the variance of the data all we need to do is maximize the first term. That again is equal to the scalar product of c with itself: cT ∗ c = d T ∗ X T ∗ X ∗ d.
(19)
The matrix X T ∗ X is the positive semi-definite, symmetric scatter matrix of the data. The normalized vector d maximizing the (19) is the eigenvector with the highest eigenvalue. Once we have found the main (principal) direction of the distribution, it is reasonable to ask about the secondary direction of the distribution and so on. It turns out that these correspond to the other eigenvectors of the covariance matrix. The different principal components of the data are normal to each other and can be used to describe the density of the data as a Gaussian distribution.
An Introduction to Data Mining
15
An alternative and more general measure of a distribution is its information content. In independent component analysis (ICA) [28] the direction of maximal entropy is searched for. This is a significantly more complex task than finding the direction of maximal variance. Once the first such direction is found the next direction is found in the subset of all component vectors which is statistically independent to those found up to that point, hence the name. The interesting property of ICA with respect to PCA is that it does not make any explicit assumptions on the distribution of the data (i.e., whether they are reasonably represented by a Gaussian density). Further related procedures like nonlinear PCA allow the identification of nonlinear coordinates that describe the data. We will simply mention here the kernel PCA approach (see e.g., [7]) as it is closely related to support vector machines (SVM) and is easily explained. In Kernel PCA, instead of working with the covariance matrix X T ∗ X the kernel matrix Ki j is used. Just as in SVMs the kernel matrix defines a nonlinear transformation of the data. The linear PCA in the transformed space then corresponds to nonlinear components in the original space.
2.2 Clustering PCA and related methods describe the general distribution of uniform data, however, they do not perform that well when the data contains more than one group. Identifying this type of grouping or classes in the data is useful for a number of reasons. First it can give a contracted representation of the data distribution. Second the identification of groups is often an aim in itself in the sense that it allows a first classification of the data on the basis of which further analysis can be performed. Furthermore, the identification of single outliers that appear to be very different from the rest of the data can be either of particular interest as such or used for the identification of measurement errors. A number of general reviews on clustering have been given (e.g., [17]). It lies beyond the scope of this chapter to go into deep detail on the different available types of clustering methods but we will pick out some of the simple and best known methods and explain some basic concepts used in classifying clustering algorithms. The basis for clustering is the choice of a distance function between two data points and then its generalization to the distance (compactness) of clusters. While the distance of two objects is a general question, the distance between clusters is a question inherent to clustering which gives rise to related algorithms for hierarchical clustering. In the following sections we describe a number of common choices for object similarity and we discuss clustering itself.
2.2.1 Similarity Functions As explained above, the identification of discrete structures in the data, requires the definition of a similarity or distance function which allows the grouping of data into
16
J. Apostolakis
substructures. Given the representation of the objects M1 and M2 , an exact similarity function s(M1 , M2 ) needs to be defined which typically should return 1 for identity and 0 or some negative number for absolutely no similarity. It is important to bear in mind that here identity implies only representation identity, not necessarily object identity. Often, distance d(M1 , M2 ) is used instead of similarity. A distance of 0 implies (representation) identity. Distances are not bounded from above, unless the representations are normalized. In general, it is simple to obtain a transformation from distances to similarities by an equation of the type: s(M1 , M2 ) = 1 − a ∗ d(M1, M2 )
(20)
with a as the suitable scaling constant. The simple transformation between the two measures allows interchangeable use of the two. For the direct calculation of similarities the object is typically represented as some type of real valued or integer vector. A simple form of similarity measure is then the cosine coefficient, see e.g., [10], ∑ j M1, j M2, j , M1 ∗ M2 = 2 ∗ 2 ∑ j M1, j ∑ j M2, j
(21)
which in principle is a correlation coefficient over the different descriptors of the two objects. Furthermore, the Tanimoto coefficient is also often used: ST (M1 , M2 ) =
∑ j M1, j M2, j , 2 2 − ∑ j M1, j + ∑ j M2, ∑ j M1, j M2, j j
(22)
which for nominal attributes extends to,
ST (M1 , M2 ) =
M1 M2 . M1 M2
(23)
Intuitively the latter is a ratio of the number of properties shared by two objects over the total number of properties that hold for either of the two. Of the distance metrics the most commonly used are the Manhattan distance and the Euclidean distance, which are defined as follows: DM (M1 , M2 ) = ∑ M1, j − M2, j ,
(24)
j
DE (M1 , M2 ) =
∑ M1, j − M2, j 2.
(25)
j
Both are special cases of the p-norm, for p = 1 and p = 2: D p (M1 , M2 ) = p ∑ M1, j − M2, j p j
(26)
An Introduction to Data Mining
17
The effect of the magnitude of p is that as it increases it weights the most deviating features ever more. For large p the p-norm converges towards the max-norm: Dmax (M1 , M2 ) = max M1, j − M2, j , j
(27)
which has the effect of giving back the largest deviation among the features of two objects. Its advantage is that it gives one an upper bound on the difference of any given feature between two objects, but it has the disadvantage of being too sensitive to single values. From that point of view the optimal would be a low p, for example the Manhattan distance. The Euclidean distance is most often used, due to its geometrical interpretation and the fact that it allows analytical solutions to superposition problems. The root mean squared deviation (RMSD) often used to compare different structures of identical molecules is effectively also an Euclidean distance. Furthermore, the least squares method for superposing two structures given by Kabsch effectively minimizes the Euclidean distance of two molecular structure vectors [18, 19]. 2.2.2 Clustering Algorithms Hierarchical Clustering Hierarchical clustering is defined recursively and leads to a cluster tree that represents the single steps taken during the clustering, thereby allowing the user to decide a posteriori on the number of clusters. For the recursive clustering two fundamental choices exist, either placing all data into a single cluster and then iteratively breaking it up into single clusters (divisive or partitioning approach) or starting with each data point in a single cluster and then merging clusters (agglomerative). Agglomerative Clustering Hierarchical agglomerative clustering starts with each data point defined as a (singleton) cluster and continues iteratively always reducing the number of clusters by one, until all data points have been merged into a single cluster. At each iteration the algorithm simply decides which two clusters are most similar and joins them into a new one, thus reducing the overall number of clusters by one. Thus upon clustering N objects a cluster tree with N levels is created. The user can then select the number of clusters he requires. The different variants of hierarchical agglomerative clustering differ simply in their definition of cluster similarity.
Single Linkage In single linkage clustering [20] the distance between two clusters is equal to the minimal distance of any pair of their elements. Single linkage clustering is therefore
18
J. Apostolakis
very efficient in its computation and simple to understand, suffers, however, from the disadvantage that it will often tend to produce large elongated clusters that are internally not necessarily compact. The reason for that is that the merging criterion does not take the overall structure of the new cluster into account.
Complete Linkage Complete linkage clustering [35] avoids the problem of large clusters by using the compactness of the merged cluster as the merging criterion. At each stage, for each possible cluster pair the maximum distance between any pair of objects in the two clusters is determined and the cluster pair with the lowest maximal distance is merged at this point. Complete linkage tends to create many small clusters at intermediate levels of the cluster tree. While complete linkage clustering is not able to identify clusters of complex (e.g., elongated) structure, even when they clearly exist in the data, it has been often observed in practice to produce very useful hierarchies. [16]
Average/Medoid An intermediate form of linkage is the use of the average of the distances in the merged cluster. In some versions the average distance to the centroid is taken, which then again corresponds to a minimization of the squared error, in the sense that the two clusters are merged that lead to the lowest possible (average) deviation from the cluster representative (the centroid). Partitional Algorithms The disadvantage of agglomerative methods is that they start clustering single objects to small clusters, so that global properties of the data distribution can only be taken into account when it is relatively late in the recursion. Especially when we have large data set sizes one is often interested in obtaining few clusters that describe the global structure of the data and not so much the details. It is therefore reasonable to start from the other side, i.e., placing all data points in a single cluster and then pick the least compact cluster and divide it into two clusters. The problem with this approach is that the division of a single cluster into two subclusters appears as a more difficult problem than the inverse direction. The reason for this is simple: In the agglomerative approach there are at each stage m ∗ (m − 1)/2 possibilities for the next step (where m is the number of clusters at this step). In divisive clustering, in every recursive step first the most inhomogeneous (least compact) cluster needs to be identified, and then the split that leads to the two most compact clusters needs to be found. Given a cluster with m objects there exist 2m−1 possibilities for dividing it into two clusters.
An Introduction to Data Mining
19
K-means K-means is a simple partitional algorithm [1, 36], which leads to good results when the number of required clusters is predefined or the user has a reasonable intuition about it. The algorithm starts by assigning k-clusters. It then calculates the centroid of each cluster and reassigns all objects to the cluster with the nearest centroid. These two steps are reiterated until convergence, that is, until no data switch cluster or no centroids change their position. The algorithm is in general not guaranteed to converge to a globally optimal solution; however, in practice it is known to converge very fast.
3 Supervised Methods Supervised methods explicitly train the prediction of one or more target attributes of objects based on a given training set for which we know the answer. For example, if we want to train a method for predicting tomorrows’ stock prices on the basis of their previous performance, it would be sensible to start with a data set for which we know the correct answer and to find the best algorithm in a dry run. In our example that could be trying to predict today’s prices (which we already know) on the basis of the previous prices, and to assume that the best method for prediction will not change on a daily basis. The advantage of that is that we can use the known error to identify the best model and further improve on it. From a practical point of view, in supervised learning the task consists in predicting a given target attribute (often called the label) of a certain object, given a number of the objects features. To allow a mathematical approach to solve this task, objects are often described as ordered pairs of N-tuples of feature values and corresponding target attributes. Each feature may be numerical or nominal. In the simple case of numeric features the object can be described as a vector in xi ∈ RN . For simplicity we will generally assume numerical attributes unless otherwise noted. The target attribute can be nominal, or discrete, in which case we speak of classification, and the value of the attribute is called the label or the class of the object or it can be a quantity, in which case we speak of regression. The two tasks of classification and regression are closely related, both conceptually and from the point of methodological approaches used for attacking them. Regression, for example, can be seen as a generalized classification problem with continuous class labels. Especially in cases where the target quantity is fraught with relatively high experimental error, it is not uncommon to discretize continuous quantities, assigning to each value a qualitative description such as low, middle, and high. In this case the regression can then be attacked as a classification problem. On the other hand methods from regression have been used to address classification problems, by simply pretending that the class labels are different numbers that must be fitted by the model. In the following section a more general description of the general concepts and problems underlying both classification and regression are given.
20
J. Apostolakis
The task in both cases is to construct a model which relates the feature values to the target attribute by a model function f (xi ), minimizing the total cost function C().
∑ C( f (xi ), yi ),
(28)
i
where yi is the value of the target attribute, and xi the corresponding feature vector of the ith object. Often enough in supervised learning one is simply given the target attribute, with everything else being more or less open to the modelers choice: A typical task could be to predict tomorrow’s stock prices. In general, no one will care how you do it as long as it really is predictive. So the choice of features and of the model functions are quite often part of the task. Typically the cost function takes the value of 0 for identical arguments (i.e., when model and target attribute agree) and is nonnegative otherwise. As the minimization of C() takes a central role in the definition of the model building procedure, its choice is of utmost relevance for the properties of the learning algorithm. On the one hand it can be used to define what the user considers to be a good model, and on the other hand its mathematical properties are important for the design and the properties of the learning algorithm. One possibility for the cost function is to simply count the number of errors made by the model: C( f (xi ), yi ) = δ ( f (xi ), yi ),
(29)
where the δ () function takes a value of 0 for identical arguments, and is 1 otherwise. This cost function is intuitive, and generally applicable, both for classification and regression type problems: however, it does have a significant drawback, which is best seen in the case of regression. A continuous valued real model function f () will in general have a zero probability of recovering the exact target attribute. Therefore the value of the cost function will almost always be 1, indicating an error. This problem can be alleviated by the use of a so-called ε -insensitive invariant form of the cost function, i.e., one that returns 0 if the two arguments are within a tolerance radius ε from each other, and 1 otherwise. However, there is still a significant problem with this type of error counting cost functions: The cost is a discontinuous function of the arguments, which means that gradient based approaches cannot be used to improve an initial model. In other words, since there is no differentiation between small and large errors, for a given model, it is not possible to use the cost function itself to identify local improvements of the model. In contrast to an error counting cost function other cost functions are significantly easier to minimize, and can in some cases lead to optimal parametrization of the model. Possibly the most widely used cost function (especially in regression) is the squared deviation between model and target attribute C( f (xi ), yi ) = ( f (xi ) − yi)2 .
(30)
The relevant mathematical properties of a cost function are its continuity over its arguments and convexity properties, both of which are important for the optimization of models. At the same time the cost function must implement, what we
An Introduction to Data Mining
21
consider to be a reasonable penalty for poor predictions, depending on where and when they happen. The next important choice in building a system for supervised learning is the hypothesis space, from which a particular model function can be chosen. A typical example are linear systems where the model function is given as a weighted sum of the feature values (31) f (xi ) = ∑ w j xi j = w · xi , j
here the w j are the weights for each feature of the object. The above equation defines the model space, while a particular choice of weight vector defines the particular model. The advantage of linear models is that they usually have better convergence properties than general nonlinear models, and are therefore simple to optimize, and that they allow an intuitive interpretation of the results: The absolute value of the weights can be seen as a direct information on the relevance of the corresponding feature for the model. The sign of the weight shows whether the features correlate or anti-correlate with the target attribute. The disadvantage of linear models, is that they often do not appear to lead to as good results as nonlinear models. Clearly, not every relation that we do want to model is truly linear. However, often the apparent, better approximation of a given data set by a nonlinear model is due to the tendency of the latter to over fit the data. This point will be discussed below. Nonlinear models can be given by any nonlinear function, which in general is continuous in the features, though it does not need to be. A straightforward approach to obtain a nonlinear model is to submit the features to a nonlinear transformation and apply a linear model in the new feature space. Classification can be also seen as a particular case of nonlinear model. The typical perceptron practically uses only the sign of the value of (31), leading to
f (xi ) = sgn
∑ w j xi j
.
(32)
j
This functional form only takes on two values typically assumed to be −1 for negative and 1 for nonnegative argument. It is thus suited for two class (binary) classification by assigning each value to one of the classes. The disadvantage of the above functional form is that it is discontinuous at the zero point, which can be a drawback for local optimization methods that use gradients. In neural networks it is used in a continuous variant such as
f (xi ) = σ
∑ w j xi j
,
(33)
j
where σ () is some sigmoidal function. This form is suited for gradient based optimization of the weights. A discussion of both linear and nonlinear models for classification and regression is given in Sect. 3.1
22
J. Apostolakis
3.1 Classification Linear classification and regression methods practically reduce the problem of estimating an overall probability distribution in a high dimensional feature space down to one or two relevant dimensions. These methods are discussed in the corresponding sections. Nevertheless, the ideas from statistical dependence play an important role in a number of highly successful methods such as independent component analysis and decision trees, and will therefore be described in some more detail, to show how statistical dependence is generally quantified.
3.1.1 Linear Methods The task of classification finds its simplest instance in the linear binary classification problem. In it, we are given K points lying in an N-dimensional feature space belonging to two different target classes, and we are asked to find a plane that separates the two classes of points, i.e., all points belonging to one class lie on one side of the plane, while the second class lies on the other side of the plane. If no such plane exists then the data are called inconsistent or inseparable. In that case the task is to find a plane that separates as well as possible. Given a particular plane, its orientation is given by the normal vector that passes through the origin of the coordinate system. The position of the plane is given by the bias b (see Fig. 4). For a given point the prediction is obtained by testing the following criterion: f (xi ) = ∑ w j xi j <> b,
(34)
j
Fig. 4 Two-dimensional linear separation problem. The two classes of objects are represented by empty and filled circles, respectively. An example of an object vector is shown by the dashed arrow, the separating plane is the bold line in the middle, the bold arrow is the weight vector. The red arrow indicates the margin of a given point the blue arrow is the minimal margin of all points and therefore the margin of the separation
An Introduction to Data Mining
23
where xi is the vector corresponding to the ith point and w, is the normal to the separating plane. The sum is a scalar product between the weight vector and the point, and if the weight vector w is normalized it corresponds to the length of the projection of xi on the weight vector (see Fig. 4). Typically the two classes of points are labeled with yi = −1 for one class (the negative class) and yi = +1 for the other. This allows a simplification of the above equation: f (xi ) = yi (w·x − b) > 0,
(35)
where we have now chosen the simpler scalar product notation. If a given margin γ exists: f (xi ) = yi (w·x − b) > γ . (36) We note here that the weight vector does not need to be normalized. Scaling of the vector has the effect that γ and b need to scaled by the same constant. This often leads to the following formulation: f (xi ) = yi (w·x − b) > 1.
(37)
since assuming a margin exists we can always scale the weight vector and the bias to achieve the above equation, and have thus hidden one variable from view. As can be seen in these last equations classification is a question of solving a system of linear inequations. In crystallography, very similar approaches have been used to derive potentials for the scoring of crystal structure candidates [4,22,37]. To a certain extent the approach has been to compare correct crystal structures to decoys for different molecules. Effectively all correct structures correspond to the positive class while all decoys correspond to the negative class. Finding a weight vector that distinguishes between these two classes optimally is then equivalent to finding a parametrization for the scoring function that scores correct crystal structures better than their decoys. LDA Fisher’s linear discriminant [11] and the closely related linear discriminant analysis are the oldest methods for classification and are still used in many applications. While LDA works for classification of many classes, we will discuss it here in an application for two classes. The basic idea is to find the vector on which the projections of the data points have maximal variance between classes over the variance within the classes. This ratio can be understood as a signal to noise ratio for the separation of the classes. Formally we write: (38) wopt = max J(w) w
with J(w) =
wT SB w wT SW w
(39)
24
J. Apostolakis
and SB and SW the corresponding scatter matrices between class scatter (SB ) and the within class scatter (SW ) SB = ∑ Nc (μc − x)(μc − x)T ,
(40)
SW = ∑ ∑(xi − μc )(xi − μc )T
(41)
c
c i∈c
with μc and x, the center (mean) of the classes c and all data respectively. Without going into details we will mention that the problem can be solved as a generalized eigenvalue problem. The advantage and disadvantage of LDA over other methods is its reliance on the overall distribution of the data points. It practically approximates the densities of the classes as normal distributions and then finds the most appropriate direction for telling them apart. Assuming that the densities can be modeled correctly by a normal distribution, and that the data are sufficient for a correct estimation, LDA is the method of choice. However, this is often not the case, and therefore, discrete methods such as perceptrons or SVMs are often used. Nevertheless, due to its simplicity and clear mathematical background, it is still commonly used in many applications. Perceptron Novikoff [27] suggested the perceptron algorithm for identifying a separating plane and showed that for separable data the algorithm was guaranteed to find the correct solution in less than a given number of steps. The algorithm is a simple stochastic gradient algorithm that does its work in line, i.e., it can work while the data are being read in. One of the interesting properties of the perceptron is that it only corrects the current solution for the separating plane when it encounters a misclassified point. Therefore, the result only depends on points that were misclassified at some point in the course of learning. This property is important as it sets the relation of the perceptron to the SVMs discussed in the next section. The overall algorithm is very simple. It starts by initializing the values for the weight vector and the bias, and proceeds to iterate over all points until it finds a misclassified point. When it does that it corrects the weight vector w and bias b in a way that takes the position and the label of the offending point into account. The strength of the correction can be controlled by the learning rate ξ , however, it can be shown that convergence properties of the algorithm do not depend on the learning rate significantly. While the perceptron algorithm is a very elegant algorithm, not least because of its provable properties, it is probably mainly known because of its relevance for neural networks. There it is the basic unit within each of the neurons. SVM One problem of the perceptron is that the resulting plane depends on the sequence in which the points are read in, making it a somewhat arbitrary choice. SVMs solve this
An Introduction to Data Mining
25
Algorithm 1 The perceptron algorithm In: Data x1 , . . ., xm ⊂ Rm , y1 , . . ., ym ⊂ −1, 1m , learning rate ξ Out: Number of errors k, weight vector wk and bias bk function perceptron(X,Y, ξ ) initialize w0 ← 0, b0 ← 0, R ← max[ X ] repeat for i = 1 to m if Yi (Xi w + b) < 0 wk+1 ← wk + ξ Yi Xi bk+1 ← bk + ξ Yi R2 k ← k+1 end if end for until no more mistakes in the for loop return k, wk , bk
problem by searching for the separating plane with maximal margin (see e.g., [6]). The margin of the separating plane is the minimal distance of any point to it. The intuitive idea is that the maximal margin plane will be more robust, with respect to noise, and will therefore generalize better to new data. Furthermore, it is well defined, so the result is independent of the sequence with which the data are read in. The margin determining points, i.e., those with a minimal distance to the separating plane, define the plane’s position and direction [24]. Similar to the perceptron, if all other points are left out of the optimization the result will be identical. These points are called support vectors. Technically the optimization is performed on the basis of the last representation of the classification function: f (xi ) = yi (w·x − b) ≥ 1.
(42)
The norm of the weight vector w is minimized subject to the fulfillment of the above constraints (one for each data point). This leads to a quadratic optimization problem that can be efficiently solved with a number of methods [34]. Already, for perceptrons it is possible to show that the weight vector can be written as a linear combination of the data points: w = ∑ ak xk ,
(43)
k
where ak are the weights of the data points. In SVMs the ak are nonzero only for the support vectors, in other words the weight vector ak is sparse. The support vectors are the points for which the equality condition is fulfilled: f (xi ) = yi (w·x − b) = 1.
(44)
Substituting (43) into the classification equation we obtain the following:
f (xi ) = yi
∑ ak xk k
· xi − b
> 1,
(45)
26
J. Apostolakis
which becomes f (xi ) = yi
∑ ak (xk · xi) − b
>1
(46)
k
or
∑ ak Kki − b
f (xi ) = yi
> 1,
(47)
k
where K is the positive semi-definite matrix of scalar products of data points. The interesting thing in this representation is that it becomes clear that once the matrix K has been constructed the dimensionality of the original space is irrelevant. The complexity of the problem depends only on the number of data, not on the number of dimensions. SVMs can lead to a well-defined solution also in the case of inseparable data. The trick there is to allow a so-called slack variable ξ in the classification:
f (xi ) = yi
∑ ak Kki − b
> 1−ξ.
(48)
k
During optimization the ξ is minimized along with the length of the weight vector w. The relevance of the additional term in the minimization is controlled by a constant C which needs to be set by the user.
3.1.2 Nonlinear Methods Neural Networks Neural networks originated in [23] and were initially intended as a model of the brain. One of the simplest and the most common type of neural networks is the feed forward neural network with a single hidden layer. It is based on the connection of a number of single perceptrons in the way shown in Fig. 5. Typically, an input layer of neurons, one for each feature of the data, is followed by a number of layers of hidden neurons, whose outputs are connected to the next layer until we reach the output layer. Each neuron in the network receives input from all neurons in the previous layer (except for the input layer) and feeds its output to all neurons in the following layer. For simplicity we will discuss here only simple feed forward neural networks with a single hidden layer, as it has been shown that given enough neurons in the hidden layer this architecture can approximate any function [8, 15, 21]. Each neuron in the hidden or output layers is a simple perceptron, with the difference that its output is passed through a sigmoid function to obtain continuous gradients of the output over the weights. Each connection between the two neurons is assigned a weight and the training of the neural network consists of an optimization of the weights so that the output values in the output layers are slowly fitted
An Introduction to Data Mining
27
Fig. 5 A simple feed forward three layer artificial neural network. In the input layer the two features are fed into the network. Each node in the network is connected to the next layer by an edge which feeds its output weighted by wi j into the next node. The nodes in the hidden and output layers are typically perceptrons, which are trained by backward error propagation
to the target values for the given task. For the fit the best known algorithm is back propagation of the error [30, 38]. This basically corresponds to using an error functional for the deviation between the target output and current output, and calculating the gradient of the error functional with respect to each of the weights in the neural net. The back propagation of the error is therefore, in essence a simple gradient optimization. Neural networks have enjoyed enormous popularity in the beginning of the 1990s, mainly because of their high flexibility and ability to fit any type of function. However, their popularity has waned to a large extent, for a number of reasons. The choice of architecture (number of hidden layers and the neurons therein) is not always trivial, and being a complex nonlinear system a multiple minima problem needs to be solved in training. Both problems are addressed in a significantly simpler way by SVMs.
SVM One of the most obvious approaches to obtain a nonlinear approach is to perform a nonlinear transformation on the original data space and then use a linear method on the transformed space (often called feature space). The problem with this approach is the choice of the transformation, and its representation in the original space. The
28
J. Apostolakis
so called Kernel trick has been used to obtain transformations that are exceedingly simple to work with. It has originally been shown in the SVM literature (e.g., [24, 33]) and we will describe it here in short. Recall that in (48) the matrix K of scalar products of the data points is positive semi-definite. Inversely, if we are able to devise a transformation of that matrix which leads to a new matrix K , which is also positive semi-definite we can work with it as though there existed a vector space to which each original data point had been transformed. A number of such kernel functions are known, most notably polynomial transformations of the original matrix, e.g., Kij = Kinj
(49)
with n a positive integer, or the Gaussian kernel Kij = exp(−Ki2j /σ 2 ).
(50)
These transformations define a feature space of possibly infinite dimension (this is the case for the Gaussian Kernel), which, however, we never have to deal with, as we only work with the Kernel matrix K . The other properties of SVM such as the support vector representation of the separating plane (it is now a plane only in the feature space, in the input space it is a complex surface), are not affected by the Kernel trick. For specific applications specific Kernels have been written, some with more success than others, and they can be at times very useful. Our own experience with SVMs is that the most relevant Kernels are the linear (i.e., linear SVMs) and the Gaussian Kernel.
Decision Trees Decision trees are a very simple and powerful approach for performing classification and regression. Decision trees are recursive structures that describe a decision making approach. As an example, imagine we need to find all high quality structures of small molecules in a given crystallography database. In Fig. 6 the corresponding decision process as it could be implemented is shown. Each node represents an action on the data. For internal nodes (i.e., nodes that have children) of the tree the action is a test on the data. Depending on the tests outcome the data are redirected to the corresponding child node. The process is repeated until the data reach a leaf (a childless node). In the example, data enter the highest node (the root of the tree) and are then directed to one of the nodes’ children, depending on the outcome of the action. If the test is negative the data are redirected to the child node that discards the data. Nodes such as this one, which have no children are called leafs and are generally not associated with a test. If the test were positive the corresponding data would be directed to the next node that tests for the size of the data. To understand the way that decision trees are used in learning, imagine you are given a structure selection which had been made at a previous point in time, and are asked to update it. The rules according to which this selection was made are not
An Introduction to Data Mining
29
Fig. 6 Decision tree for selecting high resolution small molecule structures from a database
available anymore, plus annotations have changed in the data so that there may not exist a well-defined (small) rule set that identifies the selection as the correct subset. The task is to reproduce the structure of the decision tree from the knowledge of the classification of previous data examples. There exist two sets, one consisting of those that were included, and one of those that were discarded. Obviously, one could create a rather complex tree that consists of many sequential nodes, where the identity of each data point is tested against the different identities of the known positives. Such a procedure would have a perfect fit, because it would perfectly recover the original selection, however, it would be completely useless for the classification of new structures. This is once more an overfitting example, where the training data are simply memorized. To obtain a decision tree with more predictive value we need to concentrate on common properties of as many as possible of the positive examples. This is indeed the approach that is taken for automatically generating decision trees. At each stage the algorithm looks for the criterion that appears to be most informative with respect to the classification of the current data. In our example, it would test a number of different properties such as molecular size, crystal symmetry, crystal dimensions, and resolution, with respect to the statistical dependence between these variables and the class variable (i.e., the question of whether the corresponding data have been included or excluded from the set). The test for statistical dependence is based on the information gain described above. If the data are reasonably clean, and there are no strong correlations between the two relevant descriptors and other tested descriptors, the algorithm should be able to identify these as the most informative. The most informative of the two, (we assume it is the size) will be used for the root node of the constructed tree, and the most informative threshold for the tree will be used to partition the data for the redirection to the children. That means that those data that are larger than the threshold will be sent to the first child and the rest to the second. If there are no errors in the annotation, then in the first child we would have a clean class: all data
30
J. Apostolakis
that are larger than the threshold would automatically belong to the discarded class. In practice such “clean” nodes are identified as leafs and the corresponding action in the node is set to predict the most represented class, in this case the action would be discard. The data arriving at the positive node (i.e., those that lie below the size threshold) would not appear like a clean class, and a further step of search for the most informative descriptor would be necessary.
3.2 Regression Regression attempts to find a function f (x) connecting the properties (features) of a given object to one or more quantitative target properties. The simple difference between predicting quantitative vs. qualitative properties (class labels) makes it significantly simpler to perform a regression compared to a classification from a technical point of view, since continuous optimization methods can be used. On the other hand, in the application realm it is often more difficult to obtain quantitative data as these require more precise measurements. Often both regression and classification are used on the same data, to either increase the robustness of a prediction (classification) or to increase its resolution (regression).
3.2.1 Least Squares Least squares optimization is the best known approach for multilinear regression. Given the data matrix X, with xi j the j-th feature of the i-th object and the target values yi for each object, we are looking for the weight vector w with w j the weight of the j-th feature for the prediction. The prediction in linear regression is given by the following term: (51) yi = ∑ w j xi j . k
This can be written in matrix form: y = X · w
(52)
and we are looking for the w that in some way minimizes the average error of the prediction e = y − y . In least squares the square of the norm of the error |e|2 is minimized. The least squares problem is a quadratic problem with a continuous solution space. As the matrix X is in general not square, one approach to solve it is to multiply both the sides of the previous equation with X T from the left leading to: It is easy to show that the solution is given by the pseudoinverse of the matrix X. X T y = X T X ∗ w,
(53)
An Introduction to Data Mining
31
X T X is a symmetric matrix and can theoretically be inverted, however there is still one problem. It generally is not of full rank. Therefore, some eigenvalues are equal to zero. The way to solve this problem is to write X T X = OT DO,
(54)
with O a unitary matrix of eigenvectors and D a diagonal with elements di i ≥ 0. We form the pseudoinverse of Dinv with if dii =0 inv dii = 0 −1 (55) dii else
we then write This leads to
T inv = OT Dinv O, X X
(56)
inv T w = X TX X y.
(57)
Practically the same result can be obtained via singular value decomposition. In PCA-based regression, the condition (55) dii = 0 is substituted by dii ≤ threshold. The threshold is set such that only the N most important (highest) Eigenvalues are used for the formation of the pseudoinverse. This has the effect that it removes noisy components from the data and makes the model more robust. Further improvements can be achieved with PLS [39] or linear SVMs for regression. These latter have the interesting property of optimizing a measure of error that is insensitive to small deviations. This is called ε -insensitive loss. The use of ε −insensitive loss leads to the sparse representation of the result, i.e., the resulting weight vector can be given as a linear combination of relatively few support vectors. Nevertheless SVD and PCA/PLS based approaches practically dominate the field of linear regression.
3.2.2 Nonlinear Regression The simplest possibility to obtain nonlinear regression is to generate new features that are nonlinear functions of the original features, and perform a linear regression on the new, expanded feature space. However, this approach soon leads to large and complex spaces that are not particularly intuitive. Often some type of feature selection is then added to reduce the number of fitted parameters. More typical approaches for nonlinear regression are SVM and neural networks, which are used to a significant extent. The main problem with nonlinear regression is the choice of the functional form and its flexibility. Choosing a too flexible method will allow fitting anything, similar to the situation where too many variables are used to perform a linear fit. This problem is shortly discussed in the another chapter.
32
J. Apostolakis
Fig. 7 Different models for fitting the same data. Top left a linear model with only one parameter, the slope. Top right a linear model with two parameters. Bottom left a simple nonlinear model. Bottom right a flexible nonlinear model
3.3 Bias vs. Variance (Fit vs. Generalizability) One of the most important properties of hypothesis spaces is their size, in other words the flexibility of the corresponding models. To obtain a reasonable fit it is clearly necessary to have some variables that affect the exact properties of the model. For example in Fig. 7 if the model is given by a line and we are allowed to fit only its slope but not the position where it cuts the x-axis, then we obviously will not obtain a particularly good model. This model is also not expected to be a good choice for making predictions on new data points, as it already shows that for a large number of training data it leads to poor predictions. Allowing some more parameters such as the slope or parameters that describe the curvature of the model make it more flexible and improve the fit (e.g., plots top right and bottom left, respectively). If we introduce enough flexibility we can achieve a perfect fit (bottom right). However, the model in this case will most probably not be predictive. It will simply have memorized the single data, without really learning anything that is of use for a new data point. This is called overfitting or overtraining. This shows that the two limiting
An Introduction to Data Mining
33
cases of having too rigid and too flexible models can both lead to poor prediction. Therefore, it is necessary to find a way to balance fitting capacity vs. predictive accuracy. The first step in this direction is to find a way to assess predictivity of the methods. This is usually obtained empirically through training set/test set distinction, leave one out or n-fold cross validation. In all of these methods a part of the data is set apart for testing (test set). The fitting is performed on the remaining data, and the prediction is tested on the test set data. In n-fold cross validation the data is randomly separated into n groups. Each group is used once as a test set, with training being performed on the remaining data. Leave one out is in principle n-fold cross validation, where n is equal to the number of data in the original data set. While leave one out prediction has the advantage of being very systematic, it generally gives a too optimistic representation of the true predictivity of the model. Finally, as these methods are generally used to set the parameter of the method it is important to still have an independent test set which has never been used in the complete process. The problem of overfitting is a rather difficult problem as it generally needs to be solved empirically. It is also a problem that is being rediscovered time and again, as it appears in different forms across the disciplines, whenever simple models need to be fitted to experimental data. One of the problems with overtraining is that the flexibility of the models is sometimes dictated by the particular application. Assuming we would like to predict the stability of a complex in a crystal structure, we could do so using as features all interatomic distances observed in the structure. For each type of distance we would need to have some parameter that should indicate whether that particular contact is stabilizing or not and if so to what extent. Automatically, this gives a large number of parameters, which lead to problems when the number of structures used for learning is limited. A seemingly perfect parametrization that perfectly explains the training data, which is very poor when applied to new “unseen” data can be found. One approach to ameliorate this problem is feature selection. The idea here is to remove features, and the corresponding weights, which are deemed to be of lower relevance for the particular question. Decision trees are in fact a greedy method for feature selection and other methods have been adapted to perform feature selection as well.
4 Conclusion This chapter has provided a short overview of the mathematical and algorithmic concepts and ideas behind the most relevant methods for data mining in crystallography. While not all of these have actually been used to a large extent, the purpose has been to provide a short introduction to the possibilities of the methods and the concepts involved. It is hoped that this will make data mining and machine learning more accessible to the general practitioner in crystallography and allow new applications in the field and the discovery of nontrivial and scientifically relevant knowledge.
34
J. Apostolakis
References 1. Lloyd SP (1982) Least squares quantization in PCM. IEEE Trans Inform Theory 28:129–137 2. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules, Proceedings 20th International Conference Very Large Data Bases, pp 487–499 3. Allen F (2002) The cambridge structural database: a quarter of a million crystal structures and rising. Acta Crystallogr Sect B (58):380–388, URL http://dx.doi.org/10.1101/gr.3541005 4. Apostolakis J, Hofmann DW, Lengauer T (2001) Derivation of a scoring function for crystal structure prediction. Acta Crystallogr Sect A, Foundations of Crystallogr 57(Pt 4):442–450, URL http://view.ncbi.nlm.nih.gov/pubmed/11418755 5. Brock C (1996) Investigations of the systematics of crystal packing using the cambridge structural database. J Res Natl Inst Stand Technol 3(101):321–325, URL http://dx.doi.org/10.1101/ gr.3541005 6. Burges C (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 2:121–167 7. Campbell C (2002) Kernel methods: A survey of current techniques. Neurocomputing 48: 63–84 8. Cybenko G (1989) Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems (MCSS) 2(4):303–314, DOI http://dx.doi.org/10.1007/ \breakBF02551274, URL http://dx.doi.org/10.1007/BF02551274 9. Fayyad UM, Piatetsky-Shapiro G, Smyth P (1996) From data mining to knowledge discovery: An overview. Adv Knowl Discovery Data Mining 10. Fechner U, Schneider G (2004) Evaluation of distance metrics for ligand-based similarity searching. Chembiochem 5(4):538–540, DOI http://dx.doi.org/10.1002/cbic.200300812, URL http://dx.doi.org/10.1002/cbic.200300812 11. Fisher R (1936) The use of multiple measurements in taxonomic problems. Annals of Eugenics 7:179–188, URL http://dx.doi.org/10.1101/gr.3541005 12. Fundel K, G¨uttler D, Zimmer R, Apostolakis J (2005) A simple approach for protein name identification: prospects and limits. BMC Bioinformatics 6(Suppl 1), DOI http://dx.doi.org/ 10.1186/1471-2105-6-S1-S15, URL http://dx.doi.org/10.1186/1471-2105-6-S1-S15 13. Fundel K, Kuffner R, Zimmer R (2007) Relex–relation extraction using dependency parse trees. Bioinformatics 23(3):365–371, URL http://dx.doi.org/10.1101/gr.3541005 14. Hendlich M, Lackner P, Weitckus S, Floeckner H, Froschauer R, Gottsbacher K, Casari G, Sippl MJ (1990) Identification of native protein folds amongst a large number of incorrect models. the calculation of low energy conformations from potentials of mean force. J Mol Biol 216(1):167–180, URL http://view.ncbi.nlm.nih.gov/pubmed/2121999 15. Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are universal approximators. Neural Networks 2(5):359–366, DOI http://dx.doi.org/10.1016/0893-6080(89) 90020-8, URL http://dx.doi.org/10.1016/0893-6080(89)90020-8 16. Jain A, Dubes R (1988) Algorithms for clustering da ta. Prentice-Hall,Englewood Cliffs, NJ 17. Jain A, Murty M, Flynn P (1999) Data clustering: A review. ACM Comput Sur 31(3):264–322 18. Kabsch W (1976) A solution for the best rotation to relate two sets of vectors. Acta Crystallogr A 32(5):922–923 19. Kabsch W (1978) A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallogr A 34(5):827–828 20. King B (1967) Step-wise clustering procedures. J Am Stat Assoc 69:86–101 21. Leshno M, Lin V, Pinkus A, Schocken S (1993) Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks 6(6):861– 867, DOI http://dx.doi.org/10.1016/S0893-6080(05)80131-5, URL http://dx.doi.org/10.1016/ S0893-6080(05)80131-5 22. Maiorov VN, Crippen GM (1992) Contact potential that recognizes the correct folding of globular proteins. J Molec Biol 227(3):876–888, URL http://view.ncbi.nlm.nih.gov/pubmed/ 1404392
An Introduction to Data Mining
35
23. McCulloch W, Pitts W (1943) A logical calculus of the ideas immanent in nervous activity. Bull Math Biophys 5:115–133 24. M¨uller K, Mika S, R¨atsch G, Tsuda K, Scholkopf B (2001) An introduction to kernel-based learning algorithms. IEEE Trans Neural Netw 12:181–201 25. Mullins IM, Siadaty MS, Lyman J, Scully K, Garrett CT, Miller WG, Muller R, Robson B, Apte C, Weiss S, Rigoutsos I, Platt D, Cohen S, Knaus WA (2006) Data mining and clinical data repositories: Insights from a 667,000 patient data set. Comput Biol Med 36(12):1351– 1377, DOI http://dx.doi.org/10.1016/j.compbiomed.2005.08.003, URL http://dx.doi.org/10. 1016/j.compbiomed.2005.08.003 26. Murray-Rust P (2008) Ser Rev 34(1):52–64, URL http://dx.doi.org/10.1101/gr.3541005 27. Novikoff A (1962) On convergence proofs of perceptrons. In: Symposium on the mathematical theory of automata, vol 12, pp. 615–622, URL PolytechnicInstituteofBrooklyn 28. Hyv¨arinen A, Erkki O (1997) A fast fixed-point algorithm for independent component analysis. Neural Computation, 9:1483–1492 29. Quinlan JR (1996) Improved use of continuous attributes in c4.5. J Artif Intell Res 4:77–90 30. Rummelhart D, Hinton G, Williams R (1986) Learning internal representations by error propagation, vol 1. MIT Press, pp 318–362 31. Shannon C (1948) A mathematical theory of communication. Bell Syst Technic J 27:379–423 32. Sippl MJ (1990) Calculation of conformational ensembles from potentials of mean force. an approach to the knowledge-based prediction of local structures in globular proteins. J Mol Biol 213(4):859–883, URL http://view.ncbi.nlm.nih.gov/pubmed/2359125 33. Smola A, Schlollkopf B (1998) On a kernel-based method for pattern recognition, regression, approximation, and operator inversion. Algorithmica 22:211–231 34. Smola A, Schlollkopf B (2004) A tutorial on support vector regression, Stat Comput 14: 199–222 35. Sneath P, Sokal R (1973) Freeman 36. Steinhaus H (1956) Sur la division des corp materiels en parties. Bull Acad Polon Sci IV: 801–804 37. Thomas PD, Dill KA (1996) An iterative method for extracting energy-like quantities from protein structures. Proc Natl Acad Sci USA 93(21):11628–11633, URL http://view.ncbi.nlm. nih.gov/pubmed/8876187 38. Werbos P (1974) Beyond regression – new tools for prediction and analysis in the behavioral sciences. PhD Thesis, Harvard University 39. Wold H (1966) Estimation of principal components and related models by iterative least squares. Academic, New York, URL http://dx.doi.org/10.1101/gr.3541005, multivariate Analysis
Struct Bond (2010) 134:37–58 DOI:10.1007/430 2009 5 c Springer-Verlag Berlin Heidelberg 2009 Published online: 1 September 2009
Data Bases, the Base for Data Mining Christian Buchsbaum, Sabine H¨ohler-Schlimm, and Silke Rehme
Abstract Data collections provide a basis for solving numerous problems by data mining approach. The advantages of data mining consists in the retrieving of a new knowledge from existing information. The comprehensiveness of the data collection, the structure and quality of the data, and the selection of relevant data sets are extremely important to get correct results. In the crystallographic field, scientists will find several databases dealing with crystal structures of inorganic and organic compounds, or proteins. Usually databases have detailed data evaluation mechanisms integrated in their database production process and offer comprehensive and reliable data sets. The CIF standard enables the scientists to exchange the data. As an example, the Inorganic Crystal Structure Database (ICSD), a source of information for crystallographers, mineralogists, physicists, and chemists will be presented here. The ICSD contains about 120,000 entries (March 2009) of fully determined crystal structures. This chapter gives a detailed description of data collection, the contents of the data fields, data evaluation, and finally search the functionality of the ICSD database. Keywords: Crystallographic databases · ICSD · Data collection · Data evaluation · Database functionality · Database design · Search strategies Contents 1 2
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Description of Data Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Structure Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38 40 40 40 44
C. Buchsbaum, S. H¨ohler-Schlimm, and S. Rehme () Fachinformationszentrum Karlsruhe, Hermann-von-Helmholtz-Platz 1, 76344 EggensteinLeopoldshafen e-mail:
[email protected]
38 3 4
C. Buchsbaum et al.
Acquisition of Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Revision and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Formal Checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Verification of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Database Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Access to ISCD Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Retrieval Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 General Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 The Black Tar Mystery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Example: Searching for Ice Ic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45 47 47 47 49 49 50 50 52 53 54 57
1 Introduction With the abundance of all kinds of extensive data collections and the emergence of new technical possibilities, data mining has become an important issue in recent years. In the field of crystallography, databases like ICSD, CSD, ICDD, and others offer unheard of possibilities, e.g., in the fields of molecular design and crystal engineering. For this, the available data are evaluated using different criteria. In all fields of science, data mining would be impossible without extensive databases. Since the data mining is a powerful algorithm to analyze data and predict properties, the structure of database standardized data are of prime importance. Any analysis of selected data fields will be useless if the information in these data fields is not available in standardized form. Looking at the content of crystallographic databases, standardization starts with the unification of units and ends with the calculation of standardized crystal data or the classification of structures like structure types. A standard data exchange format, the Crystallographic Information File (CIF), was defined in the early nineties in order to enable the exchange of data between the most varied applications. Consequent application of this standard has made it possible to evaluate data deposited at FIZ, CCDC, or publishing houses without difficulty. Completeness of the data material is the second important aspect. The more extensive the data material, the more satisfactory the results of the statistical methods will be in the end. Thirdly, the data quality is a decisive factor. Prior to the data mining process, the data should be checked and evaluated carefully. Last but not the least, the choice of datasets is an important concern. The higher the amount of relevant content in the selected clusters, the more accurate the results of the data mining process will be. Databases like ICSD, CSD or PCD offer manifold search fields and combination potentials in their databases which leads to precise search results. Some aspects of retrieval are dealt with in this article in more detail. The chapter will focus on the above mentioned aspects. As a producer of the ICSD database, FIZ Karlsruhe has decades of experience in compiling data, processing data, and making them available to customers. This chapter describes the FIZ Karlsruhe procedures for data collection and data processing, which are probably similar to those used by other producers of crystallographic databases.
Data Bases, the Base for Data Mining Table 1 An (incomplete) overview of available crystal structure databases. Provider Raw Data No. of entries (year) Search fields FIZ ICSD 120,000 (2009) Structure ICDD ICSD + LPF 285,000 (2008) Powder pattern + structure CSD CSD 470,000 (2009) Structure Crystal PCD 165,000 (2008) Structure + Powder Impact pattern PDB PDB 57,000 (2009) Structure COD AMS + COD 48,000 (2006) Structure NISTMet Metals 40,000+ Structure Toth Metals File/CrystMet 126,000 (2008) Powder pattern PCD PDB 52,000 (2009) Structure crystaleye SI from publishers 100,000 (2007) Structure
39
Software FindIt, WWW PDF4+/PDF2 ConQuest PCD N/A Online, free N/A TothToolkit Online, free Online, free
The above-mentioned databases provide very good coverage of the various aspects of chemistry. Crystal structures and powder patterns of organic and inorganic compounds, metals, and proteins can be searched. Each database is unique in its character but there are overlaps resulting, e.g., from similar scopes of long years of cooperation between producers. Many crystal structures are also available on the servers of publishing houses as supplementary information (SI) contained in scientific publications. These have grown into extensive free data collections during the past few years. On the other hand, databases that are available only with costs still have an advantage in the completeness of coverage. Most of these databases date back many years, which means many years of data collection and deep analysis of publications. Also, the evaluation process is more complex, and the software is more sophisticated and offers more and better functionalities. Of course, the abovementioned CIF format is a useful tool for combining data from various databases and use them for scientific studies. Databases of crystal structures (see Table 1 for an overview) are usually regarded as archives of unit cells and atomic coordinates. Coordinates have proved themselves as valuable input for e.g., Rietveld refinements [18, 19] or similar refinements. Besides this, crystallographic databases contain a lot more valuable information, which can enhance high-quality research easily. In order to get an idea of the concepts of a database, this chapter will explain certain details of the Inorganic Crystal Structure Database (ICSD). The Inorganic Crystal Structure Database (ICSD) [5, 6] contains information on all structures: • Which have no C–C and no C–H bonds • Which include at least one of the nonmetallic elements H, He, B–Ne, Si–Ar, As–Kr, Te, I, Xe, At, Rn • Whose atomic coordinates have been fully determined or were derived from the corresponding structure types
40
C. Buchsbaum et al.
Recently, crystal structure data of metallic and inter-metallic compounds were introduced into ICSD. Each structure determination reported in the literature forms its own entry. The earliest entry comes from Bragg’s paper on sodium chloride in 1913 [7,8]; the most complex is the structure reported in 2006 of the mineral Johnsenite from Grice and Gault which contain 22 different elements [11]. At present (March 2009) the ICSD contains 120,000 entries. The database is maintained by the Fachinformationszentrum (FIZ) Karlsruhe with the assistance of several contributors around the world. The database is provided as a stand-alone version for installation on a local computer (FindIt, in cooperation with NIST), as a local intranet version, suitable for small groups of users, and as a WWW version, which is hosted and maintained by FIZ Karlsruhe [1].
2 Contents 2.1 General The ICSD contains records of all inorganic crystal structures published since 1913, including the very first crystal structure determination by X-rays (NaCl by Bragg and Bragg). Inorganic compounds in the context of ICSD are defined as compounds without C–C and/or C–H bonds and containing at least one of the following nonmetallic elements: H (D, T), He, B, C, N, O, F, Ne, Si, P, S, Cl, Ar, As, Se, Br, Kr, Te, I, Xe, At, Rn. To appear in ICSD a crystal structure has to be fully characterized, i.e., it has to contain unit cell data and all atomic coordinates as well as a fully specified composition. Crystal structure data are analyzed and categorized by experts with the help of sophisticated computer programs. During the recording process additional information, such as Wyckoff sequence, Pearson symbol, molecular formula and weight, calculated density, ANX formula, mineral name, structure types, etc. are generated and added to the crystal structure data. Currently (ICSD 2009/1) the database contains 120,000 entries. More than 90% of entries in ICSD are represented by compounds with 2–5 elements (Fig. 1). A statistical view of binary metal–element compounds, sorted by groups of the periodic table, can be seen in Fig. 2.
2.2 Description of Data Fields The entries stored in the ICSD give full structural and bibliographic information including: • Chemical name and phase designation • Special name record
Data Bases, the Base for Data Mining
41 42,539 (35.22 %)
45000 40000
30,804 (25.50 %)
Number of entries
35000 30000 22,959 (19.01 %)
25000 20000
14,277 (11.82 %)
15000 10000 5000
5,315 (4.40 %)
1,373 (1.14 %)
1,850 (1.53 %)
ry
y ep H
Q
H
ui
nt
ex
ta
an
na
ar
y an
ar Q
ua
te
rn
ar
y
y ar rn
El
Te
Bi
em
na
en
ry
ts
0
Fig. 1 Overview of multinary compounds. Compounds with 2–5 elements make more than 90% of ICSD
5,738
6000
Number of entries
5000 4000 2,559
3000
2,368
2,247 2000
1,761
1,510
1000
17 ro up
de s M
-G
ul fi -s M
15 (w M-G i an tho rou d ut p su o 16 lfi xid de e s) s M -o xi de s
-G M
M
-G
ro
ro up
up
14
0
Fig. 2 Overview of compounds with metal (M) and elements of groups 14–17 of the periodic table
42
• • • • • • • • • • • • • •
C. Buchsbaum et al.
CAS Registry Numbers Mineral name and origin Chemical formula Unit cell dimensions and measured density Number of formula units Hermann–Mauguin space group symbol Atomic coordinates and site occupation Oxidation state of the elements Thermal parameters Temperature and pressure of measurement Reliability index Method of measurement Author, journal, volume and page, year of publication Title of paper
Chemical names follow the IUPAC rules; further electronic processing might result in difficulties, though. Modifications to the rules have been made to produce a standardized name suitable for computer treatment. The name should reflect the composition and structure of the compound, if possible. Phase designations are taken from the paper and standardized. Both the originally published as well as the standardized crystal structures are available. The special name record provides additional identification of the material. It contains the substance numbers from Landolt–B¨ornstein where the user may find further information and references to the compound in question. The mineral name follows the conventions of Strunz. However, some older names are included and in some cases the names of families of minerals such as feldspars and zeolites are given, even when the entry describes a synthetic member. The origin of the sample has been given if possible. The chemical formula is given in the normal structured form; atoms on identical sites, chemical building units or complexes are given in parentheses. The atomic symbols follow the normal sequence, e.g., (NH4 )2 (PO3 F)(H2 O)2
or
(Nb0.55 Zr0.45 )(O1.1 F1.8 )
For solid solutions or nonstoichiometric compounds the formula of the actual sample investigated is given. Minor constituents which cannot be observed by X-ray diffraction may be omitted from the formula and thus from the table of atomic coordinates. In these cases a second formula record containing the analytical composition is given, e.g. Mineral: Cyprine Chemical formula: Ca29 Al4 Cu Al8 Si18 O68 (O H)10 2nd formula record: (Ca28.28 Mn0.68) Al4 (Fe0.29 Cu0.71) (Al6.36 Mg0.56 Ti0.03 Zn0.97) (Si17.51 Al0.49) O68 ((O H)8.5 F1.5)
Because the number of formula units Z in the unit cell is always given as an integer, it is occasionally necessary to multiply the formula unit by a decimal fraction.
Data Bases, the Base for Data Mining
43
˚ The unit cell dimensions are given in Angstr¨ om units and degrees. The number of formula units Z in the unit cell is also given in this record. The space group is given by the Hermann–Mauguin symbol. More than 400 different settings of the space groups have been used to report inorganic crystal structures, and to avoid introducing errors by transforming these to a standard setting, the authors’ original settings have been retained. To make the space group symbol setting consistent some conventions and additions have been adopted: • The bar always follows the number. • The letter Z has been added at the end to indicate that the origin is at the centre of inversion. • The letter S at the end of the symbol indicates a special origin (e.g., the second setting given in International Tables or a setting indicated by the symmetry operators that follow). • Monoclinic space groups are always given the full symbol. • Rhombohedral settings are indicated by the letter R at the end of the symbol. • The obverse hexagonal setting of rhombohedral space groups is indicated by the letter H and the reverse setting by HR. Symmetry records are included to give the symmetry operators of the special position when a nonstandard space group setting has been used. The atomic coordinates are preceded by the element symbol (which may include D for deuterium) followed by an atom identifier which is always a simple integer regardless of the identifier used by the author. This is followed by the oxidation state which is the formal charge of the atom in the most probable ionic formulation. Standard rules for determination of oxidation states are applied. In some cases it may not be possible to assign individual oxidation states and all the atoms of that element will be assigned the same (possibly nonintegral) oxidation number. If an oxidation state for some atom cannot be assigned by any of these methods it is set to zero. In any case the sum of all the oxidation numbers in the unit cell must be zero. After the oxidation number, the multiplicity and the Wyckoff symbol of the site, coordinates (in decimals) with the standard deviations (as given by the author), and the site occupation factor follow. It always relates to the number of positions and only occurs if it is different from one. All the elements occupying the same site are listed separately each, with its own site occupation factor (unless this is less than 0.01). Hydrogen (and occasionally other) atoms whose coordinates have not been determined are included in the atom list without coordinates but with an appropriate site occupation factor (which in this case may be greater than 1.0) in order that the formula calculated from the atom list agrees with the structured chemical formula. The number of hydrogen atoms attached to each anion is indicated in those cases where the coordinates are not given (e.g., OH2 , NH3 ). Thermal parameters (atomic displacement factors) are stored in the form given by the authors: exp(−B sin2 θ /λ 2 ),
44
C. Buchsbaum et al.
exp(−8π2U sin2 θ /λ 2 ). Temperature factors are often deposited (indicated by L) but wherever possible are included in the database. Anharmonic temperature factors are not included but their existence in the original paper is indicated by the remark “AHT.” The reliability index R serves as a rough measure of the quality of the structure determination. Authors use different definitions and the lowest is usually included in the database. Comments on the structure and its determination are included in the remarks, the more common of which have been coded into three letter symbols. Unless otherwise stated it is assumed that the structure determination was carried out at room temperature and pressure using single crystal X-ray diffraction. Any other method is indicated by the use of standardized remarks: A full description of each record is described in the Coding Instructions which can be obtained from the authors.
2.3 Structure Types Structure types were introduced into ICSD in 2005 [2]. For this purpose new standard remarks (labels) had to be integrated into the database; their names are TYP and STP, and they can be assigned to each entry. Every entry that belongs to a certain TYP label is represented by a member of this group of entries. This arbitrarily chosen member will then serve as the structure prototype and is additionally marked with the STP label. Methods had to be developed to overcome the difficulties of determining structure types automatically and assigning crystal structures to their structure prototype [20]. Two definitions, as mentioned in an IUCr report [10], proved to be suitable and fully sufficient as a theoretical concept for the task, namely isopointal and isoconfigurational structures. Two structures should be described as isopointal, if they have the same spacegroup type or if they belong to two enantiomorphic space group types. They are isopointal as well, if the atomic positions are the same in both structures, i.e., the sequence of occupied Wyckoff positions is identical for both structures after standardization. For differently standardized crystal structures, the Wyckoff sequence may depend on the selected cell origin. For example, in spinels that crystallize in space group ¯ (two standard settings: origin at 3¯ or 4), ¯ an origin shift by 1 1 1 will change Ia3d 2 2 2 the Wyckoff sequence from e d a into e c b. Isoconfigurational structures include a subset of isopointal structures in a way that isoconfigurational structures have to be isopointal and that the crystallographic orbits for all Wyckoff positions have to be similar. It has to be noted that the second definition is difficult to deal with, since the exact meaning of “similar geometric interrelationships” is not specified. For the introduction of structure types, novel methods that combine different criteria needed to be introduced. According to the definition of Lima-de-Faria, we use an a priori definition of geometric criteria for the distinction of structure families.
Data Bases, the Base for Data Mining
45
3 Acquisition of Information The ICSD database was created in 1978, with annual updates comprising about 1,000–2,000 structures. During the past decades, the number of publications on inorganic crystal structures has increased steadily, and a more professional approach to database processing was called for. Today, about 7,000 new structures are recorded annually for ICSD, and the existing structures are regularly revised, corrected, and updated (see Fig. 3). This was made possible, among other things, by software enhancements that enabled today’s largely automatic data acquisition and processing procedure. The number of modified structures has been raised extensively during the past four years. This included also the deletion of many duplicates, which had been incorporated into ICSD in the 1980s as a result of distributed update procedures among the partners. These were problems of a more or less technical nature; they could be solved completely, and the database quality has increased dramatically during the past few years. In the early days of database development, crystal structures were recorded manually and checked only intellectually. Today, many records are incorporated electronically as CIF records, and verifications and checks are made automatically (see Sect. 4). A look at Fig. 4 shows that about 40% of all structures contained in ICSD are taken from the 12 most relevant journals. In addition to these, ICSD also contains structures that derive from more than 1,000 journals published worldwide. One can easily imagine that it is much more difficult to find these 60% of structures and incorporate them in the database. On the other hand, it is this 60% that decides the completeness of coverage and thus the quality of the ICSD database. Information on the existence of relevant structures is obtained in various ways, e.g., by regular scanning of expert journals and also by regular searching of bibliographic databases
Number of Crystal Structures
10000 9000 new
8000
modified
7000 6000 5000 4000 3000 2000 1000
year Fig. 3 Input development in ICSD over the years 1981–2008
08 20
02
04 20 06
20
20
98 20 00
19
94 19 96
92
19
19
88 19 90
19
84 19 86
19
19
82
0
46
C. Buchsbaum et al. Physica C Materials Research Bulletin Kristallografiya Z. für Naturforschung Z. für Kristallographie Inorganic Chemistry American Mineralogist J. of the Less-Common Metals J. of Alloys Compounds Z. für Anorg. und Allg. Chemie J. of Solid State Chemistry Acta Crystallographica (A-E)
Entries
0
5000
10000
15000
Fig. 4 The 12 most productive journals in ICSD Electronic transmission of data by authors via telecommunication networks
Scanning of the most important journals
Searches in bibliographic databases for relevant articles
Reference on missing data by users
Assigning of ICSD numbers
Assigning of ICSD numbers
Duplication check
Duplication check
Ordering of the original documents
Ordering of the original documents
Transformation of the data into the input format
Duplication check
Excerption/ keyboarding of the data
Manual adding of further data from the printed article
Automatic and intellectual checking procedures
Correction of old data
Assigning of ICSD numbers
ICSD database
Fig. 5 Procedure of data input for ICSD. After [3]
and Internet publications. FIZ Karlsruhe also receives information from queries to our crystal structure depot and by scientists informing the authors of structures that should be incorporated. As soon as FIZ Karlsruhe receives information on publications that may contain relevant structures, FIZ experts start to check the contents of the original publication (see Fig. 5 for a schematic overview). If the information is up to the standard of
Data Bases, the Base for Data Mining
47
the ICSD quality criteria, the publication will be processed for incorporation in the database. If the information is incomplete, the publishers’ sites will be analyzed for any further structural information that may be obtained. If necessary, the authors will be contacted as well. There are two ways of data acquisition: Parts of the structures are still processed manually; on the other hand, increasing numbers of crystal structures are imported as CIF files. Electronic import is, of course, much more efficient and less prone to errors. Most of the CIF files can be generated from the crystal structure depot that is available at FIZ Karlsruhe. FIZ Karlsruhe started this depot of crystal structure data more than 25 years ago, first in printed form and, since the early nineties, also increasingly in electronic form. Today, only electronic data are stored in the depot. The crystal structure depot enables authors to store their extensive data and to refer to the stored record(s) in their publications. Access to the crystal structure depot is free on request for scientists (see also http://www.fiz-karlsruhe.de/request for deposited data.html).
4 Revision and Evaluation To maintain a high quality standard of ICSD, it is absolutely necessary to validate the compiled data prior to publication in the database. This step is essential also in the interest of the database user; it is the central element of database production. As mentioned earlier, many checks today are made automatically. This comprises formal checks, plausibility checks, and checks of certain constraints resulting from mathematical and physical laws. The sections below will present some examples.
4.1 Formal Checks Formal checks are independent of the content; they comprise, e.g., duplication checks of new records, missing field entries, correctness of bibliographic data, standardization of authors’ names, syntax, etc.
4.2 Verification of Contents These checks refer to the correctness of the chemical and crystallographic information; they are of decisive importance for the scientist user. Among others, they comprise the following checks: • Plausibility and validity of cell • Matching of cell and space group
48
• • • • • • •
C. Buchsbaum et al.
Validity of oxidation state Multiplicity Site occupation Electroneutrality Molecular formula Plausibility of isotropic/anisotropic temperature factors Interatomic distances.
These verifications are regular background processes during data processing; their results are evaluated intellectually by experts. Validation rules are, e.g.,: • Plausibility and validity of cell: The following rules apply to the validation for the plausibility of cell:
α + β + γ ≤ 360◦, α ≤ β + γ;
β ≤ α + γ;
γ ≤ α +β
and for the validity of cell 0.5 g cm−3 < ρ < 20 g cm−3 . • Validity of sum formula: The sum formula is calculated from atomic parameters, site occupation, and site multiplicity, and compared with the corresponding formula given by the author. There are cases in which uncertainties remain even after automatic verification and intellectual checking. It is a basic principle of ICSD that original data should be retained as far as possible unless the database expert finds errors of a fundamental and obvious nature. In such cases, the original data will be corrected, and the record will contain a comment written by the database expert to show where changes were made. If no corrections are made, the so called test flags are introduced instead. Examples are: – Difference between the calculated formula and the formula in the formulafield is tolerable. – Deviation of the charge sum from zero is tolerable. – Calculated density is unusual but tolerable. – Displacement factors are those given in the paper but are implausible or wrong. – A site occupation is implausible but agrees with the paper. – Lattice parameters are unusual but agree with the paper. – Coordinates are those given in the paper but are probably wrong. – Reported coordinates contain an error. Values corrected In addition to the automatic checks described above, there are some aspects of ICSD that are verified and, if necessary, corrected only intellectually. This includes, e.g.: • Relevant decision • Chemical nomenclature
Data Bases, the Base for Data Mining
49
• Mineralogical nomenclature • General, free remarks added to the structures
5 Database Design 5.1 Access to ISCD Data The ISCD database is offered via various channels: • WWW Version The ICSD Web version is available since June 2009. It offers both, the flexibility of a browser based interface and the functionality of a graphical user interface. The new Web Version was developed in 2008/2009 by FIZ Karlsruhe in order to meet both the increased requirements of the user community (user friendly Interface, easy to navigate, up-to-date retrieval interface and visualization, flexible export of data), and the requirements of modern software development. For a schematic view of the database design see Fig. 6.
ICSD Main
ICSD Visualization
Search and Query Management
CIF Export
Jmol 3D-vis. applet
Powder Pattern
Calculate Distances/Angles
Presentation Layer
Web Application Server
Application Server
Business Layer
User Rights External Executables Management System (Lazy Pulverix, Bonds)
Persistence Layer ICSD Relational Database
User DB
Fig. 6 Modules of the new ICSD Web version
50
C. Buchsbaum et al.
• STN Version. The ICSD STN Version is fully integrated in STN, the Scientific and Technical Information Network. The graphical user interfaces for this version of ICSD are STN Express and/or STN on the Web. The visualization components of the ICSD Web version are also available for the STN version. • Intranet Version. The ICSD Intranet version is available since June 2009. It is based on the previous ICSD Web version developed by A. Hewat & FIZ Karlsruhe. It can be installed locally in corporate or campus network. It will be replaced as soon as possible by the new ICSD WEB version components. • CD-ROM Version. The PC-based Search and Retrieval interface was developed by NIST. The Web system is based on a Java Enterprise edition (JEE) multitier architecture. Each layer encapsulates a specific set of functionalities. The session management located in the Web application server tracks and authorizes the user’s activities utilizing the services of a User Rights Management System. From there it is connected asynchronously to an Enterprise Resource Planning (ERP) system. The business process components run on a multithreaded application server. Some scientific executables are bound externally to fulfill the requirements of the user communities. This new Web version of ISCD will be used in the following chapters for screenshots and examples.
6 Retrieval Examples 6.1 General Strategies ICSD Web provides various retrieval options, from classic bibliographic searches through simple or combined searches to complex search strategies. The user is offered clearly structured search masks in which bibliographic data, chemical compositions, selected symmetry properties, measuring conditions, etc. can be entered, combined logically, and stored. The complex search strategies comprise all published data that are contained in the database, as well as data derived or calculated from these (e.g., structure types, Pearson symbol, ANX formula, interatomic distances, etc.). All records are recorded twice on principle, i.e., both in the published form and in a standardized form; they can be compared directly with each other or with the original data in a synoptic view. Searching via the reduced cell is possible as well. Depending on the query, the search result can be used for ordering the document in full text, e.g., in the case of reviews. Further, structured data can be exported in CIF format [12] for use as input data for all common visualization programs (Mercury [15], Diamond [17], Jmol [16]). The PDF number provides a reference to the powder pattern database of ICDD (PDF2/PDF4+) [13]. The following sections contain information for choosing the optimum search strategy, which should be both highly selective and comprehensive. Often, there are
Data Bases, the Base for Data Mining
51
several ways to arrive at the desired search result. In many cases, the Basic Search will be sufficient; this strategy is the most user friendly and permits intuitive use. If this mode is not precise enough, there is also the Advanced Search mode, which offers several options. For example, the user can search all settings of the desired space group; he can search groups in the periodic table of the elements instead of single elements, or he can open a browser window to get a general idea of the available data. This may be useful, e.g., if the correct spelling of a name or structure type is not known and if wild card searching would produce too many undesired results. If a cell or space group is used as a start, the user must choose the data source he would like to access, i.e., standardized data, published original data, or the reduced cell data (Niggli). Search in standardized data is recommended whenever a comparison of several structures in a synoptic view is intended. It is also possible here to compare the published data with the standardized data. Unusual cell settings in a publication may not be of particular relevance; on the other hand, some settings may illustrate developments in phase changes or contain information on a group theory context. A look at the published data is particularly useful in these cases. Searching in reduced cell data has the advantage that, e.g., in the case of pyroxenes, the search is independent of the number of formula units per cell in a record (which is unknown to begin with). For example, via the reduced cell one finds both FeSiO3 (Z = 16) and Fe2 Si2 O6 (Z = 8), which would not be possible when searching via the formula with stoichiometric coefficients. Another advantage of reduced cell searching is the fact that the search is independent of the chemical composition and may provide unexpected results (e.g., isomorphic compounds, mixed crystal series). On the other hand, when searching in the reduced cell one needs to choose the optimal tolerance and to know how to interpret the results. Many users prefer searches for isotypic compounds in ICSD Web in order to obtain initial values for a new structural search. For classification of simple inorganic structures, the ANX type was used in the “Strukturberichte.” The letters A–M represent the cations, N–R the elements with oxidation number zero, and S–Z the anions. Further details like the treatment of hydrogen, partially occupied sites, elements with different oxidation numbers, etc. are contained in the scientific reference manual of the ICSD Web database. The ANX type “AX” comprises structure types that are as different as NaCl, NiAs, ZnS (wurtzite type or zinc blende type), which can be derived from different packing patterns and also have different types of gaps (octahedral gaps, tetrahedral gaps) and different coordination polyhedra. The characteristic “AX” therefore is not sufficient for accurate classification. ICSDWeb offers the additional option of searching for the Pearson symbol, which is described in detail in the scientific manual. To give an example: A search for mP4 (monoclinic, primitive cell with four atoms) alone is too unspecific to yield satisfactory results. More detailed searching, e.g., with space group number = 11 and the number of elements permitted = 2, will find only records of the compound NiTi on the one hand and records of the high-pressure phases of AgCl, AgBr, and AgI on the other hand.
52
C. Buchsbaum et al.
If one applies the ANX concept to these two groups, AgCl belongs to “AX” (cation–anion) while NiTi belongs to “NO” (elements, oxidation number = 0). The Wyckoff sequence is 2e in both cases, i.e., it cannot be used for further differentiation. This is where ICSD Web offers the possibility of viewing both structures side by side in a synoptic view. ICSD Web also gives the user the option of further differentiation of the two groups by entering a structure type according to Allmann. As described in Sect. 2.3, Allmann developed a structure type concept based on 11 criteria, which enables further detailing of the structure description and a combination of these structures in the ICSD database. The concept is described in detail in the publication by Allmann and Hinek [2]; it regularly comprises the space group, Wyckoff sequence, and Pearson symbol as characteristics. For further differentiation, the ANX type, (c/a)min , (c/a)max , βmin , βmax , space group number as well as the occurrence or absence of certain elements are introduced. New records are automatically selected by these criteria and are assigned to structure types accordingly. In the above case, the high-pressure phases of AgCl, AgBr, and AgI are of the AgCl (mP4) type while NiTi belongs to the NiTi type. On the other hand, if we reverse the search by looking for the NiTi structure in ICSD Web, we will find that NiTi is the only representative of this type so far in the ICSD Web database. Another feature for visualization of crystal structures is the so-called Movie Display, where two structures are displayed alternatingly in a single window. This is an interesting feature, e.g. for illustrating slight differences between structures. One example of this is the alternating display of the superconductor YBa2 Cu3 O(7−x) in the space groups Pmmm and P4/mmm as described in [9].
6.2 The Black Tar Mystery A spectacular example of a problem solved with ICSD is found in a publication by Kaduk [14]. In a BP refinery, a pump valve was clogged by an unknown black, viscous mass. It was feared that sulfuric acid might be released and react with the aluminium casing, although it was unclear if this would happen at the onset or in the course of the process. The powder diagram (Fig. 7) indicated the presence of Al4 H2 (SO4 )7 (H2 O)24 , FeSO4 (H2 O), and Al2 (SO4 )3 (H2 O)17 ; the tar-like substance could be removed by washing with acetone. The crystal structure of Al4 H2 (SO4 )7 (H2 O)24 was unknown at the time. For refining by Rietveld methods, it was necessary first to identify the structure. The search for the ANX formula (in this case A4B7X52; Al, S: cations, O: anion) resulted in a single hit, i.e., the compound Cr4 H2 (SO4 )7 (H2 O)24 . The two compounds were found to be isostructural. Refining by Rietveld methods with Al in the positions of the Cr atoms provided an excellent result.
Data Bases, the Base for Data Mining
53
Intensity/counts per second
[PAK02.MDI] TXC Alky2 seal oil, acetone-washed grey powder (45,30) JAK
2000 1500 1000 500 0 22-0006> Al4H2(SO4)7!24H2O - Aluminum Hydrogen Sulfate Hydrate 72-0981> Cr4H2(SO4)7(H2O)24 - Chromium Hydrogen Sulfate Hydrate
10
20
30
40
50
60
2 /° Fig. 7 Powder pattern of the unknown black substance
It was found that sulfuric acid was indeed released, which reacted with parts of the pump to form metastable Al4 H2 (SO4 )7 (H2 O)24 . Corrosion occurred in the course of the process and not (only) at the onset. The problem could be solved and the production process was optimized. Since then, the crystal structure of Al4 H2 (SO4 )7 (H2 O)24 has been given a record in the ICSD database (Collection Code 77310).
6.3 Example: Searching for Ice Ic The following example shows how to search and find crystal structures of Ice Ic . As mentioned in Sect. 6.1, the first approach is to use the Basic Search function of ICSD Web. Entering H2 O in the appropriate field and limiting the number of elements to 2 gives the first set of results (Fig. 8). Unfortunately, among the results are entries about crystal structures of H2 O2 as well as H2 O2 ·H2 O, which are not interesting in this special search case (Fig. 9). A more suitable approach to solve the problem is the Advanced Search function (Fig. 10). Since we know that we are looking for ice there is probably a structure type for it in ICSD. Opening the list of structure types and entering “H2 O” we find five results, among these is also the desired ice Ic . The results can be seen in Fig. 11. The result set includes one crystal structure of D2 O (entry 5), although we did not explicitly include any chemical element in the search. For a comprehensive overview it is possible to display the set of crystal structures or the simulated powder patterns in a synoptic view (Fig. 12).
54
C. Buchsbaum et al.
Fig. 8 Basic search. Enter H2 O into Composition field (1) and 2 into the Number of Elements field (2)
Fig. 9 Basic search results. Entries of H2 O2 (1 and 2) are not desired in this search
7 Outlook As mentioned before, ICSD has undergone significant changes during the past decades. On the one hand, there is an exponential increase in the structures added annually. Bibliometric analyzes have shown that the number of structures contained in ICSD has doubled every 10–11 years (Fig. 13) [4]. On the other hand, database experts are working continually on filling in the gaps in the older data (Fig. 14). Of
Data Bases, the Base for Data Mining
55
2.
1.
Fig. 10 Advanced search. Possible candidates for structure types beginning with “H2 O” (1). We enter the appropriate structure type into the corresponding search field (2)
Fig. 11 Advanced search results. All entries with structure type of ice Ic are included
course, this dramatic increase in ICSD structures made it necessary to have a more efficient search functionality for quick and selective retrieval of the desired results. In 2001, ICSD changed over to a relational database system, which was a decisive step ahead. The contents were now presented in 25 tables and about 200 database fields. The new Web interface permits searches for an even larger number of criteria; as a result, ICSD today offers more than 35 tables and far more than 300 fields. The most important innovations were the introduction of structure types and the calculation of standardized data which resulted in a new search option.
56
C. Buchsbaum et al.
Fig. 12 The synoptic view allows a quick comparison of crystal structure
120000
Cumulative number of entries
Fit: Polynomial degree 3 R2 = 0.9996 80000
40000
0 1920
1940
1960
1980
2000
2020
year
Fig. 13 Cumulative number of database records added per publication year. Fit with polynomial of degree 3; parameters: a0 = −16, 30, 88, 372, a1 = 25, 38, 524.873, a2 = −1, 317.0895, a3 = 0.227785, R2 = 0.9996
Further continuous enhancement of the content of ICSD is envisaged also for the future. This includes on the one hand continuous updating for completeness and on the other hand the introduction of new data fields and contents as well as constant
Data Bases, the Base for Data Mining
57
5000 4500
Status ICSD 2009
4000
Status ICSD 2005
entries
3500 3000 2500 2000 1500 1000
0
1950 1952 1954 1956 1958 1960 1962 1964 1966 1968 1970 1972 1974 1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008
500
Publication year Fig. 14 Gaps in old data are continuously filled
improvement of the database functionality. The focus is on high database quality as this is the only way to enable extensive and complex data analyzes and therefore data mining. The search strategies and examples presented here may give an idea of the many possibilities offered by a crystal structure database to solve problems of materials science. Search functionalities will develop further in the next years. New technologies as Web 2.0 and semantic web will have their impact also on databases like the ICSD. We can expect that the future databases will offer more possibilities for interaction, e.g. easy integration of data in existing data collections or ways to annotate or comment data. Standard exchange formats like CIF even today make it possible to switch to other databases and consult them with regard to the problem at hand. It may be possible to develop networks of different databases to link different types of content. This may lead to advanced solutions in fields like data mining.
References 1. ICSD is available at FIZ Karlsruhe at http://www.fiz-karlsruhe.de/icsd.html or http://icsd.fizkarlsruhe.de (2009) 2. Allmann R, Hinek R (2007) The introduction of structure types into the inorganic crystal structure database icsd. Acta Crystallogr Sect A 63:412–417 3. Behrens H (1996) Data import and validation in the inorganic crystal structure database. J Res Natl Inst Stand Technol 101:365–373 4. Behrens H, Luksch P (2006) A bibliometric study in crystallography. Acta Crystallogr Sect B 62:993–1001 5. Belsky A, Hellenbrandt M, Karen VL, Luksch P (2002) New developments in the Inorganic Crystal Structure Database (ICSD): accessibility in support of materials research and design. Acta Crystallogr Sect B 58:364–369
58
C. Buchsbaum et al.
6. Bergerhoff G, Brown ID (1987) Crystallographic databases. International Union of Crystallograhy, Chester, pp 77–95 7. Bragg WH, Bragg WL (1913) The reflection of X-rays by crystals. Proc Roy Soc London Ser A, Containing Papers of a Mathematical and Physical Character 88:428–438 8. Bragg WL (1913) The structure of some crystals as indicated by their diffraction of X-rays. Proc Roy Soc London Ser A, Containing Papers of a Mathematical and Physical Character 89:248–277 9. Cava RJ, Hewat AW, Hewat EA, Batlogg B, Mareziod M, Rabe KM, Krajewskia JJ, Peck Jr WF, Rupp Jr LW (1990) Structural anomalies, oxygen ordering and superconductivity in oxygen deficient BA2 YCU3 Ox . Physica C 165:419–433 10. de Faria JL, Hellner E, Liebau F, Makovicky E, Parth´e E (1990) Nomenclature of inorganic structure types. Report of the International Union of Crystallography Commission on Crystallographic Nomenclature Subcommittee on the Nomenclature of Inorganic Structure Types. Acta Crystallogr Sect A 46:1–11 11. Grice JD, Gault RA (2006) Johnsenite-(CE): A new member of the eudialyte group from Mont Saint-Hilaire, Quebec, Canada. Can Mineral 44:105–115 12. Hall S, McMahon B (eds) (2005) Definition and exchange of crystallographic data. International Tables for Crystallography, vol G. Springer, Dordrecht 13. Kabekkodu SN, Faber J, Fawcett T (2002) New Powder Diffraction File (PDF-4) in relational database format: advantages and data-mining capabilities. Acta Crystallogr Sect B 58: 333–337 14. Kaduk JA (2002) Use of the inorganic structure database as a problem solving tool. Acta Crystallogr Sect B 58:370–379 15. Macrae CF, Bruno IJ, Chisholm JA, Edgington PR, McCabe P, Pidcock E, Rodriguez-Monge L, Taylor R, van de Streek J, Wood PA (2008) Mercury csd 2.0 – new features for the visualization and investigation of crystal structures. J Appl Crystallogr 41:466–470 16. McMahon B, Hanson RM (2008) A toolkit for publishing enhanced figures. J Appl Crystallogr 41:811–814 17. Pennington WT (1999) Diamond – visual crystal structure information system. J Appl Crystallogr 32:1028–1029 18. Rietveld HM (1967) Line profiles of neutron powder-diffraction peaks for structure refinement. Acta Crystallogr 22:151–152 19. Rietveld HM (1969) A profile refinement method for nuclear and magnetic structures. J Appl Crystallogr 2:65–71 ¯ – (219) 20. Villars P, Cenzual K (eds) (2004) Structure types. Part 1: space groups (230) Ia3d ¯ F 43c, Landolt–B¨ornstein – Group III Condensed Matter, vol. 43A1. Springer, Berlin
Struct Bond (2010) 134:59–87 DOI:10.1007/430 2009 4 c Springer-Verlag Berlin Heidelberg 2009 Published online: 1 September 2009
Data Mining and Inorganic Crystallography Krishna Rajan
Abstract Inorganic crystallography is a data intensive field of science. Much of the work in this vast field is focused on ways to acquire, model, organize and manage that data. To a lesser extent, there are efforts to survey that information from which one hopes to glean patterns of behavior that would offer insight into the complex chemical and geometrical relationships governing the existence or stability of a given compound. In this article we provide an overview of the types of information that can be gleaned by applying data mining and statistical learning techniques to inorganic crystallography. The focus of the paper is in two broad areas, classification and prediction of data, and the two primary roles of data mining as a field. Keywords: Crystal structure · Data mining · Descriptors · Dimensionality reduction · Prediction Contents 1 2 3
Introduction: Knowledge Discovery in Inorganic Crystallography . . . . . . . . . . . . . . . . . . Data Systematics: Enumeration Schemes for Inorganic Crystallography . . . . . . . . . . . . . Data Classification: Structure–Chemistry Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Geodesic Distance Measures for Data Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Topology Preserving Distance Measures for Data Clustering . . . . . . . . . . . . . . . . 4 Data Prediction: Structure–Property Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Data Mining Aided Structure Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Descriptor Refinement via Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
K. Rajan () Department of Materials Science and Engineering, Iowa State University, Ames, IA, USA e-mail:
[email protected]
60 62 64 69 71 73 75 78 79 84
60
K. Rajan
Abbreviations GA IsoMap KDD LLE PCA PLS QSAR SVM/SVR
Genetic algorithms A non linear data dimensionality reduction algorithm Knowledge discovery in data (bases) Locally linear embedding Principal component analysis Partial least squares Quantitative structure-activity relationship Support vector machines/regression
1 Introduction: Knowledge Discovery in Inorganic Crystallography At a first glance, the need for data mining for crystallography seems natural as crystallographic data sets are large and the searching for the appropriate information clearly could be aided with the use of searching algorithms. However, the focus of this article is to discuss how data mining of inorganic crystallography databases (or other forms of information repositories) can serve as tool that goes far beyond a “search and retrieve” function to one of scientific enquiry and discovery of knowledge. Broadly speaking data mining techniques have two primary functions: pattern recognition and prediction, both of which form the foundations for understanding the materials behavior. When following the treatment of Tan et al. [1] the former which is more descriptive in scope serves as a basis for deriving correlations, trends, clusters, trajectory and anomalies among disparate data. The interpretation of these patterns is intrinsically tied to an understanding of materials physics and chemistry. In many ways this role of data mining is similar to the phenomenological structureproperty paradigms that play a central role in the study of engineering materials; except now we will be able to recognize these relationships with far greater speed and not necessarily depend on a priori models, provided of course we have the relevant data. The predictive aspect of data mining tasks can serve for both classification and regression operations. Data mining, which is an interdisciplinary blend of statistics, machine learning, artificial intelligence and pattern recognition, is viewed to have a few core tasks: Cluster analysis: seeks to find groups of closely related observations and is valuable in targeting groups of data that may have well behaved correlations and can form the basis of physics based as well statistically based models. Cluster analysis when integrated with high throughput experimentation, can serve as a powerful tool for rapidly screening data bases. Predictive modeling: helps build models for targeted objectives (e.g., a specific materials property) as a function of input or exploratory variables. The suc-
Data Mining and Inorganic Crystallography
61
cess of these models also helps refine the usefulness and relevance of the input parameters. Association analysis: is used to discover patterns that describe strongly associated features in data (for instance the frequency of association of a specific materials property to materials chemistry). Such an analysis over extremely large data sets is made possible with the development of very high speed search algorithms and can help to develop heuristic rules for materials behavior governed by many factors. Anomaly detection: does the opposite by identifying data or observations significantly different from the norm. To be able to identify such anomalies or outliers are critical in materials since it can serve to identify a new class of materials with an unusual property or anticipate potential harmful effects which are often identified through a retrospective analysis. The concept of “knowledge discovery in data” (or KDD as described in the data mining literature [2, 3]) lies as the core objective of the discipline of data mining, and consists of a series of steps, each of which can be viewed in isolation or in groups as constituting informatics (see Fig. 1). The original data comes of course from experimentally generated diffraction studies of inorganic crystals. It should be
Crystallographic Data bases
Data systematics
Statistical Learning Interpretation
Feature Extraction
Data Mining & Visualization
Data Warehousing
Knowledge Patterns Transformed Data Preprocessed Data
Original Data
Diffraction data
Data enumeration schemes
Dimensionality Clustering reduction
Fig. 1 A schematic providing an overview of the range of objectives and steps associated with knowledge discovery in databases (a formal term known as “KDD” in the field of data mining). Examples of the types of data gathering and analysis techniques specifically associated with inorganic crystallography in the context of the steps in KDD are given and elaborated in the text above. In this article we focus on the latter steps outlined in the dashed boxes and examples of types of analysis associated with these data mining activities are cited
62
K. Rajan
pointed out that data mining provides an opportunity to actually predict unknown structures but that will be discussed later on. The concept of data warehousing, as the name suggests, refers to the building of the repository on that information. In the case of crystallography, the organization of that repository is manifested in crystallographic tables which is of course is based on group theory and provides a clear formalism for the hierarchy of information and the critical descriptors that provide the description of the crystal structure. These important steps in data mining in fact constitute a major fraction of the reports in crystallography. This chapter addresses the other steps in the informatics toolkit, preprocessing, feature extraction, data transformation, data mining and visualization, pattern recognition and finally interpretation leading to knowledge discovery. In the context of inorganic crystallography, preprocessing refers to what we do with the data that we get from databases before embarking on a study; such as developing enumeration schemes which subsequently help to identify systematic behavior among data sets that permit feature extraction. These systematics however are limited to two and three dimensions, and exploring higher levels of dimensionality requires the data to be transformed mathematically using methods that can reduce the dimensionality of the data to be visualized and mined using statistical learning techniques. From this higher level of data complexity, patterns can be resolved in the data using techniques such as clustering and prediction methods. When combined with a thorough understanding the physics, chemistry and crystallographic principles associated with the inorganic crystal structures, the physical meaning of the observed patterns can be interpreted and hence the discovery of KDD. The following discussion expands on each of these data mining steps.
2 Data Systematics: Enumeration Schemes for Inorganic Crystallography The organization of data can occur in many ways. Crystallographic tables for instance have an organizational hierarchy that is ultimately based on group theory. Other forms of data organization are derived by exploiting systematics in data [4–19]. The discovery of such systematics in data (whether they be derived from computation or experiments) is often empirical. There are numerous examples in the history of inorganic crystallography where such trends in data have been defined through heuristic observation. Structure maps are for instance a consequence of such efforts to develop enumeration schemes for crystallographic data [20–23]. Villars et al. [24] have surveyed both single elemental structures, and nearly fifteen thousand binary, ternary and quaternary compounds and phenomenologically linked variety of descriptors related to properties based on both experimental and computationally derived data. They came with enumeration schemes that identified both structural and chemical attributes that for instance linked the role of valency to atomic number. They used these phenomenologically derived relationships to
Data Mining and Inorganic Crystallography
63
Fig. 2 A plot of a dimensionless ratio of valency effects vs. atomic size effects based on Villars et al. enumeration schemes which permits clear separation of over 2,300 binary systems into compound formers (blue) and noncompound formers (yellow) (From [24])
discover patterns among thousands of compounds to identify which combinations of elements can produce binary, ternary and quaternary compounds (Fig. 2) Other approaches for tracking and predicting systematics in crystallographic data include the use of tiling theory, especially for framework like structures. For instance Foster et al. [25] have developed a systematic enumeration for three dimensional periodic networks, based on advances in the mathematical tiling theory. These enumeration schemes provide a means of identifying clusters of possible geometric arrangements in complex inorganic crystallographic structures. Another well established approach to developing data systematics is the use of geometrical modeling equations that relate a complex array of geometrical parameters such as bond angle, bond lengths and other geometric and electronic structure information. This approach allows for exploring scalar variations in individual characteristics of a crystal structure [26,27]. For instance following the work of Sickafus [26], Suh and Rajan [27] have used this approach to track the influence of lattice parameters and the dilation of anion position (termed as anion parameter – u) from an ideal cubic close packed (ccp) spinel structure. The numerous relationships are shown in Fig. 3 and are associated with the equations shown in Table 1.
64
K. Rajan
Fig. 3 Governing factors for spinel nitrides, AB2 N4 . Note that expressions of variables for bond lengths and polyhedral volumes follows the work of Sickafus et al. [26], A and B represent tetrahedral (tet) and octahedral (oct) cation sites, respectively, and square and triangle represent an octahedral vacancy and tetrahedral vacancy, respectively. Shown are plots of some of these equations quantifying the effects of varying lattice parameters and other geometrical parameters on bonding as measured by anion parameter (from Suh and Rajan [27])
3 Data Classification: Structure–Chemistry Relationships The search for stable compound structures based on information about the constituent elements is a classic crystal chemistry problem. For the prediction of new possible materials, the use of structure maps has been a traditional approach for searching stable phases. Structure mapping has played an important role as a useful a priori guide for finding stable phases and serving as a visualization tool for structure–property relationships in a bivariate way. Physical factors governing stable crystal structures serve as the coordinates of structure maps [28]. Having carefully chosen physical factors, each compound can be spatially identified by its structure type. From the viewpoint of informatics, a structure map is a classification tool whereby through the choice of appropriate coordinates one can map clustering of crystal structure related data. There is of course a long and distinguished tradition of such maps including Mooser-Pearson plots, Philips and van Vechten diagrams, Goldschmidt diagrams and Pettifor plots just to mention a few examples [29–31]. Each of these mapping schemes identifies some key parameters related to electronic or crystal structure information which are placed on orthogonal axes and the
Data Mining and Inorganic Crystallography
65
Table 1 Geometrical expression of crystallographic descriptors (from [26]) √ 3. BLA−N = A8a − N32e = 3a(u − 1/4) 4. 20. 21. 22. 23. 24. 25. 26.
BLB−N = B16d − N32e = a[2(u − 3/8)2 + (5/8 − u)2 ]1/2 √ A8a − 16c or B16d − Δ8b,48 f = ( 3/8)a A8a − Δ48 f = (1/4)a √ 2ndn.n. , B A8a − Δ48 16d − B16d , or B16d − 16c = ( 2/4)a f √ 2ndn.n. = ( 11/8)a A8a − B16d , B16d − Δ8b , or B16d − Δ48 f √ A8a − A8a or A8a − Δ8b = ( 3/4)a √ (N32e − N32e )1 = 2 2a(u − 1/4) √ (N32e − N32e )2 = 2 2a(1/2 − u)
27. (N32e − N32e )3 = 2a{(u − 3/8)2 + 1/32}1/2 √ 28. Δ8b − N32e = 3a(1/2 − u) 29. (Δ48 f − N32e )1 = a{2(u − 1/4)2 + (1/2 − u)2 }1/2 30. (Δ48 f − N32e )2 = a{(u − 1/4)2 + 2(1/2 − u)2 }1/2 31. 16c − N32e = a{2(u − 3/8)2 + (u − 1/8)2 }1/2 33. Vtet(A−8a) = (8/3)a3 (u − 1/4)3 34. Vtet(A−8b) = (8/3)a3 (1/2 − u)3 (A−48 f )
35. Vtet 36. 37. 38.
= (1/3)a3 (1/2 − u)(u − 1/4)
(B−16d) Voct = (16/3)a3 (1/2 − u)2 (u − 1/8) (B−16e) Voct = (16/3)a3 (u − 1/4)2 (5/8 − u) (N−32e) Voct = (1/192)a3
occurrence of a given crystal chemistry is then plotted accordingly. The resulting diagram maps out the relative position of structure types from which one tries to discern qualitatively if there are strong associations of certain crystal types to certain bivariate combinations of parameters. This however does not address the multivariate nature of the parameters associated with crystal chemistry and hence provides a strong motivation to apply data dimensionality reduction techniques discussed earlier. Building on this multidimensional character of crystallographic data Suh and Rajan have for instance demonstrated a strategy for developing structure maps without any a priori assumption of which two parameters are to be selected for developing a structure map [32]. They treated this as a multivariate analysis problem where they collectively input many of the well known and accepted variables that can have an influence (termed “latent variables” in the jargon of informatics) on the occurrence of a given type of crystal structure [33–40]. We demonstrate that by using this approach we can in fact identify the appropriate selection of variables for use in two dimensional structure maps. The first step in this process is applying data dimensionality reduction techniques such as Principle Component Analysis. The computational details are not described here but the reader is referred to numerous standard texts in the field [41–46] as well as papers that have used this specific multivariate data analysis method in materials science applications [47–58].
66
K. Rajan
Principal component analysis (PCA) (also referred to as singular value decomposition) is one of a family of related techniques that include Factor Analysis and Principal Coordinate Analysis that provide a projection of complex data sets onto a reduced, easily visualized space. The problem with multivariate data is that it cannot be displayed in just two or three dimensions. For more than two dimensions, we have to project the data onto a plane. This projection changes with its direction; or, in other words, the projected image changes if the data points are rotated in the N-dimensional space. One might now ask how to find a rotation of the data (or of the axes – which is quite the same) which displays a maximum of information in the projected data. The strategy of the analysis involves the manipulation of a data matrix where the goal is to represent the variation present in many variables using a small number of factors. A new row space is constructed in which to plot the data, where the axes represent the weighted linear combinations of the variables affecting the data. Each of these linear combinations is independent of each other and hence orthogonal. The data when plotted in this new space is essentially a correlation plot, where the position of each data point not only captures all the influences of the variables on that data but also its relative influence compared to the other data. Descriptors can, for instance be a variety of geometrical and/or electronic parameters associated with crystal structure (see for example Fig. 4) The enormous number of descriptors makes screening through scatter maps a prerequisite. From the screened descriptor space, we can select a region for the solution of our problem. We
Fig. 4 Examples of different framework structures predicted by enumeration schemes developed through tiling theory and the corresponding space groups. The reader is directed towards the original reference for the details of the notation. (From [25])
Data Mining and Inorganic Crystallography
67
can also simplify this selection by compressing the dimensionality of the descriptor space by linear combinations of the original descriptors. PCA gives us a method to transform multiple descriptors into a much smaller set of descriptors without losing much information. This allows easier visualization and also easier prediction and classification (Fig. 5). Mathematically, PCA relies on the fact that most of the descriptors are interrelated and these correlations in some instances are high. It results in a rotation of the coordinate system in such a way that the axes show a maximum of variation (covariance) along their directions. This description can be mathematically condensed to a so-called eigenvalue problem. The data manipulation involves decomposition of the data matrix X into two matrices V and U. The two matrices V and U are orthogonal. The matrix V is usually called the loadings matrix, and the matrix U is called the scores matrix. The loadings can be understood as the weights for each original variable when calculating the principal component (PC). The matrix U contains the original data in a rotated coordinate system. The mathematical analysis involves finding these new “data” matrices U and V . The dimensions of U (i.e., its rank) that captures all the information of the entire data set of X (i.e., number of variables) is far less than that of X (ideally 2 or 3). One now compresses the N-dimensional plot of the data matrix X into two or three dimensional plot of U and V . The eigenvectors of the covariance matrix constitute the PCs. The corresponding eigenvalues give a hint as to how much “information” is contained in the individual components. The first PC accounts for the maximum variance (eigenvalue) in the original dataset. The second PC is orthogonal (uncorrelated) to the first and accounts for most of the remaining variance. A new row space is constructed in which to plot the data, where the axes represent the weighted linear combinations of the variables affecting the data. Each of these linear combinations is independent of each other and hence orthogonal. The data when plotted in this new space is essentially a correlation plot, where the position of each data point not only captures all the influences of the variables on that data but also its relative influence compared to the other data. Thus the mth PC is orthogonal to all others and has the mth largest variance in the set of PCs. Once the N PCs have been calculated using eigenvalue/eigenvector matrix operations, only PCs with variances above a critical level are retained). The M-dimensional PC space has retained most of the information from the initial Ndimensional descriptor space, by projecting it into orthogonal axes of high variance. The complex tasks of prediction or classification are made easier in this compressed, reduced dimensional space. In this chapter we will be providing numerous examples of such plots and much of our data mining analysis will be based on data represented in this reduced dimensional space. Using AB2 N4 spinel nitrides as a template Suh and Rajan [32] have assessed the statistical interdependency of each of the descriptors that may influence chemistry– structure–property relationships. Using PCA, they demonstrated that classical versions of structure maps (from the early work of Hill [59]) based on heuristic observations for this class of crystal chemistry can in fact be reproduced via data mining (Fig. 6). The informatics approach hence provides an alternative method for visualizing structure maps as well as interpreting structure-property relationships.
68
K. Rajan
M compounds, N descriptors x2 Descriptor 2
xi
N-dimensional dataspace
x1
Descriptor n
Data matrix: latent variables/response metrics Descriptor 1
PC3
Principal component analysis: Goal is to find a projection that captures the largest amout of variation in data Dimensionality reduction Principal component (eigenvector) space: Each PC is described as a linear combination of the weighted contribution of each variable
PC2 PC1
Binning/clustering: Each data point represents a correlation position of the compound, as influenced by all descriptors Seek patterns of clustering in PCA space which may involve other statistical and data-mining techniques: integrate into materials science interpretation for knowledge discovery
Fig. 5 Summarizes the procedural logic of PCA. From a set of N correlated descriptors, we can derive a set of N uncorrelated descriptors (the principal components). Each principal component (PC) is a suitable linear combination of all the original descriptors. The first PC accounts for the maximum variance (eigenvalue) in the original dataset. The second PC is orthogonal (uncorrelated) to the first and accounts for most of the remaining variance. Thus the mth PC is orthogonal to all others and has the mth largest variance in the set of PCs. Once the N PCs have been calculated using eigenvalue/eigenvector matrix operations, only PCs with variances above a critical level are retained. The M-dimensional PC space has retained most of the information from the initial Ndimensional descriptor space, by projecting it into orthogonal axes of high variance. The complex tasks of prediction or classification are made easier in this compressed space. If we assume that information from the data can be gained only if the variation along an axis is a maximum, we have to find the directions of maximum variation in the data. In addition, these new axes should again be orthogonal to each other. To find the new axes, the direction of the maximum variation should be found first to take it for the first axis. Thereafter we use another axis which is normal to the first and rotate it around the first axis until the variation along the new axis is a maximum. Then we add a third axis, again orthogonal to the other two and in the direction of the maximum remaining variation, and so on (ideally we can capture all the meaningful relationships in three dimensions). (From [36, 50])
Data Mining and Inorganic Crystallography
69
Fig. 6 Comparison a principal component plot (a) and a structure map (b) using the parameterization based on enumeration schemes proposed by Hill [59]. The similarity between the two plots is what is brought to attention here. This serves as an example of how the data mining techniques using discrete geometrical and electronic parameters without any prior knowledge of any enumeration schemes or crystal chemistry equations can still recover the identification of critical parameters governing crystal structure. (From [51])
While the very powerful PCA is only a first order step in the different approaches that are available in data dimensionality reduction, PCA can as shown above find patterns and associations and develops a useful visualization tool for tracking correlations in high dimensional data. However that assumes the geometric visualization, which can be interpreted in terms of linear mappings using Euclidean distances as the variables contributing the properties that interact as a linear combination. To handle nonlinear relationships we need to find other ways of finding meaningful low dimensional structures hidden in their high dimensional observation [60]. In the following discussion we briefly summarize two approaches and show some representative results on the same spinel data sets discussed for the PCA (Fig.7). We shall give examples of two methods, one based on distance preserving correlations and the other topological persevering methods. The PCA method described earlier uses a Euclidean distance measure; which works well if all the latent variables interact in a linear combination of each other. If the data correlation space however is such that these contributions interact in a nonlinear fashion that make them invisible to PCA we then need to explore nonlinear data dimensionality reduction techniques.
3.1 Geodesic Distance Measures for Data Clustering A sense of the challenge in defining a neighborhood distance is well demonstrated by the “Swiss roll” template where the definition of the shortest path between points on the surface of the Swiss roll can either be treated as a straight line in high dimensional space or as curved line following the trajectory of the Swiss roll surface
70
K. Rajan
Dim. Red.
Geometric
Distance
Euclidian
AA NN
Topology
Geodesic
PDL
DDL
Fig. 7 A schematic showing the hierarchical relationships of a few dimensionality reduction techniques (adapted from [60]) Linear • Euclidian: PCA – principal component analysis
Non-linear • • • •
Euclidian: NLM – non linear mapping Geodesic: Isomap PDL (pre defined lattice) – self organization maps DDL (data drive lattice) – local linear embedding (LLE)
(Fig. 8). The straight line however does not capture the true similarity between points. The IsoMap technique effectively unravels this shape into two dimensions revealing the true relationships between points residing on this complex surface using the concept of Geodesic distance metrics rather than Euclidean metrics [61]. From this geometric three dimensional example, we can extend this to N dimensions and ask the question of how latent variables influencing crystal structure and chemistry for instance are related without the assumption of linear interactions between the descriptors. In the crystallography example, our “Swiss roll” is actually a very high dimensional “data correlation” surface for which we do not have a geometric description and hence searching for clustering in a Euclidean framework may not capture the right relationships between variables. The IsoMap approach unravels that surface into two dimensions. We have for instance repeated our PCA calculations on spinels using IsoMap (Fig. 9) [62]. In this case the two dimensional mappings are qualitatively similar but upon closer examination, we can observe that the neighborhood associations of certain compounds are different, suggesting that the impact of certain descriptors are uniquely different for those compounds.
Data Mining and Inorganic Crystallography
71
Fig. 8 The “Swiss roll” data set, illustrating how IsoMap exploits geodesic paths for nonlinear dimensionality reduction. (a) For two arbitrary points (circled) on a nonlinear manifold, their Euclidean distance in the high dimensional input space (length of dashed line) may not accurately reflect their intrinsic similarity, as measured by geodesic distance along the low dimensional manifold (length of solid curve). (b) The neighborhood graph G constructed in step one of IsoMap (with K = 7 and N = 1, 000 data points) allows an approximation (red segments) to the true geodesic path to be computed efficiently in step two, as the shortest path in G. (c) The two dimensional embedding recovered by IsoMap in step three, which best preserves the shortest path distances in the neighborhood graph (overlaid). Straight lines in the embedding (blue) now represent simpler and cleaner approximations to the true geodesic paths than do the corresponding graph paths (red) (From [61])
Fig. 9 Shows that except for one data point the PCA and Isomap yield almost similar results as far as classification of samples are concerned. The spinel chemistry denoted by the red circle lies on the opposite side of the reduced dimensionality space as determined by Isomap suggesting subtleties in the role the descriptors have on this particular compound ( Hf3 N4 ) relative to all the other compounds. (Meco, Suh and Rajan-unpublished work [62])
3.2 Topology Preserving Distance Measures for Data Clustering Locally Linear Embedding (LLE) considers the high dimensional data as locally linear patches. The linear patches are put in the embedded low dimensional space so
72
K. Rajan
that the local neighborhood of each data point is preserved. This is obtained by using the same reconstruction matrix calculated in the original high dimensional space as weights in the low dimensional embedded space to reconstruct the data points. LLE calculates the low dimensional embedding of high dimensional data by preserving the neighborhood structure of each data point. In the original data space, K, the nearest neighbors are found for each data points. The basic idea of LLE is to reconstruct each data point in the low dimension space using its K neighbors (Fig. 10). The reconstruction matrix is made up of linear coefficients that reconstruct each data point from its neighbors and is obtained by minimizing the error between one data point and the linear combination of its neighbors with the constraint that the rows of the weight matrix sum to one [63]. Li and Rajan [64] have used this approach to accurately generate classifications among the spinel structures into semiconductor, metal and semimetal behavior.
1 Select neighbors
Xi
2 Reconstruct with linear weights
Xi
Wik Wij
Yi
Wik
Yk
Wij
Yj
Xk Xj
3 Map to embedded coordinates
Fig. 10 Steps of locally linear embedding: (1) Assign neighbors to each data point Xi (for example by using the K nearest neighbors). (2) Compute the weights W ij that best linearly reconstruct Xi from its neighbors, (3) Compute the low dimensional embedding vectors Y (from [63])
Data Mining and Inorganic Crystallography
73
4 Data Prediction: Structure–Property Relationships The orthogonality of the PCs eliminates the multidimensional problem; but, the problem of choosing an optimum subset of predictors remains. A possible strategy is to keep only a few of the first components, however they are chosen to explain X rather than Y , and so, nothing guarantees that the PCs, which “explain” X, are relevant for Y. The next step in examining the data in PCA space is to see if we can develop models based on this new data set. In other words we want to try to set up a regression model between the scores and not just the original data. It is here that we introduce the concept of partial least squares regression (PLS). PLS models are based on PCs of both the independent data X and the dependent data Y . The central idea is to calculate the PC scores of the X and the Y data matrix and to set up a regression model between the scores (and not the original data). Hence in contrast to PCA, PLS regression finds components from X that are also relevant for Y . Specifically, PLS regression searches for a set of components (called latent vectors) that performs a simultaneous decomposition of X and Y with the constraint that these components explain as much as possible of the covariance between X and Y . This step generalizes PCA. It is followed by a regression step where the decomposition of X is used to predict Y . As noted by Tobias [65], the emphasis in PLS is on predicting the responses and not necessarily on trying to understand the underlying relationship between the variables. For example, PLS is not usually appropriate for screening out factors that have a negligible effect on the response. However, when prediction is the goal and there is no practical need to limit the number of measured factors, PLS can be a useful tool. When the number of factors gets too large (for example, greater than the number of observations), you are likely to get a model that fits the sampled data perfectly but that will fail to predict new data well. This phenomenon is called over-fitting. In such cases, although there are many manifest factors, there may be only a few underlying or latent factors that account for most of the variation in the response. The general idea of PLS is to try to extract these latent factors, accounting for as much of the manifest factor variation as possible while modeling the responses well. For this reason, the acronym PLS has also been taken to mean “projection to latent structure.” Hence PLS expresses a dependent variable (target property) in terms of linear combinations of the PCs. As the descriptors themselves are interdependent (covariant), the PCs so generated are independent (orthogonal). The PLS equation, then, assumes the following form for the case of “n” descriptors: Target property = a0 + a1 (PC1 ) + a2 (PC2 ) + a3 (PC3 ) + · · · + an (PCn ) . The PLS method can be applied to rationalize the materials attributes relevant to materials function or property and this permits one to use PLS methods to develop explicit quantitative relationships that identify the relative contributions of different data descriptors, and the resulting relationship between all these descriptors as a linear combination, to the final property.
74
K. Rajan
While a fundamental tenet in materials science is to establish structure-property relationships, it is the life sciences and organic chemistry community that has formally introduced the concept of quantitative structure activity (or also termed property) relationships (QSAR or QSPR). Using this approach Suh and Rajan [66] explored the attributes used in electronic structure calculations and their influence on predicting bulk modulus. Using PLS, a QSAR was developed relating bulk modulus with a variety of Bulk modulus = 1.00096 EN− 0.35682 x − 0.77228 BLA−N − 0.83367 BLB−N +0.03296 Q∗tet + 0.18484 Q∗oct − 0.1350 Q∗N (1) where EN is the weighted electronegativity difference, x the internal anion parameter, BLA–N the A–N bond length, BLB–N the B–N bond length, Q*tet the Mulliken effective charge for tetrahedral site ion, Q*oct the Mulliken effective charge for octahedral site ion and Q*N is the Mulliken effective charge for N ion. By systematically exploring the number and type of variables needed, they found very strong agreement in being able to predict properties consistent with ab-initio calculations based strictly on a data driven analysis (Fig. 11). Based on our QSAR formulation, the role of the effective charge (Q*) in enhancing modulus is particularly notable. This is consistent with theoretical studies that show it is the effective charge parameter which helps to define the degree of charge transfer and the level of covalency associated with the specific site occupancy of a given
PLS derived bulk modulus (GPa)
340 c-SiC2N4
320
c-TiC2N4
300 c-GeC2N4
280 c-TiGe2N4
260 c-TiZr2N4
c-GeSi2N4
240
220 220
240
260
280
300
320
340
ab initio derived bulk modulus (GPa) Fig. 11 Example of data table and quantum mechanical descriptors used in PLS based predictions of bulk modulus as compared to ab-initio calculations. (from [59])
Data Mining and Inorganic Crystallography
75
species. Ab-initio calculations of this effective charge can be then be used as a major screening parameter in identifying promising crystal chemistries for promoting the modulus. Hence, using PLS to develop a QSAR formulation combined with an interpretation of the physics governing these materials can indeed be valuable. Our predictions fit well with systems of similar electronic structure and allow us to clearly identify outliers based on these quantum mechanical calculations. Based on these predictions we can now seriously and effectively accelerate materials design by focusing on promising candidate chemistries. Those selected can then be subjected to further analysis via experimentation and computational methods to validate crystal-structure-level properties. The data generated by these selective experiments and computations also serve to refine the next generation of “training” data for another iterative round of data mining, which permits a further refinement of high-throughput predictions.
5 Data Mining Aided Structure Prediction The importance of predicting structures of unknown compounds has been a grand challenge in crystallography [67–78]. Different strategies to attack this problem using data mining and/or statistical methods as part of their computational toolkit have been proposed. We shall summarize a few of those approaches here. Ceder and colleagues [79, 80] have used data mining on a high dimensional space of structural formation energies creating a 114 dimensional space. They applied PCA to identify correlations among ab initio energies of different structures in different materials from this large search space acquired from crystallographic databases. They developed an algorithmic approach using a Bayesian probability to help predict crystal structures on the availability of known information and other data mining methods. From this they could use data mining to screen for the most likely crystal structure from a candidate list of compounds using first principles calculations (Fig. 12). Zunger and colleagues [81,82] have used genetic algorithms (GA) as the primary data mining strategy to select the best components of a model Hamiltonians from a vast set of possibilities of crystal structures using a handful of first principles calculations. They have successfully predicted unknown ground states for some chemically distinct systems with the accuracy of a first principles Hamiltonian. They built their approach around a pool of atomistic clusters, termed Many Body Interaction Types (MBITs) on which they applied GA methods to search for the best MBIT to fit a known set of structures. Their algorithm was successful for finding the correct MBIT over extremely large search spaces (Fig. 13). Woodley et al. [83] have also applied GA techniques for structure prediction. They produced a multistage scheme which has been developed for the structure prediction of viable clusters, where a GA is implemented within the first stage to generate approximate structures that can be refined via direct energy minimization techniques at a later stage. Inclusion and exclusion zones, regions of space where
76
K. Rajan
Cross-validation score (meV)
Fig. 12 An example developed for predicting the structure of AgMg3. (a) A (candidate list) of the crystal structure of AgMg3 on the basis of the limited data available at other compositions (green box). The structures are ordered by decreasing probability within their data mining algorithm (which they term “DMSP”). This ordering is compared with a ranking on the basis of the frequency with which these structures occur in the experimental database (parenthesized value in candidate list). (b) Ab initio formation enthalpy (with respect to the pure elements) of the top five structures along with 26 additional structure types calculated to aid in verifying the prediction. (From [72])
Fig. 13 Genetic algorithm (GA)-based identification of the optimally predictive sets of MBITs: Cluster-expansion-generated input data for the Mo–Ta alloy so that the underlying MBITs are known exactly. The plot shows the development of cross-validation scores of the entire population of trial solutions as a function of GA generation, and the set of MBITs that constitutes the optimum solution. (From [73])
Data Mining and Inorganic Crystallography
77
atoms are allowed and forbidden, respectively, are used to reduce the search space and to steer the predicted structures so that they have a predefined architecture, e.g., bubble-like as opposed to dense phase-like. In a final stage, electronic, rather than atomistic, calculations were performed. The method has for instance successfully generated sets of stable II–VI semiconductor clusters (MX)(n), with n ranging from 1 to 15 (Fig. 14). Javed et al. [84] have used support vector machines (SVM) to predict lattice constants associated with orthorhombic ABO3 perovskites. Unlike the other methods, they have used data mining methods as a way to directly make predictions. They took advantage of the predictive capability of techniques such as support vector regression (SVR) methods which finds the functional form of the data so that it can be used for the prediction of values for those samples not presented to SVR before. SVR is very useful for solving nonlinear regression estimation. The estimated function is a linear expansion in terms of functions defined on a certain subset of the data
Fig. 14 Examples of predicted ZnS cluster geometries derived from genetic algorithms (From [75])
78
K. Rajan
(support vectors), and the final number of coefficients in such an expansion does not depend on the dimensionality of the space of input variables. These two properties make SVM an especially useful technique for dealing with very large data sets in a high dimensional space.
6 Descriptor Refinement via Data Mining The above discussion has shown how data mining can help develop new classification schemes in crystallography linking chemistry with structure. This was based on developing a diverse crystal descriptor database. Sometimes, while the information may appear to be comprehensive it does not allow if we want to go to the next step of utilizing data mining as a predictive tool. The in silico design of materials for specific property values involve: (a) the use of the proper descriptors containing pertinent information that influence the property; (b) the proper data mining method. The former involves choosing structural, chemistry, and processing variables based on experimental, theoretical evidence and statistical criteria. The choice of data mining method in the latter is a function of both the data distribution and the query to be answered. In the following discussion of the value of “secondary” or derived descriptors in attempting to identify descriptor parameterizations that may contain more physically meaningful information, which can subsequently serve as a guide to the above design process. We have used PCA primarily as a tool to seek classification behavior and PLS to lay down the basis for predictive analysis. Only when we include the proper secondary descriptors do we have a higher predictive capability. In general, primary descriptors are property, structural and process variables that are measurable, or computable with a physical meaning to attribute to. Thus, primary descriptors like the lattice parameters are not optimized statistically but are used because they can be measured comparatively easily and have a well-established scientific significance. The ability to use data mining methods effectively is undermined if we choose to use only these unoptimized variables. This is where “secondary descriptors” can fill in this gap well. These are essentially descriptors derived from the primary variables in order to enhance the data mining performance. It is important to note that this is not the same as using some random variables. A random sequence may very well correspond to the primary descriptor itself, but that does not undercut the value of the primary descriptor because it has a well-established physical significance. In a similar vein, if the secondary descriptors have physical meaning relevant to the problem at hand, we cannot dismiss the use of these descriptors (Fig. 15). For instance, Rajagopalan and Rajan [85] have used data mining techniques to develop secondary descriptors from zeolite frameworks with no compositional information to zeolite analogues with different compositions we have also demonstrated that our selection of secondary descriptors can capture
Data Mining and Inorganic Crystallography
79
PLS Prediction with Space Group Descriptors
PLS Prediction with Lattice Constants, Space Groups and Secondary Descriptors
2
2
Predicted M-O Bond Length
31
1.9
33
13
1.85
28 12 7 9 4 15 8
29
6
24
11 18
34
1.8
27
20
3 5 14
25
Predicted M-O Bond Length
26
1.95
32
26
30 17 10 22 21 16
23
19 2
1.75 1
1.7
1.95
27 17
1.9
21 24 22 30
31 34
1.85
28 11
19 9 7 6 13 4 8 3 12 15 33 14 5
1.8 1.75
16 18
25 32 23
10 2
29
1.7 1
1.65
1.65
20
35
1.6 1.6
35
1.65
1.7
1.75
1.8
1.85
1.9
Measured M-O Bond Length
1.95
2
1.6 1.6
1.65
1.7
1.75
1.8
1.85
1.9
1.95
2
Measured M-O Bond Length
Fig. 15 A comparison of the two data mining based predictive models for predicting bond lengths in complex framework structures. The figure on the right with the better fit is based on data mining calculations using additional “secondary descriptors.” (From [77])
the complexities of chemistry and other geometrical parameters that while PCA using primary descriptors often lack physical and chemical meaning, the secondary descriptors however can offer much more promise.
7 Conclusions A fundamental issue in inorganic crystallography is to understand the relationship between chemical stoichiometry and crystal structure. The relationship between specific compounds and specific crystal structures is usually developed heuristically by surveying crystallographic data of known compounds. This process of structure–chemistry association has laid the historical foundations of identifying crystal structure prototypes and structural classifications. In this chapter we demonstrate how informatics can quantitatively accelerate the discovery of structurechemistry relationships which can also be used as the foundation for developing structure-chemistry-property relationships. Acknowledgments The author gratefully acknowledge support from the: National Science Foundation-International Materials Institute Program for the Combinatorial Sciences and Materials Informatics Collaboratory (CoSMIC-IMI), grant # DMR-08–33853; Air Force Office of Scientific Research, grant # FA95500610501 and grant # no. FA95500810316 and Defense Advanced Research Program Agency – Center for Interfacial Engineering for MEMS, grant # HR 0011–06–1–0049.
80
K. Rajan
Appendix Computational Case Study: Principal Component Analysis For Classifying Crystal Chemistry–Property Relationships In Semiconductors (Adapted From [86])
am1
Scaling (for example, UV scaling)
( a11 / sk1 ) (am1 / s k1)
... amn
XT X S = cov(X ) = m −1
X=
...
(a1 n / s kn )
...
a1n
...
...
... ...
A=
...
a11
...
In this chapter, we present a tutorial type example of one of the techniques described in this article, namely principal component analysis. We shall take an example of how we can map classification schemes for chemistry-property relationships in compound semiconductors. We shall provide a step-by-step link between the algorithmic strategies of the PCA with an example of the manipulation of data associated with this method. Given a data matrix A we wish to calculate matrix X representing the principal components which is then plotted in terms of scores and loadings):
... (a mn / s kn )
Covariance Matrix
Eigenvalue Decomposition
X = t1 pT1 + t1 pT1 + ... + tk pTk + E
where
k ≤ min{m, n} ti ;scores, pi : loadings
The data matrix A in this case lists a series of attributes associated with each compound in the database. In this case it is a 44 × 6 matrix (44 compounds with 6 variables).
AlN AlP AlAs AlSb GaN GaP GaAs GaSb InN InP InAs InSb ZnS ZnSe ZnTe CdSe CdTe MgS MgSe (AlGA)0.5 N (Alln)0.5 N (Galn)0.5 N
Compounds
Atomic no. 10 14 23 32 19 23 32 41 28 32 41 50 23 32 41 41 50 14 23 14.5 19 23.5
Melting point 498 625 1,011.5 918.5 182.5 309.5 545 603 246 373 759.5 666.5 540 593 707 544 658 655 708 340.25 372 214.25
4 4 4 4 4 4 2.5 4 4 4 4 4 9 9 9 9 9 4 4 4 4 4
VE
1.135 0.435 0.26 −0.09 1.155 0.455 0.28 −0.07 1.51 0.81 0.635 0.285 0.78 0.595 0.21 0.93 0.545 0.93 0.745 1.135 1.135 1.155
Radii −1.21 −0.68 −0.63 −0.5 −1.15 −0.62 −0.57 −0.44 −1.22 −0.69 −0.64 −0.51 −1.21 −1.1 −0.94 −1.14 −0.98 −1.34 −1.23 −1.21 −1.21 −1.15
EN
Lattice const. (ang) 3.11 5.47 5.66 6.14 3.16 5.45 5.65 6.1 3.54 5.87 6.06 6.48 5.41 5.67 6.1 6.05 6.48 5.7 5.9 3.14 3.32 3.31 (ZnMg)0.5 S (SSe)0.5 Mg (SSe)0.5 Zn (ZnMg)0.5 Se (ZnCd)0.5 Se (SeTe)0.5 Zn (SeTe)0.5 Cd (ZnCd)0.5 Te (AlGa)0.5 P (PAs)0.5 Ga (AlGa)0.5 As (Galn)0.5 P (Alln)0.5 P (AsSb)0.5 Al (AlZn)0.5 As (AsSb)0.5 Ga (AlGa)0.5 Sb (Galn)0.5 As (PAs)0.5 ln (Alln)0.5 Sb (Gain)0.5 Sb (AsSb)0.5 ln
Compounds
Atomic no. 18.5 18.5 27.5 27.5 36.5 36.5 45.5 45.5 18.5 27.5 27.5 27.5 23 27.5 27.25 36.5 36.5 36.5 36.5 41 45.5 45.5
Melting point 597.5 681.5 566.5 650.5 568.5 650 601 682.5 467.25 502.75 853.75 341.25 499 965 951.25 649.5 760.75 727.75 566.25 792.5 634.75 713 6.5 4 9 6.5 9 9 9 9 4 4 4 4 4 4 6.25 4 4 4 4 4 4 4
VE 0.78 −0.93 −0.78 0.595 0.595 −0.595 −0.93 0.21 0.435 −0.455 0.26 0.455 0.435 −0.26 0.26 −0.28 −0.09 0.28 −0.81 −0.09 −0.07 −0.635
Radii −1.21 1.34 1.21 −1.1 −1.1 1.1 1.14 −0.94 −0.68 0.62 −0.63 −0.62 −0.68 0.63 −0.63 0.57 −0.5 −0.57 0.69 −0.5 −0.44 0.64
EN
Lattice const.(ang) 5.52 5.81 5.52 5.78 5.86 5.9 6.28 6.27 5.46 5.61 5.65 5.68 5.69 5.88 5.87 5.85 6.11 5.82 5.94 6.27 6.26 6.24
Data Mining and Inorganic Crystallography 81
82
K. Rajan
The data has to be scaled since the variables have different units and we need to then calculate the covariance matrix S associated with the scaled matrix of A which we designate as X (Note: EN – electronegativity; VE – number of valence electrons) −1.9425 −0.5401 −0.5887 −1.5621 0.1188 −0.5887
1.3495 −0.9224 −2.5417 0.2139 −0.2332 −0.0565
X = auto-scaledmatrix of A = 0.1694 −0.5887 −0.6054 0.5754 −0.5887 −1.5220
1.4335 1.4335
P = eigenvectors of matrix S =
P = eigenvectors of matrix S =
−0.0127 −0.0803 −0.0002
0.0789 1.4833
0.7754 0.7544
0.4316 0.4572 0.0245
0.7058 −0.4518 −0.5264
0.0786 −0.6541 0.5302
0.3936 0.0501 0.6422
−0.7551 0.0363 −0.6253 0.1088 −0.1791 −0.7687
0.0697 −0.0702 0.1041
−0.2024 0.3560 −0.5058 0.4069 −0.5065 0.4124 −0.2800 0.2164 0.4909 0.5302 −0.2024
0.6422 0.3560
−0.0002 −0.7551
0.0245 0.0363
−0.5264 0.0697
−0.6253
0.1088
−0.0702
0.4069
−0.5065
0.4124
−0.1791 −0.7687
0.1041
−0.2800
0.2164
0.4909
0.0420 0 0
Λ = eigenvalues of matrix S =
0
0
0.2204 0 0 0.5865
0
0
0
0 0
0 0
0
0
0
0
0
0.8700
0 0
0 0
0 0
0 0
PC5
PC4
PC3
1.2247 0 0 3.0563
PC2
PC1
0.8026 −0.9228 −0.0484 −0.7721 −3.3827 − 0.6085 −0.8208 −0.6350 −0.8050 −0.8993 0.0048 0.2372 0.0224
−0.0127 −0.0803 −0.0002 P = Loadings = −0.7551 −0.6253
0.0723 0.3937
1.2781 −0.3728 0.1069 0.9300 0.1245 −0.9149
...
...
...
...
...
...
0.0812
T = Scores =
0.1693
0 0
This now allows us to calculate the scores matrix PC6
0.3928 0.3874 0.1693
1.2485 2.4383
0.4316 0.7058 0.0786 0.3936 0.3928 0.4572 −0.4518 −0.6541 0.0501 0.3874 0.0245 −0.5264 0.5302 0.6422 0.1693 0.0363 0.0697 −0.2024 0.3560 −0.5058 0.1088 −0.0702 0.4069 −0.5065 0.4124 −0.1791 −0.7687 0.1041 −0.2800 0.2164 0.4909
Data Mining and Inorganic Crystallography
83
We are now ready to plot the principal component projections for both scores and loading plots PC2(20.41%)
PC1(50.94%)
PC1
84
K. Rajan
References 1. Tan P-N, Steinbach MV, Kumar V (2006) Introduction to data mining. Addison-Wesley, Boston 2. Maimon O, Last M (2001) Knowledge discovery and data mining. Kluwer Academic Publishers, Dordrecht 3. Pal NR, Jain L (eds) (2004) Advanced techniques in data mining and knowledge discovery. Springer, London 4. Belsky A, Hellenbrandt M, et al (2002) New developments in the inorganic crystal structure database (ICSD): accessibility in support of materials research and design. Acta Crystallogr B 58:364–369 5. Bergerhoff G, Berndt M, et al (1999) Concerning inorganic crystal structure types. Acta Crystallogr B 55:147–156 6. Collins A, Barr G et al (2007) The application of cluster analysis to identify conformational preferences in enones and enimines from crystal structural data. Acta Crystallogr B 63: 469–476 7. Foster MD, Simperler A, et al (2004) Chemically feasible hypothetical crystalline networks. Nat Mater 3:234–238 8. Foster MD, Treacy MMJ, et al (2005) A systematic topological search for the framework of ZSM-10. J Appl Crystallogr 38:1028–1030 9. Han SX, Smith JV (1999) Enumeration of four-connected three-dimensional nets. I. Conversion of all edges of simple three-connected two-dimensional nets into crankshaft chains. Acta Crystallogr A 55:332–341 10. Han SX, Smith JV (1999) Enumeration of four-connected three-dimensional nets. III. Conversion of edges of three-connected two-dimensional nets into saw chains. Acta Crystallogr A 55:360–382 11. Hawthorne F (1994) Structural aspects of oxide and oxysalt crystals. Acta Crystallogr B 50:481–510 12. Igartua JM, Aroyo MI, et al (1999) Search for Pnma materials with high-temperature structural phase transitions. Acta Crystallogr B 55:177–185 13. Igartua JM, Aroyo MI, et al (1996) Systematic search of materials with high-temperature structural phase transitions: Application to space group P2(1)2(1)2(1). Phys Rev B 54(18):12744–12752 14. Jenkins HDB, Roobottom HK et al (1999) Relationships among ionic lattice energies, molecular (formula unit) volumes, and thermochemical radii. Inorg Chem 38(16):3609–3620 15. Kiselyova NN (2000) Databases and semantic networks for the inorganic materials computer design. Eng Appl Artif Intell 13:533–542 16. Kroumova E, Aroyo MI, et al (2000) Systematic search of displacive ferroelectrics. Ferroelectrics 241:1939–1946 17. Lord EA, McKay AL (2003) Periodic minimal surfaces of cubic symmetry. Curr Sci 85: 346–362 18. Majda D, Paz FAA, et al (2008) Hypothetical zeolitic frameworks: in search of potential heterogeneous catalysts. J Phys Chem C 112:1040–1047 19. Mercier PHJ, Le Page Y et al (2005) Geometrical parameterization of the crystal chemistry of P63/m apatites: comparison with experimental data and ab initio results. Acta Crystallogr B 61:635–655 20. Feng LM, Jiang LQ et al (2008) Formability of ABO(3) cubic perovskites. J Phys Chem Solids 69:967–974 21. Hauck J, Mika K (2002) Structure maps for crystal engineering. Cryst Eng 5:105–121 22. Korotkov AS, Alexandrov NM (2006) Structure quantitative map in application for AB2 X4 system. Comput Mater Sci 35:442–446 23. Michael Clark P, Lee S et al (2005) Transition metal AB3 intermetallics: structure maps based on quantum mechanical stability. J Solid State Chem 178:1269–1283
Data Mining and Inorganic Crystallography
85
24. Villars P, Daams J, Shikata Y, Rajan K, Iwata S (2008) A new approach to describe elementalproperty parameters. Chem Met Alloys 1:1–23 25. Foster MD, Friedrichs OD et al (2004) Chemical evaluation of hypothetical uninodal zeolites. J Am Chem Soc 126:9769–9775 26. Sickafus KE, Wills JM, Grimes NW (1999) J Am Ceram Soc 82:3279–3292 27. Suh C, Rajan K (2009) Informatics for chemical crystallography. J Met 61:48–53 28. Pettifor DG (1995) In: Westbrook JH, Fleischer RL (eds) Structure mapping, vol 1. Wiley, Chichester, pp 419–438 29. Mooser E, Pearson WB (1959) Acta Cryst 12:1015–1022 30. Pettifor D (1995) Bonding and structure of molecules and solids. Oxford University Press, Oxford 31. Phillips JC, Van Vechten JA (1969) Phys Rev Lett 22:705–708 32. Suh C, Rajan K (2008) Data mining and informatics for crystal chemistry: establishing measurement techniques for mapping structure-property relationships. J Mater Sci Technol (in press) 33. S¨oerberg K, Bostr¨omM, Kubota Y, Nishimatsu T, Niewa R, H¨assermann U, Grin Y, Terasaki O (2006) J Solid State Chem 179:2690–2697 34. Villars P (1983) J Less Common Met 92:215–238 35. Villars P, Hulliger F (1987) J Less Common Met 132:289–315 36. Villars P, Phillips JC (1988) Phys Rev B 37:2345–2348 37. Wei S, Zhang SB (2001) Phys Rev B 63:045112–045118 38. Zhang SB, Cohen ML (1989) Phys Rev B 39:1077–1080 39. Zhang SB, Cohen ML, Phillips JC (1988) Phys Rev B 38:12085–12088 40. Zunger A (1981) In: O’Keeffe M, Navrotsky A (eds) A pseudopotential viewpoint of the electronic and structural properties of crystals, vol I. Academic Press, NY, pp 73–135 41. Jackson JE (1991) A users guide to principal components. Wiley, NY 42. Johnson RA, Wichern DW (2002) Applied multivariate statistical analysis. Upper Saddle River, Prentice Hall 43. Joliffe IT (2002) Principal Component Analysis, Springer-Verlag NY 44. Bajorath J (2002) Integration of virtual and high-throughput screening. Nat Rev Drug Discov 1(11); 882–894 45. Quakenbush J (2001) Computational analysis of microarray data. Nat Rev Genet 2:418 46. Li G, Rosenthal C, Rabitz H (2001) High dimensional model representations. J Phys Chem A105:7765–7777 47. Rajan K, Rajagopalan A, Suh C (2002) Data mining and multivariate analysis in materials science. In: Gaune-Escard M (ed) Molten salts – fundamentals to applications. Kluwer Academic, Norwell, MA, p 241–248 48. Suh C, Rajagopalan A, Li X, Rajan K (2003) Chemical discovery in molten salts through data mining. In: Øye HA, Jagtøyen A (eds) International Symposium on Ionic Liquids; Festchrift in honor of Prof M.Gaune-Escard. Norwegian University of Science and Technology, Trondheim, Norway, pp 587–599 49. Gadzuric S, Suh C, Gaune-Escard M, Rajan K (2006) Extracting information from molten salt database. Met Trans A 37:3411–3414 50. Rajan K (2008) Combinatorial materials sciences: experimental strategies for accelerated knowledge discovery. Annu Rev Mater Res vol 347 51. Kamath C, Wade N, Karypis G, Pandey G, Kumar V, Rajan K, Samatova NF, Breimyer P, Kora G, Pan C, Yoginath S (2009) Scientific data analysis. In: Shoshani A, Rotem D (eds) Scientific data management. Taylor and Francis, UK (in press) 52. Rajan K (2009) Combinatorial materials science and EBSD. In: Schwartz AJ, Kumar M, Adams BL (eds) Electron backscatter diffraction in materials science-2. Taylor and Francis, UK 53. Zaki M, Rajan K (2001) Data mining: a tool for materials discovery. Proceedings of 17th CODATA meeting, Baveno, Italy. www.codata.org
86
K. Rajan
54. Rajan K (2000) An informatics approach to interface characterization: establishing a “materials by design” paradigm. In: Ankem S, Pande CS (eds) Science and technology of interfaces. TMS, Warrendale, PA 55. Rajan K, Suh C, Rajagopalan A, Li X (2002) Quantitative structure-activity relationships (QSARs) for materials science. In: Takeuchi I et al (eds) Artificial intelligence and combinatorial materials science, MRS, Pittsburgh PA, 700, pp S7.5.1–10 56. Suh C, Rajagopalan A, Li X, Rajan K (2002) Applications of principal component analysis in materials science. Data Sci J 1:19–26 57. Rajan K, Rajagopalan A, Suh C (2002) Data mining and multivariate analysis in materials science. In: Gaune-Escard M (ed) Molten salts – fundamentals to applications. Kluwer Academic, p 241–248 58. Rajan K (2005) Materials Informatics. Mater Today 8:38 59. Hill AC (1979) Phys Chem Miner 4:317–339 60. Lee JA, Verleysen M (2007) Nonlinear dimensionality reduction. Springer, NY 61. Tanenbaum JB, deSilva V, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290:2319–2323 62. Meco H, Suh C, Rajan K (2008) unpublished work 63. Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290:2323–2326 64. Li X, Rajan K (2008) unpublished work 65. Tobias RD (1997) An introduction to partial least squares regression. SAS Institute, Cary, NC 66. Suh C, Rajan K (2005) Virtual screening and QSAR formulations for crystal chemistry. QSAR Comb Sci 24:114–119 67. Baur WH, Danl D (1998) Can we predict crystal structures of minerals? Leopoldina Meeting on can Crystal Structures be Predicted. Dresden, Germany 68. Burton AW (2007) A priori phase prediction of zeolites: case study of the structure-directing effects in the synthesis of MTT-type zeolites. J Am Chem Soc 129:7627–7637 69. Caracas R, Wentzcovitch RM (2006) Theoretical determination of the structures of CaSiO3 perovskites. Acta Crystallogr B 62:1025–1030 70. Della Valle RG, Venuti E, et al (2008) Are crystal polymorphs predictable? The case of sexithiophene. J Phys Chem A 112:6715–6722 71. Doll K, Schon JC, et al (2007) Global exploration of the energy landscape of solids on the ab initio level. Phys Chem Chem Phys 9:6128–6133 72. Freeman CM, Newsam JM et al (1993) J Mater Chem 3:531–535 73. Gagliardi L (2006) Prediction of new inorganic molecules with quantum chemical methods. Theor Chem Acc 116:307–315 74. Le A (2005) Inorganic structure prediction with GRINSP. J Appl Crystallogr 38:389–395 75. Meden A (2006) Inorganic crystal structure prediction – a dream coming true? Acta Chimica Slovenica 53:148–152 76. Mellot-Draznieks C, Dutour J, et al (2004) Hybrid organic-inorganic frameworks: routes for computational design and structure prediction. Angew Chem Int Ed 43:6290–6296 77. Mellot-Draznieks C, Ferey G (2005) Assembling molecular species into 3D frameworks: Computational design and structure solution of hybrid materials. Prog Solid State Chem 33:187–197 78. Mellot-Draznieks C, Girard S, et al (2002) Computational design and prediction of interesting not-yet-synthesized structures of inorganic materials by using building unit concepts. Chem Eur J 8:4103–4113 79. Curtarolo S, Morgan D, et al (2003) Predicting crystal structures with data mining of quantum calculations. Phys Rev Lett 91:135503 80. Fischer CC, Tibbetts KJ, et al (2006) Predicting crystal structure by merging data mining with quantum mechanics. Nat Mater 5:641–646 81. Hart GLW, Blum V, Walorski MJ, Zunger A (2005) Nat Mater 4:391–394 82. Blum V, Hart GLW, Walorski MJ, Zunger A (2005) Phys Rev B 72:165113
Data Mining and Inorganic Crystallography
87
83. Woodley SM, Sokol AA, et al (2004) Structure prediction of inorganic nanoparticles with predefined architecture using a genetic algorithm. Zeitschrift Fur Anorganische Und Allgemeine Chemie 630:2343–2353 84. Javed SG, Khan A, et al (2007) Lattice constant prediction of orthorhombic ABO3 perovskites using support vector machines. Comput Mater Sci 39:627–634 85. Rajagopalan A, Rajan K (2005) In: Maier W, Potyrailo RA (eds) Combinatorial and highthroughput discovery and optimization of catalysts and materials. CRC press, NY 86. Suh C, Rajan K (2003) Combinatorial design of semiconductor chemistry for bandgap engineering: “virtual” combinatorial experimentation. Appl Surf Sci 223:148–158
Struct Bond (2010) 134:89–134 DOI:10.1007/430 2009 2 c Springer-Verlag Berlin Heidelberg 2009 Published online: 1 September 2009
Data Mining in Organic Crystallography Detlef W.M. Hofmann
Abstract Large data bases of crystal structures available today provide prospects for data mining. In this chapter we will demonstrate the derivation of an empirical force field from an experimental data base and its application to the most challenging problems to be solved in structural organic crystallography, e.g. estimation of sublimation energy, crystal structure prediction, and crystal engineering. Another example of data mining given in the chapter – the clustering of crystal structures. This approach allows for the recognition of similarities between crystals and for the removal of manifold entrances in virtual and experimental databases. Keywords: Crystal structure prediction · Data mining · Force field · Powder diagram comparison · Crystal structure similarity Contents 1 2
3
4
5
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cluster Analysis in Crystallography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Types of Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Recognition of the Similarities Crystal Structures . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Clustering of Crystal Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Crystal Structure Determination as an Application of Similarity Searching . . . . . Support Vector Machines and Intermolecular Interactions . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Algorithm of Data Mining Force Field Derivation . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Recognition of Faulty Crystal Structures in Data Bases . . . . . . . . . . . . . . . . . . . . . 3.3 Validation of DMFF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Applications of the Data Mining Force Field in Organic Crystallography . . . . . . . . . . . . 4.1 Docking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Melting Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data Mining and Crystal Structure Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Generation of Crystal Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
D.W.M. Hofmann () CRS4, Edificio 1 Localita Piscinamanna 09010 Pula (CA) Italy e-mail:
[email protected]
90 93 93 94 96 98 100 101 109 111 118 119 121 123 123
90
D.W.M. Hofmann
5.2 Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Ranking of Crystal Structures with Data Mining Potentials . . . . . . . . . . . . . . . . . . 5.4 Completeness of Structure Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Screening of Manifold Crystal Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
125 126 126 127 131 131
1 Introduction The main stream in computational crystallography of organic compounds today is crystal engineering which combines the prediction, design and synthesis of organic crystals with desired properties. A large percentage of descriptors usually used for the prediction of the properties of solids are determined by the properties of molecules. This approach causes the long-standing problem in crystal structure prediction of organic molecules: the ability of molecules to form more than one crystal structure (polymorphism). The properties of polymorphs are completely different, even if the crystals are formed by the same atoms or molecules. Polymorphism has strong industrial impact for pharmaceuticals, agrochemicals, pigments, dyestuffs, foods, and explosives production. Therefore, descriptors referring only to molecular information are insufficient for the purpose of crystal engineering. The final crystal structure depends, in addition, on the intermolecular interaction during the self-assembly process of crystallization (self-recognition), the thermodynamic properties of the system, kinetic factors such as heat and pressure, type of solvent and so on. Consequently, the successful crystal engineering strategy requires taking into account these additional descriptors as intrinsic to the crystal structure prediction.
Crystal Structure Prediction Over the past couple of decades, many methods have been developed for the purpose of crystal structure prediction. Usually, these methods involve three general steps: • Calculating three-dimensional molecular structures from the chemical diagrams; • Generating of the crystal packing and minimizing lattice energies; • Ranking the generated crystal packings A complete analysis of the reliability of individual methods and an objective picture of the status of the field can be found in the periodic blind tests started in 1999 [14, 15, 37, 46]. The crystal structure prediction of small organic molecules is usually approached as a problem in global energy minimization, assuming that the resulting structure is determined solely by thermodynamics. The main focus is the location of the lowest energy structures on the complex energy surface with a huge number of local minima. The huge amount of local minima is not just an artifact of the prediction procedure; it also reflects the ability of many organic molecules
Data Mining in Organic Crystallography
91
to form different crystal structures under the same or different conditions of crystallization [63] (polymorphism) making it extremely difficult to predict a certain crystal structure [65]. The obstacle for this energy-based approach is that most molecules are found to have many distinct crystal packing possibilities within a very narrow energy range [21]. A successful approach to crystal structure prediction requires a high quality force field and a good global optimization procedure. Until recently, developmental work focused on improving atom-atom model potential methods, both in the functional forms used and their parametrization. In spite of all the attempts to improve the atom-atom model potential method, till a few years ago crystal structure prediction in frames of Molecular Mechanics was considered an unsolvable problem [18, 24, 70]. The ab initio approach for crystal structure prediction still remains too complicated for routine usage. Nevertheless the problem is of great significance because many important physical properties of solids (such as solubility, bioavailability, color and nonlinear optical properties) depend on the crystal structure. Force Fields It is important to note that the usage of the term “force field” in crystallography, chemistry and computational biology differs from the standard usage of the term in physics. In chemistry, “a force field” is defined as a potential function, while in physics, the term is used to denote the negative gradient of a scalar potential. In the context of molecular mechanics, a force field refers to the functional form and parameter sets used to describe the potential energy of a system of particles (typically, but not necessarily, atoms). Force fields are the heart of molecular mechanics (MM) and classical molecular dynamics (MD). The data mining force field (DMFF) being derived from static data at standard conditions, intrinsically includes the entropy. Therefore, it is appropriate for application in MM calculations at standard conditions. The importance of the entropic term becomes obvious for crystal structures, where, very often, at different conditions one can find different polymorphs. In contrast to the MM, the MD approach requires force fields without the entropic term of the free energy as, during the simulations, the temperature and the entropy appear as a result of the simulation. Force fields for MD can be derived by fitting the parameters to properties obtained from structural data on liquids at different temperatures and pressures (e.g. radial distribution functions [31]. The basic functional form of a force field encapsulates both bonded terms relating to atoms that are linked by covalent bonds, and nonbonded “noncovalent” terms. In crystallography, the Buckingham (6-exp) and the Lennard-Johns (6–9, 6–12) potentials are the most common for describing noncovalent forces. In widespread usages, there are the parametrizations obtained from experimental data by Dreiding [41], ECEPP [44], COMPASS (for nonbonded terms [61]) and/or from high-level quantum mechanical calculations, (generalized) AMBER [66], DMAREL [68], and COMPASS for bonded terms [61]. Some of these force fields contain only the nonbonded parameters and assume the molecules to be rigid [26, 68]; other force
92
D.W.M. Hofmann
fields include bonded terms and describe flexible molecules [41, 44, 51, 61, 66]. As a new development, we can mention the reactive force fields, which can be used to model the formation and breaking of bonds in molecular dynamic simulations [17, 31]. Structures of small molecules [41] and proteins [66], radial distribution functions [31], and densities of liquids [61], have been used as experimental data for deriving parameters. In addition, it is necessary to mention a new method of energy evaluation, including atom-atom potentials derived purely from first principles calculations (tailor-made force field [50]) and the direct application of first principles electronic structure calculations to the crystal structure [52]. However, these promising approaches suffer from very long computational times (0.4 teraflop/crystal), even for medium-sized molecules of ten atoms. These developments contributed to the understanding of crystal structure prediction using the very quick (1.8 megaflop/crystal) data mining approach. As all types of nonbonded forces are strongly environment-dependent, it is necessary to derive the parameters directly from the crystal structures, rather than from the gas phase. It was achieved by Hofmann [26] using the data mining approach. He derived the force field functions as well as the sets of parameters from the crystal structural database itself. In addition, assuming that the observed crystal structures must be stable as well as kinetically accessible, using such information could be a way of adding some kinetic factors to descriptors. In this chapter we will focus on the derivation of an empirical force field from an experimental structural data base and its application to the most challenging questions in structural organic crystallography.
Data Mining Data mining [16] is a very powerful algorithm to analyze data and to predict properties. Today, it offers a suit of tools, which have been described in “An Introduction to Data Mining”. Typical areas of application of this algorithm are customer behavior, prediction of stock courses, and the process control of industrial plants. In contrast to other procedures, this approach does not need any “a priori” hypothesis about the behavior of systems or interconnections of “structure–properties” and allows, therefore, the description of highly complex processes. For crystal structure prediction, the analysis of intermolecular interactions is of great importance. Two of the Data Mining tools are of special importance for the derivation of an analytical expression for the intermolecular interactions: the cluster analysis and the support vector machine. The first method allows for the comparing and clustering of crystal structures, the validation of the results of crystal structure prediction, the finding of manifold entrances in the data bases, polymorph screening, and for crystal structure determination from powder diagram and by isostructural crystals. The second method allows for the separation of correct experimental crystal structures from the wrong ones and from virtual crystal structures generated by a program. The obtained weights can be interpreted as energy belonging to the descriptors, they define
Data Mining in Organic Crystallography
93
an intermolecular force field. The third method, which will not be described here, is the principal component analysis. This tool might gain importance as soon as a method for crystal structure prediction is successful. It allows for the recognition the most important descriptors and for a reduction in the complexity of the problem. This can shorten significantly the CPU-time presently required.
2 Cluster Analysis in Crystallography Data clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics. Cluster analysis in crystallography is the classification of crystal structures into different groups (clusters), so that the structures in each subset share some common traits. The task of comparing crystal structures and identifying structural similarity is one of immediate need in crystal structure engineering. Comparing of crystal structures, followed by clustering is very important for crystal structure prediction (where a huge amount of generated hypothetical structures can be made to converge on the same structure; it is highly beneficial to remove duplicates to produce a list of unique structures), as well as for crystal structure determination (where hundreds of simulated powder diagrams have to be compared with an experimental powder diagram), or for the data bases themselves (to find the errors and manifold entrances). Therefore, for cluster analysis there is the need for both: an algorithm of clustering and a method for similarity recognition.
2.1 Types of Clustering Algorithms Data clustering algorithms can be hierarchical or partitional. Hierarchical algorithms find successive clusters using previously established clusters, whereas partitional algorithms determine all the clusters at once. Partitional algorithms divide a set X into non-overlapping “parts” or “blocks” or “cells” that cover all of X. More formally, these “cells” are both collectively exhaustive and mutually exclusive, the set being partitioned. Hierarchical algorithms can be agglomerative (“bottomup”) or divisive (“top-down”). Agglomerative algorithms begin with each element as a separate cluster, and then merge them into successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters. Two-way clustering, co-clustering or bi-clustering, which should be mentioned here, are the names for clustering in which not only are the objects clustered, but also the features of the objects, i.e., if the data is represented in a data matrix, the row and columns are clustered simultaneously. The most popular algorithm of clustering in biology, pharmacy and in crystallography, is hierarchical clustering with the traditional representation of the hierarchy as a tree, or a dendrogram, Fig. 3, with individual elements at one end and a single
94
D.W.M. Hofmann
cluster containing every element at the other. Very famous examples of dendrograms are the evolutionary tree of Darwin and family trees. Cutting the tree at a given height will give a clustering at a selected precision, e.g. in family trees different levels of relationship can be selected. Cutting at zero gives identities. The most common algorithms for agglomerative hierarchical clustering, used, for example, in the case of crystal structure determination, are single-linkage, complete linkage, average linkage, and centroid linkage. Clustering is started by the assignment to each structure of the set its own cluster. For example, if we have a set of 50 crystal structures, we will define 50 different clusters. Now, clusters should be successively merged. For this purpose, the distance between the clusters has to be calculated. The distance between two clusters A and B is usually one of the following definitions: Complete linkage
The maximum distance between elements of each cluster max { d(x, y) : x ∈ A ∧ † ∈ B }
Single linkage
The minimum distance between elements of each cluster min { d(x, y) : x ∈ A ∧ † ∈ B }
Average linkage
(1)
(2)
The mean distance between elements of each cluster 1 ∑ ∑ d(x, y) |A | · |B| x∈A y∈B
(3)
Each agglomeration occurs at a greater distance between clusters than the previous agglomeration, and one can decide to stop clustering either when the clusters are too far apart to be merged (distance criterion), or when there is a sufficiently small number of clusters (number criterion). A graphical representation of the results of hierarchical cluster analysis is the dendrogram. The branches represent clusters obtained at each step of hierarchical clustering.
2.2 Recognition of the Similarities Crystal Structures Today, in crystallography, several different computational methods to estimate the similarity of crystal structures, as well as associate with them programs for the recognition of similar crystal structures have developed: CRYCOM (uses unit cell, space-group, fractional coordinate data) [19], Polymorph Predictor (uses radial distribution function) [64], COMPACK (uses relative position and orientation of molecules) [9]. Crystal structure similarity can also be identified by comparing computed powder patterns [28, 33, 54]. As one of the main difficulties in comparing crystal structures, based on cell parameters, one should mention the possibility of describing unique crystal structures in different ways. They can occur by the selection of different cell vectors or
Data Mining in Organic Crystallography
95
Fig. 1 The crystal structure BUWKEJ. The published unit cell is indicated by planes. The standard unit cell is displayed by black lines
by choosing different space groups on the X-ray analysis stage. A crystal cell is described by four cell vectors (the origin, a, b, and c) which can always be transformed into another crystal structure with a , b , and c , where the transformed vectors are a linear combination of the original vectors. As an example, we show (Fig. 1) the crystal structure with reference code BUWKEJ taken from CSD. The referred crystal structure was given as a monoclinic cell indicated in the figure by the planes. This unit cell, according to the rules for standardization, can be transformed into an orthorhombic unit cell. The new unit cell is indicated in the figure by black lines. A visual analysis of the molecular packing features of BUWKEJ points the other error: it has too large empty spaces in packing. It means that in reality, the crystal structure of BUWKEJ should belong to a space group of more high symmetry. In Sect. 3.2, we will describe the possibility of a systematical search of such kind (and other) errors in databases. Other problem arises often: if the molecule itself has some symmetry, e.g. a crys¯ a molecule that takes position on the crystallotal structure of the space group P1, graphic inversion center, might be described in space group P1 as well. For a long time, crystallographers have followed rules to define a unique description [38]. However, the comparison remains difficult, especially if, in the decision tree the crystal cell parameters are close to one of the criteria. Therefore, the successful algorithm of similarity search in crystal structures should compare distances between selected atoms/positions rather than lattice vectors or space symmetry. This can be performed by the comparison of the molecule centers [20], the environment of the atoms [9] or by the reciprocal crystal structure [28, 29]. The similarity index introduced by Hofmann [28] is based on comparing integrated powder diagrams instead of the powder diagrams themselves. The index is defined as the mean difference between the normalized simulated and observed integrated powder diagrams and is proportional to the surface between the two curves (Fig. 2). This index was originally developed for the purpose of crystal structure determination from unindexed powder diagrams (see Sect. 2.4), but thanks to its simplicity, it can be used for the relatively fast comparing and clustering of huge amounts
96
D.W.M. Hofmann 1.0 experimental simulated 0.8
integrated experimental integrated simulated
Intensity
0.6
0.4
0.2
–0.0
–0.2
0
5
10
15
20
25
30
2θ Fig. 2 Comparison integrated experimental and simulated powder diagrams for pigment PY111 for the predicted structure of rank 1, similarity index sint = 0.41
of crystal structures (primary being converted to powder diagram representation) in experimental and in virtual data bases, as well. In contrast to traditionally used similarity indices, the proposed method is valid for comparing in cases of large deviations in the cell constants. The refinement according to this index, closes the gap between crystal structure prediction and automated crystal structure determination. If once a similarity index is introduced, the similarity matrix or, more generally, a distance matrix between crystal structures can be calculated. Starting from this distance matrix, the crystal structure can be clustered; identical and similar structures between the predicted crystal packings (or in data bases) can be identified and removed; and crystal structure determination from unindexed powder diagrams can be performed.
2.3 Clustering of Crystal Structures As an example of clustering we show in Fig. 3 a dendrogram produced by FlexCryst software with basing on similarity index suggested in [28, 29].
Data Mining in Organic Crystallography
97
Fig. 3 Clustered crystal structures
The set of crystal structures (about 50 REFCODES) for analysis was taken randomly from a list of refcodes of CSD around refcode BUWKEJ (plus the reduced cell presentation BUWKEJ-reduced-cell). For all crystal structures of this set, the powder diagrams have been calculated, and cluster analysis according to similarity indexes have been performed. From the obtained dendrogram, it follows that in two cases the similarity index between structures is (nearly) zero. The first case of zero value of similarity index between BUWKEJ and BUWKEJ-reduced-cell, has
98
D.W.M. Hofmann
Fig. 4 Similar crystal structures: The two crystal structures are isostructural and differ only by the substitution of the chlorine atom by an methoxy-group (see bottom)
been described above. Both structures are identical and consequently, this identity is visible in the dendrogram. The second case the two structures BUWGEF and BUWGEF01. The structure BUWGEF01 is a refined redetermination of BUWGEF. The accuracy in BUWGEF01 is higher; however, on our scale the difference in crystal structures is negligible. An interesting feature of the dendrogram is that it can indicate the isostructural crystal structures as well. In the dendrogram, the next closest vicinity between two different crystal structures is found for structures with refcodes BUWDUS and BUWFEE. In Fig. 4 the two crystal structures are shown. Both structures are very similar and belong to space group P21 . The molecules differ only by the substitution of the chlorine atom in BUWDUS by a methoxy-group in BUWFEE. The unit cells also differ slightly: a = 10.538, b = 16.333, c = 9.680, β = 100.06, for BUWDUS compared to a = 10.359, b = 16.474, c = 9.654, β = 100.82 for BUWFEE.
2.4 Crystal Structure Determination as an Application of Similarity Searching An example of the practical application of clustering analysis in crystallography is the structure determination from unindexed powder diagrams. In principle, the crystal structure determination from powder X-ray data can be considered as a simplified case of the crystal structure prediction: the X-ray
Data Mining in Organic Crystallography
99
diffraction patterns can be included as additional information on the stage of crystal structure generation as the boundary condition of the lattice parameters (case of indexed powder diagrams) or on the stage of ranking of the generated crystal packing where hundreds of simulated powder diagrams have to be compared with experimental powder diagrams (case of unindexed powder diagrams). The algorithm of crystal structure determination from unindexed powder diagrams works as the next. As a first step, the crystal structures are generated; then their powder diagrams are simulated and compared with the experimental powder diagram. Afterwards, the ones found closest to the experiment crystal structures have to be minimized according to the evaluation function (in our case, it is the similarity between the experimental and simulated powder diagrams). The similarity index plays a crucial role here. The most widely used similarity index used for comparing powder diagrams is the mean square deviation between two diagrams (R-value, Rietveld). Unfortunately, this comparison is very sensitive even to very small shifts in the peaks and does not allow for performing minimization in the case of slightly different parameters of unit cells. The other similarity indexes mentioned above [28, 33] are tolerant to these tiny cell differences. The index of Karfunkel [33], nevertheless, arbitrarily provides either positive or negative values as a result of using a nondefinite matrix for the fold, and therefore, this method cannot be used for refinement. The advantages of the integral similarity index [28] are many: similar powder diagrams can be recognized, even if peaks are shifted; the similarity index is calculated quickly because the evaluation works pointwise; during refinement interchanged peaks can be reordered correctly. As an example of a crystal structure determination, we show the case of Pigment Red 181 (PR181) [28]. The comparison of predicted crystal structures with experimental powder diagram and subsequent structure refinement indicates the most promising crystal structure of rank 12. The initial value of the similarity index of this structure was 3.58% decreased during refinement to 0.29%. In Fig. 5 which one can observe shifted peaks at 10◦ and 20◦ of 2θ for predicted crystal structures. The shift of these peaks (010 and 020) is caused by the fact that the predicted cell constant is ˚ too short. During refinement, this inaccuracy was corrected and the cell vec0.4 A ˚ Another problem for common refinement tor was elongated from 9.21 to 9.63 A. would be caused by the two peaks at 25◦ and 27◦ . In the predicted crystal structure, both peaks are overlaying and located at 27◦ . The algorithm recognizes correctly that this peak has to be split and separates it. As for the case of interchanged peaks, other similarity indices would have a local minimum for this situation, making further refinement impossible. Another possibility of the cluster analysis application for crystal structure determination is the comparison of the experimental powder diagram with available crystal structures in the data base with the aim of searching for isostructural crystal structures (similar to the case of isostructural crystals BUWDUS and BUWFEE). By subsequent refinement of the predicted crystal structure the experimental structure can be determined [59]. The crystal structures of Pigment Red 3 (4-methyl-2triphenylarsine-2-naphthol, refcode MNIPZN) and Pigment Orange 5 (1-((2,4dinitrophenyl)azo)-2-naphthol, refcode CICCUN) have been solved in this way. In
100
D.W.M. Hofmann 4 experimental refined predicted
Intensity
3
2
1
0
0
10
20 2θ
30
40
Fig. 5 Structure determination of pigment red 181. From the predicted polymorphs the structure most close to the experimental powder diagram is selected. After refinement both powder diagrams are nearly identical
a similar manner the crystal structure, Pigment Red 170, was solved: first was found the crystal structure of Me-derivate, and afterwards the crystal structure of Pigment Red 170 was found isostructural to it.
3 Support Vector Machines and Intermolecular Interactions The second, tool of data mining, mentioned in the introduction, support vector machines (SVM), is of special importance for the investigation of intermolecular interactions to derive a new force field. This method allows for the separation of the correct experimental crystal structures from the wrong and/or virtual crystal structures generated by a program. The weights of descriptors obtained by this procedure being interpreted as energy, can define an intermolecular force field. In this chapter, we describe the derivation of a force field by data mining. The obtained force field consists of the set of potentials which can be physically interpreted as the interaction energy between a pair of atoms depending on the distance between these atoms. In contrast to existing potential functions (see Introduction), only minimal assumptions were to derive these new potentials, and the functional form (such as Lennard-Jones or Buckingham) was not fixed. Similar attempts were applied in the protein structure prediction [39, 48]. However, the
Data Mining in Organic Crystallography
101
group increments extracted were not to be compared to physical models directly (obliquely). The SVM method applied provides two results: the weights of the descriptors, which can be interpreted as a force field and outliers, which allow the recognition of errors in the data base.
3.1 Algorithm of Data Mining Force Field Derivation The algorithm of Force Field deriving by Data Mining with SVM can be divided into several steps. Collecting of the data. The needed data sets have to be collected. Today, a huge amount of data in our case crystal structures, are collected in several big crystallographic databases. Here, to derive our force field we use the structural data stored in the Cambridge structural database by the Cambridge Crystallographic Data Center existing since 1970. Definition of the descriptors. All the descriptors necessary for a successful description of the system have to be defined. In our case, it means defining all the variables which might influence the final crystal structure. In this chapter, the assumptions are the intermolecular interactions, and these interactions are assumed to act pairwise. The required data are the distances between the atoms in the crystal structures. However, additional factors might influence the final crystal structure formed by a molecule, and in more advanced predictions, these descriptors have to be taken into account. Parametrization. Weights have to be assigned to all the descriptors. These weights can have some physical meaning, but this is not necessary. With the help of these weights, the data (crystal structures) are separated in to two different classes, the existing crystal structures and the virtual or distorted crystal structures. If the descriptors are the atom pair distances, then the weights results in energies. Validation. A validation has to be performed. The common way is to separate the data in a training set and a validation set. If the obtained trained potentials are able to prognosticate the validation set, the data mining is considered successful. Since data mining in crystallography is quite new, we will show in a wide variety of further validations, the significance of the derived force field. These validations are not part of data mining and will be described in a separate chapter.
Collecting of the Data As already mentioned above, crystallographers have been the first to recognize the importance of collecting data on crystal structures for discovering new knowledge. Consequently, a huge amount of crystal structural data is collected nowadays in several big crystallographic databases: • The Protein Data Bank (PDB) is a repository for 3-D structural data of proteins and nucleic acids. The data typically obtained by X-ray crystallography
102
•
•
•
•
D.W.M. Hofmann
or NMR spectroscopy and submitted by biologists and biochemists from around the world, are released into the public domain, and can be accessed for free. The ICSD is a database of inorganic crystal structures and is described in more detail in “Data Basis, the base for Data Mining”. It contains information on all inorganic crystal structures published since 1913, including pure elements, minerals, metals, and intermetallic compounds (with atomic coordinates). The ICSD contains 1,03,679 entries as of mid-2008. The Cambridge Structural Database (CSD) [1, 7], is a repository for small organic and organometallic molecule crystal structures obtained mainly by singlecrystal X-ray analysis to determine the crystal structure. The CSD is compiled and maintained by the Cambridge Crystallographic Data Centre. The new development in this field is the web-orientated database on text mining, e.g. CrystalEye [47]. This software harvests and aggregates crystal structures from accessible Web sites and joins them to one data base. Presently, it accesses over 100,000 structures published in the last 15 years. In essence, CrystalEye transforms the scholarly publications of crystallography into a giant database. Virtual databases, are databases that can be generated by quantum chemical calculations. These data bases contain, in addition to the structure in general, the energy of the structure and other information. Therefore, they are commonly used for the (recursive) fitting of classical methods and for trials to reproduce the energy within the classical model as good as possible [50, 53]. Fitting procedure based on these data needs less dataset and computational time than data mining; however, it requires additional information about non-equilibrium crystal structures. This information can be obtained from high quality quantum chemical calculations.
The CSD might be the most important database for organic chemistry. Today, this data base contains nearly half million crystal structures. The exponential increase of this data leads to an approximate doubling of their amount every seven years, as it can be seen on the statistics in Fig. 6. The tendency to increase the accuracy of the crystal structure data promises better quality of parameters obtained by data mining for the future because the quality of the result depends strongly on the amount, the quantity and the reliability of the data used.
Definition of Descriptors of the DMFF To define a force field, it is necessary to define a functional form of potentials and sets of parameters. An intermolecular force field assigns to two molecules an interaction energy depending on the relative orientation of the interacting molecules. In most of the force field approaches, the interaction energy between two molecules I and J is expressed as the sum of all pair–atom interactions. If one molecule contains
Data Mining in Organic Crystallography
103
Fig. 6 The histograms below show the growth of the Cambridge structural database (CSD) since 1970
ni atoms and other n j atoms, the interaction energy can be calculated according to the atom–atom potential approach of Kitaigorodski [35] as follows: nI nJ
EIJ ≈ ∑ ∑ εi jrij .
(4)
i=1 j=1
In the literature, there is a large number of proposed functional forms for the pair-potential εi jrij . The oldest potential was derived from theoretical considerations about the dispersion and electrostatic interactions. For the long-range interaction, it obtains a r−6 law and for the electrostatic term r−1 . This potential has been completed with short-range interactions. For computational convenience, a r−12 [36] term has been chosen
qi q j σ 12 σ 6 (5) ε (r) = 4εvdW − + r r r or an exponential term [8]:
ε (r) = ∑ Ae−Br − Cr−6 +
qi q j . r
(6)
It was already recognized in the 1970s, that this functional form is not fit to describe hydrogen bonds and an additional 10–12 potential was proposed for hydrogen bonds [42]. qi q j . (7) ε (r) = ∑ Ae−10 − Br−12 + r
104
D.W.M. Hofmann
Later, more and more complex forms of potentials were developed. As an example, we give here the shifted force n - m potential [11] m n
1 1 r0 n r0 m α E0 n m ε (r) = − − − nβ , mβ n−m r γ r γ n m qi q j β β nmα E0 r − γ r0 (8) − + + n − m γ r0 γ γ r with
γ = rcutoff /r0 m+1 1 γ − 1 n−m β =γ γ n+1 − 1 α=
n−m nβ m (1 + m/γ − m − 1/γ m) − mβ n(1 + n/γ − n − 1/γ n)
This development faces two problems. On the one hand, the final theoretical shape of the potential is under discussion. On the other, even for the simple expressions in molecular mechanics, by far the largest part of CPU-time is used for the evaluation of the intermolecular interactions. For very complex analytical expressions, the CPU-times for the evaluation of these terms become extremely high. This fact, along with a large number of experimental data available, necessitates the developing of the data mining approach. It has already been mentioned that the force field obtained by data mining is free from any functional form. It defines the energy function of intermolecular interactions εi jrij by auxiliary points. The energy of pair-wise interaction is to be taken into ˚ account if the distance rij between the atoms is below the value of rcutoff ≈ 5, 77 A. The indication for this cutoff was the convergence of all derived potentials to zero at this distance. Another reason was that the distance of the cutoff should be as short as possible in order to avoid unnecessary evaluation of a huge number of interactions. All force fields differentiate between different atom types. The number of atom types varies in the different approaches; in the most simple approach, the atom number is assigned as atom type. The most complex approach assigns to each atom of the molecule its own atom type. In the described DMFF, an intermediate complexity has to be selected for the problem of overfitting. As has already been pointed out, the most simple approach is not able to describe hydrogen bonds. Therefore, in the DMFF the atom numbers i and j have been assigned the atoms n and m. The hydrogen atom is handled exceptional by with regard to the hydrogen bonds. Hydrogen atom types have been split into four types: hydrogen bound to carbon, hydrogen bound to nitrogen, hydrogen bound to oxygen, and all the other hydrogen atoms. The number of atomic types natom types along with the number of auxiliary points nauxiliary points define the needed number of parameters: nparameters = nauxiliary points ∗ natom types
(9)
Data Mining in Organic Crystallography
105
10
8
6
4
2
2
4
6
8
Fig. 7 Overfitting: The polynomial fit results are perfect and all points lie on the line. In fact the linear relationship is lost and the fit makes no sense
It should also be taken into account that the number of parameters and the available experimental data must be balanced. If the force field has too many parameters n parameters in comparison to the amount of data available, an absurd and false force field may perfectly fits the data. This effect is called overfitting. In Fig. 7 the overfitting of a few points is illustrated. By selecting a too high polynomial for the fitting, all points are found on one curve, but the linear relationship is lost. On the other hand, if the force field has too few parameters, the system will be too simplified, and reliable predictions will not be possible. For the selection of the auxiliary points, two aspects have to be taken into account. First, the points should be closely connected to the final energy expression to allow fast evaluation. Since the energy expression always contains the distance between two atoms, (10) rij = (xi − x j )2 + (yi − y j )2 + (zi − z j )2 a reasonable choice is a function of rij2 . It helps to avoid time-consuming transformations for extracting the root or other mathematical operations. Second, the points should be placed densely in the curving, short-range distance and scarcely in the flat, long-range region. Hence, the auxiliary points can be favorably defined as: 100 with 3 ≤ l ≤ 60 . (11) K = kmnl : kmnl = l The density of the auxiliary points depends on the factor 100. This factor leads to a sufficient accuracy without possible overfitting.
106
D.W.M. Hofmann
The steadiness of the potential is very important in the application of the force field in various minimization procedures. To ensure a steadiness of the energy function, one has to interpolate between the auxiliary points. The most simple way is the linear interpolation which is obtained as follows:
εmnr = ( f − l)εmnl + (l + 1 − f )εmnl+1 with l = f and f =
100 rij2
(12)
Now if one uses this atom–atom pair potential in the equation for the calculation of intermolecular interactions, one will obtain: nI nJ
nI nJ
i=1 j=1
i=1 j=1
EIJ = ∑ ∑ εtype(i),type( j),r = ∑ ∑ εmnr .
(13)
The maximum distance rcutoff between two atoms, which is still taken into consideration in the calculations, arises from (11) 100 ˚ ≈ 5.77 A. (14) rcutoff = k3 = 3 The total number of parameters to be determined, according to the (9), with approximately 100 atom types and 100 auxiliary points, are 10,000. The determination of this large number of parameters will be described in the next section.
Parametrization The parametrization in our method is achieved by the following procedure. We divide crystal structures into two classes. One class contains all the correct experimental structures, and the other class contains all the erroneous and distorted crystal structures. The idea is based on the assumption that each experimental crystal structure must be in a local minimum for free energy. This assumption is a direct consequence of the second law of thermodynamics, and the free energy of a chemical system is always a minimum in the equilibrium (or a local minimum in metastable states). The other assumption is that the interaction energy between two molecules I and J can be expressed as the sum of all pair-atom interactions (see (4)). This assumption is fundamental in classical mechanics. To precondition that the data base is error free, erroneous crystal structures have to be generated and added to the data base for our purpose. In the described case, for each crystal structure crystaln , the cell parameters have been slightly distorted and five erroneous structures crystalnm have been generated. We can have a minimum condition, (15) Scoren ≤ Scorenm + δ e where in Scoren is the score of the experimental crystal structure, while Scorenm is the score energy of the distorted crystal structure.
Data Mining in Organic Crystallography
107
The weights wnml of the defined descriptors kmnl can be determined by defining a cost function. This cost function J measures the distorted structures, which violate the minimum condition. J(ε ) =
400000 10
∑ ∑ max(0, Scoren − Scorenm + δ e)
(16)
n=1 m=1
This is the expression of minimization. The aim of minimization with neuronal networks or simplex algorithm is to separate as far as possible the decoys and the experimental structures by a hyperplane as depicted in Fig. 8. The perpendicular vector to this plane contains the weights w of the scoring function. The above inequality 15 cannot be corrected completely even with optimal parameters. After minimization, one always finds some outliers. This has two reasons: on the one hand, the databases are not free from errors and inaccuracies. (This point will be discussed in more detail in the Sect. 3.2.) On the other hand only a restricted number of descriptors are taken into account. In fact, in this force field, only atomic distances and very few atom types are taken into account as influences; nevertheless, it is known that crystal structures depend on the temperature T and pressure p. If one applies this dependence in a phase diagram, it is easy to recognize that, according to the conditions, different crystal structures called polymorphs, can occur. As the highest elected force field contains neither pressure nor temperature dependence, it is clear that such a phenomenon cannot
Fig. 8 Experimental structure with ten decoys. The structures are described by two selected types of contacts. Two of the decoys are outliers and can not be separated by the hyperplane (in the two-dimensional case, it reduces to a line)
108
D.W.M. Hofmann
be calculated with the force field. Newer methods assign to each atom of a given molecule its own atom type. This method promises to be very successful; however the force field loses its universality and has to be determined for each molecule newly [52]. Besides the serious mistakes discovered during the analysis of outliers, in particular, the positions of hydrogen atoms are often determined only in a very imprecise way. This is because of the low diffraction of X-rays on hydrogen atoms, and/or to the low quality of the crystals.
Validation As the last step in data mining a validation has to be performed. In general, one selects a set of data for this purpose and keeps them apart during the parametrization. In our example, all REFCODE’s entrances of the data base starting with “A” have been selected. If the determined weights w separate the validation set with similar quality, the procedure has been successful. The main aim of the validation is - to get information about a possible overfitting. Another validation, which is not always available, is the checking of the result for its reasonableness. In our case, because a large number of force fields are published, we can presume that the potentials should be smooth, that they should rise quickly for short distances, and after passing a minimum converge slowly to zero. In Fig. 9, the obtained auxiliary points for C...C ˚ the fitted curve coincides with our interaction can be seen. For distances below 3 A, expectations. For shorter distances, it comes to a significant overfitting. As we will discuss in Sect. 3.3, the obtained curves have to be preprocessed. Also, the scale of the energy remains undetermined in this first step.
5
trained potential fit
4
Energy
3 2 1 0 –1
2
3
4
r [A]
Fig. 9 The interaction potential for C...C by data mining
5
6
7
Data Mining in Organic Crystallography
109
3.2 Recognition of Faulty Crystal Structures in Data Bases In principle, the error recognition (or cleanup of data) belongs to the first step of data mining, but, in fact, a large number of errors in data can only be recognized after the first attempts of dividing the crystal structures into two classes as described above see section “Parametrization” This dividing is an excellent tool to find errors in the data base. Since data mining is susceptible to errors the outliers should be analyzed carefully and removed before performing the dividing procedure. For example, for one subset including all the crystal structures of space group P1, we observed an error quote of 30%. Obviously, data mining can not achieve any reasonable result in such a case. A large number of errors can be found by a careful analysis of the result of the dividing. The grim avatars in Fig. 8 indicate the many errors in the database. In our case, this analysis did allow us to find several hundreds of errors in the data. The faulty crystal structures were communicated to the administration of the data bank and amended accordingly. Here, we will describe in more details two main sources of observed errors in the database mentioned in Sect. 2.2: on the one hand, the wrong space groups were assigned to crystal structures, on the other, the setting was wrong (for example P21 /n instead of P21 /a).
Correction of the Space Group Figure 10 illustrates the case of wrong space group assignment for a crystal. This crystal structure mentioned already in Sect. 2.2 (Fig. 1), shows an unusually large empty space in packing. After transformation of the original crystal structure, as¯ to the reduced standard cell, it becomes obvious that the crystal might signed as P1, belong to an orthorhombic rather than to a triclinic space group. The checking of the possible supergroups of the crystal structure reveals a realistic crystal structure in the space group Pbn21 . After applying the corresponding symmetrical operations, only an additional correction was made for the positions of hydrogen atoms of the methyl group. Once such errors are recognized, one can try to develop filters, which will find all similar errors. The intuition gives a clear hint for a filter for this kind of error. Obviously, a tool is required, which can estimate the empty spaces or express it in a different way, if the number of atoms in the cell coincide with the volume. Such a rule has already been developed. According this rule, the volume of the unit cell V is proportional to the number of non-hydrogen atoms multiplied by 18 A˚ 3 [34]. A check of this rule reveals that this rule is too imprecise to find all the errors even if the volume is twice as high as it should be. The rule fails especially in the case of small organometallic compounds. Bariumformat, Dilithiummalonat, Lithiumphosphonoacetat, and Dirubidiumtartrat are examples where this rule fails by more than 25%. To be able to recognize faulty space group allocations, an increment system was developed to predict the crystal densities [25]. Similar increment systems have been implemented before in commercial software [43]. Several hundreds of errors
110
D.W.M. Hofmann
Fig. 10 Wrong assignment of the space group for crystal structure BUWKEJ (CSD 5.29 Release 2007)
were found with our increment system. Besides the illustrated singular case, a systemic error became obvious. Nearly one third of the crystal structures in space group ¯ This error might be caused by a typing-error of P1 were correctly assigned to P1. non-scientific staff, who considered the bar as obsolete.
Correction of the Setting The second source of error is the wrong setting. For example, the crystal structure of (7R*,10R*,11R*)-7,11-Dibromo-8,8,10-trimethyl-1,5-dioxaspiro(5.5)undecane is given in the setting P21 /c. Now one calculates the interatomic distances in this crystal, a number of unusually short van der Waals interactions will appear. They are marked in Fig. 12 in purple. The changing to the setting P21 /n leads to an escape of all clashes, and only a bromine–hydrogen distance remains highlighted. These interactions are known to contribute to an extra stabilization of crystals. In this simple case, the error causes very short intermolecular distances and an experienced crystallographer can recognize it without difficulty. Sometimes, however, several different settings are free from interatomic clashes. Therefore, we developed a filter based on the energy. All crystal structure were calculated in all possible settings of the given space group. If the lowest energy was calculated for another setting as the declared one, this structure was marked as possibly faulty and was announced
Data Mining in Organic Crystallography
111
D.W.M. Hofmann Kempster et al. Mighell et al.
80
90
100 100*dcalc /dexp
110
120
Fig. 11 Comparison of different procedures to the prediction of the crystal structures density, Hofmann [25], Kempster [34], Mighel [43]
to CCDC for examination. In the presented example, the declared crystal structure has an energy of −54.80 kJ mol−1 compared to −120.45 kJ mol−1 for the correct setting. For the wrong setting, we observed a systemic error, too. A large number of crystal structures in space group P21 21 21 showed a wrong handed axis.
3.3 Validation of DMFF It has been mentioned earlier in the section “Parametrization” that the determined values of weight vectors w can be interpreted as a scoring function and, after proper scaling, the parameters of the force field can be obtained.
Scaling of the Energy Vector w for Sublimation Energy The scaling of the energy vector w can be done with the help of the sublimation energies. The sublimation energy E psublimation corresponds to the intermolecular interaction energy of a crystal. This correspondence is not entirely accurate, because a molecule is more mobile in the gas phase and owns translation and rotation degrees of freedom. If a crystal is evaporated, the supplied energy is used to break the intermolecular interaction. The internal degrees of freedom like oscillations and rotations
112
D.W.M. Hofmann
Fig. 12 Wrong assignment of the setting for crystal structure FABFOE (CSD 5.29 Release 2007)
with the supplied energy also become partly activated. Therefore, a molecule has in the gas phase higher energy than in the solid state. However, these entropic terms are very hard to compare with the total energy of a crystal cell, so that roughly, the sublimation energy and the cell energy can be equated E psublimation ≈ E p .
(17)
The scaling factor sKJ was established by the minimization of the root mean squares (rms) for known sublimation energies. Here, the sublimation energies listed in [22] have been taken into account.
∂ J2 ∂ = ∂ sKJ ∂ sKJ
nexperiment
∑
p=1
(E psublimation − Score p)2 = 0.
(18)
Data Mining in Organic Crystallography
113
250 e ol
200
Ecalc [KJ]
n
ea m
150
er
r: ro
/m KJ 1 . 9
100
50
0
0
50
100
150
200
250
Eexp [KJ]
Fig. 13 Measured and calculated sublimation energies
The factor sKJ converts the weight vector from an arbitrary energy scale to an energy in kilojoule per mole. The weights wmnk : 3 ≤ k ≤ 100 multiplied with the scaling factor gives the atom pair potential between the atom types m and n in kilojoule per mole ε [KJ] = sKJ w. (19) The average error between calculated and measured sublimation energy is 9.1 kJ mol−1 . This is in the range of other sources of error. In Fig. 13, all the experimental crystal structures of a given molecule were dyed uniformly and connected. Several crystal structures occur multiply. Partly, these are repeated single crystal structures and partly, other polymorphs of the substance. Looking at the data around 40 kJ of Fig. 13, a marked group of this kind can be seen. The energy differences for the different crystal structures of this molecule are in the magnitude of the average error. Also, it is necessary to note that sublimation energy is difficult to measure experimentally and, hence, it is error-afflicted. For these reasons, the precision for the estimation of the sublimation energy is limited. This also explains that the force field derived by Filippini et al. [22] has a similar error. As soon as the weights of auxiliary points w obtained by Data Mining are scaled to energies, the atom–atom pair potentials of the new force field (DMFF) are defined. But potentials still do not have analytical expression. To make it possible to validate them, it is necessary to compare the potential curves, deepness and positions of minima with existing force fields.
114
D.W.M. Hofmann
Validation of Homoatomic Pair Potentials In section “Definition of Descriptors of the DMFF”, some of the most common functionals which are implemented for the intermolecular interactions in different programs of molecular modeling are listed. In the Fig. 14, we present the fitting of the auxiliary points of the new potentials (black points) for some homoatomic interactions by the 6–12 (5) (red points) and the exp-6 potential (6) (green points). As it follows from this figure, both these potentials describe excellently H(C)...H(C) and C...C interactions. Nevertheless, in the case of C...C, one can see the clear overfitting in the repulsive region of short distances. The S...S interactions are still described satisfactorily, but in the case of O...O interactions, the deviations are recognisable even if electrostatic term with optimized charges was added to the functionals. The potential function describing all homoatomic
Fig. 14 Comparison between different functional forms for homogenous atoms
Data Mining in Organic Crystallography
115
interaction excellently was found as an exponential plus one Gaussian (exp-1G) (black curve): 2 (20) emnr = a0 ∗ e−a1 r − a2e−a3 (r−a4 ) . The checking of the new potentials for the valid range has shown that in the long term range the new potential rises faster than the r−6 normally responsible for dispersion forces. At the short distance range, it can come to overfitting as mentioned already for the C...C potential. For this reason, some points have to be excluded in the fitting procedure. As shown in Fig. 14, for cases of the carbon-carbon potential, the first two points have been excluded; for the oxygen-oxygen potential, the first point has been excluded from fitting. The above-mentioned discrepancies in the fitting of the O..O potential refer to possible hydrogen bonding. A careful checking of all atom pairs able to form hydrogen bonds shows that their functional form can not be reproduced by any of the commonly used potentials. In Fig. 15, the cases of H(O)...O and H(N)...O hydrogen bonding are shown. In this figure, the derived DM potentials are fitted with several functions. Even the specially extended functions for this case with an additional term 1/r for the Coulomb interactions can not reproduce the derived potentials satisfactorily. We have found that all the potentials of hydrogen bonding can be fitted accurately with a function of one exponential plus two Gaussian (exp-2G): emnr = a0 ∗ e−a1 r − a2e−a3 (r−a4 ) − a5e−a6 (r−a7 ) . 2
Fig. 15 Fit of an hydrogen bond by different functions
2
(21)
116
D.W.M. Hofmann
Validation of Heteroatomic Pair Potentials The obtained DM potentials of heteroatomic interactions are of special importance. In all previous force fields, the parameters of heteroatomic interactions are derived by combination rules: The interaction energy of two dissimilar atoms (e.g. C...O) is an average of the interaction energies of the corresponding identical atom pairs (i.e., C...C and O...O). The parameters are then defined either as the geometrical or as the arithmetic mean. A more complex expression already developed in 1931, takes the polarizability of the atoms into account [60]. In most of the force fields one of these rules is applied. In this way, the 1/2N(N + 1) parameter sets for the N atom types can be applied to N sets of atomic parameters obtained from homoatomic interactions. The rules can be easily understood after rewriting the 6–12 potential formula (5): vdW 6 12 r rvdW vdW ε (r) = ε −2 . (22) r r In this formula, the parameters ε vdW and rvdW are the coordinates of the potential minimum. In general, these parameters are designed as van der Waals energy and van der Waals radius. The heteroatomic interactions can be written now as the arithmetic mean of these parameters:
εijvdW =
εiivdW + ε vdW riivdW + rvdW jj jj and rijvdW = 2 2
(23)
as the geometrical mean,
εijvdW =
vdW εiivdW ∗ ε vdW and r = riivdW ∗ rvdW ij jj jj
(24)
or, according to Lorentz–Berthelot, as the geometric mean in the energy and arithmetic mean in the radius.
εijvdW =
riivdW + rvdW jj vdW . εiivdW ∗ ε vdW and r = ij jj 2
(25)
Data mining makes it possible to derive the heteroatomic potentials directly from the experimental data and to inspect the existing combination rules. In Fig. 16 the expression of Lorentz–Berthelot is examined. The DMFF is compared with the calculated curve by 25 from the van der Waals parameters for the homoatomic interaction. For clarity, the curves of the homoatomic interaction are added too. One can see that this combination rule works very well for the C...S interaction; it is satisfactory for C...O and H(C)...S interaction; and it is still acceptable for H(C)...C interactions. But for H(C)...O and especially for O...S interactions, it fails completely.
Data Mining in Organic Crystallography
Fig. 16 Combination rules (geometric mean) for heteroatomic interactions
117
118
D.W.M. Hofmann Table 1 Van der Waals parameters for carbon/sulfur interactions vdW vdW vdW Force field rCC εCC rSS ˚ ˚ (A) (kcal mol−1 ) (A) DMFF [27] Amber [66] Dreiding [40] UFF [55] Scheraga et al. [42] TRIPOS [10]
3.88 3.7–4.0 3.88–4.18 3.851 3.70–4.12 3.4
0.056 0.245–0.387 0.095–0.305 0.105 0.038–0.141 0.0951
4.00 4.00 4.14–4.24 4.035 3.78 3.6
vdW εSS (kcal mol−1 ) 0.332 0.45 0.215–0.305 0.274 0.043 0.3440
Validation of the Van Der Waals Parameters (Energies and Radius) Finally, the parameters of van der Waals interactions of the different force fields can be compared. An accurate comparison, however, is hard to make. Each parameter set depends strongly on the atom types introduced, the cutoff criterium, the functional form of the potentials, the calculation of the electrostatic energy and the set of charges used in this term. This will be illustrated for the example of the hydrogen bond. The total energy is the sum of the interaction of the oxygen of the hydroxyl group with the other oxygen atom, and that of the hydrogen atom with the oxygen atom. The sum of both energies should always be equal for all force fields; however, this sum can be distributed in the two interactions in very different ways. In the DM force field and the force field by Scheraga et al. [42], the interaction H(O)..O is described by its own potential; the other force fields recover this energy in electrostatic terms, or in other ways. This situation causes a large problem for molecular modeling. Since all the parameters depend strongly on the choice of these details and all the parameters have to be consistent, the mixing of parameters from different force fields becomes (nearly) impossible. In Table 1, we show parameters for some homoatomic van der Waals interactions. The van der Waals radii have very close values in all force fields; the energy, however varies in a rather wide range. The other advantage of the DMFF is the existence of parameters for all the elements occurring in the database. This is not the case for most force fields, they are parametrized only for the common elements important in drugs design (C, H, N, O, S, etc.). Some force fields are extended to further elements by extrapolations. These values are not reliable, because they have never been checked. The DMFF includes such exotic elements as Hg, Te, Cs, U, and Pu in the list.
4 Applications of the Data Mining Force Field in Organic Crystallography In the following section, we will present some selected cases of studies with the DMFF; the application of DMFF for the crystal structure prediction will be described in section “Data Mining and Crystal structure prediction”.
Data Mining in Organic Crystallography
119
4.1 Docking An aim of bioinformatics is to support the development of new drugs. It tries to find new drugs or to optimize the known ones with the aim of changing their effect on enzymes, to increase, decrease, or inhibit their activity. An example of an inhibitor is carbon monoxide. Because it is colorless, odorless, tasteless, and non-irritating, it is one of the most dangerous poisons. Carbon monoxide binds to hemoglobin, producing carboxyhemoglobin, and the traditional belief is that carbon monoxide toxicity arises from the formation of carboxyhemoglobin, which decreases the oxygen-carrying capacity of the blood. This inhibits the transport, delivery, and utilization of oxygen. Nicotine increases the activity of an enzyme. By binding to ganglion type receptors, nicotine increases the flow of adrenaline and acts as a stimulating hormone. Molecular modeling predicts the formation of such complexes between potential drugs and enzymes. The strength of the complex can be estimated by scoring or energy functions. In spite of scoring functions, which give a score not necessarily related to the energy, and of the force fields based on quantum mechanics, which give the enthalpy, the force field obtained by Data Mining, DMMF, is able to be assigned to the complex binding affinity, including the entropy. From the binding affinity, the effect of the molecule on the activity of the enzyme can be concluded. In a lot of cases, the problem can be restricted just to the calculation of the intermolecular energy because the molecules of drugs are very often rigid (see Fig. 17). On the other hand, flexible molecules often show much more biological activity and are more practical to find leading structures for drug design. Due to their flexibility, they can easier reshape themselves for a given pocket of an enzyme. Flexible drugs lose degrees of freedom during docking, which results in a decrease of the entropy. As a result, the binding affinity of flexible molecules is lower than for rigid molecules, and the equilibrium is shifted in 26 to the left-hand side for large entropy. As a result, flexible molecules have lower binding affinity than rigid molecules. Henzyme + Hdrug − T Sdrug Hcomplex
(26) O
CI
O
COOCH3 N
H S
H O
H3C
N
N
H
O
N
N H
N
CH3 N
N CH3
O
N CI
CIopidogrel antiplatelet (annual sales 6,057,000 USD)
progesterone contraception
caffeine (30mg/Espresso)
Fig. 17 Some examples of well known and widespread drugs
nicotine (1mg/Cigarette)
Valium
120
D.W.M. Hofmann
Fig. 18 An example of a ligand–protein complex. The protein has a pocket, which is filled with a ligand
The algorithm of docking is divided as in other problems in MM. As a first steps structures, here complexes, have to be generated. These structures have to be minimized, and manifold structures have to be removed. Finally a ranking has to be performed. In the example of a complex between a drug and an enzyme, here a protein, given in Fig. 18, ranking has been performed. According to the energies from the DMFF, the generation of the complexes has been done with FlexX [56–58]. This program has special algorithms to generate possible complexes (poses) very quickly for a given molecule and an enzyme. These simulations have been done on a benchmark of 147 known complexes. For each pair of drug and enzyme 200 reasonable complexes have been generated. The experimental structure was added to all 147 sets to guarantee at least one correct structure between the 200 complexes. With the program FlexCryst, the structures have been minimized by a simplex algorithm. The original pose has been minimized and the ligand is slightly shifted from its original position.
Data Mining in Organic Crystallography
121
Fig. 19 The rmsd of the lowest ranked structure for 147 sets of complexes. In the background is the empirical scoring function of FlexX, in the foreground that of the DMFF
In Fig. 19 we show the results of the ranking (as histograms of rmsd) for given complexes obtained with DMFF implemented to program FlexCryst and one of the typical scoring function implemented in program FlexX. One can see that, with the DMFF, the experimental structure was ranked first 89 times. This compares very well with the scoring function implemented in FlexX, which assigned the first rank only 22 times to the experiment. The following 12 cases of complexes obtained with the DMFF have a lower energy than the experiment itself, but are still placed very ˚ The scoring function of close to the experiment with an error between 0 and 1 A. ˚ The difference in the results FlexX favors structures with an error between 2 and 5 A. of ranking arises because of the different data bases selected for the parametrization. ˚ the The DMFF is determined with X-ray data which are accurate in hundredth A; scoring function of FlexX is based on the PDB. The data of the PDB are derived by ˚ NMR and are accurate within some A.
4.2 Melting Points The phenomenon alternation of melting points of alkanes was known over 100 years ago. A. Baeyer [2] mentioned that the increase in the melting points of fatty acids with increasing chain length is not monotonous, but the melting points with an even number of C atoms in the chain are relatively higher than those with an odd number. Later, it was found that this is true not only for the fatty acids, but for almost all alkanes including α -substituted and α , ω -disubstituted alkanes. In the standard textbooks of organic chemistry, this is generally explained by packing effects. Nevertheless, the macroscopic details of this phenomenon were not understood at the molecular level and the existing computational methods were not able to simulate this phenomenon because of a lack of proper parameterization.
122
D.W.M. Hofmann
Tsmp. / ° C
Only in the last few years, crystals suitable for crystal structure determination have been obtained from the development of modern method of crystal growth in capillary [4]. Recently [5], the crystal structures of all alkanes from ethane to nonane were investigated. As a result, the correlation between the densities and the melting points was found. It has been mentioned that the densities in investigated rows of alkanes are alternating. Compared with the “even” alkanes, the “odd” alkanes have a relatively lower density. Later, this was also found in the α , ω -alkandioles and α , ω -alkandiamines [62]. For these substances, the lattice energy was also found by using the Dreiding force field. As was expected, the “even ω -disubstituted alkanes have a relatively higher energy of lattice. If one splits the lattice energy into energy of hydrogen bridges and van der Waals interaction, both terms reflect the alternating again. We used the DMFF to examine the melting points alternation in the investigated row of the α , ω -alkandihalogenides [13]. In this case, it was also easy to recognize (see Fig. 20) the alternation of melting points. The alternation of lattice energy was the most clearly pronounced for the bromides. For the diiodalkanes, the alternation energy also was recognizable; however, diiodoctan does not fall into the scheme by the rezone of its unusual crystal structure. The melting points, however, show no clear trend. The opposite situation is realized for the melting points and lattice energy in the dichlorides. The alternation is clearly marked for the melting points, but the lattice energy is alternated only for short chains. For longer chain alkanes, it stays almost constant.
100 80 60 40 20 0 –20 –40 –60 –80 –100 1
2
3
4
5
6
7
ΔHθ / KJ moI–1
–20
8
9
10
a, w - Diiodalkane a, w - Dlbromalkane a, w - Dichloralkane
–40 –60 –80 –100 –120 –140 1
2
3
4
5
6
7
8
9
10
n (C)
Fig. 20 Melting points and lattice energy of different dihalogenalkanes (Carsten Schauerte)
Data Mining in Organic Crystallography
123
5 Data Mining and Crystal Structure Prediction The last chapter is dedicated to a challenging problem of crystal engineering: crystal structure prediction (CSP), and the application of both the above-mentioned methods of data mining to face existing problems in this field. The common algorithm for crystal structure prediction, usually including the steps of generation, minimization and sorting of crystal structures according to the used energy or scoring function (ranking), can be accomplished by estimation of the completeness of crystal structure generation and the removing of manifold crystal structures with instruments of data mining. The first three steps are common in many applications of Molecular Mechanics, e.g., the ligand–protein docking and conformational search mentioned above. The crystal structure generation is, obviously, always the first necessary step. The second step, the minimization of lattice energies, should follow for all equilibrium states, since the system in equilibrium state places the minimized structure in a minimum (or optimum) to rank in the third step according to the final energies. The estimation of completeness the crystal structure generation and the removing of manifold crystal structures is not really necessary, but the data about the total number of minima and manifold structures contains important information about the overall energy landscape, including information about the global minimum, i.e. the lowest energy of all the structures. The structures close in energy to the global minimum can be important as metastable polymorph or as minor conformation of a molecule.
5.1 Generation of Crystal Structures The generation of crystal structures is not the problem from the theoretical point of view; however, the different procedures differ greatly in the speed and quality of the starting points. The purpose of the crystal structures generation is to produce as few crystal structures of a given molecule as possible, but, at least one of them must be close to the global minimum. If we compare this destination with the simplest conceivable algorithm, generating crystal structures randomly, it becomes clear that such an algorithm generates (indeed quickly) a huge number of crystal structures; however, very seldom are the meaningful crystal structures produced. Hence, algorithms were developed in two ways: either not generate pointless crystal structures, or filter them quickly. If we analyze the random generation algorithm, two basically senseless ways of generated crystal structures will be obvious. On the one hand, it gets too short contacts between neighboring molecules (they even interlock); on the other, the molecules are arbitrarily too far from each other in the space. To avoid a generation of crystal structures with non-interacting molecules, several strategies have been developed. The oldest strategy is based on a consecutive construction of the crystal. First a dimer of molecules is formed; then tetramers (and consequently n-mer’s) are formed by adding the other molecules step by step, till the full crystal is defined [23].
124
D.W.M. Hofmann
The algorithm becomes specially fast, if the n-mers (and consequently, the crystal structures) are built with the boundary condition that at least one atom from one of two interacting molecules has to lie on an interaction surface of the other molecule. An interaction surface is defined here as a potential atom on this surface having especially strong interaction with the molecule [30]. Therefore these interaction surfaces correspond to one part of the Connolly surface. The formation of a hydrogen bridge, an example of an especially strong interaction, is shown schematically in Fig. 21. Another approach is based on the crystal density estimation. Since the density can be estimated accurately [25], a random cell can be scaled for correct volume, and the molecule is positioned randomly in this cell. This guarantees interaction between the molecules and its images. Crystal structures with unusually short distances between neighboring molecules are sorted out easily. The atomic distances between the molecules have to be checked. If extremely short distances appear, the crystal structure is rejected. The generation is commonly restricted to a few space groups. The reason for this restriction lies in the frequency of different space group occurrence. In the Table 2 we show the frequency of occurrence of space groups in the CSD. One can see that
Fig. 21 Model of a hydrogen bond. The two interaction centers, oxygen and hydrogen, lie on the interaction surface of the other unit, forming an hydrogen bond Table 2 Rank 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
The ten most frequent space groups in the CSD Space group no. Space group name Frequency 14 P21 /c 1,51,852 2 P1¯ 98,645 34,708 19 P21 21 21 15 C2/c 34,622 23,733 4 P21 61 Pbca 15,246 6,177 33 Pna21 62 Pnma 5,439 9 Cc 4,653 1 P1 4,181
% of CSD 35.1 22.8 8.0 8.0 5.5 3.5 1.4 1.3 1.1 1.0
Data Mining in Organic Crystallography
125
nearly 80% of the known crystal structures belong to only five space groups, and the first ten groups already include 90% of all the known crystal structures. The alternative approaches to searching in the most popular space groups can be to generate P1 unit cells with varying numbers of total molecules in the unit cell [32], or to consider all 230 space groups [52].
5.2 Minimization Many well examined algorithms are available for the minimization of generated crystal structures. In molecular modeling, the gradient algorithms for minimization are used most often. The existing procedures differ in the top order of the derivatives, which is still taken into consideration. The oldest optimization technique, based on Gauss, is known as the steepest descent. It considers the first order derivatives. Another algorithm is the Newton-Raphson, which is of a second order. Procedures of higher order are more expensive in the programming and often offer no advantage in computing speed. The higher the order of procedure chosen, the bigger the width step during the minimization. On the other hand, the required expenditure stands to the calculation of the derivatives. This applies to both the numerically and the analytically calculated derivatives. Often, the analytical expressions are so complicated in their calculation, that no more computing advantage can be achieved. Gradient based algorithms require a certain knowledge of the curve and fail completely with δ -functions. For example, an upright needle on a table can not be found by a gradient method. A promising method is the Simplex method [49], which is implemented in FlexCryst. It is of a zero order and does not need gradients at all. The step adjusts automatically to the smoothness of the function during minimization. In the Simplex method, the functional values are calculated first on an n-dimensional surface for n + 1 points. Afterwards, the top functional value is reflected on the surface defined by the other points and the functional value of this reflected point is calculated. If the calculated value is smaller than the last one, this point is extended. If the functional value of the reflected point is higher in energy, the movement of the point is shortened. If one continues with this procedure, it will finish in the result, with an arbitrarily small Simplex. Hence, the procedure is cut off if the volume of the Simplex falls short of a certain limit value. Further algorithms for minimization are simulated annealing and genetic algorithms. These algorithms do not need derivatives, too. The genetic algorithm has been tested for crystal structure prediction [45]. Simulated annealing, normally very common in molecular mechanics, seem does not promising for crystal structure prediction. The energy surface obtains a large number of local minima (polymorphs) with similar energy. These minima are separated by large energy barriers, and to come from one minimum to the next, the crystal structure has to be destroyed completely. These conditions are opposite to the requirements for simulated annealing.
126
D.W.M. Hofmann
5.3 Ranking of Crystal Structures with Data Mining Potentials The ranking of crystal structures, following after the minimization, is performed normally, accordingly to the estimated lattice energies. The methods for the lattice energy calculations are presently widely variable: from simple empirical LennardJones [6, 66] and Buckingham potentials [67], trough the complicated multipoles descriptions [3], to ab initio calculations. A purely ab initio approach can hardly describe intermolecular interactions, and only very time-consuming multireference configuration allows for estimation of the intermolecular interaction correctly. Therefore, it becomes necessary to add a correction term to ab initio descriptions. An explicit correction of electron-electron interactions was calculated in CoulombHole Hartree Fock [12], and in density functional theory, a large number of correlation functionals has been developed. Despite these improvements in DFT, some significant difficulties in using density functional theory for the proper description of intermolecular interactions, especially van der Waals (dispersion) forces, still exist. Therefore, in the recent ab initio approach [50], empirical van der Waals corrections were added, anyway. As finally, nearly all feasible methods of lattice energy calculations need an empirical correction for the van der Waals interaction, the empirical potentials are of high importance. The empirical potentials obtained by the see section “Parametrization” data mining approach can be parametrized, even in the cases where the lack of thermodynamical data does not allow for the derivation of parameters in the traditional way. Data mining is an excellent method to obtain parameters based only on structural data. For this case, an overdetermined equation system has to be written. For simple models, the resulting equation system can be solved by singular value decomposition and, in more complex models, by recursive fitting. The other advantage of the applied method is that parametrization can be obtained for all kinds of atoms available in the structural data bases. In the described case, the atom numbers have been chosen as descriptors, and only the hydrogen atom was handled exceptionally with regard to the hydrogen bonds. Hydrogen atom types have been split in to four types: hydrogen atom bound to carbon, hydrogen atom bound to nitrogen, hydrogen atom bound to oxygen, and all the other hydrogen atoms. Further improvements of DMFF will be connected with splitting the number of the other atoms into different types.
5.4 Completeness of Structure Generation The estimation of the completeness of structure generation is a consequential procedure for random generation of crystal structures. The tools of data mining (namely cluster analysis) allow for checking the state of crystal structure generation begin complete and accordingly, to estimate the total number of minima on the energy landscape and to derive the completeness of the polymorph searching. For this, after minimization, the generated crystal structures have to be clustered to estimate
Data Mining in Organic Crystallography
127
the numbers of singlets nclusters , doublets nclusters , triplets nclusters and so on upto 1 2 3 clusters ) clusters. If we normalize these numbers by the total number of minima (nk nminima (which, of course, is unknown), one can obtain the probability Pk of finding a cluster of a certain cardinal k. This distribution, assuming an equally distributed randomness for finding of a certain minimum, obeys the Poisson equation: /nminima = Pk = ncluster k
λk λ e k!
(27)
In this formula, λ is the number of generated crystal structures divided by the total number of minima nminima . If the probability between two different cardinals is divided, the equation can be solved for λ and expressed in terms of cluster frequencies ncluster λ = 2 2cluster . (28) n1 For once obtained λ , the percentage of missed local minima can be easily calculated by the Poisson distribution. P0 = e−λ
(29)
The percentage of covered (generated) minima Pcoverage is the converse of the probability of missed minima and is obtained by: Pcoverage =
∞
∑ Pk = 1 − P0
(30)
k=1
In Fig. 22, the progress of the search is illustrated for the sample case of 1-Aza2-cyclobutanone (refcode in CSD FEPNAP), a small-sized molecule determined by Yang et al. [69]. In the depicted case, a crystal structure prediction was performed. For simplicity, the calculation was restricted to the four most frequent space groups. In the time scale from 40 to 320 s one can see the changing of the numbers of crystal structures, which belong to a singlet, a doublet, a triplet, or to a higher cluster. After 320 s, the clustering of the predicted polymorphs divides the structures into 25 singles, 11 doubles, 7 triples and 9 clusters of a higher cardinal. Insertion of these values to 30 results in a total coverage of 59%.
5.5 Screening of Manifold Crystal Structures When the steps of crystal structure generation and minimization are completed, the next step is to get rid of the manifold structures. Manifold crystal structures appear as a result of an applied algorithm of random generation and present a notorious problem for CSP. As we have shown in the section above, the generated crystal structures are distributed into the different local minima approximately according
128
D.W.M. Hofmann
100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%
Other Triplets Dublets Singlets
40
80
160
320
Fig. 22 During the CSP, the generated crystal structures are periodically inspected. In time, the ratio between the singlets and doublets diminishes, indicating the progress of the coverage of the search space
to the Poisson distribution. Then, closer the generation comes to full coverage, the more frequently manifold generated polymorphs appear. Similarly to the described cluster analysis of crystal structures in data bases Sect. 2.3, which helps to recognize the manifold crystal structures, the cluster methods can be used to screen the manifold generations of predicted crystal structures. For this purpose, the similarity matrix for low ranking crystal structures is calculated and the crystal structures are ordered as dendrogram, according to the similarity index. Then, clusters are formed by an estimated maximum similarity index (threshold, red line in the figure). Finally, from each cluster found, just one crystal structure is retained and the other structures of the cluster are discarded. Such methods were first described by Van Eijck and Kroon [20]. For illustration, we show in Fig. 23 the clustering of the results of prediction for sample molecule 1-Aza-2-cyclobutanone (FEPNAP) after 80 s. For simplicity of the ¯ and to a very short presentation, we restricted the prediction to one space group (P1) calculation time (80 s). After 80 s, 29 polymorphs have been generated. The experimental crystal structure and minimized experimental crystal structure have been added to the set for clustering (in total 31 structures). The threshold for cluster formation depends on the tolerance during the minimization. In the illustrated case, the accuracy of the minimization is very high and the threshold gets a very low value. As a result, the experimental and minimized experimental crystal structures fall into different clusters. In total, with the chosen threshold one cluster with five structures, three doublet and 20 singlet clusters have been formed. The cluster with the lowest energy ranks contains five members: rank one, rank two, rank three, rank six, and rank ten with energies from 54.71 to 53.19 kJ mol−1 . The rank most similar to the minimized experimental structure (54.80 kJ mol−1 ) was found to be rank number four (53.87 kJ mol−1 ). The other two doublets include ranks 8,12 and 21,23, respectively. After the clustering, the manifold crystal structures are removed, and the total
Fig. 23 Clustering of predicted polymorphs. From each cluster only one representative is retained
Data Mining in Organic Crystallography 129
130
D.W.M. Hofmann
Fig. 24 Comparison between experimental structure and Rank 2 (in green) of the predicted crystal structures for 1-Aza-2-cyclobutanone
number of polymorphs is reduced to 22 structures. After screening the dendrogram does not show any clusters further below the selected threshold. From the first cluster, only structure number one with the lowest energy (54.72 kJ mol−1 ) is retained. The minimized experimental structure is most similar to rank number two (number four before clustering). In Fig. 24 eight molecules of the experimental structure are superimposed on eight molecules of the predicted polymorph number two. From this picture, it becomes clear that both structures are identical. The correlation between both structures is 90%. The molecules in the crystals are held in typical dimers via NH...O hydrogen bonds. This structural feature, inherent in nearly all molecules containing NH-CO group, has been easily reproduced by DMFF.
Comparison with the Experimental Structure A comparison of the results of crystal structure prediction with experimental data holds importance as long as CSP is not successful by 100%, and additional information is necessary to select the actually existing structure from among the list of predicted polymorphs. The usual application is to assess the benefit of algorithms developed for CSP and to provide an objective picture of the status of the field in blind tests organized by Cambridge Structure Data Base. The blind test normally involves a set of molecules with unknown (prior to the performing of predictions)
Data Mining in Organic Crystallography
131
crystal structures. Traditionally, for evaluating the results of prediction, the rms values of the selected set of molecules, or the momentum of inertia, or the symmetry operations of the crystal structures were used [9, 19, 20]. For this evaluation the usage of the cluster method with similarity working in reciprocal space introduced in [28] Sect. 2.2, has allowed for reduction in the calculation times in order of magnitude in the last blind test. Besides the simple task of finding among the list of polymorphs, the most similar crystal structure to the experiment, clustering gives further information about the quality of the force fields for a given problem. If the predicted structures group is far away from the experimental structure, it indicates a systemic error in the force field. The other application, mentioned in Sect. 2.4 is the crystal structure determination from unindexed powder diagrams. This open up a possibility of solving the crystal structure even if other X-ray methods fail.
6 Conclusions In this chapter, we have mainly highlighted two actual applications of data mining in organic crystallography: the cluster analysis and the support vector machines (SVM). The SVMs are used to find errors in the data bases and to derive force fields without any hypotheses on the functional form. Since the accuracy of the force fields derived by data mining depends on the number of known crystal structures, this approach should be favored in the long-term. The second method, clustering, has been introduced in this field only very recently. An obvious application is the screening of data bases to remove undesired repetitions of crystal structures. This is important for virtual as well as for experimental data bases. Its application is interesting in crystal structure determination, where it can be used to find isostructural crystal structures. With this simple application, the knowledge about regularities between isostructural crystal structures gives very valuable information for crystal engineering. A third method, principal component analysis, might become more important in the future, as it is already in use nowadays in inorganic crystallography.
References 1. Allen FH, Kennard O (1993) 3D search and research using the Cambridge Structural Database. Chem Des Autom News 8:31–37 2. Baeyer A (1877) Ber Chem Ges 10:1286 3. Beyer T, Day GM, Price SL (2001) The prediction, morphology, and mechanical properties of the polymorphs of paracetamol. J Am Chem Soc 123:5086–5094 4. Boese R, Nussbaumer M (1994) Transformations and interactions in organic crystal chemistry. International Union of Crystallography
132
D.W.M. Hofmann
5. Boese R, Wei”s HC, Bl”aser D (1999) Die schmelzpunktalternanz der kurzkettigen n-alkane: Einkristallr”ontgenstrukturanalysen von propan bei 30 k und von n-butan bis n-nonan bei 90 k. Angew Chem 111:1042–1045 6. Brooks BR, Bruccoleri RE, Olafson BD, States DJ, Swaminathan S, Karplus M (1983) CHARMM: A program for macromolecular energy, minimization, and dynamics calculations. J Comp Chem 4:187–217 7. Bruno IJ, Cole JC, Lommerse JPM, Rowland RS, Taylor R, Verdonk M (1997) Isostar: a library of information about nonbonded interactions. J Comp–Aided Mol Design 11:525–537 8. Buckingham RA (1938) The classical equation of state of gaseous helium, neon, and argon. Proc Phys Soc London A 168:264–283 9. Chisholm J, Motherwell WDS (2005) Compack: A program for identifying crystal structure similarity using distances. J Appl Cryst 38:228–231 10. Clark M, Cramer RD III, van Opdenbosch N (1989) Validation of the general purpose tripos 5.2 force field. J Comp Chem 10:982–1012 11. Clarke J, Smith W, Woodcock L (1986) Short range effective potentials for ionic fluids. J Chem Phys 84:2290–2294 12. Clementi E, Hofmann DWM (1995) Coulomb-hole–hartree-fock functional for molecular systems. J Mol Struc (Theochem) 330:17–31 13. Schauerte C, Buchsbaum C, Fink L, Hofmann DWM, Schmidt MU, Knipping J, Boese C (2005) Crystal structures of trans- and cis-octenes. Acta Cryst A61:C290–C291 14. Day GM, Cooper TG, Cabeza AJC, Hejczyk KE, Ammon HL, Boerrigter SXM, Tan J, Valle RGD, Venuti E, Jose J, Gadre SR, Desiraju GR, Thakur TS, van Eijck BP, Facelli JC, Bazterra VE, Ferraro MB, D W M Hofmann, Neumann kM, Leusen FJJ, Kendrick J, Price SL, Misquitta AJ, Karamertzanis PG, Welch GWA, Scheraga HA, Arnautova YA, Schmidt MU, van de Streek J, Wolf A, Schweizer B (2009) Significant progress in predicting the crystal structures of small organic molecules a report on the fourth blind test. Acta Cryst B65:107–125 15. Day GM, Motherwell WDS, Ammon HL, Boerrigter SXM, Valle RGD, Venuti E, Dzyabchenko A, van Eijck BP, Erk P, Facelli JC, Bazterra VE, Ferraro MB, Hofmann DWM, Leusen FJJ, Liang C, Pantelides CC, Karamertzanis PG, Price SL, Lewis TC, Torrissi A, Nowell H, Scheraga H, Arnautova Y, Schmidt MU, Schweizer B, Verwer P (2005) A third blind test of crystal structure prediction. Acta Cryst B61:511–527 16. Duda RO, Hart PE (1973) Pattern classification and scene analysis. Wiley Sons, New York 17. van Duin A, Dasgupta S, Lorant F, III WG (2001) Reaxff: A reactive force field for hydrocarbons. J Phys Chem A 105:9396–9409 18. Dunitz JD (2003) Are crystal structures predictable? Chem Commun:545–548 19. Dzyabchenko A (1994) Method of crystal-structure similarity searching. Acta Cryst B 50: 414–425 20. Eijck BPV, Kroon J (1998) Fast clustering of equivalent structures in crystal structure prediction. J Comp Chem 18:1036–1042 21. van Eijk BP, Mooij WTM, Kroon KJ (1995) Attempted prediction of the crystal structure of monosaccharides. Acta Cryst B 51:99 22. Filippini G, Gavezzotti A (1993) Empirical intermolecular potentials for organic crystals: the ‘6-exp’ approximation, revisted. Acta Cryst B 49:868–880 23. Gavezzotti A (1991) Generation of possible crystal structures from the molecular structure for low-polarity organic compounds. J Am Chem Soc 113:4622–4629 24. Gavezzotti A (1994) Are crystal structures predictable? Acc Chem Res 27:309–314 25. Hofmann DWM (2002) Fast estimation of crystal densities. Acta Cryst B 58:489–493 26. Hofmann DWM, Apostolakis J (2003) Crystal structure prediction by data mining. J Mol Struc (Theochem) 647:17–39 27. Hofmann DWM, Kuleshova LN (2005) New force field for molecular simulation and crystal design developed based on the data mining method. Cryst Rep 50:335–337 28. Hofmann DWM, Kuleshova LN (2005) A new similarity index for crystal structure determination from X-ray powder diagrams. J Appl Cryst 38:861–866 29. Hofmann DWM, Kuleshova LN (2006) A method for automated determination of the crystal structure from X-ray powder diffraction data. Cryst Rep 51:452–460
Data Mining in Organic Crystallography
133
30. Hofmann DWM, Lengauer T (1997) A discrete algorithm for crystal structure prediction of organic molecules. Acta Cryst A53:225–235 31. Hofmann DWM, Kuleshova LN, D’Aguanno B (2007) A new reactive potential for the molecular dynamics simulation of liquid water. ChemPhysLett 448:138–143 32. Pillardy J, Arnautova YA, Czaplewski C, Gibson KD, Scheraga HA (2001) Conformationfamily monte carlo: A new method for crystal structure prediction. Proc Nat Acad Sci USA 98:12351–12356 33. Karfunkel HR, Rohde B, Leusen FJJ, Gdanitz RJ, Rihs G (1993) Continuous similarity measure between nonoverlapping X-ray powder diagrams of different crystal modifications. J Comp Chem 14:1125 34. Kempster CJE, Lipson H (1972) Rapid method of assessing the number of molecules in the unit cell of an organic crystal. Acta Cryst B 28:3674–3674 35. Kitaigorodskii AJ (1961) Organic Chemical Crystallography. Consultants Bureau, New York 36. Lennard-Jones JE (1931) Cohesion. Proc Phys Soc 43:461–482 37. Lommerse JPM, Motherwell WDS, Ammon HL, Dunitz JD, Gavezzotti A, Hofmann DWM, Leusen FJJ, Mooij WTM, Price SL, Schweizer B, Schmidt MU, van Eijck BP, Verwer P, Williams DE (2000) A test of crystal structure prediction of small organic molecules. Acta Cryst B B56:697–714 38. Looijenga-Vos A, Buerger MJ (2006) Determination of space groups. In: International Tables for Crystallography Volume A: Space-Group Symmetry, International Union of Crystallography, pp 44–54 39. Maiorov VN, Crippen GM (1992) Contact potential that recognizes the correct folding of globular proteins. J Mol Bio 227:876–888 40. Rappe AK, Casewit CJ, Colwell KS, Goddard WA III, Skiff WM (1992) UFF, a full periodic table force field for molecular mechanics and molecular dynamics simulations. J Am Chem Soc 114:10024–10035 41. Mayo SL, Olafson BD, Goddard WA III (1990) Dreiding: A generic force field for molecular simulations. J Phys Chem 94:9997–8909 42. McGuire R, Momany F, Scheraga H (1972) Energy parameters in polypeptides. v. an empirical hydrogen bond potential function based on molecular orbital calculations. J Phys Chem 76:375–393 43. Mighell AD, Hubbard CR, Stalick JK, Santoro A, Snyder RL, Holomany M, Seidel J, Lederman S (1987) NBS*AIDS83, crystal data and JCPDS editor program, copyright 1987, U.S. Department of Commerce. National Bureau of Standards (USA), Gaithersburg 44. Momany FA, Carruthers LM, McGuire RF, Scheraga HA (1974) Intermolecular potentials from crystal data. iii. determination of empirical potentials and application to the packing configurations and lattice energies in crystals of hydrocarbons, carboxylic acids, amines, and amides. J Phys Chem 78:1595 – 1620 45. Motherwell W (2001) Crystal structure prediction and the cambridge structural database. Mol Cryst Liq Cryst Sci Tech Mol Cryst Liq Cryst 356:559–567 46. Motherwell WDS, Ammon HL, Dzyabchenko A, Erk P, Dunitz JD, Gavezzotti A, Hofmann DWM, Leusen FJJ, Lomerse JPM, Mooij WTM, Price SL, Scheraga H, Schweizer B, Schmidt MU, van Eijck BP, Verwer P, Williams DE (2002) Crystal structure prediction of small organic molecules: a second blind test. Acta Cryst B 58:647–661 47. Murray-Rust P (2008) Open data in science. Ser Rev 34:52–64 48. Najmanovich MVR, Domany E (2000) Can a pairwise contact potential stabilize native protein folds against decoys obtained by threading? Proteins: Struct, Funct Genet 38:134–140 49. Nelder J, Mead R (1964) A simplex method for function minimization. Comput J 7:308–313 50. Neumann M, Perrin M (2005) Energy ranking of molecular crystals using density functional theory calculations and an empirical van der Waals correction. J Phys Chem B 109:15531– 15541 51. Neumann MA (2008) Tailor-made force fields for crystal-structure prediction. J Phys Chem B 112:9810–9829 52. Neumann MA, Leusen FJJ, Kendrick J (2008) A major advance in crystal structure prediction. Angew Chem Int Ed 47:2427–2430
134
D.W.M. Hofmann
53. Price SL (2004) Quantifying intermolecular interactions and their use in computational crystal structure prediction. Cryst Eng Commun 6:344–353 54. Putz H, Sch”on JC, Jansen M (1999) Combined method of ab initio structure solution from powder diffraction data. J Appl Cryst 32:864–870 55. Rappe AK, Casewit CJ, Colwell KS, III WAG, Skiff WM (1992) Uff, a rule-based full periodic table force field for molecular mechanics and molecular dynamics simulations. J Am Chem Soc 114:10024–10=035 56. Rarey M, Kramer B, Lengauer T, Klebe G (1996) A fast flexible docking method using an incremental construction algorithm. J Mol Biol 261:470–489 57. Rarey M, Wefing S, Lengauer T (1996) Placement of medium-sized molecular fragments into the active site of proteins. J Comp-Aided Mol Des 10:41–54 58. Rarey M, T BK, Lengauer (1997) Multiple automatic base selection: Protein–ligand docking based on incre mental construction without manual intervention. J Comp-Aided Mol Design 11:369–384 59. Schmidt MU, Buchsbaum C, Schnorr JM, Hofmann DWM, Ermrich M (2007) Pigment orange 5: cystal structure determination from a non-indexed X-ray powder diagram. Z Krist 222:30– 33 60. Slater J, Kirkwood JG (1931) The van der Waals forces in gases. Phys Rev 37:682–=697 61. Sun H, Ren P, Fried J (1998) The compass force field: Parameterization and validation for phosphazenes. Comput Theor Polym Sci 8:229–246 62. Thalladi VR, Boese R, Weiss HC (2000) The melting point alternation in α , ω -alkandioles nnd α , ω -alkandiamines: interplay between hydrogen bonding and hydrphobic interactions. Angew Chem Int Ed Engl 39:918–922 63. Threlfall TL (1995) Analysis of organic polymorphs: A review. Analyst 120:2435 64. Verwer P, Leusen F (1998) Computer simulation to predict possible crystal polymorphs. Rev Comp Chem 12:327–365 65. Wawak RJ, Gibson KD, Liwo A, Scheraga HA (1996) Theoretical prediction of a crystal structure. Proc Natl Acad Sci USA 93:1743 66. Weiner SJ, Kollmann PA, Guyen DT, Case DA (1986) An all atom force field for simulations of proteins and nucleic acids. J Comp Chem 7:230–252 67. Williams DE (1996) Ab initio molecular packing analysis. Acta Cryst A 52:326–328 68. Willock DJ, Price SL, Leslie M, Catlow CRA (1995) The relaxation of molecular crystal structures using a distributed multipole electrostatic model. J Comp Chem 16:628–647 69. Yang Q, Seiler P, Dunitz J (1987) β -Propiolactam (1-aza-2-cyclobutanone) at 170 K. Acta Cryst C 43:565–567 70. Zimmerman SC (1997) Putting molecules behind bars. Science 276:543
Struct Bond (2010) 134:135–167 DOI:10.1007/430 2009 3 c Springer-Verlag Berlin Heidelberg 2009 Published online: 1 September 2009
Data Mining for Protein Secondary Structure Prediction Haitao Cheng, Taner Z. Sen, Robert L. Jernigan, and Andrzej Kloczkowski
Abstract Accurate protein secondary structure prediction from the amino acid sequence is essential for almost all theoretical and experimental studies on protein structure and function. After a brief discussion of application of data mining for optimization of crystallization conditions for target proteins we show that data mining of structural fragments of proteins from known structures in the protein data bank (PDB) significantly improves the accuracy of secondary structure predictions. The original method was proposed by us a few years ago and was termed fragment database mining (FDM) (Cheng H, Sen TZ, Kloczkowski A, Margaritis D, Jernigan RL (2005) Prediction of protein secondary structure by mining structural fragment database. Polymer 46:4314–4321). This method gives excellent accuracy for predictions if similar sequence fragments are available in our library of structural fragments, but is less successful if such fragments are absent in the fragments database. Recently we have improved secondary structure predictions further by combining FDM with classical GOR V (Kloczkowski A, Ting KL, Jernigan RL, Garnier J (2002a) Combining the GOR V algorithm with evolutionary information for protein secondary structure prediction from amino acid sequence. Proteins 49:154–66; Sen TZ, Jernigan RL, Garnier J, Kloczkowski A (2005) GOR V server for protein secondary structure prediction. Bioinformatics 21:2787–8) predictions to form a combined method, so-called consensus database mining (CDM) H. Cheng, R.L. Jernigan, and A. Kloczkowski () Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA 50011, USA e-mail:
[email protected] and L. H. Baker Center for Bioinformatics and Biological Statistics, Iowa State University, Ames, IA 50011–3020, USA T.Z. Sen 1025 Crop Genome Informatics Laboratory, Ames, IA 50011, USA and Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50011, USA
136
H. Cheng et al.
(Sen TZ, Cheng H, Kloczkowski A, Jernigan RL (2006) A Consensus Data Mining secondary structure prediction by combining GOR V and Fragment Database Mining. Protein Sci 15:2499–506). FDM mines the structural segments of PDB, and utilizes structural information from the matching sequence fragments for the prediction of protein secondary structures. By combining it with the GOR V secondary structure prediction method, which is based on information theory and Bayesian statistics, coupled with evolutionary information from multiple sequence alignments (MSA), our CDM method guarantees improved accuracies of prediction. Additionally, with the constant growth in the number of new protein structures and folds in the PDB, the accuracy of the CDM method is clearly expected to increase in future. We have developed a publicly available CDM server (Cheng H, Sen TZ, Jernigan RL, Kloczkowski A (2007) Consensus Data Mining (CDM) Protein Secondary Structure Prediction Server: combining GOR V and Fragment Database Mining (FDM). Bioinformatics 23:2628–30) (http://gor.bb.iastate.edu/cdm) for protein secondary structure prediction. Keywords: Data mining · Protein structure prediction · Secondary structure prediction · Protein crystallography · Structural fragments · Fragment database mining · Consensus database mining · Crystallization data mining Contents 1 2 3 4 5 6 7 8 9
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Crystallization Data Mining in Protein X-Ray Structure Determination . . . . . . . . . . . . . . General Overview of Data Mining Methods in Biology . . . . . . . . . . . . . . . . . . . . . . . . . . . Measures of Quality of Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Review of Protein Secondary Structure Prediction Methods . . . . . . . . . . . . . . . . . . . . . . . GOR V Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fragment Database Mining Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Consensus Data Mining Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Secondary Structure Prediction Servers Based on GORV and Consensus Data Mining Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Abbreviations AA Amino acid AI Artificial intelligence BLAST Basic local alignment search tool BLOSUM Blocks of amino acid substitution matrix BMCD Biological macromolecular crystallization database C Coil CASP Critical assessment of techniques for protein structure prediction CB513 Cuff and Barton dataset of 513 sequences
137 140 142 143 145 148 149 155 160 162 163
Data Mining for Protein Secondary Structure Prediction
CDM DSSP DT E EM EMBL FDM FN FP GOR H H/P JCSG MSA NIH NMR NN PAM PDB PSI-BLAST SOV SVM TN TP
137
Consensus data mining Dictionary of secondary structure assignments Decision tree Extended (β-strand) Electron microscopy European molecular biology laboratory Fragment data mining False negative False positive Garnier Osguthorpe Robson method Helix Hydrophobic/Polar Joint center for structural genomics Multiple sequence alignment National institutes of health Nuclear magnetic resonance Neural network Percent accepted mutation Protein data bank Position specific iterated basic local alignment search tool Segment overlap coefficient Support vector machine True negative True positive
1 Introduction The prediction of protein structure and of the related protein function from the amino acid sequence is one of the most important problems in modern bioinformatics. Recently, with completion of many large-scale genome sequencing projects this problem has become even more important, since the rapidly growing protein sequences require more and more structure predictions. Genome sequencing provides a huge amount of amino acid sequence data, while the corresponding structural information is much more difficult to obtain experimentally. The gap between the number of known protein amino acid sequences and the number of known structures in the protein data bank (PDB) [1] constantly and rapidly increases. Some methods, such as homology modeling or threading are useful, but sometimes infeasible/not feasible, making major advances in protein structure prediction from sequences of the utmost importance. Although the prediction of tertiary structure is one of ultimate goals of protein science, the prediction of secondary structure from sequences remains a more feasible intermediate step in this direction. Furthermore, usually knowledge of the secondary structure can serve as input for prediction of tertiary structure. Instead of predicting the full three-dimensional structure, it is
138
H. Cheng et al.
Fig. 1 Secondary structure elements: the α-helix (a), β-strands forming parallel (b) and antiparallel (c) β-sheets
much easier to predict the simplified aspects of structure, namely the key structural elements of the protein and the location of these elements along the protein amino acid sequence. This reduces the complex three-dimensional problem to a much simpler one-dimensional problem. The fundamental, and most frequently recurring, elements of the secondary structure of proteins are α-helices, β-strands, coils, and turns. The α-helix is a regularly coiled conformation that resembles a spring (Fig. 1a). The structure of the α-helix is stabilized by hydrogen bonds formed between the backbone N–H groups and the backbone C = O groups that are four residues apart along the sequence. The α-helices may be either right-handed or left-handed, although left-handed helices are quite rare. Atoms in α-helices are densely packed, with each amino acid corresponding to a 100◦ turn of the helix. As a result there ˚ The first geare 3.6 residues per turn, and the distance between two turns is 5.4 A. ometrically correct model of the α-helix was proposed by Linus Pauling and his collaborators in 1951 [2, 3]. Other frequently observed secondary structure elements in proteins are β-strands. A β-strand is a conformation of the fragment of the amino acid chain in which peptide backbones are almost completely extended. Additionally β-strands are usually shorter than α-helices, their typical length is 5– 10 residues, whereas α-helices, especially transmembrane ones can be several times longer. Most of β-strands are located adjacent to other β-strands and form β-sheets stabilized by interstrand hydrogen bonds between the N–H groups of one β-strand and the C = O groups of another β-strands. If two β-strands have the same directions (defined by vector between terminal atoms N and C) the resulting β-sheet is parallel (Fig. 1b). If directions of two β-strands are in opposite directions they form an antiparallel β-sheet (Fig. 1c). The simplest case of such antiparallel arrangement is the β-hairpin where two antiparallel strands are connected by a very short loop. Turns are elements of protein secondary structure which occur when two Cα atoms of residues that do not belong to α-helices or β-strands approach closely (less than ˚ The remaining unstructured more irregular part of the protein structure is 7 A). called coil.
Data Mining for Protein Secondary Structure Prediction
139
All these elements of secondary structure can be easily observed by crystallographers in the three-dimensional structures of proteins in the PDB, and the visual assignments of the secondary structures was used by crystallographers was used until the mid-1980s. However such visual observation is highly subjective and more rigorous definition of the various elements of protein secondary structure from the atomic coordinates in the PDB became a necessity. In 1983 Kabsch and Sander [4] developed the classification of elements of secondary structure based mostly on the location of hydrogen bonds between the backbone carbonyl (C=O) and N–H groups in proteins. They developed the dictionary of secondary structure assignments – DSSP. The DSSP algorithm is now widely used in protein science to assign the secondary structure from atomic coordinates, and its application assures the uniform definition of secondary structures. They developed also the DSSP server at EMBL in Heidelberg with all proteins in the PDB bank having been given DSSP assignments (ftp://ftp.embl-heidelberg.de/pub/databases/dssp/). Although DSSP is the most popular method, there are other alternative assignment methods, such as STRIDE [5]. According to the DSSP classification there are eight types of secondary structure assignment denoted by letters: H (α-helix), E (extended β-strand), G (310 helix), I (π-helix), B (bridge, a single residue β-strand), T (β-turn), S (bend) and C (coil). Coil is defined as a structural element which does not belong to any of the other seven classes. The 310 helix differs from a regular α-helix by each amino acid corresponding to a 120◦ turn in the helix, instead of the standard value 100◦. Because of this, the 310 helix has three residues per turn, whereas the α-helix has 3.6 residues. The hydrogen bonds in the 310 helix are formed between ith and (i + 3)rd residues, while in the α-helix hydrogen bonds occur between ith and (i + 4)th amino acids. The π-helix is a really rare form of helix, where the hydrogen bonds are formed between ith and (i + 5)th residue. Each amino acid in a π-helix corresponds to 87◦ turn in helix, and because of this a π-helix has 4.1 residues per turn. The bridge is a single residue β-strand, which because of its small size is quite difficult to predict accurately from the sequence. Eight types of secondary structure are however too many for the existing methods of secondary structure prediction. Instead usually only three states are predicted: helix (H), extended (β-sheet) (E) and coil (C). Some authors use the symbol (L) instead of the letter C for coil (loops). Early methods of protein secondary structure prediction tried to predict four secondary structure elements, including turns as a separate fourth structural element. Most current methods of protein secondary prediction focus on prediction of only three secondary structure elements, treating turns as coil. There are many different methods to translate the eight-letter DSSP alphabet into the three-letter code. The most common method is the one used in the critical assessment of structure prediction (CASP) [6] experiment – a biannual competition (with several hundred participants from all over the world) to predict blindly protein structures from their sequences. Structures of newly determined, but not yet published, sequences of new PDB entries are used in the competition. The CASP experiments initiated by John Moult in 1994 gave protein structure prediction a
140
H. Cheng et al.
substantial new impetus. The CASP experiment helps to assess and compare various prediction algorithms and techniques, and to establish the current state of the art of prediction methodology. The translation of the eight-letter DSSP code into the three-letter H, E, C code used in the CASP experiment is as follows: Helices (H, G and I) in the DSSP code are assigned the letter H in the threeletter secondary structure code, while strands (E) and bridges (B) in the DSSP code are translated into sheets (E) in the three-letter code. Other elements of the DSSP structure (T, S, C) are treated as coil (C). There are, however, other possible ways to make these assignments. Some authors translate I (π-helix) into coil (which is not so important because of the scarcity of I structures). Frishman and Argos [5] assumed that the DSSP H and E are translated to H and E in the three state codes, and all other letters of the DSSP code are translated to coil (C); and additionally that helices shorter than 5 residues (HHHH or less) and sheets shorter than three residues (EE) are coils. Some other authors have treated helical DSSP G elements as helices H in the three-letter code only if they are neighbors to H sequences, but isolated G elements were treated as coil [7, 8]. The bridge (B) is the DSSP structure which is most difficult to locate on the sequence, because it is only one residue long, so frequently the corrections for bridges B are done, e.g., BC is translated to EE and BCB to CCC [9]. Many programs include correction algorithms that remove very short secondary structure sequences as most likely being assignment errors. Additionally there are different versions of DSSP assignment algorithms that are not identical. For example the DSSP algorithm used in the PDB differs slightly from the original DSSP algorithm developed at EMBL. The original EMBL algorithm takes into account interchain hydrogen bonds for proteins composed of several chains. The PDB algorithm uses only intrachain hydrogen bonds, and completely neglects interchain bonds. Additionally, hydrogen bond placements can differ due to ambiguities and errors in experimental crystallographic data. The theoretical problem of one-dimensional secondary structure prediction with three states (H, E, C) can be reduced even further to only two states by defining the structural states of residues with respect to water accessibility, with residues being designated as either buried inside protein core and inaccessible to water, or being on the protein surface and easily accessible to water [9]. This binary classification of structural elements of proteins corresponds to a simple hydrophobic/polar (H/P) model of proteins. Such simple models have frequently been used in molecular biology to reduce the computational complexity of the protein folding problem.
2 Crystallization Data Mining in Protein X-Ray Structure Determination The ultimate goal of protein science is to determine the three-dimensional structures of all the proteins, and to find their functions and interactions with all other proteins and ligands. There are several ways to solve protein structures, such as X-ray crystallography, nuclear magnetic resonance (NMR) and electron microscopy (EM).
Data Mining for Protein Secondary Structure Prediction
141
According to the most recent (December 2008) PDB statistics, there are 46,656 X-ray structures, 7,598 NMR structures, and 204 EM structures in the database (including obsolete and removed structures). This shows that X-ray crystallography continues to be the most important approach to determine protein structures. Additionally the resolution of X-Ray-determined structures is usually higher than the resolution of NMR-determined proteins and much higher than the resolution of EM-determined structures. NMR methodology is also applicable only to relatively small proteins, since for larger proteins (with the number of residues N > ∼150) the NMR spectrum becomes too complicated for a detailed analysis. X-ray crystallography is based on diffraction of an X-ray beam in crystal lattice, discovered in 1911 by Max von Laue. The incoming beam is being scattered in certain directions. The directions and intensities of such reflections depend on the type and distribution of the atoms within the crystal. In 1912 Bragg showed that X-ray scattering can be used for structure determination. The structure of the first biological molecule – hemoglobin was determined in 1959 by X-ray crystallography by Perutz. Despite recent progress in protein structure determination there is still a huge gap between the numbers of protein sequences and protein three-dimensional structures. The latest advances in X-ray crystallography and associated technologies, allow protein structure to be determined within a few hours [11–14]. This corresponds however to an ideal case, with optimal crystallization conditions and all necessary resources in place. Actually, protein crystallization is a slow process that creates a bottleneck for protein X-ray crystallography. Numerous efforts have focused on efficiently and systematically determining the critical variables (known as crystallization space) for optimized crystal size and morphology, by mining crystallization data. Those variables include: concentration and the nature of the protein, salt concentration and its type, pH, buffer, additives, temperature, precipitant type and concentration, etc. Positive and negative data combinations are equally important for predicting crystallization conditions to accelerate the structure determination process. One of the earliest large-scale mining efforts was the construction of the biological macromolecular crystallization database (BMCD) [15, 16]. It is the most extensive database for crystallization parameters available to the public. Its entries include protein name, protein concentration, crystallization precipitant, pH, temperature, unit cell, and resolution, which are the parameters used to successfully crystallize a given protein. Two disadvantageous factors, however, limit the use of this database in determining crystallization conditions. One is that it only includes positive data by collecting data only from successful crystallization experiments. Second is that the preparation methods for different entries in the database are significantly dissimilar. Despite this deficiency the BMCD database led to the development of many popular sparse matrix screens. Researchers from Hauptmann Woodward Medical Institute in Buffalo in collaboration with their colleagues from Canada developed high-throughput crystallization and evaluation setup with information repository that analyzed data from past crystallization experiments for the design of new crystal growth experiments [17].
142
H. Cheng et al.
Two research groups, the Joint Center for Structural Genomics (JCSG) and the University of Toronto, independently mined the crystallization data, in an effort to determine the smallest sets of conditions (minimal screens) that could crystallize the maximum number of proteins in the data [18, 19]. JCSG mined a dataset of 539 T. maritime proteins, and identified ten most effective conditions, 108 best conditions, which crystallized 196 and all 465 proteins, respectively. Sixty seven conditions were defined as the most essential to promote protein crystallization, and referred to as the core screen. Together with the next 29 most effective conditions (expanded core screen), the expanded screen is now used regularly for initial crystallization trials [20]. The Toronto group mined a data set of 775 proteins with 48 conditions [18]. By using different sample preparation processes and crystallization conditions, they showed that proteins from different species vary in their crystallization behavior. A minimal screen with just 6 conditions produced 205 crystallizations, 61% out of the 338 crystallized proteins in the set, whereas one with 24 conditions produced 318 protein crystals or 94% of all crystallized proteins. These results also show that many of the conditions used in sparse matrix screens are redundant, and that current screens are not sufficient to crystallize all proteins, since 76 of JCSG proteins and 417 of the Toronto proteins could not be crystallized at all. Segelke [21] and Rupp [22] suggested random sampling of crystallization space and DeLucas et al. [23] applied incomplete factorial screens, to streamline the crystallization procedure. For some specific target proteins, those strategies are more appropriate and efficient to identify crystallization conditions. Oldfield applied data mining of protein fragments for molecular replacement models that are used for solving the phase problem in X-ray crystallography [24]. It is obvious that protein crystallization conditions are not enough to guarantee robust screens for crystallization of all protein targets. Protein biophysical properties, including sequence information, predicted secondary structure, order and disorder information, sample preparation methods, and other experimental parameters that must also be evaluated and considered. An appropriate future goal would be to determine the crystallization potential based on sequence information alone. Enhanced procedures for protein structure prediction and determination, combined with improved crystallization condition mining, should enable a better understanding of the biological functions of proteins.
3 General Overview of Data Mining Methods in Biology Biological data mining is a rapidly growing new field of science. It combines molecular biology, computational methods and statistics for analyzing and processing biological data. With the completion of the human genome project and mass-scale sequencing of whole genomes of many other organisms’ huge amounts of constantly growing biological data have been collected. All this data is being deposited and stored in databases of genomes, protein, DNA and RNA sequences.
Data Mining for Protein Secondary Structure Prediction
143
To analyze and process these enormous amounts of biological data new data analysis methodologies have been developed. They are based on mapping, searching and analyzing patterns in sequences and three-dimensional structures through data mining techniques. This has led to a development of a new field of science – bioinformatics. Mining for information in biological databases involves various forms of data analysis such as clustering, sequence homology searches, structure homology searches, examination of statistical significance, etc. The most basic data mining tool in biology is basic local alignment search tool (BLAST) to examine a new nucleic acid or protein sequence. BLAST compares the new sequence to all sequences in the database to find sequences that are the most similar to a query sequence. Data mining tasks can be descriptive – by uncovering patterns and relationships in the available data or predictive based on models derived from these data. Most popular and frequently used are automated data mining tools that employ sophisticated algorithms to discover hidden patterns in biological data. Data mining is an attempt to discover new knowledge from an enormous amount of collected, constantly growing biological data. Data mining can be applied to a variety of biological problems such as analysis of protein interactions, finding homologous sequences or homologous structures, multiple sequence alignment, construction of phylogenetic trees, genomic sequence analysis, gene finding, gene mapping, gene expression data analysis, drug discovery, etc. All these different problems can be studied by using various data mining tools and techniques.
4 Measures of Quality of Prediction To measure the quality of protein secondary structure prediction we require an accuracy matrix [Aij ] of size 3 × 3 (where i and j stand for the three predicted states H, E, C) is introduced. The ijth element Aij of the accuracy matrix is then the number of residues predicted to be in state j, which according to the DSSP data are actually in state i Then the sum over the columns of matrix A gives the number of residues n j which are predicted to be in state j [9] 3
n j = ∑ Aij .
(1)
i=1
On the other hand the sum over the rows of A gives the number of residues Ni which according to the experimental data are in state i Ni =
3
∑ Aij .
(2)
j=1
The diagonal elements of A count the correct predictions for each of three structural states, and the off-diagonal elements contain the information about wrong predictions.
144
H. Cheng et al.
The most commonly used parameter measuring the accuracy of the protein secondary structure prediction is the parameter Q3 defined as: 3
∑ Aii
Q3 =
i=1
,
N
(3)
which gives the percentage of all correctly predicted residues within the three-states (H, E, C). Here N is the total number of residues in the sequence. 3
N = ∑ Ni = i=1
3
∑ n j.
(4)
j=1
We may define also parameters measuring individually the correctness of prediction for each of the structural classes such as qi =
Aii for i = H, E, C. Ni
(5)
Usually the easiest to predict are helices (H) and coils (C) and the most difficult for prediction are β-sheets (E). For three equally probable states the random assignment gives the probability for each of 33.3% if all three states are equally populated (which would not usually be true for real proteins). The parameter measuring the quality of prediction is the correlation coefficient Ci proposed by Matthews [25] T Pi T Ni − FNi FPi , Ci = (T Ni + FNi )(T Ni + FPi )(T Pi + FNi )(T Pi + FPi )
(6)
where T Pi , T Ni , FNi , FPi are the numbers of true positives, true negatives, false negatives, and false positives for the ith secondary structure element, respectively. Definitions of these numbers in terms of the elements Aij of the accuracy matrix are as follows: T Pi = Aii . 3
3
T Ni = ∑ ∑ Ajk . k=i j=i 3
FPi = ∑ Aij .
(7)
j=i 3
FNi = ∑ Aji . j=i
The correlation coefficient allows us to compare the result of a prediction with the completely random assignment. For a perfect prediction the Matthews coefficient Ci = 1 while for the completely random case Ci = 0 (negative values of Ci are also
Data Mining for Protein Secondary Structure Prediction
145
possible, for predictions worse than random). Other quantities used for the assessment of prediction accuracy are the average
length of each type of structural element: helices, sheets and coils, the number of various secondary structure elements in the protein, and the segment overlap coefficient (SOV) that punishes wrongly predicted disruptions in secondary structure elements, and is normalized to 100% for a perfect prediction [7, 26, 27]. The number of structural elements and the predicted average lengths and their overlaps should be near the experimental data for good predictions. On average proteins contain about 30% helical structure (H), about 20% β-strands (E) and about 50% coil (C) structure. This means that even the most trivial prediction algorithm which assigns all residues to the coil (C) state would give around a 50% correct prediction. The coil is also the easiest to predict, while strands (E) are the most difficult for prediction. The difficulty of the prediction of β-sheets is due to their relative rarity and to the irregular, nonlocal nature of contacts, in contrast to α-helices where contacts are well localized along the sequence (the ith and i + 4th residue have a nonbonded contact).
5 Review of Protein Secondary Structure Prediction Methods The first attempts of the secondary structure prediction started in 1970s with works of Chou and Fasman [28], Lim [29, 30] and Garnier et al. [31] (GOR I method). All these methods were based on single sequences and gave the cross-validated accuracies of predictions below 60%. These early works on the prediction of the secondary structure were based on the single residue statistics of amino acids in various structural elements, and amino acid properties. The predictions were made by using a sliding window of a certain size (for example of a width of four residues, a characteristic length for helical contacts, in the Chou and Fasman method [28], or of width 17 residues in the GOR I method [31]) but only single residue statistics for each residue within such a window were calculated for the prediction. This limited data made these predictions seriously deficient. A significant improvement in the protein secondary prediction was made by using the pair-wise statistics for blocks of residues in secondary structure segments within windows (GOR III–IV). The practical implementation of this method is based also on a window of a certain width, which is repositioned along the protein chain. Then the statistics of the residues within the window are used to predict the conformational state of the residue at the center of the window. By moving the window along the chain, the secondary structure states of all residues from the N-terminal to the C-terminal along the chain are predicted. This window-based method has been used in many different secondary structure prediction methods, based on various techniques, such as information theory (GOR III [32] GOR IV [33]), neural networks (NN) [34–41], nearest-neighbor algorithms [7,8,42–46], and several other approaches [7,47–52]. The accuracy of the prediction of these methods based on a single sequence analysis has been improved significantly, breaking the 60% level but always below the 70% limit even for the most successful methods.
146
H. Cheng et al.
In the last 15 years major progress has been made in the accuracy of the prediction of the secondary structure from the sequence. The improvement has been obtained by using, instead of a single sequence, multiple sequence alignments (MSA) containing the evolutionary information about protein structures. The multiple sequence alignment information was first used in 1987 by Zvelebil et al. [53] and later (1993) supplemented by Levin et al. [54] and independently by Rost and Sander [39] for the prediction of secondary structure. It gave a significant boost to the accuracy of secondary structure predictions. The most successful methods like PHD [55] and its most recent versions, or PSIPRED [36] achieved a prediction accuracy above 76%. The main reason that information from the MSA improves the prediction accuracy is due to the fact that during evolution protein structure is more conserved than sequence, which consequently leads to the conservation of the long-range information. One may suppose that part of this long-range information is revealed by multiple alignments. Many proteins have a similar structure while having sequence identities as low as 20%. Protein function is more vital for evolutionary survival, than is sequence conservation so random mutations to the sequence that destroy its function usually cause the mutated sequence to be eliminated during evolution and hence do not exist. The new efficient multiple sequence alignment programs such as PSI-BLAST [56] allow for easy use of the alignment information for secondary structure prediction. Multiple sequence alignment enables identification of the evolutionarily conserved residues and leads to improvement in the prediction of secondary structure [57–60]. The inclusion of homologous sequences with a low (25–30%) sequence identity to the query sequence, improves significantly the secondary structure prediction. Sequences from multiple sequence alignment which are highly similar to the query sequence; do not improve the secondary structure prediction, but rather the comes from the most diverse sequences. The methodology of the secondary structure prediction is usually based on having a database of sequences with known secondary structures. The protein secondary structure prediction algorithm finds relations between a sequence and structure by using the sequences in the database and using them for the prediction of the secondary structure of new sequences, different than those in the database. Most protein secondary structure prediction algorithms are based on machine learning techniques. The first machine learning algorithms that were applied to secondary structure predictions were NN. Later Hidden Markov Models have been applied, and most recently support vector machines (SVM) have been used. These machines learning algorithms are first trained (which requires a large amount of computation time) on the sequences from the data set, and later used for the predictions of a different set of sequences. Consequently the success of predictions depends to a large extent on the proper choice of sequences for the database. The database should cover all types of proteins in the most representative way, and no proteins in the database should be too similar. Usually it is required that the similarity between the sequences in the database should be as low as possible (10–20% or less). If a protein sequence for which the secondary structure is being predicted is too similar to any of the sequences in the database then the predictions are usually better.
Data Mining for Protein Secondary Structure Prediction
147
Currently no secondary structure prediction techniques yield better than 80% accuracy in cross-validated predictions, measured by Q3 [59]. For example, the most successful techniques based on NN, such as PHD [55] and PSIPRED [36] report accuracies around 76%. This limitation in accuracy for the secondary structures is subsequently transferred into many tertiary structure prediction methods, which rely on predicted secondary structures for starting structures, thus limiting their performance. Secondary structure prediction is an active research area. Recently, SVM [61,62], sequence-based two-level [63] and dihedral angle-based [64] neural network algorithms were successfully used with accuracies below 80%. NN were also applied for cases where secondary structures are combined not only into three categories (helix, sheet, and coil); but also seven categories in a more detailed representation [65]. Accurate prediction of protein secondary structure is essential for many bioinformatics applications that allows faster structural alignments based on secondary structure topologies [66], improved structural understanding whenever tertiary structures are not available [67] (especially for membrane proteins [68]), and more accurate tertiary structure predictions [69, 70]. In the tertiary structure predictions, there are two main limitations: (1) tertiary structure prediction methods cannot provide a three-dimensional model for all proteins, (2) when the methods offer a model, the model resolution can vary from 2–3 to even tens of Angstroms [71]. By contrast, though having various accuracies, secondary structure prediction methods nonetheless always provide a secondary structure model. Improved secondary structure prediction can also lead to enhanced structural comparisons and searches, as well as the identification of distant homologies. Despite the variety in these prediction methods, the cross-validated 80% accuracy barrier is still present and has not yet been overcome. Is there a structural explanation for this limit? In a recent, interesting work, Kihara [72] pointed out the importance of long-range interactions on the formation of secondary structure. He argued that as long as secondary structure predictions are based on a sliding window, the long-range effects not only on β-sheets, but even on helices will be treated in a limited fashion. A comparison of accuracies as a function of residue contact orders (sequence separation between contacts) seems to support this argument at least for some helical and coil fragments and provides interesting conclusions about protein folding [73]. However, the accuracies for some other helices with high contact order (i.e., the number of sequential residues, separating two amino acids being in contact) were also low, suggesting that there might be some other effects not taken into account in the present secondary structure prediction algorithms. As the performance of various prediction programs depends on the databases used and on the set of sequences for which the predictions are made, the comparison of the accuracy of prediction of the various methods should be done carefully by using cross-validation techniques. Cuff and Barton [48, 49] proposed a database of nonredundant domains for the unbiased testing of various prediction algorithms. The first database contained 396 sequences and the latest database contains 513 nonredundant sequences. We have utilized this database of 513 sequences in our work.
148
H. Cheng et al.
6 GOR V Method The GOR program was one of the first methods proposed for protein secondary structure prediction from the sequence. The original paper (GOR I) was published by Garnier, Osguthorpe and Robson in 1978, with the first letters of the authors’ names forming the name of the program [31]. The method has been continuously improved and modified over the last 30 years [32,33,74,75]. The first version (GOR I) used a rather small database of 26 proteins with about 4,500 residues. The next version (GOR II) used the enlarged database of 75 proteins of Kabsch and Sander containing 12,757 residues [74]. Both versions predicted four conformations (H, E, C and turns T) and used singlet frequency information within the window (socalled directional information). Starting with GOR III [32] the number of predicted conformations was reduced to three (H, E, C). The GOR III method started to use additionally information about the frequencies of pairs (doublets) of residues within the window, based on the same database as the earlier version. The GOR IV [33] ˚ version used 267 protein chains (with crystallographic resolution of at least 2.5 A) containing 63,566 residues and was based on single sequences, without considering evolutionary information from multiple sequence alignment. The GOR algorithm is based on the information theory combined with Bayesian statistics. It computes the conditional probability P(S|R) of observing conformation S, (where S is one of the three states: helix (H), extended (E) or coil (C)) for residue of type R (where R is one of the 20 possible amino acids) and the probability P(S) of the occurrence of conformation S. The conformational state of a given residue in the sequence depends not only on the type of the amino acid R but also on the neighboring residues along the chain within the sliding window. GOR IV uses a window of 17 residues, i.e., a given residue, together with eight nearest neighboring residues on each side are being analyzed. According to information theory the information function of a complex event can be decomposed into the sum of information of simpler events. The GOR IV method assumed that the information function is a sum of information from single residues (singlets) and pairs of residues (doublets) within the sliding window. It computes from the database the pair frequencies of residues R j and R j+m occurring in various conformations By using the frequencies calculated from the databases the program can predict probabilities of conformational states for a new sequence. The accuracy of the prediction with the GOR IV program based on single sequences (without multiple alignments) tested on the database of 267 sequences with the rigorous jack-knife methodology was 64.4%. Other machine learning methods (using single sequences) for the secondary structure predictions have similar or lower success rates. The full jack-knife procedure means that each time when the prediction for the given sequence is done the sequence is removed from the database and the spectrum of frequencies used for the prediction is recalculated without including the information about the query sequence. Application of the full jack-knife procedure for machine learning methods is computationally infeasible/not feasible, since it requires retraining of the machine every time. Over several decades, the GOR method has been constantly improved by including larger databases and more detailed statistics. With these improvements, the
Data Mining for Protein Secondary Structure Prediction
149
Q3 accuracy reached 64% in GOR IV. However, studies by other groups showed that the accuracy of prediction for secondary structure prediction methods could be significantly increased by including evolutionary information by incorporating MSA [42, 53, 55] (for a recent review see [76]). In the most recent GOR V [77], evolutionary information in the form of MSA is included using PSI-BLAST [56] (GOR V Server is available at http://gor.bb.iastate.edu) [78]. MSA are generated using PSI-BLAST with the NR database, allowing up to five iterations. MSA increases the information content and therefore allow an improved discrimination of secondary structures. In the last stage, heuristic rules related to the predicted secondary structure distribution are used to improve predictions. With the help of evolutionary information, the full jack-knifed prediction accuracy of GOR V using the Cuff and Barton dataset reaches Q3 = 73.5%; an almost 10% increase from the previous GOR IV performance. The segment overlap (SOV) [27], an alternative to Q3 measure of prediction accuracy, is also higher at 70.8%.
7 Fragment Database Mining Method The fragment database method proposed by us in 2005 is based on the finding by Baker and collaborators in the Rosetta algorithm [79, 80] for tertiary structure prediction that structure fragments are the most useful information relating sequences and structures. Rosetta takes the distribution of local structures adopted by short sequence segments and identifies patterns (so-called I-sites) that correspond to local structures of proteins in PDB. It assembles the I-site library that is used later for the prediction of protein three-dimensional structure. The Rosetta method considers both the local and superlocal sequence structure biases and fragment insertion Monte Carlo method to build the three-dimensional structure of protein. Its success in structure and function predictions shows the importance of the information in structural fragments for theoretical predictive studies of proteins. A similar approach was used by us in the Fragment Database Mining (FDM) method. The method is based on MSA of the query sequence with all other sequences of known structures from the PDB by using BLAST. The fragments of the alignments belonging to proteins from the PBD are then used for further analysis. In the prediction and evaluation part, for each query sequence from the dataset, we assign weights to the matching segments obtained from BLAST, calculate normalized scores for each residue, predict the secondary structure for that residue according to the normalized scores, and finally, calculate Q3 and Matthews’ correlation coefficients. In the weight assignment part, several parameters are considered, including different substitution matrices, similarity/identity cutoffs, degree of exposure to solvent of residues, and protein classification and sizes. Two strategies are applied to predict the secondary structure according to the normalized scores of residues. One method is to choose the highest-scoring structure class as the prediction, and the other is to use artificial intelligence (AI) approaches to choose a classification based on training.
150
H. Cheng et al.
Target sequence
Sequence alignment (BLAST) against PDB Identity scores
Selection of structural fragments
Weight assignment
Normalized scores for H, E and C
Predicted secondary structure
Fig. 2 The secondary structure prediction scheme using fragment database mining (FDM)
Figure 2 shows the flowchart of our approach. The database of 513 nonredundant domains developed by Cuff and Barton [48, 49] (CB513) has been used. Local sequence alignments were generated by using BLAST with different parameters. Specifically, we used several different substitution matrices [81], including BLOSUM-45, BLOSUM-62, BLOSUM-80, PAM-30 and PAM-70. We used three-state elements of the secondary structures: helix (H), extended (β-sheet) (E), and coil (C), applying the standard reduction from the eight-letter DSSP alphabet to the three-letter code. Helices (H, G, and I) in the DSSP code were assigned the letter H in the three-letter secondary structure code; strands (E) and bridges (B) were translated into sheets (E), whereas other elements of the DSSP structure (T, S, C) were translated into coil (C). For weighting assignments we defined identity scores and their powers (idc , where c is a positive real number) as the weights of matching segments. Hereid is the ratio of the number of exact matches of residues to the total number of residues in the matching segment. Weights were then adjusted to obtain the best match. This is illustrated in Fig. 3. At each position, the predicted secondary structure is determined by the secondary structures of the matches at that position. Each match is
Data Mining for Protein Secondary Structure Prediction
Query sequence
151
1 2 3 4 5 6 7 8 9
# of Matches Weight (w)
1
0.9
2
0.7
3
0.6
4
0.8
E E H H E H H E E C H C C H H E C C H E H H E C
Fig. 3 An example showing a query sequence and its matching segments based on sequence matches (sequences not shown). The matching segments are expressed as secondary structure elements. The weights are shown for each segment
assigned a weight according to the similarity or identity score of the alignment from BLAST. At each position, the weights are normalized, and the normalized scores for the position being at each of secondary structure states are calculated. We defined s(H, i) as the normalized score for position i to be in the state H. s(H, i) =
∑ w(H, i) . ∑ w(H, i) + ∑ w(E, i) + ∑ w(C, i)
(8)
Here w(H, i) is the weight for one matching segment with residue at the ith position in a helix, and w(E, i) and w(C, i) are similarly defined. For example s(H, 2) = 0.7/(0.9 + 0.7 + 0.8) = 0.292, and s(E, 4) = (0.7 + 0.8)/(0.9 + 0.7 + 0.6 + 0.8) = 0.5. The secondary structure state having the highest score is chosen as the final prediction result for a given residue in the sequence. For the ith position of a query sequence, we have three normalized scores for each secondary structure state s(H, i), s(E, i), and s(C, i). In our prediction scheme, we always choose the highest score among these three to determine the secondary structure at the ith position. We have used two different substitution matrices: PAM and BLOSUM. Percent accepted mutation (PAM) matrix was introduced by Dayhoff [82] to quantify the amount of evolutionary change in a protein sequence, based on observation, how often different amino acids replace other amino acids in evolution. The blocks substitution matrices (BLOSUM) were introduced by Henikoff [81] to obtain a better measure of differences between two proteins, specifically for distantly related ones. The BLOSUM matrix is derived from observations on the frequencies of substitutions in blocks of local alignments in related proteins. We used several versions of these matrices: BLOSUM-45, -62, -80 and PAM-30, -70 in BLAST. For example PAM-30 refers to 30 mutations per 100 amino acids of sequence, and BLOSUM-62 means that it was derived from sequence blocks clustered at the 62% identity level. We tried different combinations of matrices and identity powers. The best result was obtained by using BLOSUM 45 and id 3 as the weight assignment method. Figure 4 shows the basic average prediction accuracies (Q3 ) using the different substitution matrices.
152
Q3
H. Cheng et al. 1 0.95 0.9 0.85 0.81 0.8 0.75
0.902
0.931
0.754
BLOSUM 45 BLOSUM 62 BLOSUM 80 PAM 30 PAM 70
0.738
0.7 0.65 0.6 0.55 0.5 id
id2
id3
1 –
id 2
id
1 – 3
Fig. 4 Prediction accuracies measured by Q3 computed for the CB513 dataset using five different substitution matrices and several different powers of identity scores
Fig. 5 Prediction accuracy Q3 using different identity cutoffs
To test the concept of fragment assembly for secondary structure prediction, we set limits to the extent of similarity or identity for fragments to be included by using different cutoffs: 99, 90, 80, 70, and 60% of similarity or identity scores. The matches having similarity or identity scores higher than a cutoff were eliminated from matching lists of segments, which are used for calculating normalized scores. Figure 5 shows results obtained for BLOSUM 45 and id3 . We observe a constant decrease in Q3 with the decrease of the cutoff. Matches with the highest identity scores (greater than id cutoff) were filtered out, but “reasonably” high-id matches were kept. We define the “reasonable” high-id matches to be those that have relatively high identity scores (greater than id cutoff), and are not too short (>5 residues), and not too long (less than 90 or 95% of the length of the query sequence). The prediction accuracy Q3 ’s were compared for three cases, using always BLOSUM 45 with id cutoff 0.90. In case 1, all high-id
Data Mining for Protein Secondary Structure Prediction
153
matches (matches with identity scores higher than identity cutoff) are filtered out. This case is used as a control. In case 2, sequences with identity scores greater than id cutoff (0.90 here), lengths longer than five residues, and lengths less than 90% of query sequence are retained. In case 3, sequences with identity scores greater than 90%, lengths longer than five residues, and lengths below 95% of query sequence are kept. Table 1 shows the accuracies for all three cases. We observe again that again id 3 gives the best accuracy. To additionally include approximate tertiary information in the computations, we calculated the degree of exposure to the solvent for all available PDB sequences by using Naccess software (http://www.bioinf.manchester.ac.uk/naccess, Hubbard and Thornton 1992). We used this information to differentiate buried and exposed residues by assigning them different weights. If the relative accessibility of a residue computed by Naccess is less than 5.0%, then it is regarded as a buried one, while if relative accessibility is greater than 40.0%, a residue is assumed to be exposed; the others are regarded as intermediate cases. Buried residues were weighted most heavily. We made the following linear changes to the weights of residues: if a residue is buried, its weight is multiplied by an integer (2); if it is intermediate, we multiply the original weight by 1.5; if exposed, the weight is unchanged. Table 2 gives the Q3 results that show that the differentiation of buried and exposed residues does not have a significant effect. These results were obtained for weights id3 , id cutoff equal 0.99, and the BLOSUM 45 substitution matrix. We also studied the effect of the protein size on the accuracy of prediction. We divided all proteins in our dataset into four groups according to the sequence length: very small (n ≤ 100 residues), small (100 < n ≤ 200 residues), large (200 < n ≤ 300 residues), and very large (n > 300). No optimization was applied. Table 3 shows the prediction accuracies for proteins of different sizes. The accuracies vary from 0.911 for the group of very small proteins to 0.948 for large ones. Table 1 Prediction accuracies with identity cutoff 90 for three studied cases Matrix
id Cutoff
BLOSUM 45
0.90
High id Matches Processing Case 1 (all filtered) Case 2 Case 3
1
1
id 3
id 2
id
id 2
id 3
0.675 0.677 0.678
0.680 0.683 0.683
0.697 0.701 0.702
0.725 0.730 0.731
0.735 0.740 0.742
Table 2 Prediction accuracies (Q3 ) with accessibility of residues taken into account in the weighting scheme (wt – weight, SA – solvent accessibility) Residue status definition Q3 Control (status not applied) 0.825 wt = 2 if SA ≤ 5
0.828
wt = 2 if SA ≤ 20
0.829
Table 3 The accuracies of predictions for proteins of different lengths Groups CB513 Very small Small Large Very large Q3 0.931 0.911 0.936 0.948 0.940
154
H. Cheng et al.
We also applied several AI techniques to modify the final secondary structure in the decision step. Instead of assigning the secondary structure for a specific position based on the highest normalized score, we applied AI methods to choose the optimal normalized score according to learning results from training sets. We used three different AI methods: decision trees (DT), NN, and SVM and compared their predictions. The main idea of these AI approaches is to gather information from a training set and use it to predict for a new test set. In our case, the ratio of the number of training and test sequences chosen randomly was 4:1. We first formed a file that contains all the normalized scores for all the query sequences from the benchmark dataset, then randomly partitioned these scores into training and test sets, and finally applied AI approaches for the prediction. We used the substitution matrix BLOSUM 45 and discarded all matches better than 90%. The best result of prediction accuracy for the test set without using AI methodology was Q3 = 0.720. Figure 6 shows a comparison among three different AI approaches: Decision Tree, Neural Network and Support Vector Machine for different window sizes. Only small improvements over Q3 = 0.720 are observed in prediction accuracy among DT, NN and SVM even for optimal window sizes. We calculated also average Matthews correlation coefficient for the α-helices (Cα ), β-strands (Cβ ) and coil (Cc ). Table 4 shows these results averaged over the number of sequences (sequence average) or over the number of amino acids (AA average). The highest prediction accuracy of our method, when all parameters are optimally tuned is Q3 = 0.931. Of course this value of Q3 is overestimated, since in real applications, a new query sequence would not likely have a perfect match in an alignment against any database. After filtering out overestimated matches (with identity scores above a cutoff value); the best prediction accuracy for cutoff 0.99 was
Fig. 6 Normalized score based prediction using three different AI approaches at different window sizes. (DT – decision tree, NN – neural network, SVM – support vector machine) Table 4 Correlation coefficients for α-helix, β-strand and coil
Id cutoff 0.99 0.90
Sequence average AA average Sequence average AA average
Cα 0.682 0.810 0.549 0.625
Cβ 0.614 0.780 0.472 0.589
Cc 0.688 0.739 0.552 0.553
Data Mining for Protein Secondary Structure Prediction
155
0.825, and for cutoff 0.90 – 0.735. We notice that perfect and near-perfect matches play an important role in the accuracy of the prediction. Even a 0.01 cutoff decrease leads to a sharp drop in the prediction accuracy. Overall, AI approaches to determine the final predictions according to previously obtained normalized scores that yield only slightly better results.
8 Consensus Data Mining Method In order to improve the accuracy of secondary structure predictions with the FDM method, we developed a new hybrid method: consensus data mining (CDM), which combines our two previous successful secondary structure prediction methods: the FDM [83] and GOR V algorithm [77, 78]. The basic assumption with this approach is that the combination of two complementary methods can enhance the performance of the overall secondary structure prediction. We utilize the distinct advantages of both methods: FDM relies on the availability of sequentially similar fragments in the PDB structures, which leads to highly accurate (much better than GOR V) predictions for cases when all fragments are available, but fails when such perfectly matched fragments are missing. On the other hand, GOR V predicts the secondary structures of less similar fragments fairly accurately, even when suitable structural fragments are missing. The CDM algorithm uses a single parameter – sequence identity threshold, to decide whether to apply FDM or GOR V prediction at a given site. This basic idea of the CDM method is illustrated in Fig. 7, where the first row is a part of the query sequence, the second and third rows are the FDM and GOR V predictions. In order
Fig. 7 The graphical representation of the CDM method. For a given sequence fragment, the FDM and GOR V three state predictions are calculated. Then, according to a sequence identity threshold (shown as a straight horizontal line), the regions with higher identity scores (above the line) predicted by FDM are selected, and the rest by GOR V. The final predictions are highlighted in black background
156
H. Cheng et al.
to decide which method is used for CDM, first an identity score map is generated. Depending on the sequence identity score, either FDM (if the site has a score higher than the sequence identity score) or GOR V is used for the CDM. The highlighted portions of Fig. 7 specify which predictions are used in CDM. The success of FDM largely depends on the availability of fragments similar to the target sequence. In practice, however, the availability of similar sequences can vary significantly. In order to analyze the relationship between the performance of CDM and the sequence similarity of fragments, we have methodically excluded fragment alignments with sequence identities above a certain limit, and have called this limit upper sequence identity limit. The upper sequence identity limit is not an additional parameter in the CDM method; these results demonstrate what expected results would be in the absence of fragments with similarities above the sequence identity limit. The performance of all secondary structure prediction methods can be improved with MSA: The GOR V method tested with the full jack-knife methodology yields an accuracy of 73.5%, when MSA are included; otherwise the accuracy is about 10% less. To explore the full range of performance for our hybrid methods, we used GOR V predictions both with and without MSA. One of the significant advantages of FDM is its applicability to various evolutionary problems because the algorithm does not rely exclusively on the sequences with the highest sequence similarity, but assigns weights to BLAST-aligned sequences that apparently capture divergent evolutionary relationships. As a result, CDM, which incorporates FDM, can be successfully used, even when there is a variety of sequence similarities among the BLAST identified sequences. Although the availability of sequences with high similarity in the PDB essentially depends on the target sequence, the question remains as to what the optimum value of the sequence identity threshold should be. To identify this optimal threshold, we applied the CDM algorithm to our data set with a wide range of identity thresholds as shown in Fig. 8. The plots show a distinct dependence of CDM on the upper sequence identity limit. We observe a 10% drop in the prediction accuracy when the upper sequence identity limit drops from 100 to 99%. Our results show that the 50% sequence identity threshold gives the best performance of the CDM method for the upper sequence identity limit, as seen in Fig. 8. This optimum value increases to 55% when MSA are incorporated into GOR V. Figure 9 illustrates the dependence of accuracy on the upper sequence identity limit as a function of sequence identity threshold demonstrating that the sequence identity threshold of 50% gives the highest prediction accuracy Q3 of CDM. It also displays the strong dependence of the performance of CDM on the upper sequence identity limit: The sharp drop of the accuracy of prediction when almost identical sequences are removed clearly demonstrates the importance of the availability of highly homologous sequences for successful secondary structure prediction. This strong dependence explains why secondary structure predictions fail to reach high accuracies, signifying the limitation of short-range treatments in prediction algorithms.
Data Mining for Protein Secondary Structure Prediction
157
Fig. 8 Effect of the sequence identity threshold on the accuracy of prediction Q3 of the consensus data mining method. The upper sequence identity limit has been varied from 100 to 50%. The box around the value at 50% for the sequence identity threshold contains most maxima in the individual curves
Fig. 9 Accuracy of prediction Q3 of the consensus data mining method as a function of the upper sequence identity limit. The different curves were obtained by varying the sequence identity thresholds. The sequence identity threshold of 50% gives the best results. At the 50% threshold, CDM always performs better than individual FDM applied to the whole sequence (full coverage)
158
H. Cheng et al.
Fig. 10 The length distribution of fragments predicted by FDM in CDM as a function of upper sequence identity limit. The sequence identity threshold is 50%. The upper sequence identity limit values are identified on the individual curves
We have also analyzed the lengths of the fragments predicted by FDM in the final consensus predictions (Fig. 10). The results are shown as a function of upper sequence identities. When the upper sequence identity limit is 100%, the fragment lengths are distributed almost evenly, showing only two small peaks around 21 and 36. The remainder of the plots shows similar curves peaking around 14, 16, 18 and 20. With decreasing upper sequence identity limit, more FDM predicted fragments are utilized in CDM: the numbers of fragments are 510, 716, 974, 1097, and 1,153 for the upper sequence identity limits of 100, 99, 90, 80, 70, and 60, respectively. Lower values of upper sequence identity limit, however, decrease the average length of fragments. Table 5 shows the prediction accuracies of the individual FDM, FDM and GOR V methods for the sequence regions they are applied to, and the consensus (CDM) method, for a range of upper sequence identities. The coverage of the FDM method is also shown (coverage of a specific method is defined as the fraction of residues predicted by this method used in the consensus prediction). The average cross-validated (by the jack-knife methodology) accuracy of GOR V is 73.5% when MSA and heuristic rules are used. In the absence of MSA, the jack-knifed accuracy drops to 67.5%. Note that the accuracy of individual GOR IV (previous version) was 64.4%. The 3.1% difference arises as a result of the heuristic rules based on the length of helix and β-sheet predictions: if their lengths are too short (e.g., helices shorter than five residues or sheets short than three residues), the predictions are converted into coils.
Data Mining for Protein Secondary Structure Prediction
159
Table 5 The prediction performance of the Fragment database mining (FDM), GOR V and CDM methods with the applied sequence identity threshold 50% for varying upper sequence identity limit. The table shows Q3 for individual FDM, for FDM and GOR V methods for the part of the sequence they are applied to, and for CDM for the cases when GOR V is used with and without MSA. The third column shows the coverage of the FDM method, i.e. the fraction of residues for which the FDM prediction was used in CDM. Averaging over all 513 sequences has been performed Upper FDM Q3 FDM FDM Q3 GOR V CDM GOR V CDM Q3 Q3 Q3 sequence (individual) Coverage Q3 identity in CDM (without GOR V MSA) (with GOR V MSA) limit 100 0.931 0.99 0.940 0.577 0.932 0.500 0.931 99 0.827 0.88 0.889 0.639 0.833 0.688 0.843 90 0.742 0.65 0.804 0.638 0.752 0.692 0.769 80 0.713 0.60 0.772 0.639 0.725 0.696 0.745 70 0.694 0.56 0.753 0.636 0.706 0.691 0.728 60 0.680 0.53 0.736 0.636 0.693 0.693 0.717
The identification of ranges of parameters where CDM gives better performance than individual methods is crucial. The data in Table 5 clearly demonstrate that when the upper sequence identity limit is greater than or equal to 90%, CDM confers higher accuracy than individual GOR V with or without MSA. Additionally, on an average CDM is always better for the entire sequence than individual FDM regardless of the upper sequence identity limit. For the cases of 100 and 99% sequence identities, only a small portion of sequence is predicted by GOR V (1 and 12% respectively). At these upper sequence identities, GOR V without MSA performs better than GOR V with MSA for this small portion. Only when the upper sequence identity limit falls to 90% or below does GOR V with MSA perform better. Although it is generally assumed that adding MSA to predictions increases the accuracy, the data in Table 1 clearly demonstrate that MSA are not effective for low sequence identities. Another interesting feature shown in Table 5 is the coverage of the FDM method, i.e., the fraction of FDM predictions in the consensus CDM method. When the upper sequence identity limit drops from 100 to 90%, the FDM coverage plummets from 99 to 65%, illustrating the lack of aligned sequences with high identity. Compare this value with the 12% coverage lost when the upper sequence identity limit drops further from 90 to 60%. Another measure of prediction accuracy besides Q3 is the Matthews correlation coefficient. The correlation coefficients for the three secondary structure elements, α-helices (H), β-sheets (E) and coil (C), are shown in Fig. 11. The curves in Fig. 11 were obtained at the sequence identity threshold 50% for a varying range of upper sequence identities. Similarly to the majority of secondary structure algorithms, the correlation coefficients are highest for α-helices (H), followed by those for β-sheets (E), and lastly for coils (C). The correlation coefficients obtained by CDM show a consistent and smoother monotonic decrease with decreasing upper sequence identity limit.
160
H. Cheng et al.
Fig. 11 The Matthews correlation coefficients for CDM predictions for individual secondary structure element types as a function of upper sequence identity limit
9 Secondary Structure Prediction Servers Based on GORV and Consensus Data Mining Methods In 2003 we created the GOR V web server for protein secondary structure prediction freely available to the public. The GOR V algorithm combines information theory, Bayesian statistics and evolutionary information. In its fifth version, the GOR method reached (with the full jack-knife procedure) an accuracy of prediction Q3 = 73.5%. The GOR V server is freely accessible to public users and private institutions at http://gor.bb.iastate.edu/. The server became highly popular and is being constantly used by tens of thousands of users from all over the globe. The GORV server works in the following manner. When the input sequence is provided by the user, the GOR V server that was trained on 513 proteins calculates the helix, sheet and coil probabilities at each residue position and makes an initial prediction based on the structural states having highest probability at each site. After this initial prediction, heuristic rules are applied. These rules include converting helices shorter than five residues and sheets shorter than two residues to coil. For a more detailed discussion of these heuristic rules, please refer to the original GOR V paper [75,77]. As output, the user receives the secondary structure prediction for the input sequence and the probabilities for each secondary state element at each position. The prediction results are shown in the web browser, which should stay open during the run, and are also sent to the
Data Mining for Protein Secondary Structure Prediction
161
e-mail address previously provided by the user. Any run-time error message will appear in the web browser, and if any problem arises, the user can contact the system administrator via the e-mail provided on the web page. For a sequence of 100 amino acids, the secondary structure prediction takes ∼1 min. However, the most time consuming steps are the PSI-BLAST alignments, which in some cases – for many hits and slowly converging iterations may take considerable time. We have also tested the GOR V server for sequences up to 300 amino acids successfully. Currently, the server is a Linux box with RedHat Enterprise 3.0 system installed with 4.5 GB RAM and 140 GB memory. The program code is compiled using the Intel Fortran Compiler 8.0.034, and the web interface is established with a CGI script written using HTML and PERL. In the future, we will enhance the GOR V server both in hardware or software for improved performance, especially if user demand suggests the need. Similarly to the GORV server we developed in 2007 a new secondary structure prediction CDM server, available freely to the public at the address: http://gor.bb.iastate.edu/cdm/. Figure 12 shows the homepage of the CDM server. In the homepage of the CDM server, the user is asked to enter his/her e-mail address and sequence information by a series of one-letter amino acid codes (up to 5,000). As an option, the user can also provide a sequence name for his/her convenience. Once the information is submitted, the server checks the reliability of the e-mail address and the sequence information, and then sends a confirmation page to the browser (or an error page if there is a problem). At this point, the server accepts the job and the user can close the web browser anytime without disturbing the job run.
Fig. 12 Web page of CDM server available at the address: http://gor.bb.iastate.edu/cdm/
162
H. Cheng et al.
Another perl script then takes over and runs BLAST against the PDB sequences, and PSI-BLAST against nr (nonredundant) databases. The results of these searches are then fed to FDM and GOR V, respectively. When the FDM, GORV, and CDM runs are completed, the following information is sent to the user’s e-mail address (as html links to the output files on the server): the secondary structure predictions of FDM, GOR IV, GOR V, and CDM; the secondary structure prediction weights for each site for GOR V; the fragment alignments and their identity scores used by FDM. The predictions for FDM, GOR V, and CDM are provided in two formats: either in a single line (for each method of prediction), or formatted so that each line contains up to 80 residues. These two formats should be sufficient to facilitate the visualization of the prediction results for most users. The CDM server uses RedHat Enterprise 3.0 system, built on a Dell Xeon with 4.6 GB RAM and 140 GB memory. The server side CGI script is a combination of HTML and perl, and the program code is written in C++ (FDM and CDM) and Fortran (GOR V). The server is housed in the LH Baker Center for Bioinformatics and Biological Statistics at Iowa State University. Both GOR V and CDM are open source programs that are freely available to the academic community. According to recent federal government policy all software and databases developed by using money from NIH grants should be freely available to the public and distributed as open source projects. Below we provide a short list of other most useful servers for biochemical research: PDB: http://www.rcsb.org/pdb Database of proteins with known structures SwissProt: http://ca.expasy.org/sprot/ Database of all protein sequences Gene Ontology: http://www.geneontology.org/ Genes and gene products annotations BLAST: http://blast.ncbi.nlm.nih.gov/ Multiple sequence alignments PSIPRED: http://bioinf.cs.ucl.ac.uk/psipred/ Secondary structure prediction I-TASSER http://zhang.bioinformatics.ku.edu/I-TASSER/ Tertiary structure prediction
10 Discussion The accuracy of secondary structure prediction is important for modeling the threedimensional structures of proteins. We show that data mining of fragments of proteins with known structures deposited in PDB can significantly improve protein structure prediction. We have developed a Fragment Data Mining Method that is based on this idea. Additionally, we combined the FDM method with our earlier algorithm GOR V, to develop more highly accurate CDM method based on the availability of aligned sequences with high similarity having known structures. The CDM method is an alternative to other currently available secondary structure prediction algorithms, especially when the MSA with high similarities are included
Data Mining for Protein Secondary Structure Prediction
163
in the predictions. Our results show that the accuracy of the method ranges from 67.5 to 93.2% depending on the sequence similarity of the target sequence to sequences of the known PDB structures. This represents a significant improvement over the original GOR V method (accuracy with multiple sequence alignment: 73.5%) and, when a strong sequence similarity is present, only a slight gain of about 1% is present, yet it is a consistent increase over the Fragment Data Mining method. Our consensus method shows that hybrid methods have the potential for improving the secondary structure prediction performance of individual methods consistently. We have shown that the combination of methods of different strengths can greatly benefit from the availability of experimentally determined structures and has the potential to enhance secondary structure predictions considerably. We have developed publicly freely available servers for the prediction of protein secondary structure from the sequence: an earlier one that relies on GOR V predictions, and a new improved one that utilizes CDM methodology. Our results show that data mining of known protein structures is an extremely useful approach to protein secondary structure prediction. Future progress in protein secondary structure prediction can be obtained by combining newest AI methods with data mining. We are currently working on application of a new technique, the so-called extreme machine learning method to protein secondary structure prediction and our preliminary results are highly encouraging. Since most of errors in protein secondary structure prediction occur at the ends of the secondary structure elements, future methods must address this problem. We have also discussed applicability of data mining to optimize crystallization conditions for target proteins. Since obtaining protein crystals for X-ray crystallography is the most costly and time consuming process in protein structure determination, optimization of crystallization conditions is currently one of the most important problems in structural genomics. Acknowledgments It is a pleasure to acknowledge the financial support provided by NIH grants 1R01GM073095, 1R01GM072014, and 1R01GM081680.
References 1. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The Protein Data Bank. Nucleic Acids Res 28:235–42 2. Pauling L, Corey RB (1951) Configuration of polypeptide chains. Nature 168:550–1 3. Pauling L, Corey RB, Branson HR (1951) The structure of proteins; two hydrogen-bonded helical configurations of the polypeptide chain. Proc Natl Acad Sci USA 37:205–11 4. Kabsch W, Sander C (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22:2577–637 5. Frishman D, Argos P (1995) Knowledge-based protein secondary structure assignment. Proteins 23:566–79 6. Moult J, Pedersen JT, Judson R, Fidelis K (1995) A large-scale experiment to assess protein structure prediction methods. Proteins 23:ii–v
164
H. Cheng et al.
7. Biou V, Gibrat JF, Levin JM, Robson B, Garnier J (1988) Secondary structure prediction: combination of three different methods. Protein Eng 2:185–91 8. Salamov AA, Solovyev VV (1995) Prediction of protein secondary structure by combining nearest-neighbor algorithms and multiple sequence alignments. J Mol Biol 247:11–5 9. Rost B, Sander C (2000) Third generation prediction of secondary structures. Methods Mol Biol 143:71–95 10. Jankarik J, Kim S (1991) Sparse matrix sampling: a screening method for crystallization of proteins. J Appl Crystallogr 24:409–411 11. Kingston RL, Baker HM, Baker EN (1994) Search designs for protein crystallization based on orthogonal arrays. Acta Crystallogr D Biol Crystallogr 50:429–40 12. McPherson A (1999) Crystallization of Biological Macromlecules. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, ME, p 586 13. Saridakis E, Chayen NE (2000) Improving protein crystal quality by decoupling nucleation and growth in vapor diffusion. Protein Sci 9:755–7 14. Scott WG, Finch JT, Grenfell R, Fogg J, Smith T, Gait MJ, Klug A (1995) Rapid crystallization of chemically synthesized hammerhead RNAs using a double screening procedure. J Mol Biol 250:327–32 15. Gilliland GL, Tung M, Ladner J (1996) The Biological Macromolecule Crystallization Database and NASA Protein Crystal Growth Archive. J Res Natl Inst Stand Technol 101: 309–20 16. Gilliland GL, Tung M, Ladner JE (2002) The Biological Macromolecule Crystallization Database: crystallization procedures and strategies. Acta Crystallogr D Biol Crystallogr 58:916–20 17. Jurisica I, Rogers P, Glasgow JI, Fortier S, Luft JR, Wolfley JR, Bianca MA, Weeks DR, DeTitta GT (2001) Intelligent decision support for protein crystal growth. IBM Syst J 40:394– 409 18. Kimber MS, Vallee F, Houston S, Necakov A, Skarina T, Evdokimova E, Beasley S, Christendat D, Savchenko A, Arrowsmith CH, Vedadi M, Gerstein M, Edwards AM (2003) Data mining crystallization databases: knowledge-based approaches to optimize protein crystal screens. Proteins 51:562–8 19. Page R, Grzechnik SK, Canaves JM, Spraggon G, Kreusch A, Kuhn P, Stevens RC, Lesley SA (2003) Shotgun crystallization strategy for structural genomics: an optimized two-tiered crystallization screen against the Thermotoga maritima proteome. Acta Crystallogr D Biol Crystallogr 59:1028–37 20. Page R, Stevens RC (2004) Crystallization data mining in structural genomics: using positive and negative results to optimize protein crystallization screens. Methods 34:373–89 21. Segelke B (2001) Efficiency Analysis of Sampling Protocols Used in Protein Crystallization Screening. J Cryst Growth 232:553–562 22. Rupp B (2003) Maximum-likelihood crystallization. J Struct Biol 142:162–9 23. DeLucas LJ, Bray TL, Nagy L, McCombs D, Chernov N, Hamrick D, Cosenza L, Belgovskiy A, Stoops B, Chait A (2003) Efficient protein crystallization. J Struct Biol 142:188–206 24. Oldfield TJ (2001) Creating structure features by data mining the PDB to use as molecularreplacement models. Acta Crystallogr D57:1421–1427 25. Matthews BW (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta 405:442–51 26. Rost B, Sander C, Schneider R (1994b) Redefining the goals of protein secondary structure prediction. J Mol Biol 235:13–26 27. Zemla A, Venclovas C, Fidelis K, Rost B (1999) A modified definition of Sov, a segment-based measure for protein secondary structure prediction assessment. Proteins 34:220–3 28. Chou PY, Fasman GD (1974) Prediction of protein conformation. Biochemistry 13:222–45 29. Lim VI (1974a) Algorithms for prediction of alpha-helical and beta-structural regions in globular proteins. J Mol Biol 88:873–94 30. Lim VI (1974b) Structural principles of the globular organization of protein chains. A stereochemical theory of globular protein secondary structure. J Mol Biol 88:857–72
Data Mining for Protein Secondary Structure Prediction
165
31. Garnier J, Osguthorpe DJ, Robson B (1978) Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. J Mol Biol 120: 97–120 32. Gibrat JF, Garnier J, Robson B (1987) Further developments of protein secondary structure prediction using information theory. New parameters and consideration of residue pairs. J Mol Biol 198:425–43 33. Garnier J, Gibrat JF, Robson B (1996) GOR method for predicting protein secondary structure from amino acid sequence. Methods Enzymol 266:540–53 34. Frishman D, Argos P (1997) Seventy-five percent accuracy in protein secondary structure prediction. Proteins 27:329–35 35. Holley LH, Karplus M (1989) Protein secondary structure prediction with a neural network. Proc Natl Acad Sci USA 86:152–6 36. Jones DT (1999) Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 292:195–202 37. Petersen TN, Lundegaard C, Nielsen M, Bohr H, Bohr J, Brunak S, Gippert GP, Lund O (2000) Prediction of protein secondary structure at 80% accuracy. Proteins 41:17–20 38. Qian N, Sejnowski TJ (1988) Predicting the secondary structure of globular proteins using neural network models. J Mol Biol 202:865–84 39. Rost B, Sander C (1993) Prediction of protein secondary structure at better than 70% accuracy. J Mol Biol 232:584–99 40. Rost B, Sander C, Schneider R (1994a) PHD–an automatic mail server for protein secondary structure prediction. Comput Appl Biosci 10:53–60 41. Stolorz P, Lapedes A, Xia Y (1992) Predicting protein secondary structure using neural net and statistical methods. J Mol Biol 225:363–77 42. Levin JM, Garnier J (1988) Improvements in a secondary structure prediction method based on a search for local sequence homologies and its use as a model building tool. Biochim Biophys Acta 955:283–95 43. Levin JM, Robson B, Garnier J (1986) An algorithm for secondary structure determination in proteins based on sequence similarity. FEBS Lett 205:303–8 44. Salamov AA, Solovyev VV (1997) Protein secondary structure prediction using local alignments. J Mol Biol 268:31–6 45. Salzberg S, Cost S (1992) Predicting protein secondary structure with a nearest-neighbor algorithm. J Mol Biol 227:371–4 46. Yi TM, Lander ES (1993) Protein secondary structure prediction using nearest-neighbor methods. J Mol Biol 232:1117–29 47. Barton GJ (1995) Protein secondary structure prediction. Curr Opin Struct Biol 5:372–6 48. Cuff JA, Barton GJ (1999) Evaluation and improvement of multiple sequence methods for protein secondary structure prediction. Proteins 34:508–19 49. Cuff JA, Barton GJ (2000) Application of multiple sequence alignment profiles to improve protein secondary structure prediction. Proteins 40:502–11 50. Karplus K, Barrett C, Hughey R (1998) Hidden Markov models for detecting remote protein homologies. Bioinformatics 14:846–56 51. King RD, Sternberg MJ (1990) Machine learning approach for the prediction of protein secondary structure. J Mol Biol 216:441–57 52. Ouali M, King RD (2000) Cascaded multiple classifiers for secondary structure prediction. Protein Sci 9:1162–76 53. Zvelebil MJ, Barton GJ, Taylor WR, Sternberg MJ (1987) Prediction of protein secondary structure and active sites using the alignment of homologous sequences. J Mol Biol 195: 957–61 54. Levin JM, Pascarella S, Argos P, Garnier J (1993) Quantification of secondary structure prediction improvement using multiple alignments. Protein Eng 6:849–54 55. Rost B (1996) PHD: predicting one-dimensional protein structure by profile-based neural networks. Methods Enzymol 266:525–39
166
H. Cheng et al.
56. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–402 57. Di Francesco V, Garnier J, Munson PJ (1996) Improving protein secondary structure prediction with aligned homologous sequences. Protein Sci 5:106–13 58. Lecompte O, Thompson JD, Plewniak F, Thierry J, Poch O (2001) Multiple alignment of complete sequences (MACS) in the post-genomic era. Gene 270:17–30 59. Rost B (2001) Review: protein secondary structure prediction continues to rise. J Struct Biol 134:204–18 60. Russell RB, Barton GJ (1993) The limits of protein secondary structure prediction accuracy from multiple sequence alignment. J Mol Biol 234:951–7 61. Hua S, Sun Z (2001) A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach. J Mol Biol 308:397–407 62. Nguyen MN, Rajapakse JC (2005) Two-stage multi-class support vector machines to protein secondary structure prediction. Pac Symp Biocomput 346–57 63. Huang X, Huang DS, Zhang GZ, Zhu YP, Li YX (2005) Prediction of protein secondary structure using improved two-level neural network architecture. Protein Pept Lett 12:805–11 64. Wood MJ, Hirst JD (2005) Protein secondary structure prediction with dihedral angles. Proteins 59:476–81 65. Lin K, Simossis VA, Taylor WR, Heringa J (2005) A simple and fast secondary structure prediction method using hidden neural networks. Bioinformatics 21:152–9 66. Krissinel E, Henrick K (2004) Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Crystallogr D Biol Crystallogr 60:2256–68 67. Wray LV Jr, Fisher SH (2007) Functional analysis of the carboxy-terminal region of Bacillus subtilis TnrA, a MerR family protein. J Bacteriol 189:20–7 68. Kashlan OB, Maarouf AB, Kussius C, Denshaw RM, Blumenthal KM, Kleyman TR (2006) Distinct structural elements in the first membrane-spanning segment of the epithelial sodium channel. J Biol Chem 281:30455–62 69. Jayaram B, Bhushan K, Shenoy SR, Narang P, Bose S, Agrawal P, Sahu D, Pandey V (2006) Bhageerath: an energy based web enabled computer software suite for limiting the search space of tertiary structures of small globular proteins. Nucleic Acids Res 34:6195–204 70. Meiler J, Baker D (2003) Coupled prediction of protein secondary and tertiary structure. Proc Natl Acad Sci USA 100:12105–10 71. Moult J (2006) Rigorous performance evaluation in protein structure modelling and implications for computational biology. Philos Trans R Soc Lond B Biol Sci 361:453–8 72. Kihara D (2005) The effect of long-range interactions on the secondary structure formation of proteins. Protein Sci 14:1955–63 73. Tsai CJ, Nussinov R (2005) The implications of higher (or lower) success in secondary structure prediction of chain fragments. Protein Sci 14:1943–4 74. Garnier J, Robson B (1989) The GOR method for predicting secondary structures in proteins. In: Fasman GD (ed) Prediction of protein structure and the principles of protein conformation. Plenum, New York, pp 417–465 75. Kloczkowski A, Ting KL, Jernigan RL, Garnier J (2002b) Protein secondary structure prediction based on the GOR algorithm incorporating multiple sequence alignment information. Polymer 43:441–449 76. Simossis VA, Heringa J (2004) Integrating protein secondary structure prediction and multiple sequence alignment. Curr Protein Pept Sci 5:249–66 77. Kloczkowski A, Ting KL, Jernigan RL, Garnier J (2002a) Combining the GOR V algorithm with evolutionary information for protein secondary structure prediction from amino acid sequence. Proteins 49:154–66 78. Sen TZ, Jernigan RL, Garnier J, Kloczkowski A (2005) GOR V server for protein secondary structure prediction. Bioinformatics 21:2787–8 79. Simons KT, Kooperberg C, Huang E, Baker D (1997) Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J Mol Biol 268:209–25
Data Mining for Protein Secondary Structure Prediction
167
80. Simons KT, Ruczinski I, Kooperberg C, Fox BA, Bystroff C, Baker D (1999) Improved recognition of native-like protein structures using a combination of sequence-dependent and sequence-independent features of proteins. Proteins 34:82–95 81. Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 89:10915–9 82. Dayhoff MO, Schwartz RM, Orcutt BC (1978) Atlas Protein Seq Struct, Suppl., 345–352 83. Cheng H, Sen TZ, Kloczkowski A, Margaritis D, Jernigan RL (2005) Prediction of protein secondary structure by mining structural fragment database. Polymer 46:4314–4321
Index
ABO3 perovskites, orthorhombic 77 Adrenaline 119 Agglomerative clustering 17 α -ω -Alkandihalogenides 122 Alkanes, melting points 122 AMBER 91 Amino acid sequence data 137 Anion parameter 63 Anomaly detection 61 Association analysis 61 Association rule mining 11 Atom-atom potentials 103 tailor-made force field 92 Atomic coordinates 43 1-Aza-2-cyclobutanone (FEPNAP) 127, 128, 130 Basic local alignment search tool (BLAST) 143 Bayesian statistics 136 Bias vs. variance (fit vs. generalizability) 32 Biological macromolecular crystallization database (BMCD) 141 Black tar mystery, ICSD search 52 Boltzman, L. 10 Buckingham potentials 126 BUWDUS 98 BUWFEE 98 BUWGEF 98 BUWGEF01 98 BUWKEJ 95, 110 C-C interactions 114 Caffeine 119 Cambridge Structural Database (CSD) 102 Carbon/sulfur interactions, van der Waals parameters 118
Carbon monoxide 119 Carboxyhemoglobin 119 Chemical formulas 42 Chemical names, IUPAC 42 Clopidogrel 119 Classification 22 Cluster analysis 60, 92, 93 Clustered crystal structures 97 Clustering 15 algorithms 17, 93 hierarchical 17, 93 COMPACK 94 COMPASS 91 Complete linkage clustering 18 Confidence 12 Connolly surface 124 Consensus database mining (CDM) 135, 136, 155, 160 Correlation, linear 4 Correlation coefficients 4 Critical assessment of structure prediction (CASP) 139 CRYCOM 94 Crystal structures 59 clustering 96 databases, overview 39 determination 98 faulty 109 generation 123 prediction (CSP) 89, 90, 123 ranking 126 similarities 89 CrystalEye 102 Crystallization data mining 136 proteins 140 Crystallization space 141 Crystallographic databases 37
170 Crystallographic descriptors, geometrical expression 65 Crystallographic Information File (CIF) 38 Data classification 64 Data “cloud” 13 Data clustering 69 topology preserving distance measures 71 Data collection 37 Data evaluation 37 Data fields 40 Data mining 1, 59, 89, 92, 136 crystal structure prediction 92 methods in biology 142 Data mining force field (DMFF), derivation 101 application 118 descriptors 102 parametrization 106 validation 111 Data prediction, structure–property relationships 73 Data preprocessing 62 Data systematics 62 Data training 75 Database design 37, 49 Database functionality 37 Database management systems (DBMS) 3 DDL (data drive lattice) 70 Decision trees 28 Density estimation 4, 12 Descriptor refinement, data mining 78 Descriptors 59 DFT 126 Diagram comparison 89 Dihalogenalkanes, melting points/lattice energies 122 Dilation of anion position 63 Dimensionality reduction 59 techniques 70 Distance function 15 DMAREL 91 DMSP 76 Docking 119 Dreiding force field 122 Drug and enzyme 120 Drug development 119 DSSP algorithm 139 ECEPP 91 Eigenvalues 31, 67 Eigenvectors 14, 31, 67 EMBL algorithm 139 Entropy 11, 91
Index Enumeration schemes 62 Euclidean distance 16 FABFOE 112 Fatty acids 121 Faulty crystal structures 109 Feature space 27 Feed forward neural network 26 Fisher’s linear discriminant 23 FlexCryst 125 Flexible molecules 92 Force field 89, 91 derivation 101 Fragment database mining (FDM) 136, 149, 155 Genetic algorithms (GA) 75, 125 Genome sequencing 137 Geodesic distance Geometrical modeling equations 63 Goldschmidt diagrams 64 GOR V 148 secondary structure prediction 160 Helix predictions 138, 158 Hemoglobin 119 Hierarchical clustering 17, 93 Homoatomic pair potentials, validation 114 Hydrogen bonds 103, 115, 124 Ice Ic 53 Independent component analysis (ICA) 13, 15 Information gain 11 Inorganic Crystal Structure Database (ICSD) 37, 102 access 49 Web version, modules/search strategies 49, 50 Inorganic crystallography, enumeration schemes 62 knowledge discovery 60 Intermolecular interactions 100 IsoMap technique 70 K-means 19 Kernel PCA 15 Knowledge discovery in databases (KDD) 2, 61 Latent variables 65 Lattice constants 77 Lattice energies 90 Lattice parameters 63 LDA (linear discriminant analysis) 23 Least squares optimization 30
Index Lennard-Jones potentials 126 Ligand–protein docking 120, 123 Linear correlation 4 Linkage 94 Liquids, densities 92 Locally linear embedding (LLE) 70, 71 Manhattan distance 16 Many Body Interaction Types (MBITs) 75 Melting points 121 Mineral names 42 Minimization 125 Molecular dynamics (MD), force fields 91 simulations, formation/breaking of bonds 92 Molecular mechanics (MM), force fields 91 Mooser-Pearson plots 64 Multiple sequence alignments (MSA) 146 Naive Bayes 13 Neural networks 26 Nicotine 119 Outliers, overestimation/underestimation 5 Overfitting/overtraining 32, 105 Oxidation numbers 43 PCA (principal component analysis) 13, 65, 66, 93 nonlinear 13 procedural logic 68 semiconductors 70 PDL (pre defined lattice) 70 Perceptron algorithm 24, 25 Perovskites, orthorhombic 77 Pettifor plots 64 Phase designations 42 Philips and van Vechten diagrams 64 Pigment Orange 5 (1-((2,4-dinitrophenyl)azo)2-naphthol, refcode CICCUN) 99 Pigment Red 3 (4-methyl-2-triphenylarsine-2naphthol, refcode MNIPZN) 99 Pigment Red 170 100 Pigment Red 181 (PR181) 99 Polymorph Predictor 94 Polymorphism 90, 91 Polymorphs 90, 107, 113, 123, 125 clustering 129 predicted 127 Powder 89 patterns/diagrams 39, 50, 52, 93 indexed 99 unindexed 98, 131 Powder X-ray data 98 Prediction, quality 143
171 Predictive modeling 60 Progesterone 119 Protein Data Bank (PDB) 101, 137 Proteins 92 coil 138 crystallization 136, 142 folding 140 secondary structure prediction 136, 145 structure prediction 136 X-ray structure determination, crystallization data mining 140 PSI-BLAST 146 QSAR/QSPR 74, 75 Radial distribution functions 92 Rank correlation 5 Regression 19, 30 nonlinear 31 Reliability index 44 Retrieval 50 Root mean squared deviation (RMSD) 17 Search strategies 37 Secondary descriptors 78 Secondary structure prediction 136, 160 Segment overlap coefficient (SOV) 145 Self organization maps 70 Self-assembly (self-recognition) 90 Shannon, C. 10 β -Sheet predictions 138, 158 Similarity functions 15 Similarity index 95, 99 Simplex method 125 Single linkage clustering 17 Singular value decomposition see also PCA 66 Space group preferences 11 Space groups 109 CSD 124 Hermann–Mauguin symbol 43 Special name records 42 Spinel nitrides 64, 67 Statistical dependence 8 Structural fragments 136 Structure generation 126 Structure maps 62, 69 Structure prediction, data mining aided 75 Structure–chemistry relationships 64 Sublimation energies 111, 113 Supervised methods 19 Support vector machines (SVM) 15, 24, 27, 92, 100, 146
172 Support vector regression (SVR) methods 77 “Swiss roll” template 69 Symmetry records 43 Tanimoto coefficient 16 Temperature factors 44 Text mining 12 Thermal parameters (atomic displacement factors) 43 Tiling theory 66
Index Valency effects vs. atomic size effects 63 Validation 101 Validity 48 Valium 119 Van der Waals parameters 118 Virtual databases 102 Water, ICSD search 53 Wyckoff sequence 44 X-ray structure determination, proteins 140
Unit cell dimensions 43 Unsupervised methods 4
ZnS cluster geometries 77