Computer-Assisted Structure Elucidation

Computer-Assisted Structure Elucidation Dennis H. Smith, Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.fw0...

Author: Dennis H. Smith

14 downloads 918 Views 2MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Computer-Assisted Structure Elucidation Dennis H. Smith,

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.fw001

Stanford

University

EDITOR

School of

Medicine

A symposium sponsored by the Division of Chemical Information at the 173rd Meeting of the American Chemical Society, New Orleans, La., March 23, 1977

ACS SYMPOSIUM SERIES

AMERICAN CHEMICAL SOCIETY WASHINGTON, D. C. 1977

54

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.fw001

Library of Congress CIP Data Computer-assisted structure elucidation. (ACS symposium series; 54 ISSN 0097-6156) Bibliography: p. Includes index. 1. Chemical structure—Data processing—Congresses. I. Smith, Dennis H . , 1942. II. American Chemical Society. Division of Chemical Information. III. Series: American Chemical Society. ACS symposim series; 54. QD471.C594 ISBN 0-8412-0384-9

543'.08 ACSMC8

77-24427 54 1-151

Copyright © 1977 American Chemical Society All Rights Reserved. No part of this book may be reproduced or transmitted in any form or by any means—graphic, electronic, including photocopying, recording, taping, or information storage and retrieval systems—without written permission from the American Chemical Society. PRINTED IN T H E U N I T E D STATES O F AMERICA

ACS Symposium Series

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.fw001

Robert F. Gould, Editor

Advisory

Board

D o n a l d G . Crosby Jeremiah P. Freeman E. Desmond Goddard Robert A . Hofstader J o h n L. Margrave N i n a I. M c C l e l l a n d J o h n B. Pfeiffer Joseph V . Rodricks Alan

C. Sartorelli

Raymond B . Seymour Roy L. W h i s t l e r Aaron W o l d

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.fw001

FOREWORD T h e A C S S Y M P O S I U M SERIES was f o u n d e d i n 1 9 7 4 to

provide

a m e d i u m for p u b l i s h i n g s y m p o s i a q u i c k l y i n b o o k f o r m .

The

f o r m a t of the SERIES p a r a l l e l s that of its predecessor, A D V A N C E S IN CHEMISTRY

SERIES, except that i n o r d e r to save t i m e

the

p a p e r s are n o t typeset b u t are r e p r o d u c e d as t h e y are s u b m i t t e d b y the authors i n c a m e r a - r e a d y

form.

A s a further

means of s a v i n g t i m e , the papers are not e d i t e d or r e v i e w e d except b y the s y m p o s i u m c h a i r m a n , w h o b e c o m e s e d i t o r t h e book.

P a p e r s p u b l i s h e d i n the A C S S Y M P O S I U M

of

SERIES

are o r i g i n a l c o n t r i b u t i o n s not p u b l i s h e d e l s e w h e r e i n w h o l e or major p a r t a n d i n c l u d e reports of r e s e a r c h as w e l l as r e v i e w s since s y m p o s i a m a y e m b r a c e b o t h types of p r e s e n t a t i o n .

PREFACE "Elucidation

of

u n k n o w n m o l e c u l a r structures o c c u p i e s

a

significant

a m o u n t of the chemist's t i m e i n m a n y areas of c h e m i c a l research. A v a r i e t y of p h y s i c a l a n d c h e m i c a l m e t h o d s are a v a i l a b l e to assist i n this task.

Suitably programmed

d i g i t a l c o m p u t e r s are a d d i t i o n a l tools

w h i c h c a n h e l p to solve s t r u c t u r a l p r o b l e m s .

If u s e d i n t e l l i g e n t l y , the

s p e e d a n d thoroughness of t h e c o m p u t e r c a n b e a p o w e r f u l asset to t h e Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.pr001

c h e m i s t , b o t h i n d e c r e a s i n g t i m e r e q u i r e d for analysis a n d i n e n s u r i n g that a l l p l a u s i b l e alternatives h a v e b e e n c o n s i d e r e d .

(Computers play a

c r i t i c a l role i n x-ray c r y s t a l l o g r a p h i c t e c h n i q u e s also.

However,

such

t e c h n i q u e s w e r e not a d d r e s s e d i n this s y m p o s i u m . ) T h i s s y m p o s i u m w a s h e l d at a t i m e w h e n : (1) M o s t o r g a n i z a t i o n s i n v o l v e d i n c h e m i c a l r e s e a r c h h a v e a v a i l a b l e c o m p u t e r systems to h a n d l e m a n y phases of a c q u i r i n g a n d r e d u c i n g experimental data (2) Some of t h e earliest m e t h o d s u s i n g c o m p u t e r s to assist i n struct u r e e l u c i d a t i o n , e.g., l i b r a r y s e a r c h t e c h n i q u e s , are w i d e l y a v a i l a b l e a n d are i n c o r p o r a t e d i n t o c o m m e r c i a l systems (3) C o m p u t e r n e t w o r k i n g is m a k i n g m o r e recent p r o b l e m - s o l v i n g p r o g r a m s a v a i l a b l e to m a n y chemists at r e l a t i v e l y l o w cost These developments

have placed powerful

computer

systems

i n the

l a b o r a t o r y for r o u t i n e w o r k a n d , b y resource s h a r i n g v i a n e t w o r k i n g , h a v e r e d u c e d t h e l a g t i m e for a p p l i c a t i o n of n e w c o m p u t e r t e c h n i q u e s to days or w e e k s r a t h e r t h a n m o n t h s or years. T h e p a r t i c i p a n t s i n this s y m p o s i u m are u s i n g c o m p u t e r s i n several different

ways

to h e l p solve u n k n o w n structures.

Methodologies

cussed i n c l u d e l i b r a r y s e a r c h t e c h n i q u e s , a u t o m a t e d

interpretation

disof

d a t a , p a t t e r n r e c o g n i t i o n , s t r u c t u r e generation, a n d r a n k i n g of c a n d i d a t e structures b a s e d o n p r e d i c t i o n of s p e c t r o s c o p i c or c h e m i c a l b e h a v i o r .

It

is s t r i k i n g to see a c o n v e r g e n c e of ideas a n d t e c h n i q u e s i n these s e e m i n g l y diverse m e t h o d s .

If one v i e w s s t r u c t u r e e l u c i d a t i o n as a t r a n s f o r m a t i o n

of d a t a ( a n d representations of other i n f o r m a t i o n a b o u t a n u n k n o w n ) i n t o representations of m o l e c u l a r s t r u c t u r e w h i c h are p o t e n t i a l solutions, t h e n the reasons for the c o n v e r g e n c e are o b v i o u s .

T h e c o m p u t e r is a

p o w e r f u l a i d e i n c a r r y i n g out these transformations. I express m y sincere t h a n k s to M o r t o n M u n k for his h e l p f u l s u g gestions i n a r r a n g i n g the s y m p o s i u m a n d for c h a i r i n g a p o r t i o n of the vii

p r o g r a m . T h e officers of the D i v i s i o n of C h e m i c a l I n f o r m a t i o n ,

especially

C y n t h i a O ' D o n a h u e , w e r e m o s t h e l p f u l i n assisting m y efforts. S t a n f o r d U n i v e r s i t y S c h o o l of M e d i c i n e Stanford, Calif.

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.pr001

M a y 3 1 , 1977

viii

DENNIS

H.

SMITH

1 Computer-Assisted Structure Identification of U n k n o w n Mass Spectra

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch001

R. VENKATARAGHAVAN, H. E. DAYRINGER, G. M. PESYNA, B. L. ATWATER, I. K. MUN, M. M. CONE, and F. W. McLAFFERTY Department of Chemistry, Cornell University, Ithaca, NY 14853

Mass spectrometry has become a routine technique for structure identification in a number of applications (1). Gas chromatograph/mass spectrometer/computer (GC/MS/COM) systems capable of producing a mass spectrum every second are commercially available (2). Voluminous amounts of data are generated with such systems using subnanogram amounts of sample. For full utilization of this highly specific information it is essential to employ computer techniques. Such computer-aided structure identification from mass spectrometric data has taken two distinct directions (3). The first utilizes "retrieval" systems which compare the unknown data to a library of reference spectra to report compounds with a high degree of similarity. A number of techniques have been employed for the retrieval approach (3). The second approach involves interpretive schemes that attempt to identify part or all of the unknown structure from correlations of mass spectral fragmentation behavior. Pattern recognition (4) and artificial intelligence (5) are examples of such schemes that have been employed for interpreting mass spectral data of specific classes of compounds. We will describe here a retrieval Probability Based Matching (PBM) system (6, 7) and an interpretive Self-Training Interpretive and Retrieval System (STIRS) (8 -11)developed for the analysis of low resolution mass spectra. Both these systems are available on a computer network (TYMNET) from an IBM-370/168 computer system at Cornell University to outside users. Probability Based Matching System It has been shown that to increase the relevancy of information retrieved from a library of data it is essential to attach proper weighting to the contents of the system (12). The PBM system employs a probability weighting to both the mass and abundance data (6, 7) • The abundance values are weighted according to a log normal distribution (13) and the masses are given a uniqueness value based on their occurrence probability 1

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch001

2

COMPUTER-ASSISTED STRUCTURE ELUCIDATION

i n a m a s s s p e c t r a l d a t a b a s e of 1 8 , 8 0 6 different c o m p o u n d s (14). The P B M s y s t e m a l s o u s e s a r e v e r s e s e a r c h s t r a t e g y , i n d e p e n d e n t l y p r o p o s e d b y A b r a m s o n (15), w h i c h i s v a l u a b l e i n i d e n t i f y i n g components of a m i x t u r e . This t e c h n i q u e demands that the p e a k s of the r e f e r e n c e s p e c t r u m be p r e s e n t i n the u n k n o w n , but not that a l l p e a k s of the u n k n o w n be p r e s e n t i n the r e f e r e n c e . The d e g r e e of m a t c h of the r e f e r e n c e to the u n k n o w n i s i n d i c a t e d w i t h a c o n fidence i n d e x K, b a s e d on the s t a t i s t i c a l p r o b a b i l i t y that this degree of m a t c h o c c u r r e d by c o i n c i d e n c e ; d e t a i l s of the method h a v e b e e n d e s c r i b e d e l s e w h e r e (6, 7 ) . A s t a t i s t i c a l e v a l u a t i o n of P B M ' s performance w a s made u s i n g " u n k n o w n " m a s s s p e c t r a , for e a c h of w h i c h at l e a s t one other s p e c t r u m of the same compound w a s present i n the d a t a b a s e . L o w a n d h i g h m o l e c u l a r w e i g h t s e t s , e a c h of ~ 4 0 0 u n k n o w n s p e c t r a r e m o v e d at r a n d o m from the d a t a b a s e , w e r e r u n t h r o u g h the P B M s y s t e m , and the r e s u l t s evaluated u s i n g r e c a l l and r e l i a b i l i t y as measures of performance. R e c a l l (RC) i s d e f i n e d a s the number of r e l e v a n t s p e c t r a a c t u a l l y r e t r i e v e d and r e l i a b i l i t y (RL) i s t h e p r o p o r t i o n o f r e t r i e v e d s p e c t r a w h i c h a r e a c t u a l l y relevant. In a d d i t i o n to t h e s e terms i t i s d e s i r a b l e to e x p r e s s the performance of automated s y s t e m s i n terms of f a l s e p o s i t i v e s (FP), the p r o p o r t i o n of s p e c t r a p r e d i c t e d i n c o r r e c t l y (16). R

C

=

:

c

/

V (

I

c

P

(2)

V

+

FP = I / P f

(1)

c

(3)

f

where I = number of c o r r e c t p r e d i c t i o n s , P = t o t a l p o s s i b l e n u m b e r o f c o r r e c t p r e d i c t i o n s , If = n u m b e r o f f a l s e p r e d i c t i o n s , and P = t o t a l p o s s i b l e number of f a l s e p r e d i c t i o n s . At the 50% r e c a l l l e v e l the r e l i a b i l i t i e s for the l o w and h i g h m o l e c u l a r w e i g h t sets w e r e 65% and 4 2 % , counting as correct only predicted s t r u c tures w h i c h are i d e n t i c a l to the u n k n o w n . I n v a r i a b l y r e t r i e v a l s y s t e m s p r e d i c t s i m i l a r s t r u c t u r e s i n a d d i t i o n to the i d e n t i c a l s t r u c t u r e . In the e v a l u a t i o n of P B M r e s u l t s four c l a s s e s of s i m i l a r i t y w e r e d e f i n e d : I, i d e n t i c a l c o m p o u n d or s t e r e o i s o m e r ; I I , c l a s s I o r a r i n g p o s i t i o n i s o m e r ; I I I , c l a s s II o r a h o m o l o g ; I V , c l a s s III o r a n i s o m e r o f c l a s s III c o m p o u n d f o r m e d b y m o v i n g o n l y o n e c a r b o n a t o m . It w a s f o u n d t h a t w h e n c l a s s IV t y p e c o m pounds were a c c e p t e d as correct predictions the r e l i a b i l i t y of the s y s t e m i n c r e a s e d to 95% at the same r e c a l l l e v e l . R e c e n t l y , i t has b e e n found that the performance of P B M for the i d e n t i f i c a t i o n of c o m p o n e n t s i n a m i x t u r e c a n be e n h a n c e d (17) b y i n c o r p o r a t i n g a s p e c t r u m s u b t r a c t i o n p r o c e d u r e s i m i l a r t o the one p r o p o s e d b y H i t e s a n d B i e m a n n (18). The method subtracts the reference compound matched by P B M w i t h the h i g h e s t c o n f i d e n c e i n d e x (or a n y o t h e r i n t h e l i s t o f p r e d i c t e d s p e c t r a ) f r o m t h e unknown spectrum and matches the r e s i d u a l peaks against the c

f

c

1.

VENKATARAGHAVAN E T A L .

Structure

of Unknown

Mass Spectra

3

reference f i l e b y P B M . This operation i s p a r t i c u l a r l y v a l u a b l e for identifying a minor component m i s s e d by the reverse search proc e d u r e w h e n there i s s u b s t a n t i a l o v e r l a p i n the s p e c t r a of the major and minor c o m p o n e n t , or w h e n amount of the latter f a l l s o u t s i d e the l i m i t s set for " p e r c e n t c o m p o n e n t " or " p e r c e n t c o n tamination" #

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch001

Self-Training Interpretive and Retrieval

System

The STIRS s y s t e m i s a n i n t e r p r e t i v e s c h e m e t h a t t r a i n s i t s e l f for the i d e n t i f i c a t i o n of different s t r u c t u r a l features i n a n u n k n o w n b y u t i l i z i n g s p e c i f i c c l a s s e s o f m a s s s p e c t r a l d a t a (8). Table I shows the fifteen data c l a s s e s u s e d ; although these have b e e n s e l e c t e d for t h e i r s t r u c t u r a l s i g n i f i c a n c e , there are no p r e d e s i g n a t e d c o r r e l a t i o n s of s p e c i f i c s p e c t r a l d a t a w i t h c o r r e s ponding s t r u c t u r e s . For e a c h unknown spectrum the system matches its data i n each c l a s s against the corresponding c l a s s d a t a of a l l r e f e r e n c e s p e c t r a a n d c o m p u t e s a m a t c h f a c t o r (MF) i n d i c a t i n g the degree of s i m i l a r i t y . In e a c h data c l a s s the fifteen r e f e r e n c e c o m p o u n d s o f h i g h e s t M F v a l u e s a r e s a v e d . If a p a r t i c u l a r substructure(s) i s found i n a s i g n i f i c a n t p r o p o r t i o n of t h e s e c o m p o u n d s , its p r e s e n c e i n the u n k n o w n i s p r o b a b l e . A b s e n c e of a s u b s t r u c t u r e i s not p r e d i c t e d , as the m a s s s p e c t r a l features of one s u b s t r u c t u r e c a n be made n e g l i g i b l e by the p r e s e n c e of a more p o w e r f u l f r a g m e n t a t i o n - d i r e c t i n g g r o u p . The d a t a b a s e for the s y s t e m i n c l u d e s i n f o r m a t i o n from 2 9 , 4 6 8 different o r g a n i c compounds containing the common elements H , C , N , O, F, S i , P, S, C I , B r , a n d / o r I. A l l s t r u c t u r e s of t h e s e c o m p o u n d s h a v e b e e n c o d e d i n W i s w e s s e r L i n e N o t a t i o n (WLN) to f a c i l i t a t e c o m puter h a n d l i n g of s t r u c t u r e d a t a . To u t i l i z e t h e i n f o r m a t i o n p r o v i d e d b y t h e STIRS s y s t e m , the r e s u l t s for e a c h d a t a c l a s s are e x a m i n e d and the common s t r u c t u r a l features i d e n t i f i e d . To a i d t h i s p r o c e s s , i n a r e c e n t l y i m p l e m e n t e d s y s t e m (9), t h e c o m p u t e r e x a m i n e s t h e d a t a for t h e p r e s e n c e of 179 f r e q u e n t l y f o u n d s u b s t r u c t u r e s (19). The p r o b a b i l i t y for the p r e s e n c e i n the u n k n o w n of e a c h s u b s t r u c t u r e i s predicted u s i n g a random drawing m o d e l . Knowing the frequency of o c c u r r e n c e of a s p e c i f i c s u b s t r u c t u r e i n the f i l e , t h i s m e t h o d i n d i c a t e s the p r o b a b i l i t y that the p r e d i c t i o n of its p r e s e n c e i n the u n k n o w n o c c u r r e d at r a n d o m . From t h i s p r o b a b i l i t y the c o n f i d e n c e for e a c h p r e d i c t i o n i s c a l c u l a t e d . For e x a m p l e , i n the STIRS d a t a b a s e t h e p h e n y l s u b s t r u c t u r e i s f o u n d to b e p r e s e n t i n 28% of the c o m p o u n d s . S t a t i s t i c a l l y on the a v e r a g e t h i s s u b s t r u c t u r e w o u l d o c c u r i n 4 o f a n y 15 c o m p o u n d s i n t h e d a t a b a s e , i n c l u d i n g t h e t o p 15 c o m p o u n d s s e l e c t e d i n a S T I R S d a t a c l a s s . O n t h e o t h e r h a n d i f p h e n y l i s f o u n d i n 10 o f t h e 15 c o m p o u n d s , the probability that this occurred by chance i s only 1 i n 113, so that the confidence i n the p h e n y l p r e d i c t i o n i s >99%, or a f a l s e p o s i t i v e s v a l u e of < 1 % .

4

COMPUTER-ASSISTED STRUCTURE ELUCIDATION

Table I.

M a s s S p e c t r a l D a t a C l a s s e s U s e d i n STIRS

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch001

Data Class

D e s c r i p t i o n , maximum number of p e a k s

I

Ion Series

(14 a m u s e p a r a t i o n )

2-4

Characteristic ions

<100

2A

Four even-mass,

2B

Eight

47-102

3A

Seven

61-116

3B

Seven

89-158

4A

Six

117-200

4B

Six

159-(M -

5-6

four o d d - m a s s

Range of m a s s or m a s s l o s s

6-88

1)

+

Primary neutral l o s s e s

5A

Five

0-2, 15-20, 26-53

5B

Five

34-75

6A

Five

59-109 (MW

6B

Five

7 6 - 1 4 9 ( M W >250)

5C

Five

16-20, 30-38, 44-51, 59-65,72-76

6C

Five

26-28, 39-42, 52-56, 62-70, 80-84

7, 8

II

S e c o n d a r y n e u t r a l l o s s e s from most abundant o d d - m a s s (MF7) and e v e n - m a s s (MF8) l o s s O v e r a l l match factors

11.0

MF11.1 + MF11.2

11.1

2A + 2B + 3A + 3B + 4A + 4B

11.2

5A + 5 B + 6A

<65

>175)

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch001

1.

VENKATARAGHAVAN E T A L .

Structure

of Unknown

Mass Spectra

5

The s y s t e m h a s b e e n e x t e n s i v e l y t e s t e d for e a c h of the 179 s u b s t r u c t u r e s b y s e l e c t i n g 373 c o m p o u n d s a t r a n d o m from t h e d a t a b a s e (every 50th compound i n the Registry data) (20). If the d a t a s e t d i d n o t c o n t a i n a t l e a s t 30 c o m p o u n d s w i t h a p a r t i c u l a r s u b s t r u c t u r e , the required a d d i t i o n a l compounds were s e l e c t e d at r a n d o m t h a t c o n t a i n e d t h e s u b s t r u c t u r e . I f f e w e r t h a n 30 c o m pounds w i t h a g i v e n substructure w e r e a v a i l a b l e , a l l of them were selected. System performance i n each data c l a s s was evaluated by computing r e c a l l and r e l i a b i l i t y terms for e a c h s u b s t r u c t u r e . In contrast to equation 2 , the r e l i a b i l i t y term i n this c a s e i n c l u d e d a f a l s e p o s i t i v e factor, b e i n g set equal to RC/(RC + FP), s u c h that the v a l u e s reflect the s y s t e m performance averaged for c o m pounds c o n t a i n i n g and not c o n t a i n i n g the s u b s t r u c t u r e . This r e l i a b i l i t y term l e d to s u b s t a n t i a l c o n f u s i o n , s o that w e n o w f e e l that i t i s better to report performance of the s y s t e m i n terms of r e c a l l and f a l s e p o s i t i v e s (16), as d i s c u s s e d for P B M (equations 1 and 3). A n a l y s i s of the d a t a s h o w s that a l t h o u g h i n d i v i d u a l d a t a c l a s s e s are g o o d for s p e c i f i c s u b s t r u c t u r e i d e n t i f i c a t i o n , the b e s t p e r f o r m a n c e i s f o u n d i n t h e o v e r a l l m a t c h f a c t o r ( T a b l e I) r e s u l t s . T h i s i s due to the fact t h a t the o v e r a l l m a t c h factor d a t a c o m b i n e s the i n f o r m a t i o n d e r i v e d from the i n d i v i d u a l d a t a c l a s s e s . The overall match factor, M F 1 1 . 0 , w h i c h combines ion series, c h a r a c t e r i s t i c i o n s , and neutral l o s s data has been found to g i v e the most r e l i a b l e information on the different substructure p o s s i b i l i t i e s i n a n u n k n o w n c o m p o u n d . F o r t h e 179 s u b s t r u c t u r e s t e s t e d , t h e M F 1 1 . 0 g a v e a r e c a l l of 4 9 % a t 1.9% f a l s e p o s i t i v e l e v e l . A number of improvements have b e e n made to the c h a r a c t e r i s t i c i o n d a t a c l a s s e s (10) a n d t h e p r i m a r y n e u t r a l l o s s e s ( 1 1 ) ; the o v e r a l l m a t c h factors M F 1 1 . 1 and M F 1 1 . 2 h a v e b e e n found to g i v e a n a v e r a g e r e c a l l of 47% and 3 2 . 1 % , r e s p e c t i v e l y , at<2% false positives level. For the different substructures t e s t e d the r e c a l l v a l u e s varied widely. This i s due to the v a r i a t i o n i n the p r o m i n e n c e of mass s p e c t r a l c h a r a c t e r i s t i c s of different substructures that were tested. For example, complex substructures such as adamantyl ( R C = 6 7 % ) , b e n z p h e n a n t h r e n e ( R C = 89%) a n d o x a z o l e ( R C = 83%) have been found to g i v e h i g h r e c a l l v a l u e s whereas c o n s i d e r a b l y s i m p l e r s u b s t r u c t u r e s s u c h a s c a r b o n y l (RC = 31%) d i d n o t . This is not a l i m i t a t i o n on the c a p a b i l i t i e s of the s y s t e m but rather a c h a r a c t e r i s t i c of the m a s s s p e c t r a l b e h a v i o r of the different s u b s t r u c t u r e s . In the c a s e of a c a r b o n y l group there i s no d i s t i n c t peak a s s o c i a t e d w i t h the s u b s t r u c t u r e , but the v i s i b i l i t y of the group i n c r e a s e s i n c a r b o n y l - c o n t a i n i n g substructures s u c h as a c e t y l and b e n z o y l . Although the o v e r a l l " c h a r a c t e r i s t i c i o n s " m a t c h f a c t o r , M F 1 1 . 1 , has b e e n found to g i v e better r e c a l l v a l u e s for most s u b s t r u c t u r e s t h a n does M F 1 1 . 2 , the o v e r a l l n e u t r a l l o s s match f a c t o r , for c e r t a i n i n d i v i d u a l s u b s t r u c t u r e s the r e c a l l v a l u e s of M F 1 1 . 2 are far s u p e r i o r (11), c o m p l e m e n t i n g t h e i o n s e r i e s and c h a r a c t e r i s t i c ions d a t a . This effect i s apparent i n the

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch001

6

COMPUTER-ASSISTED S T R U C T U R E E L U C I D A T I O N

M F 1 1 . 0 r e s u l t s w h i c h combine the c a p a b i l i t i e s of a l l t h e s e d a t a classes. T h e 179 s u b s t r u c t u r e s a u t o m a t i c a l l y s o u g h t b y S T I R S w e r e selected b e c a u s e they appear i n the dictionary of frequently found s u b s t r u c t u r e s (19), not b e c a u s e of t h e i r m a s s s p e c t r a l b e h a v i o r . W e are currently d e v e l o p i n g a c o m p u t e r - a s s i s t e d procedure to m a k e e x t e n s i v e m a s s s p e c t r a l c o r r e l a t i o n s (21) o f t h e S T I R S d a t a base. This study s h o u l d p r o v i d e a set of s u b s t r u c t u r e s that p r o duce the most c h a r a c t e r i s t i c mass spectral d a t a , thus taking full a d v a n t a g e of the i n f o r m a t i o n a v a i l a b l e i n the d a t a b a s e . This p r o c e d u r e i n i t i a l l y s e l e c t s s i g n i f i c a n t p e a k s from e a c h s p e c t r u m u s i n g the c r i t e r i a d e v e l o p e d for the P B M s y s t e m . The p e a k s are then a s s i g n e d a l l p o s s i b l e elemental c o m p o s i t i o n s that are c o n s i s t e n t w i t h the m o l e c u l a r f o r m u l a of t h e p a r e n t c o m p o u n d . Each c o m p o s i t i o n i s then a s s i g n e d p o s s i b l e structures that are c o m patible w i t h parent structures by u s i n g a graph w a l k algorithm. Each structure is then assigned a score based on mass spectral fragmentation r u l e s . At t h i s s t a g e s t r u c t u r e s a n d c o m p o s i t i o n s w i l l be e x a m i n e d m a n u a l l y for c o n s i s t e n c y and c o r r e c t n e s s of assignment. The v a l i d a t e d d a t a w i l l then be tabulated i n terms of m a s s , e l e m e n t a l c o m p o s i t i o n s , a n d s u b s t r u c t u r e s a l s o i n d i c a t i n g their s i g n i f i c a n c e w i t h i n the data b a s e . E v e n w i t h t e s t i n g t h e 15 S T I R S - s e l e c t e d c o m p o u n d s f o r t h e 179 s u b s t r u c t u r e s , i t w o u l d b e h e l p f u l i f t h e c o m p u t e r c o u l d c o m pare t h e s e structures to identify their common m a x i m a l s u b s t r u c t u r e s . W e have written a computer program to a c c o m p l i s h this t a s k (22). T h i s i n v o l v e s t h e c o m p a r i s o n of two s t r u c t u r e s at a t i m e , c o d i n g them i n a W i s w e s s e r c o n n e c t i o n table; c o m p a t i b i l i ties b e t w e e n nodes of e a c h structure are e s t a b l i s h e d t a k i n g into a c c o u n t the c o n n e c t i v i t y r e l a t i o n s h i p of the n o d e s . The C o m p a t i b i l i t y Table generated by this algorithm is then examined to l o c a t e a l l the m a x i m a l common s u b s t r u c t u r e s . This process i s r e p e a t e d f o r a l l p a i r s o f s t r u c t u r e s i n t h e 15 c o m p o u n d s o f e a c h data c l a s s , and the most frequently found substructures are reported. The s u b s t r u c t u r e i n f o r m a t i o n d e r i v e d from the a b o v e procedures c o u l d then be u s e d i n the c o n s t r a i n e d structure g e n e r a t o r ( C O N G E N ) p r o g r a m , d e s c r i b e d b y C a r h a r t , et a l . , at t h i s s y m p o s i u m , to b u i l d a l l p o s s i b l e s t r u c t u r e s for the u n known u s i n g the molecular weight information. W e have a l s o d e s i g n e d a n a l g o r i t h m to determine the m o l e c u l a r w e i g h t from the mass s p e c t r a l d a t a e v e n w h e n the m o l e c u l a r i o n i s absent i n the spectrum. The r e t r i e v a l and i n t e r p r e t i v e s y s t e m s d e s c r i b e d here are d e s i g n e d to be u s e d as a n a i d to the i n t e r p r e t a t i o n of m a s s s p e c t r a l d a t a . If t h e s p e c t r u m o f t h e u n k n o w n i s m a t c h e d s u f f i c i e n t l y w e l l by a library spectrum, then the structure i d e n t i f i c a t i o n i s g r e a t l y s i m p l i f i e d . If a s p e c t r u m o f t h e u n k n o w n i s n o t i n t h e l i b r a r y , STIRS c a n s u g g e s t p o s s i b l e s u b s t r u c t u r e s t o s p e e d t h e e l u c i d a t i o n of t h e t o t a l s t r u c t u r e . B o t h t h e P B M a n d STIRS s y s t e m s are s m a l l enough to r u n on a laboratory computer s y s t e m ,

1.

VENKATARAGHAVAN E T A L .

Structure

of Unknown

Mass

Spectra

7

requiring under one minute to be run i n our D E C P D P - 1 1 / 4 5 system. The programs are a l s o a v a i l a b l e for o u t s i d e u s e r s through T Y M N E T from a n I B M - 3 7 0 / 1 6 8 c o m p u t e r at C o r n e l l U n i v e r s i t y (23). The i n c r e a s e d u s a g e of t h e s e c o m p u t e r - a s s i s t e d s t r u c t u r e i d e n t i f i c a t i o n procedures over the computer network i n the l a s t two years has a l r e a d y d e m o n s t r a t e d the v a l u e of s u c h s y s t e m s . The a n t i c i p a t e d g r o w t h i n the a p p l i c a t i o n of G C / M S to i m p o r t a n t p r o b l e m s suggests that further development i n this area i s e s s e n t i a l .

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch001

Applications P B M S p e c t r u m S u b t r a c t i o n . B r e n n e r a n d S u f f e t (24) h a v e r e c e n t l y reported c o m p a r a t i v e s t u d i e s of v a r i o u s computer s y s t e m s for m a s s s p e c t r a l i d e n t i f i c a t i o n as a p p l i e d to t r a c e o r g a n i c a n a l y s i s of n a t u r a l w a t e r s . The s p e c t r u m of a n u n k n o w n m i x t u r e w h i c h they determined to c o n t a i n b i s ( 2 - c h l o r o e t h y l ) e t h e r and C 3 ~ b e n z e n e isomers i s g i v e n i n Table II. B r e n n e r a n d S u f f e t r e p o r t (24) t h a t t h e B i e m a n n / M I T m a t c h i n g s y s t e m (18) i n d i c a t e d t h e p r e s e n c e o f C 3 ~ b e n z e n e s , a l t h o u g h w i t h a r e l a t i v e l y l o w p r o b a b i l i t y ( T a b l e II). S i m i l a r l y , STIRS o n l y i n d i c a t e d t h e a r o m a t i c c o n s t i t u e n t s , w i t h 11 o f t h e 15 c o m p o u n d s selected by the o v e r a l l match factor containing a benzene r i n g . O n l y o n e o f t h e t o p 15 c o m p o u n d s w a s d i c h l o r i n a t e d ; t h e p e r f o r m a n c e of STIRS i s g e n e r a l l y m u c h b e t t e r for t h e m a s s s p e c t r a of r e l a t i v e l y pure c o m p o u n d s . In c o n t r a s t to the r e s u l t s from the M I T a n d STIRS s y s t e m , the P B M e x a m i n a t i o n of t h e u n k n o w n s p e c t r u m f o u n d t h e o t h e r c o m p o n e n t , b i s ( 2 - c h l o r o e t h y l ) e t h e r ( T a b l e III); a p p a r e n t l y no C 3 - b e n z e n e s p e c t r a w e r e r e t r i e v e d b e c a u s e of the r e s t r i c t i o n s o n % c o n t a m i n a t i o n and % c o m p o n e n t (14). However, at the c o n c l u s i o n of the P B M r u n the b e s t - m a t c h i n g r e f e r e n c e spectrum i s a u t o m a t i c a l l y s u b t r a c t e d from the u n k n o w n s p e c t r u m , and w h e n t h i s w a s r u n , a l l of the b e s t - m a t c h i n g s p e c t r a found b y P B M w e r e of C 3 ~ b e n z e n e s ( T a b l e III). The b e s t - m a t c h i n g r e f e r e n c e s p e c t r u m of t h i s r u n i s a g a i n s u b t r a c t e d , and the r e s u l t i n g r e s i d u a l of the r e s i d u a l d o e s s h o w s i g n i f i c a n t p e a k s i n d i c a t i v e of one or more a d d i t i o n a l c o m p o n e n t s , s u c h a s a t m/e 5 7 , 7 1 , a n d 85 w h i c h d o n o t a r i s e f r o m e i t h e r t h e b i s ( 2 - c h l o r o e t h y l ) e t h e r or C 3 " b e n z e n e s , p l u s p e a k s w h i c h c o u l d a r i s e from t h e s e c o m p o n e n t s w h i c h h a v e not b e e n c o m p l e t e l y s u b t r a c t e d . The l a t t e r i n f o r m a t i o n a p p a r e n t l y i s o v e r l y c o n f u s i n g for P B M , as the c o m p o u n d s r e t r i e v e d o n r u n n i n g the r e s i d u a l of the r e s i d u a l are not p a r t i c u l a r l y l o g i c a l . The b e s t - m a t c h i n g c o m pound, with K = 5 4 * * and A K = 28, is 2-keto-3-methylvaleric acid methyl ester, and a l l retrieved spectra showed % contaminat i o n >50%. W e f i n d for a v a r i e t y of other u n k n o w n s p e c t r a that this is generally the c a s e ; thus P B M examination should be r e s t r i c t e d to j u s t t h e f i r s t r e s i d u a l s p e c t r u m . STIRS R e s u l t s o n a R i v e r W a t e r E x t r a c t . The s p e c t r u m s h o w n i n F i g u r e 1 w a s s u b m i t t e d t o STIRS a s a n u n k n o w n a t a t i m e

8

COMPUTER-ASSISTED STRUCTURE

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch001

Table II.

ELUCIDATION

U n k n o w n from the P h i l a d e l p h i a D r i n k i n g W a t e r S u p p l y

n/e

Rel. Abund.

27 29 30 31 34 38 39 40 42 41 43 44 45 49 50

53 33 1 5 2 2 17 8 9 30 30 2 3 3 5

m/e 51 52 53 55 56 57 58 59 61 62 63 64 65 66 69 70

Rel. Abund. 10 3 4 4 5 63 10 1 1 3 53 2 18 1 4 4

m/e

Rel. Abund.

71 74 75 77 78 79 84 85 86 89 91 92 93 94 95 98

7 1 1 18 5 10 2 33 2 1 7 1 60 2 17 2

m/e

Rel. Abund

99 100 102 103 104 105 106 111 115 117 119 120 121 127 142

1 1 1 7 2 100 13 1 2 2 5 34 3 1 3

M a t c h e s from B i e m a n n / M I T S e a r c h o n M S S S : Name 3-(2 ' - m e t h y l p h e n y l ) - p e n t a n e 4,6,8-Trimethyl-2-methoxy-l-azocine 1,3,5 -Trimethylbenzene 1,2,4-Trimethylbenzene 1,2,3-Trimethylbenzene Nezukone sec-Butylbenzene 1- B e n z o y l - 4 - m e t h y l p e n t a n e 1 -Methyl-2 -n-propylbenzene 2- Phenylhexane

Similarity Index 0.228 0.212 0.185 0.183 0.176 0.164 0.162 0.157 0.153 0.151

1.

VENKATARAGHAVAN E T A L .

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch001

Table H I . T a b l e II

Structure

% Contamination

K

b i s (2 - C h l o r o e t h y l ) e t h e r 2-(2- Chloroethoxy) ethanol bis(2 -Chloroethyl) ether b i s (2 - C h l o r o e t h y l ) e t h e r

62** 53*** 51+ 47+

First reference PBM:

Doto class 2A:/»/fr 6-88 o-HO- C H -S0 -/» - C H e-H0-C H —S0 —*-C H < J - H O - C H - S 0 - / - C H„ o-CH -C H ~C0-CH O-HO—C H -C0-0-R /T7-H0—C H —C0-CH RC-N-C H -N0"=CH ^-RO—C H -C0—H />-H0-C H -C H N 0 R00C-C H ~C0-0-*-C H RC0NH-C H -C0-0-CH 6

4

e

2

4

3

4

6

3

2

6

3

2

e

7

e

6

3

4

e

4

e

3

3

2

3

4

3

3

4

3

3

6

3

4

3

4

2

5

4

4

4

e

4

6

2

4

e

4

e

4

6

3

4

11 10 9 10 20

p-H0-C H -C0—0-C H p-H0-C H -C0-0-fl-C Hg />~H0-C H -C0—0-CH m—H0-C H -C0-0-CH tf-H0-C H -C0-CH /fj-H0-C H ~C0-e0-CH )-R m-H0-C H -C0-CH p -H0-C«H -C0-CH /n-H0-C«H -*C0-0H)C0R 7?-R-0—C H -CO-OH 6

4

4

73+ 72+ 72+ 71+ 61**+

Data class 3B:/»/* 89-158

7

5

4

6

37 46 45 50

% Component 53% 64% 53% 38%

24% 26% 29% 30%

spectrum subtracted, residual spectrum run o n

Isopropylbenzene Isopropylbenzene Isopropylbenzene I s opropy lb en zen e 1 -Methyl-2 -ethylbenzene

50-

9

Mass Spectra

P B M R e s u l t s o n U n k n o w n a n d R e s i d u a l Spectra from

Compound

lOO-i

of Unknown

3

s

7

6

34% 34% 34% 43% 36%

77% 91% 83% 72% 74%

Oata class 5: losses of 0-64 C4H3O-CO—0— n — CjH C H 0-C0-0-5 — C H C H NH-CO-O - * - C H R0C H - CO - 0 - n — C H HSCH -C0-0-/» - C H C H -C0-0-5-C3H CH -C0—0-CH NR CH — S-n—C H C0-0-CH 4

6

3

5

6

4

7

S

7

s

2

6

S

5

3

2

s

3

4

3

Neutral Losses STIRS

results for the mass spectrum

of n-propyl

7

2

Data class 5 "(neutral losses)"

1.

7

7

m/e

Figure

7

3

p-hydroxybenzoate

7

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch001

10

COMPUTER-ASSISTED STRUCTURE

ELUCIDATION

w h e n a r e f e r e n c e s p e c t r u m of t h i s c o m p o u n d , n - p r o p y l jD-hydroxyb e n z o a t e , w a s not i n the data b a s e . R e s u l t s f o r t h r e e o f t h e 15 d a t a c l a s s e s i l l u s t r a t e t h e " s e l f - t r a i n i n g " feature b y w h i c h STIRS i n d i c a t e s structural features of the u n k n o w n . D a t a c l a s s 2 A u t i l i z e s the l a r g e s t p e a k s i n t h e l o w m a s s r e g i o n of t h e s p e c t r u m (m/e 6 - 8 8 ) ; t h e s e f r a g m e n t i o n s a r e m o r e o f t e n f o r m e d b y s e c o n d ary r e a c t i o n s of h i g h e r energy r e q u i r e m e n t s , and so are i n d i c a t i v e of g r o s s , rather t h a n s p e c i f i c , s t r u c t u r a l f e a t u r e s . Thus a l l of the s p e c t r a found of h i g h e s t M F 2 A v a l u e s c o n t a i n e d a p h e n y l group, although the phenyl rings i n these compounds c o n t a i n a rather w i d e v a r i e t y of s u b s t i t u e n t s . The e x p e r i e n c e d m a s s s p e c t r o m e t r i s t p r o b a b l y w o u l d h a v e i n f e r r e d the p r e s e n c e of p h e n y l f r o m t h e " a r o m a t i c i o n s e r i e s " i n t h i s r e g i o n ; h o w e v e r , STIRS w a s not t r a i n e d s p e c i f i c a l l y to r e c o g n i z e t h e s e f e a t u r e s , but i n s t e a d i n d i c a t e d the p r e s e n c e of p h e n y l by f i n d i n g that s u c h c o m p o u n d s matched these data the most c l o s e l y . D a t a c l a s s 3B c o v e r s a h i g h e r m a s s r a n g e , w h o s e f r a g ment p e a k s s h o u l d be i n d i c a t i v e of more s p e c i f i c s t r u c t u r a l features. A g a i n a l l c o m p o u n d s of h i g h e s t M F 3 B v a l u e s c o n t a i n the p h e n y l g r o u p , but a l m o s t a l l of t h e m a l s o c o n t a i n a n a r y l h y d r o x y g r o u p (not ortho) a n d a c a r b o n y l . N o t e t h a t t h e l a t t e r i s contained i n carboxyl, ester, and keto functionalities; because STIRS i s d e s i g n e d t o p r o v i d e p o s i t i v e i n f o r m a t i o n , d a t a c l a s s 3B thus i n d i c a t e s the p r e s e n c e of H O - p h e n y l - C O - . D a t a c l a s s 5 employs " n e u t r a l l o s s " information, the differences i n mass b e t w e e n the observed fragment i o n and the m o l e c u l a r i o n , w h i c h i n t h i s c a s e i s a s s u m e d to b e m/e 1 8 0 . C l e a v a g e of the m o l e c u l a r i o n g i v e s two fragments, o n l y one of w h i c h holds the p o s i t i v e c h a r g e , and thus the n e u t r a l l o s t g e n e r a l l y c o n t a i n s the more e l e c t r o n e g a t i v e f u n c t i o n a l i t i e s . Illustrating t h i s , w h e n the m a s s e s r e p r e s e n t i n g the most common neutral l o s s e s of t h i s u n k n o w n w e r e matched a g a i n s t the w h o l e r e f e r e n c e f i l e , the h i g h e s t M F 5 v a l u e s w e r e found to be m a i n l y p r o p y l esters. To r e i t e r a t e , STIRS w a s not p r e p r o g r a m m e d to r e c o g n i z e p r o p y l e s t e r s f r o m t h e i r c o m m o n l o s s e s o f 4 1 , 4 2 , a n d 59 m a s s u n i t s ; STIRS i n e f f e c t t r a i n s i t s e l f to r e c o g n i z e t h e p r o p y l e s t e r f u n c t i o n a l i t y by f i n d i n g that t h e s e d a t a of the u n k n o w n w e r e matched best by propyl esters i n the f i l e . Note a l s o that the c o m pounds found by M F 5 d i d not c o n t a i n a p a r t i c u l a r l y s i g n i f i c a n t n u m b e r of p h e n y l g r o u p s ; t h e d i f f e r e n t d a t a c l a s s e s of STIRS h a v e been s e l e c t e d to be s e n s i t i v e to different f u n c t i o n a l i t i e s . STIRS h a s b e e n d e s i g n e d a s a n a i d to t h e i n t e r p r e t e r ; i f t h e i n t e r p r e t e r n o w a d d s up t h e m a s s of d i - s u b s t i t u t e d p h e n y l (76), h y d r o x y l (17), a n d p r o p y l e s t e r (87), he c a n n o t e t h a t t h e s u m c o r r e s p o n d s to the s u p p o s e d m o l e c u l a r w e i g h t , 1 8 0 , i n d i c a t i n g that a l l of the f u n c t i o n a l i t i e s of the u n k n o w n m o l e c u l e h a v e been i d e n t i f i e d by these three d a t a c l a s s e s of STIRS. STIRS: U n k n o w n Terpene. The m a s s s p e c t r u m of l2p-acetoxysandaracopimar-15-en-8p-, l l a - d i o l was examined

by

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch001

1.

VENKATARAGHAVAN E T A L .

Structure

of Unknown

Mass

Spectra

11

STIRS, o m i t t i n g a l l s p e c t r a of t h i s c o m p o u n d from the r e f e r e n c e file. The n i n e s t r u c t u r e s of h i g h e s t " o v e r a l l m a t c h f a c t o r " (MF11.0) v a l u e s are shown i n Figure 2. If t h e i d e n t i t y o f t h i s m o l e c u l e h a d b e e n t o t a l l y u n k n o w n to the i n t e r p r e t e r , t h e s e M F 1 1 . 0 s e l e c t i o n s s h o u l d h a v e i n d i c a t e d at l e a s t the g e n e r a l s t r u c t u r a l features of the m o l e c u l e to the interpreter. Thus a l l of the c o m p o u n d s of F i g u r e 2 h a v e e i t h e r three or four fused rings and a l l have the three f u s e d s i x membered r i n g s t h a t are a c t u a l l y p r e s e n t i n the u n k n o w n . The four t r i c y c l i c compounds c l o s e l y r e s e m b l e the correct structure i n h a v i n g m e t h y l g r o u p s i n t h e 4 , 4 , 1 0 , a n d 13 p o s i t i o n s , hydroxy at 8, and v i n y l at 1 3 . N o t e that three of the steroids contain a 5-hydroxy group, w h i c h c a n be v i e w e d as corresponding to the correct 8-hydroxy p o s i t i o n by " f l i p p i n g " the s t r u c t u r e s , w i t h their a c e t o x y groups then at l e a s t present i n the r i n g c o r r e s p o n d i n g to t h e r i n g c o n t a i n i n g t h e a c e t o x y group i n t h e u n k n o w n . The p r e s e n c e of h y d r o x y l a n d a c e t o x y g r o u p s are i n d i c a t e d b y the f a c t that e i g h t of the n i n e c o m p o u n d s c o n t a i n h y d r o x y l s a n d s e v e n c o n t a i n a c e t o x y g r o u p s ; o n l y t w o c o n t a i n more t h a n one h y d r o x y l g r o u p , w h i l e none c o n t a i n more t h a n one a c e t o x y . H o w e v e r , the compound does g i v e a m o l e c u l a r i o n , so that it s h o u l d be p o s s i b l e for the i n t e r p r e t e r to i n f e r c o r r e c t l y t h a t the u n k n o w n c o n t a i n s one acetoxy and two h y d r o x y l groups after d e d u c i n g the t r i c y c l i c s y s tem w i t h the other s u b s t i t u e n t s . A l s o , the steroid s e l e c t e d as the seventh compound has a 4 - g e m - d i m e t h y l group. For this u n k n o w n t h u s STIRS c a n g i v e f a i r c o n f i d e n c e i n a l l of t h e s t r u c t u r e a s s i g n m e n t s e x c e p t the p o s i t i o n of the a c e t o x y a n d one of the h y d r o x y g r o u p s ; there i s e v e n s o m e i n d i c a t i o n of t h e i r p o s i t i o n s , as i n the majority of s e l e c t e d s t r u c t u r e s of F i g u r e 2 t h e s e s u b stituents are o n the exterior r i n g b e a r i n g the bridgehead h y d r o x y l . P B M / S T I R S E x a m i n a t i o n of U n k n o w n S p e c t r a of F a t t y A c i d E s t e r s . In a n e a r l y c l a s s i c c a s e of n a t u r a l p r o d u c t s t r u c t u r e d e t e r m i n a t i o n b y m a s s s p e c t r o m e t r y (25) a c o m p o u n d i s o l a t e d a s the m e t h y l e s t e r from butterfat w a s i d e n t i f i e d to be m e t h y l 3 , 7 , 1 1 , 1 5 - t e t r a m e t h y l h e x a d e c a n o i c a c i d . T h e o r i g i n a l p u b l i s h e d (25) s p e c t r u m (omitted from the r e f e r e n c e file) w a s r u n t h r o u g h P B M a n d STIRS to g i v e the r e s u l t s s h o w n i n T a b l e I V . P B M c o r r e c t l y i d e n t i f i e d the compound as methyl p h y t a n o a t e , r e t r i e v i n g the two r e f e r e n c e s p e c t r a of t h i s compound i n the P B M reference file; note that the third s e l e c t i o n is a much p o o r e r m a t c h . The s u b s t r u c t u r e s i d e n t i f i e d b y STIRS M F 1 1 . 0 a n d 1 1 . 1 are c o r r e c t , although the a c e t a t e substructure i n d i c a t e d by M F 1 1 . 2 is not (Table IV). The b e s t - m a t c h i n g c o m p o u n d s found b y STIRS M F 1 1 . 0 a r e a l l m e t h y l e s t e r s of l o n g - c h a i n f a t t y a c i d s , and a l l but one has a methyl group i n the three p o s i t i o n . The p o s i t i o n s of the other m e t h y l groups w e r e not found b y STIRS, c o n s i s t e n t w i t h the rather s m a l l effect of s u c h methyl groups on the mass s p e c t r a .

COMPUTER-ASSISTED

OAc

STRUCTURE

ELUCIDATION

12/3 — Acetoxysandaracopimar— l 5 - e n - 8 / 3 , Ha — diol

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch001

(spectrum

MF

Figure

II.0

Best

2. Best-matching STIRS examination

not in file)

Matches:

compounds and their MF11.0 values found in the of 12j3-acetoxysandaracopimar-15-en-8p,li<x-diol

1.

VENKATARAGHAVAN E T A L .

Table IV.

Structure

Phytanic Acid Methyl Ester M e t h y l Phytanoate Isopropylidene batyl alcohol

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch001

13

Mass Spectra

U n k n o w n B u t t e r f a t M e t h y l E s t e r (not i n f i l e )

PBM;

STIRS:

of Unknown

Overall Match

MF11.0

K

AK

101+ 98+ 57**

0 2 4 3

14

COMPUTER-ASSISTED S T R U C T U R E

ELUCIDATION

As a further t e s t of t h e s e n s i t i v i t y of STIRS a n d P B M for s u c h structural d i f f e r e n c e s , the s p e c t r u m of m e t h y l 1 0 - m e t h y l nonadecanoate was run as an unknown without removing its s p e c trum from the r e f e r e n c e f i l e . Table V s h o w s , as e x p e c t e d , that Table V. file)

U n k n o w n , M e t h y l 10-methylnonadecanoate (spectrum i n

PBM:

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch001

Methyl 10-methylnonadecanoate Methyl n-eicosanoate Methyl 16-methylheptadecanoate Methyl 14-methylhexadecanoate

K

AK

159+ 102+ 84** 83***

0 10 28 29

STIRS, M F 1 1 . 0 :

- E o - , Methyl Methyl Methyl Methyl Methyl Methyl Methyl Methyl Methyl

-OCH

10-methylnonadecanoate 10-D-methyloctadecanoate 11-methylnonadecanoate n-docosanoate n-tricosanoate 17-methyloctadecanoate 16-methylheptadecanoate 13-methylpentadecanoate 15-methylhexadecanoate

3

919 844 774 749 745 734 733 733 72 6

t h i s s p e c t r u m i s t h e n r e t r i e v e d as the b e s t m a t c h for b o t h P B M a n d M F 1 1 . 0 of S T I R S . N o t e t h a t b o t h t h e P B M a n d STIRS r u n s h a v e r e t r i e v e d l o n g - c h a i n fatty a c i d methyl e s t e r s , most of w h i c h c o n t a i n a s i n g l e m e t h y l group s u b s t i t u t e d far from the ester f u n c t i o n a l i t y . L o c a t i n g t h e p o s i t i o n w i t h c o n f i d e n c e b y STIRS w o u l d r e q u i r e that the r e f e r e n c e f i l e c o n t a i n s a s u b s t a n t i a l number of s u c h 1 0 - m e t h y l d e r i v a t i v e s , as w e l l as that the e x a c t p o s i t i o n of the m e t h y l group p r o d u c e s s u b s t a n t i a l d i f f e r e n c e s i n the m a s s spectral behavior. The M S f r a g m e n t a t i o n - d i r e c t i n g c a p a b i l i t y of the c a r b o n y l group i s w e l l k n o w n , and i t s p o s i t i o n i n an a l k y l c h a i n s h o u l d have a m u c h greater effect o n the mass s p e c t r u m t h a n a m e t h y l g r o u p . T a b l e VI s h o w s the r e s u l t s of the P B M / S T I R S e x a m i n a t i o n of the m a s s s p e c t r u m of m e t h y l 3 - o x o n o n a d e c a n o a t e as a n u n k n o w n . N o w t h e f i v e b e s t m a t c h e s r e t r i e v e d b y STIRS M F 1 1 . 0 are a l l m e t h y l esters of l o n g - c h a i n fatty a c i d s w i t h a c a r b o n y l group i n the three p o s i t i o n .

1.

VENKATARAGHAVAN E T A L .

Table VI. file)

Structure

Unknown, Methyl

of Unknown

3-oxononadecanoate

PBM: Methyl 3-oxononadecanoate

15

Mass Spectra

(spectrum i n

K

AK

138+

0

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch001

STIRS, M F 1 1 . 0 :

Methyl Methyl Methyl Methyl Methyl

3-oxononadecanoate 3-oxoheptadecanoate 3-oxooctadecanoate 3-oxotetracosanoate 3-oxopentadecanoate

878 837 817 785 655

As a f i n a l e x a m p l e , the r e s u l t s of e x a m i n i n g the m a s s spectrum of g l y c e r y l t r i d o d e c a n o a t e as a n u n k n o w n are s h o w n i n Table VII. Although the unknown spectrum was not i n the f i l e , Table VII.

Unknown G l y c e r y l Tridodecanoate

PBM:

K

AK

120* 29***+

0 88

Glyceryl tridodecanoate Vinyl dodecanoate

(not i n f i l e )

STIRS, M F 1 1 . 0 : Glyceryl tridodecanoate Glyceryl tritetradecanoate G l y c e r y l - 2 - l a u r a t e - l , 3-distearate N-trifluoroacetyl-N-dodecyldodecanamide Glyceryl trioctadecanoate L - l ,2-dilauroylglycerylphosphoroylcholine G l y c e r y l - l - p a l m i t a t e - 2 , 3-distearate G l y c e r y l - 2 - l a u r a t e - l , 3-dipalmitate 1- S t e a r o - 2 , 3 - d i m y r i s t i n 2- Stearo-l, 3-diacetin Glyceryl trioctadecanoin G l y c e r o l - l , 3 - d i octadecanoate G l y c e r y l - 2 - m y r i s t a t e - l , 3-distearate

725 596 5 82 5 82 57 3 5 67 560 552 551 547 5 43 542 535

16

COMPUTER-ASSISTED STRUCTURE

ELUCIDATION

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch001

P B M r e t r i e v e d another s p e c t r u m of t h i s c o m p o u n d as a n e x c e l l e n t m a t c h , w i t h K and A K v a l u e s far s u p e r i o r to that of the o n l y other r e t r i e v e d s p e c t r u m , that of v i n y l d o d e c a n o a t e . The b e s t - m a t c h i n g c o m p o u n d s r e t r i e v e d b y STTRS M F 1 1 . 0 w e r e , w i t h o n l y o n e e x c e p t i o n , a l l g l y c e r y l esters c o n t a i n i n g fatty a c i d r e s i d u e s . The d o d e c a n o a t e (lauric) a c i d m o i e t y i s found i n f i v e of the eight b e s t matching c o m p o u n d s , w h i c h s h o u l d at l e a s t provide some help to the interpreter. P r a c t i c a l U s e of P B M a n d STTRS. O u r m a i n i n c e n t i v e of making these s y s t e m s a v a i l a b l e on the C o r n e l l computer o v e r the T Y M N E T s y s t e m (23) w a s t o g a i n f e e d b a c k o n t h e i r a p p l i c a t i o n t o r e a l u n k n o w n s . The authors w o u l d e s p e c i a l l y a p p r e c i a t e r e c e i v i n g s p e c t r a o n w h i c h P B M or STIRS a p p a r e n t l y d i d not g i v e s a t i s f a c t o r y r e s u l t s . It h a s b e e n o u r e x p e r i e n c e t h a t t h e b e s t w a y t o u n d e r s t a n d the u t i l i t y , a n d the s h o r t c o m i n g s , of P B M as a r e t r i e v a l s y s t e m , a n d of STIRS a s a n i n t e r p r e t i v e s y s t e m , i s to r u n s p e c t r a of compounds of d i r e c t i n t e r e s t to the p a r t i c u l a r u s e r . Because STIRS h a s b e e n d e s i g n e d a s a n a i d t o t h e i n t e r p r e t e r , t h e i n t e r preter s k n o w l e d g e of (and i n t e r e s t in) t h e s a m p l e a n d i t s c h e m i s try s h o u l d m a x i m i z e the i n f o r m a t i o n w h i c h c a n be g a i n e d from e x a m i n a t i o n of t h e STIRS r e s u l t s . U s i n g t h e s e r e s u l t s e f f i c i e n t l y does require experience. S u g g e s t i o n s a s to b e t t e r u s e of t h e s e r e s u l t s , or i m p r o v e m e n t s t o e i t h e r t h e P B M or STIRS s y s t e m , w i l l be welcomed by the authors. 1

Literature Cited 1. Waller, G. R., "Biochemical Applications of Mass Spectrometry", Wiley-Interscience, New York, 1972. 2. McFadden, W. H., "Techniques of Combined Gas Chromatography/Mass Spectrometry", Wiley-Interscience, New York, 1973. 3. Pesyna, G. M. and McLafferty, F. W., in "Determination of Organic Structures by Physical Methods", pp. 91-155, Nachod, F. C., Zuckerman, J. J., and Randall, E. W., Eds., Vol. 6, Academic Press, New York, 1976. 4. Isenhour, T. L., Kowalski, B. R., and Jurs, P. C., Critical Review Anal. Chem., (1974), 4, 1. 5. Smith, D. H., Buchanan, B. G., Engelmore, R. S., Adlercreutz, H., and Djerassi, C., J. Am. Chem. Soc., (1973), 95, 6087. 6. McLafferty, F. W., Hertel, R. H., and Villwock, R. D., Org. Mass Spectrom., (1974), 9, 690. 7. Pesyna, G. M., Venkataraghavan, R., Dayringer, H. E., and McLafferty, F. W., Anal. Chem., (1976), 48, 1362. 8. Kwok, K.-S., Venkataraghavan, R., and McLafferty, F. W., J. Am. Chem. Soc., (1973), 95, 4185.

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch001

I.

VENKATARAGHAVAN ET A L .

Structure

of Unknown

Mass Spectra

17

9. Dayringer, H. E., Pesyna, G. M., Venkataraghavan, R., and McLafferty, F. W., Org. Mass Spectrom., (1976), 11, 529. 10. Dayringer, H. E. and McLafferty, F. W., Org. Mass Spectrom., (1976), 11, 543. 11. Dayringer, H. E. and McLafferty, F. W., Org. Mass Spectrom., (1976), 11, 895. 12. Salton, G., "Automatic Information Organization and Retrieval", McGraw-Hill, New York, 1968. 13. Grotch, S. L., 17th Annual Conference on Mass Spectrometry, Dallas, May, 1969, p 459. 14. Pesyna, G. M., McLafferty, F. W., Venkataraghavan, R., and Dayringer, H. E., Anal. Chem., (1975), 47, 1161. 15. Abramson, F. P., Anal. Chem., (1975), 47, 45. 16. McLafferty, F. W., Anal. Chem., accepted. 17. Atwater, B. L., McLafferty, F. W., and Venkataraghavan, R., in preparation. 18. Hites, R. A. and Biemann, K., Adv. Mass Spectrom., (1968), 4, 37. 19. "ISI Chemical Substructure Dictionary", Institute for Scientific Information, Philadelphia, Pennsylvania, September, 1974. 20. Stenhagen, E., Abrahamsson, S., and McLafferty, F. W., "Registry of Mass Spectral Data", Wiley-Interscience, New York, 1974. 21. McLafferty, F. W., "Mass Spectral Correlations", Advances in Chemistry Series No. 40, American Chemical Society, Washington, 1963. 22. Cone, M. M., Venkataraghavan, R., and McLafferty, F. W., J. Am. Chem. Soc., submitted. 23. Office of Computer Services, Cornell University, Ithaca, New York 14853. 24. Brenner, L. and Suffet, I. H., Environ. Sci. Health, Part A, in press. 25. Ryhage, R. and Stenhagen, E., in "Mass Spectrometry of Organic Ions", p. 417, McLafferty, F. W., Ed., Academic Press, New York, 1963. Acknowledgment We are grateful to Professor I. H. Suffet of Drexel University for helpful discussions concerning the application of PBM and STIRS to actual problems, to the National Institutes of Health (grant 16609) and to the Environmental Protection Agency (grant R804509) for financial support of this work, and to the National Science Foundation for partial funding of the DEC PDP-11/45 computer used in this research.

2 Identification of the Components of Complex Mixtures by

GC-MS

J. E. BILLER, W. C. HERLIHY, and K. BIEMANN

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch002

Department of Chemistry, Massachusetts Institute of Technology, Cambridge,MA02139

It has been well-known for many years that Gas Chromatography-Mass Spectrometry is a powerful tool for the qualitative identification of the components of complex mixtures. It has also been well-established that the computer is neededinall aspects of this process beginning with the initial acquisition of the data from the laboratory instrument, the processing and "crunching" of the data, and presentation of the data to the human interpreter. In recent years, techniques to improve and enhance the data, as well as the many methods of interpretation and identification have also fallen almost totally into the province of the computer. Our laboratory has been intimately involved in the development and use of many of these techniques. Our needs are somewhat unique in the very large number of chemical problems from very diverse sources which require identifications or structure determinations. We currently acquire and analyze approximately one-half million spectra a year from such varied sources as drugs in body fluids, geochemical extracts, metabolism studies, organic synthetic studies, and many others. We have even had to extend the capability of our data enhancement and interpretive algorithms and routines to the analysis of spectra returned from the GC-MS instruments in the two Viking landers on the surface of Mars (1). The need to analyze and interpret such a large volume of data has many implications. The most important requirement is a fully automated system capable of maximum performance and efficiency from end-to-end. This means that the GC-MS instrumentation must produce the best quality data possible under the fast scan conditions necessary for reasonable resolution of complex mixtures. The man-machine interface must be simple and human-engineered to limit the probability of operator error; the computer-instrument interfaces (both data and control) have to be accurate and highly reliable. The basic processing of the data must be efficient, fast, economical and complete, and the presentation of such vast amounts of information must be accomplished 18

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch002

2.

BILLER ET AL.

Components

of Complex

Mixtures

by GC-MS

19

i n an e q u a l l y e f f i c i e n t , yet convenient form. Perhaps most imp o r t a n t l y (and c e r t a i n l y of the g r e a t e s t c o m p l e x i t y ) , the computer must be able to purge the s p e c t r a of background and other unavoidable i n t e r f e r e n c e s so t h a t i t can r e l i a b l y i n t e r p r e t the s p e c t r a to accomplish the primary task of the i d e n t i f i c a t i o n of the components of i n t e r e s t . In a d d i t i o n , the r o u t i n e analyses of such l a r g e amounts of data r e q u i r e p r a c t i c a l approaches to the design of r o u t i n e s to both enhance and i n t e r p r e t s p e c t r a . Very elegant and timeconsuming algorithms which improve the r e s u l t s by some s m a l l f r a c t i o n a l margin towards p e r f e c t i o n at the expense of e i t h e r l a r g e computer power or time are o b v i o u s l y not u s e f u l or a p p r o p r i a t e . In a s i m i l a r way, the approach to the p r e s e n t a t i o n of the data i n such huge q u a n t i t i e s n e c e s s a r i l y precludes impressive but imp r a c t i c a l i n t e r a c t i v e graphic systems which r e q u i r e i n o r d i n a t e amounts of both human and computer time. We f e e l that we have developed a p r a c t i c a l system a b l e to meet a l l of these needs. We would l i k e to b r i e f l y d e s c r i b e how t h i s has been accomplished, and i n a d d i t i o n , d e s c r i b e a new and powerful technique f o r the automated and i n t e l l i g e n t i d e n t i f i c a t i o n of o l i g o p e p t i d e s i n complex mixtures i n support of our work on p r o t e i n sequencing. A Comprehensive System f o r the A n a l y s i s of Complex M i x t u r e s As mentioned above, the f i r s t necessary component i n a balanced system f o r the i d e n t i f i c a t i o n of the c o n s t i t u e n t s of a complex mixture i s the source of the data. The GC-MS instrument must be designed and optimized f o r f a s t , continuous a c q u i s i t i o n of data over the e n t i r e GC a n a l y s i s . The scan f u n c t i o n must be accurate and r e p r o d u c i b l e , the data must be acquired w i t h h i g h s e n s i t i v i t y , low e l e c t r o n i c n o i s e , and wide dynamic range to accomodate the l a r g e c o n c e n t r a t i o n d i f f e r e n c e s common i n complex organic mixtures. This i s accomplished i n our l a b o r a t o r y on a h i g h l y modified H i t a c h i RMU-6L Mass Spectrometer i n t e r f a c e d to a P e r k i n Elmer 990 Gas Chromatograph v i a a Watson-Biemann f r i t t e d g l a s s separator (2). Almost a l l of the RMU-6L e l e c t r o n i c s have been replaced w i t h r e l i a b l e s o l i d - s t a t e c i r c u i t r y , and the magnet i s c o n t r o l l e d d i g i t a l l y to produce a f a s t , r e p r o d u c i b l e scan. The output of a s o l i d - s t a t e electrometer i s d i g i t i z e d at three separate g a i n l e v e l s to achieve a very accurate s i g n a l w i t h l a r g e dynamic range, high s i g n a l - t o - n o i s e r a t i o , and h i g h s e n s i t i v i t y (3). The system i s diagrammed i n F i g u r e 1. The data processing i n c l u d e s the normal m a s s - i n t e n s i t y r e d u c t i o n , and i n a d d i t i o n , the r o u t i n e generation of a l l mass chromatograms (3,4). To p r i n t or p l o t t h i s amount of i n f o r m a t i o n i n the conventional ways f o r even a s i n g l e sample would be very time-consuming and d i f f i c u l t . To s o l v e the data p r e s e n t a t i o n problem, a l l s p e c t r a and mass chromatograms are d i r e c t l y micro-

GAS

SCANNER UNIT

OPERATOR DIGITAL SCAN CONTROL/DISPLAY CONTROLLER CONSOLE

HITACHI RMU-6L ELECTRONICS CONSOLE

• SAMPLE-

CHROMATOGRAPH

t

Figure 1.

AUXILIARY PROCESSING UNIT

DIGITAL INPUT / OUTPUT

system

1800 DACS Basic GC-MS

IBM

1802 CPU

1851 MULTIPLEXER

configuration

ANALOG TO DIGITAL CONVERTER CORE MEMORY (32 K)

DATA CHANNEL CONTROLS (DMA)

DIGITAL SIGNAL CONDITIONING CONTROL AUXILIARY DIFFERENTIAL AMR, LOGIC AND FILTER, S/H AMR, PROG. PULSE MULTIPLEXER -GAIN ADJUSTMENT GENERATOR i J

ANALOG FM MAGNETIC TAPE

ELECR0N MULTIPLIER ELECTROMETER AMPLIFIER

MASS SPECTROMETER

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch002

TYPEWRITER

CARD READER AND PUNCH

LINE PRINTER

PLOTTER

DISKS (3)

DIGITAL TAPES (2)

CRT DISPLAY

s

bo o

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch002

2.

BILLER E T A L .

Components

of Complex

Mixtures

by

GC-MS

21

f i l m e d (5.,6). Thus, a l l of the i n f o r m a t i o n produced i n the course of the experiment i s permanently a v a i l a b l e t o the user independently of the computer. This approach i s very economical and more i m p o r t a n t l y , very e f f i c i e n t w i t h respect t o human and computer time and resources. To prepare the data f o r the s e v e r a l automated i n t e r p r e t i v e techniques a v a i l a b l e ( i n c l u d i n g both l i b r a r y searching and s e v e r a l more i n t e l l i g e n t and s p e c i a l i z e d i n t e r p r e t i v e a l g o r i t h m s f o r s p e c i f i c c l a s s e s of compounds), the technique of r e c o n s t r u c t i o n of the s p e c t r a was developed and i s being c o n s t a n t l y improved i n t h i s l a b o r a t o r y . Since t h i s technique has been described p r e v i o u s l y (_7), a b r i e f review of the concept w i l l be s u f f i c i e n t . S p e c t r a l data from a GC-MS a n a l y s i s i s acquired c o n t i n u a l l y , and a t a constant r a t e . Thus, the normal two dimensional view of the data and i t s i n f o r m a t i o n content (mass and i n t e n s i t y ) can be expanded t o three dimensions to i n c l u d e the time element. Components only p a r t i a l l y r e s o l v e d by the gas chromatograph w i l l normally show separate maxima i n the mass chromatograms characteri s t i c of those components a t the times t h e i r i n d i v i d u a l conc e n t r a t i o n s a r e g r e a t e s t . An enhanced s e t of s p e c t r a i s generated by performing a f u l l peak p r o f i l e a n a l y s i s on the e n t i r e mass chromatogram f o r every m/e value and then r e c o n s t r u c t i n g s p e c t r a at each scan by assembling only those m/e values which maximize at that scan. These r e c o n s t r u c t e d s p e c t r a a r e p r a c t i c a l l y f r e e of the c o n t r i b u t i o n s of unresolved companion substances, t a i l i n g f r a c t i o n s , column b l e e d , and other sources of i n t e r f e r e n c e . The Mass-Resolved Gas Chromatogram (the e q u i v a l e n t of a T o t a l I o n i z a t i o n P l o t generated from t h i s new s e t of r e c o n s t r u c t e d data) i s generated and the e n t i r e s e t of data i s m i c r o f i l m e d . The Mass-Resolved Gas Chromatogram i s now a w e l l - r e s o l v e d i n d i c a t i o n of the number, l o c a t i o n , and r e l a t i v e i n t e n s i t y of each of the components i n the mixture. The r e c o n s t r u c t e d s p e c t r a a r e r e l a t i v e l y f r e e of i n t e r f e r ences which o f t e n make the c o r r e c t i d e n t i f i c a t i o n of the v a r i o u s components more d i f f i c u l t i f not i m p o s s i b l e . Automated i d e n t i f i c a t i o n techniques such as l i b r a r y searching and others such as the a l g o r i t h m f o r the i n t e r p r e t a t i o n of peptide mixtures d e s c r i b e d below a r e g r e a t l y f a c i l i t a t e d , and the r e s u l t a n t r e l i a b i l i t y g r e a t l y improved. An I n t e r p r e t i v e A l g o r i t h m f o r Complex Peptide M i x t u r e s The determination of the amino a c i d sequence of p o l y peptides i s of great i n t e r e s t and has been shown to be amenable to a n a l y s i s by gas chromatography-mass spectrometry ( 8 ) . I n order to c l a r i f y the ensuing d i s c u s s i o n of an a l g o r i t h m to automate the i n t e r p r e t a t i o n of o l i g o p e p t i d e mass s p e c t r a , the peptide sequencing s t r a t e g y used i n t h i s l a b o r a t o r y w i l l be b r i e f l y d e s c r i b e d . A p o l y p e p t i d e , such as Subunit I of the sweet p r o t e i n M o n e l l i n ( 9 ) , i s p a r t i a l l y hydrolyzed w i t h weak

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch002

22

COMPUTER-ASSISTED S T R U C T U R E E L U C I D A T I O N

a c i d or enzymes to produce a complex mixture of d i - to pentapeptides. These peptides are d e r i v a t i z e d to enhance t h e i r v o l a t i l i t y (10) (Figure 2 ) , and i n j e c t e d i n t o a gas chromatograph-mass spectrometer system which r e s u l t s i n 200-300 mass s p e c t r a of the 15-50 o l i g o p e p t i d e s i n the mixture. One must now l o c a t e and i d e n t i f y a l l of these o l i g o p e p t i d e s and f i n a l l y reassemble them to determine the sequence of the o r i g i n a l p o l y peptide. Since the manual i d e n t i f i c a t i o n of a l l the components of these complex mixtures i s a time-consuming and d i f f i c u l t task, an automated procedure was needed. Searching the unknown s p e c t r a against a l i b r a r y of standard s p e c t r a i s not f e a s i b l e s i n c e a l i b r a r y of a l l p o s s i b l e d i - to pentapeptides would c o n t a i n approximately 3.4 m i l l i o n s p e c t r a . An i n t e r p r e t i v e a l g o r i t h m i s f e a s i b l e , however, s i n c e the p o l y aminoalcohol d e r i v a t i v e s (Figure 2) used i n t h i s l a b o r a t o r y d i s p l a y p r e d i c t a b l e fragmentation p a t t e r n s . By examining a l a r g e number of standard peptide s p e c t r a we have found that d i peptides always e x h i b i t an i n t e n s e Z l i o n ; t r i - to pentapeptides always show a prominent A2 i o n , and t e t r a - and pentapeptides always e x h i b i t an i n t e n s e A3 i o n . In a d d i t i o n to these fragment a t i o n r u l e s , the a l g o r i t h m makes use of the amino a c i d composition of the o r i g i n a l p o l y p e p t i d e , and the r e t e n t i o n index f o r each scan which i s c a l c u l a t e d by a p r e v i o u s l y described method (11). A l s o , as has been p r e v i o u s l y shown, (12) we can c a l c u l a t e the expected r e t e n t i o n index f o r any o l i g o p e p t i d e which i s a f u n c t i o n of i t s composition, but r e l a t i v e l y independent of the amino a c i d sequence. Based on t h i s i n f o r m a t i o n the a l g o r i t h m shown i n Table I was developed. I t should be noted that f o r the e n t r i e s i n the peptide l i s t i n steps 1 and 3, the order of the amino a c i d s i s not s i g n i f i c a n t . Thus, a l l t e t r a p e p t i d e s which have the composition (A,B2,C) w i l l be represented by a s i n g l e entry. In steps 4-7, however, the order of the amino a c i d s i n s p e c i f i c , so that these steps r e f e r to sequences and not j u s t combinations of amino a c i d s . Steps 1 and 3 are f i l t e r s based on the amino a c i d composition of the o r i g i n a l polypeptide and the r e t e n t i o n index of the unknown s p e c t r a . Step 4 i s a f i l t e r based on the r e l i a b l e A2 i o n ( Z l f o r d i p e p t i d e s ) . S i m i l a r l y , step 6 i s a f i l t e r based on the known presence of an A3 i o n i n the s p e c t r a of t e t r a - and pentapeptides. For the h y p o t h e t i c a l example shown i n Table I an A2 i o n i s assumed to have been found f o r the p a r t i a l sequences AB.. and BA.., and an A3 i o n i s assumed to have been found only f o r the sequences ABBC and BABC. The c a l c u l a t i o n i n step 7 i s complex and w i l l be presented i n d e t a i l elsewhere (13). I t should be noted that c o r r e c t i d e n t i f i c a t i o n of an unknown spectrum i s not dependent on the presence of a molecular i o n , nor any s p e c i f i c sequence i o n (except the r e l i a b l e A2, A3, and Z l as described above). This a l g o r i t h m was t e s t e d on the data from the a n a l y s i s of a mixture of o l i g o p e p t i d e s generated i n an a c i d h y d r o l y s i s e x p e r i -

BILLER ET A L .

Components

of Complex

Mixtures

by

GC-MS

o

Q

i 0

1

<

0=0

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch002

p.

a o

0=0

1 00

Q O

us

&-

CO ctf

CM

•H

CO

CO

•H a MCU p. e g (1) CD

o CM CM

Q

i

0=0 CM

1 •U

CM

1

<

CM

o

I U 8)

+ !25 CO CM Q

CO

C a l c u l a t e scores, save best f i n d s , and p r i n t r e s u l t s .

Check t e t r a - and pentapeptide sequences f o r A3 i o n (e.g. ABBC, BABC).

Generate f u l l sequences f o r p a r t i a l sequences f o r which the A2 i o n i s present (e.g. ABBC, ABCB, BABC, BACB).

For each peptide generate N-terminal p a r t i a l sequences (e.g. AB.., BA.., AC.., CA.., BC.., CB.., BB..) and check f o r the expected A2 i o n ( Z l f o r d i p e p t i d e s ) .

S e l e c t from the l i s t of peptides those w i t h r e t e n t i o n i n d i c e s s i m i l a r to unknown's r e t e n t i o n index (± 7%).

Enter mass spectrum and r e t e n t i o n index of next scan to be i n t e r p r e t e d .

2

Generate a l i s t of a l l d i - to pentapeptides w i t h amino a c i d composition compatible w i t h sample, and c a l c u l a t e t h e i r r e t e n t i o n i n d i c e s , (e.g. (A,B ,C); R.I. = 630 + a + 2b = c)

PEPTIDE INTERPRETIVE ALGORITHM

TABLE I

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch002

2.

BILLER ET A L .

Components

of Complex

Mixtures

by

GC-MS

25

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch002

merit. Eighteen peptides had been p r e v i o u s l y i d e n t i f i e d by an experienced human i n t e r p r e t e r . The a l g o r i t h m was able to c o r r e c t l y i d e n t i f y seventeen of the o l i g o p e p t i d e s . The peptide not c o r r e c t l y i d e n t i f i e d was found as the t h i r d most l i k e l y choice f o l l o w i n g two s i m i l a r s t r u c t u r e s . We have attempted i n the preceding d i s c u s s i o n to i l l u s t r a t e why i t i s necessary to develop a l l aspects of a balanced data a c q u i s i t i o n , p r o c e s s i n g , d i s p l a y , and i n t e r p r e t i v e system i n order to cope w i t h the problems inherent i n the a n a l y s i s of complex mixtures by GC-MS. Although good i n t e r p r e t i v e a l g o r i t h m s are very important i n attempts at automated s t r u c t u r e e l u c i d a t i o n , at l e a s t i n the case of mixture a n a l y s i s , they are a t best o n l y as good as the weakest l i n k i n the procedure which provides the data to be analyzed.

Literature Cited. 1. Biemann, K., Oro, J., Toulmin III, P., Orgel, L.E., Nier, A.O., Anderson, D.M., Simmonds, P.G., Flory, D., Diaz, A.V., Rushneck, D.R., Biller, J.E., Science (1976), 194, 72-76. 2. Watson, J.T., and Biemann, K., Analytical Chemistry (1964), 36, 1135. 3. Biller, J.E., Ph.D. Thesis, Massachusetts Institute of Technology, May 1972, and other unpublished work. 4. Hertz, H.S., Hites, R.A., and Biemann, K., Analytical Chemistry (1971), 43, 681. 5. Biller, J.E., Hertz, H.S., and Biemann, K., Paper F2, presented at the Nineteenth Annual Conference on Mass Spectrometry and Allied Topics, Atlanta, Georgia, May 1971. 6. Biller, J.E., Hertz, H.S., and Biemann, K., In preparation. 7. Biller, J.E., and Biemann, K., Analytical Letters (1974), 7(7), 518-528. 8. Biemann, K., in "Biomedical Applications of Mass Spectrometry", G. Waller (ed.) (1972), 405-428. 9. Hudson, G., and Biemann, K., Biochemical and Biophysical Res. Communications (1976), 71(1), 212-220. 10. Kelley, J.A., Nau, H., Forster, H.-J., and Biemann, K., (1975), Biomedical Mass Spectrometry, 2, 313-325. 11. Nau, H., and Biemann, K., Analytical Chemistry (1974), 46(3), 426-434. 12. Nau, H., and Biemann, K., Analytical Biochemistry (1976), 73, 139-153. 13. Herlihy, W.C., and Biemann, K., In preparation.

3 The NIH-EPA Chemical Information System G. W. A. MILNE National Institutes of Health, Bethesda, MD 20014

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch003

S. R. H E L L E R Environmental Protection Agency, Washington, DC 20460

The quantity of data associated with analytical chemistry has been expanding very rapidly during the last twenty years or so, but until recently, the efficient application of computers to this problem has been vitiated by the high costs of computer storage and computation. With the continuing improvement in computer technology and the steady decrease in computation costs, it has, in the last two years, become feasible to consider the development of a completely interactive chemical information system. Earlier searching systems avoided the cost of bulk storage by maintaining the data files on magnetic tape rather than disk. Tape is a very cheap form of storage but its use implies batch searching which is necessarily slow because tape is not susceptible to random access. Magnetic disks, on the other hand, are random access devices and the data stored on them can be searched very rapidly. Until recently, the cost of storage on disks has precluded their use for large data bases, but now it is becoming practical to consider this approach. Since this permits interactive computing, we have developed a chemical information system that uses disk exclusively for the storage of data. Interactive computing is a signficantly different process from batch, or off-line, computing and a different philosophy can be used in the design of programs for such work. A major problem that a chemist has in searching a chemical data base, is that the best questions to ask are often not known. An interactive system can provide the answer to a question immediately and this will enable the user to see the deficiences in the question and to frame a new query. In this way, there can be built a feedback loop in which the chemist acts as a transducer, "tuning" the query until the system reports precisely what is required. The NIH-EPA Chemical Information System (CIS), which is described in this chapter, has been designed around this general approach (1). 26

3.

M I L N E AND HELLER

NIH-EPA

Chemical

Information

System

27

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch003

System Design The CIS c o n s i s t s of a c o l l e c t i o n of chemical data bases together w i t h a b a t t e r y of programs f o r i n t e r a c t i v e s e a r c h i n g through these d i s k - s t o r e d data bases. I n a d d i t i o n , there are a s e r i e s of programs f o r the a n a l y s i s of data, e i t h e r t o reduce them t o a form s u i t a b l e f o r s e a r c h i n g purposes or as an end i n itself. The data bases t h a t a r e i n the CIS i n c l u d e f i l e s of mass s p e c t r a , carbon-13 n u c l e a r magnetic resonance s p e c t r a , x-ray d i f f r a c t i o n data f o r c r y s t a l s and powders, and s e v e r a l b i b l i o graphic data bases. The a n a l y t i c a l programs i n c l u d e a f a m i l y of s t a t i s t i c a l a n a l y s i s and mathematical m o d e l l i n g a l g o r i t h m s and programs f o r the c a l c u l a t i o n of i s o t o p i c enrichment from mass s p e c t r a l data, a n a l y s i s of nmr s p e c t r a and energy m i n i m i z a t i o n of conformational s t r u c t u r e s . a. A d d i t i o n of components t o the CIS. A general p r o t o c o l f o r updating of CIS components o r the a d d i t i o n t o the CIS of new components has been e s t a b l i s h e d and a schematic diagram of t h i s i s shown i n F i g u r e 1. In the f i r s t phase, a data base i s a c q u i r e d , i f necessary, from one of a v a r i e t y of sources. Some of the CIS data bases have been developed s p e c i f i c a l l y f o r the CIS, an example of t h i s being the mass s p e c t r a l data base (2). Other data bases, such as the Cambridge C r y s t a l F i l e (3)> a r e leased f o r use i n the CIS and s t i l l o t h e r s , such as the X-ray powder d i f f r a c t i o n f i l e ( 4 ) , a r e operated w i t h i n the CIS by t h e i r owners. Next, the necessary program development i s undertaken. I f the component i s one i n v o l v i n g s e a r c h i n g of a data base, some r e f o r m a t t i n g of the data base, s o r t i n g and i n v e r s i o n of f i l e s and so on, i s u s u a l l y r e q u i r e d , and t h i s i s c a r r i e d out on the NIH IBM 360-168, which i s w e l l - s u i t e d t o p r o c e s s i n g of l a r g e f i l e s of data. Once searchable f i l e s have been prepared, they are t r a n s f e r r e d t o the NIH PDP-10 computer which i s p r i m a r i l y a time-sharing computer, and the programs f o r s e a r c h i n g through the data bases are w r i t t e n . The a n a l y t i c a l , data base-independent programs of the CIS are u s u a l l y w r i t t e n e n t i r e l y on the PDP-10. Out of t h i s work there f i n a l l y emerges a p i l o t v e r s i o n of the component. The p i l o t v e r s i o n i s then allowed t o run on the NIH PDP-10 and access t o i t i s provided t o a s m a l l number of people who can l o g i n t o the NIH computer by telephone, u s i n g long d i s t a n c e c a l l s i f necessary. These users are provided w i t h f r e e computation and i n r e t u r n , they t e s t the component thoroughly f o r e r r o r s and d e f i c i e n c i e s . Such problems are reported t o the development team, which attempts t o d e a l w i t h them. Depending upon the s i z e and complexity of the component, t h i s t e s t i n g phase can l a s t as long as eighteen months. When t e s t i n g i s complete, the e n t i r e component i s exported to a networked PDP-10 i n the p r i v a t e s e c t o r and the v e r s i o n on

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch003

28

COMPUTER-ASSISTED S T R U C T U R E E L U C I D A T I O N

the NIH computer i s dispensed w i t h . The component i n the p r i v a t e s e c t o r i s a v a i l a b l e t o the general s c i e n t i f i c community and can be used on a f e e - f o r - s e r v i c e b a s i s . In t h i s phase, the government r e t a i n s no f i n a n c i a l i n t e r e s t i n the component; i t i s "managed" by a sponsor o u t s i d e the U.S. Government. The Department of Industry of the B r i t i s h Government, f o r example, maintains the Mass S p e c t r a l Search System on the network. Advice and c o n s u l t a t i o n between such sponsors and NIH/EPA personnel continues, but the U.S. Government does not s u b s i d i z e the r o u t i n e o p e r a t i o n of CIS components i n the p r i v a t e s e c t o r . In f a c t , v a r i o u s government agencies of the governmnt are a c t u a l l y users of the CIS and they pay according to t h e i r use of the system, l i k e any other user. Charges f o r use of these components must be designed to cover c o s t s , and i f the component a t t r a c t s i n s u f f i c i e n t use at these p r i c e s , then i t i s probably not v i a b l e and i t s sponsor need not continue to support it. b. Computers f a c i l i t i e s used by the CIS. Programs of the CIS have u s u a l l y been designed f o r use w i t h a DEC PDP-10 computer system. The reason f o r t h i s i s that the PDP-10 i s one of the b e t t e r time-sharing systems a v a i l a b l e and has been adopted by a number of commercial computer network companies as the main v e h i c l e f o r t h e i r networks. Transfer of a program from the NIH PDP-10 t o a network PDP-10 i s u s u a l l y s t r a i g h t f o r w a r d , and use of a networked computer i s favored because the a l t e r n a t i v e philosophy of e x p o r t i n g programs and data bases to l o c a l l y operated PDP-10 computers i s l e s s workable. This l a t t e r approach contains a number of d e f i c i e n c i e s t h a t are overcome by a network. Most important i n t h i s connection i s the f a c t that use of a networked machine means that data bases need only be stored once, a t the center of the network. A great d e a l of money i s thus saved because d u p l i c a t e storage i s not necessary. F u r t h e r , a s i n g l e copy of a data base i s easy to m a i n t a i n , whereas updating of a data base that r e s i d e s on many computers i s v i r t u a l l y i m p o s s i b l e . F i n a l l y , communications between systems, personnel, and users i s very simple i n a network environment, as i s monitoring of system performance. For these and other reasons, the p o l i c y of d i s s e m i n a t i n g the CIS v i a networked PDP-10 computers was adopted a t the outset and has proved to be q u i t e s u c c e s s f u l . A t y p i c a l U.S. network of t h i s s o r t has something under 100 nodes - i . e . , l o c a l telephone c a l l access i s a v a i l a b l e i n about 100 l o c a t i o n s . These are mainly i n the U.S., but a s u b s t a n t i a l number w i l l be found i n Europe. F u r t h e r , some computer networks are now themselves i n t e r f a c e d t o the T e l e x network, thus making t h e i r computer systems a v a i l a b l e worldwide. I r r e s p e c t i v e of one's l o c a t i o n , the cost of access i s somewhere between $7 and $15 per hour, depending upon the t r a n s m i s s i o n speed used and a l s o on the time of day. Networks u s u a l l y o f f e r 110, 300 and 1200 baud s e r v i c e and the response time of the system i s u s u a l l y n e g l i g i b l y s m a l l .

3.

M I L N E AND H E L L E R

NIH-EPA

Chemical

Information

System

29

The o n l y equipment t h a t i s r e q u i r e d to e s t a b l i s h access to a computer network, i s a telephone-coupled computer t e r m i n a l . Typewriter t e r m i n a l s are becoming very common and are a l s o becoming r e l a t i v e l y cheap. Such a t e r m i n a l can be purchased from a v a r i e t y of manufacturers f o r between $1,000 and $3,000 and i n g e n e r a l , w i l l operate a t 300 baud (30 characters/second). A cathode ray t e r m i n a l , capable of running at 1200 baud can be purchased f o r as l i t t l e as $2,000. Any equipment of t h i s s o r t can u s u a l l y be leased or purchased.

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch003

Components of the CIS a. Mass S p e c t r a l Search System. The Mass S p e c t r a l Search System (MSSS) i s the o l d e s t component of the CIS, having been developed i n 1971, and has been seen as a prototype f o r more recent components. Developed as a j o i n t e f f o r t between NIH, EPA, and the Mass Spectrometry Data Centre (MSDC) i n England, the c u r r e n t MSSS data base c o n t a i n s about 30,000 mass s p e c t r a r e p r e s e n t i n g the same number of compounds. This has been d e r i v e d from an a r c h i v a l f i l e c o n t a i n i n g some 60,000 s p e c t r a of the same 30,000 compounds C5)« Computer techniques have been employed to a s s i g n every spectrum a q u a l i t y index(6) and where d u p l i c a t e s p e c t r a appear i n the a r c h i v e f i l e , only the best spectrum i s used i n the working f i l e . A l l compounds i n the a r c h i v e have been assigned Chemical A b s t r a c t s S e r v i c e (CAS) r e g i s t r y number, a unique i d e n t i f i e r t h a t i s used to l o c a t e d u p l i c a t e e n t r i e s f o r the compounds, f i n d the compound i n other CIS f i l e s and p r o v i d e s t r u c t u r e and synonym lookup c a p a b i l i t i e s throughout the CIS. Searches through the MSSS data base can be c a r r i e d out i n a number of ways. With the mass spectrum of an unknown i n hand, the search can be conducted i n t e r a c t i v e l y , as i s shown i n F i g u r e 2. In t h i s search the user f i n d s t h a t 24 data base s p e c t r a have a base peak (minimum i n t e n s i t y 100%, maximum i n t e n s i t y 100%) at an m/e v a l u e of 344. When t h i s subset i s examined f o r s p e c t r a c o n t a i n i n g a peak a t m/e 326 w i t h i n t e n s i t y of l e s s than 10%, only 2 s p e c t r a are found. I f necessary, the search can be continued i n t h i s way u n t i l a manageable number of s p e c t r a are r e t r i e v e d as f u l f i l l i n g a l l the c r i t e r i a t h a t the user c i t e d . These answers can then be l i s t e d as i s shown i n F i g u r e 2. A l t e r n a t i v e l y , the f i l e can be examined f o r a l l occurrences of a s p e c i f i c molecular weight or a p a r t i a l or complete molecular formula. Combinations of these p r o p e r t i e s can a l s o be used i n searches. Thus, a l l compounds c o n t a i n i n g , f o r example, f i v e c h l o r i n e s and whose mass s p e c t r a have a base peak at a p a r t i c u l a r m/e v a l u e can be i d e n t i f i e d . In c o n t r a s t to these i n t e r a c t i v e searches, which are of l i t t l e appeal to those w i t h l a r g e numbers of searches to c a r r y out, there i s a v a i l a b l e a batch-type search which accepts the complete spectrum of the unknown and s e q u e n t i a l l y examines a l l s p e c t r a i n the f i l e to f i n d the best f i t s . A user's data system can be connected to the network f o r t h i s purpose and the unknown

30

COMPUTER-ASSISTED

STRUCTURE

ELUCIDATION

DATA

PILOT VERSION OF S E A R C H SYSTEM

COMMERCIAL VERSION OF S E A R C H SYSTEM

'EXPERT' U S E R S

G E N E R A L USER

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch003

PROGRAMS

-12 M O N T H S -

•3-15 MONTHS-

DCRT 370, PDP10

DCRT PDP10

Figure

USER

1.

NETWORKED

Protocol for adding a component

PDP10

to the CIS

RESPONSE:PEAK

TYPE PEAK,MIN INT,MAX INT CR TO EXIT/1 FOR ID#,REGN,QI,MW,MF AND NAME USER:344,100,100 # REFS 24

M/E PEAKS 344

NEXT REQUEST: 326,0,10 # REFS 2

M/E PEAKS 344 326

NEXT REQUEST: 1 ID# REGN Ql MW MF 23455 19594913 624 344 C22H3203 25330

18326153 974 386 C24H3404

Figure

2.

PEAK

NAME A-Nor-5.alpha.androstan-3-one, 17.beta.-hydr oxy-5-vinyl-, acetate (8CI) Podocarpa-8,11, 13-triene-3.beta.,12-diol, 13isopropyl-, di acetate (8CI)

search in the MSSS

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch003

3.

M I L N E AND HELLER

NIH-EPA

Chemical

Information

System

31

s p e c t r a can be down-loaded i n t o the network computer f o r use i n t h i s search, which can be c a r r i e d out a t once, o r , i f p r e f e r r e d , overnight a t 30% o f the c o s t . Once an i d e n t i f i c a t i o n has been made, the name and r e g i s t r y number o f the data base compound are reported t o the user. I f necessary, the data base spectrum can be l i s t e d o r , i f a CRT t e r m i n a l i s being used, p l o t t e d , t o f a c i l i t a t e d i r e c t comparison of the unknown and standard s p e c t r a . The MSSS has been g e n e r a l l y a v a i l a b l e through computer n e t works f o r s e v e r a l years and i s now c u r r e n t l y r e s i d e n t upon the ADP-Cyphernetics network where some 3,000 searches and 2,000 other t r a n s a c t i o n s , such as r e t r i e v a l s , are c a r r i e d out each month by the approximately 200 users. A l l searches i n the MSSS are t r a n s a c t i o n p r i c e d a t between $1 and $7 and i n a d d i t i o n t o these charges and the connect time charge, users must pay the annual s u b s c r i p t i o n fee o f $300. This fee i s used t o defray the annual d i s k storage charges which are p a i d i n advance by the sponsor of the MSSS, the Department o f Industry of the B r i t i s h Government. b. Carbon Nuclear Magnetic Resonance (CNMR) S p e c t r a l Search System. The data base t h a t i s used i n the CNMR search system c o n s i s t s c u r r e n t l y o f 4,400 CNMR s p e c t r a . As i n the case of the MSSS, every compound has a CAS r e g i s t r y number, and a l l exact d u p l i c a t e s have been removed from the f i l e . A s p e c i f i c compound may s t i l l appear i n t h i s f i l e more than once, however, because i t s CNMR spectrum may have been recorded i n d i f f e r e n t s o l v e n t s . The CNMR f i l e i s s t i l l s m a l l but i s growing a t a f a i r l y steady r a t e and should b e n e f i t c o n s i d e r a b l y from recent i n t e r n a t i o n a l agreements t o the e f f e c t t h a t a l l major c o m p i l a t i o n s of CNMR data w i l l , i n the f u t u r e , be pooled. Searching through t h i s data base, as i n the case of the MSSS, can be i n t e r a c t i v e o r not. I n the i n t e r a c t i v e search, a user e n t e r s a s h i f t , w i t h an acceptable d e v i a t i o n , and the s i n g l e frequency offresonance decoupled m u l t i p l i c i t y , i f t h a t i s known. The program r e p o r t s the number of f i l e s p e c t r a f i t t i n g one o r both of these c r i t e r i a . The names of the compounds whose s p e c t r a have been r e t r i e v e d can be l i s t e d , o r a l t e r n a t i v e l y , the l i s t can be reduced by the e n t r y of a second chemical s h i f t . A search f o r s p e c t r a of compounds having a s p e c i f i c complete o r p a r t i a l molecu l a r formula can a l s o be c a r r i e d out, but there i s no c a p a b i l i t y for searching on molecular weight, a parameter of l i t t l e relevance to CNMR spectroscopy. I f an i n t e r a c t i v e search i s not a p p r o p r i a t e t o the problem a t hand, a batch type of search through the data base u s i n g the techniques d e s c r i b e d by C l e r c e t a l . ( 7 ) i s a v a i l a b l e . To c a r r y out such a search, the user enters a l l the chemical s h i f t s from the unknown and s t a r t s the search. The e n t i r e unknown spectrum i s compared t o every e n t r y i n the f i l e and the best f i t s are noted and reported t o the user. This program searches f o r the absence of peaks i n a given r e g i o n as w e l l as f o r the presence of peaks

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch003

32

COMPUTER-ASSISTED STRUCTURE

ELUCIDATION

and thus has the c a p a b i l i t y of f i n d i n g those compounds which are s t r u c t u r a l l y s i m i l a r to the m a t e r i a l t h a t gave the unknown spectrum. When a search i s completed, the user i s provided w i t h the a c c e s s i o n numbers of s p e c t r a that f i t the input data. The names and CAS r e g i s t r y numbers of the compounds i n q u e s t i o n w i l l a l s o be given. I f more i n f o r m a t i o n i s r e q u i r e d , the complete entry f o r a given a c c e s s i o n number can be r e t r i e v e d . This i n c l u d e s a numbered s t r u c t u r a l formula, the name, molecular formula and r e g i s t r y number of the compound, experimental data p e r t a i n i n g to the spectrum and the e n t i r e spectrum, together w i t h s i n g l e frequency off-resonance decoupled m u l t i p l i c i t i e s and ( i f a v a i l a b l e ) r e l a t i v e l i n e i n t e n s i t i e s and assignments. T h i s CNMR search system r e c e n t l y has been made a v a i l a b l e on the ADP-Cyphernetics network. Searches are a l l t r a n s a c t i o n p r i c e d at $1-3. c. X-ray C r y s t a l l o g r a p h i c Search System. This i s a s e r i e s of search programs working a g a i n s t the Cambridge C r y s t a l F i l e ( 8 ) , a data base of some 15,000 e n t r i e s d e a l i n g w i t h p u b l i s h e d c r y s t a l l o g r a p h i c data mainly f o r organic compounds. The e n t r y f o r each compound contains the compound name, i t s molecular weight and r e g i s t r y number, the space group i n which i t c r y s t a l l i z e s and the parameters of the u n i t c e l l of the c r y s t a l . The f i l e may be searched on the b a s i s of any of these parameters as shown i n F i g u r e 3, which i s an example of a search f o r any compounds that c r y s t a l l i z e i n space group P 1 and have a molecular weight between 250 and 300. As can be seen from F i g u r e 3, there are 98 e n t r i e s w i t h the c o r r e c t space group ( s c r a t c h f i l e 1) and 867 w i t h a molecular weight w i t h i n the s p e c i f i e d range ( s c r a t c h f i l e 3). The i n t e r s e c t i o n of these f i l e s r e v e a l s t h a t only 3 compounds ( s c r a t c h f i l e 4) meet both s p e c i f i c a t i o n s , and the f i r s t of these compounds, c r y s t a l sequence number 4413, i s l i s t e d . A l l the compounds i n t h i s f i l e have been r e g i s t e r e d by the CAS and t h e i r connection t a b l e s have been merged i n t o the f i l e . This data base i s , t h e r e f o r e , searchable on a s t r u c t u r a l or subs t r u c t u r a l b a s i s , as are a l l the othe f i l e s of the CIS. Once an e n t r y of i n t e r e s t i n t h i s f i l e has been l o c a t e d by one of the search programs, i t s f i l e a c c e s s i o n number, the " c r y s t a l sequency number" can be used to r e t r i e v e the a p p r o p r i a t e l i t e r a t u r e reference or the s t r u c t u r e , or both. T h i s f i l e i s a v a i l a b l e f o r general use v i a the ADP-Cyphernetics network. C u r r e n t l y , the charging of options i n t h i s system i s not t r a n s a c t i o n - p r i c e d . Enough s t a t i s t i c s are now a v a i l a b l e to i n d i c a t e that a l l searches, other than s t r u c t u r a l searches, cost l e s s than $2.00 and t h a t the s t r u c t u r a l searches cost p o s s i b l y as much as $10.00. d. X-ray C r y s t a l Data R e t r i e v a l System. The N a t i o n a l Bureau of Standards (NBS) has c o l l e c t e d a f i l e of data p e r t a i n i n g to some

3.

M I L N E A N D HELLER

NIH-EPA

Chemical

Information

System

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch003

*SPGR SPACE GROUP SYMBOL > P 1 FILE = 1 REFERENCES = 98 ITEM = P 1 •SMOLS 1 TYPE MOLECULE WEIGHT RANGE >250,300

FILE = 3

MERGED REFERENCES = 867

•INTER 1 3 FILE = 4 INTERSECTED REFERENCES = 3 SOURCE FILES WERE: 1 3 *SSHOW 4 START WITH NTH REFERENCE (1)=1 SHOW EVERY NTH REFERENCE (1) =1 STRUCTURE 1 CRYSTAL SEQUENCE 4413 I

C

*

*

*

*

C**N»*C * »

*

*

c**o**c o

*

C

c* *c *

*

*

*

c**c Figure

3. Space group and molecular search in the Cambridge crystal file

weight

33

34

COMPUTER-ASSISTED S T R U C T U R E E L U C I D A T I O N

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch003

24,000 c r y s t a l s , i n c l u d i n g those i n the Cambridge f i l e described i n (c) above (9). The data i n the NBS f i l e i n c l u d e the c e l l parameters, the number of molecules, Z, i n the u n i t c e l l , the measured and c a l c u l a t e d d e n s i t i e s of the c r y s t a l and two determina t i v e r a t i o s , such as A/B and A/C. Every compound i n the f i l e i s i d e n t i f i e d by i t s name, molecular formula, and r e g i s t r y number, and the f i l e can be s t r u c t u r a l l y searched by the CIS s u b s t r u c t u r e search system which i s d e s c r i b e d below. Searches through t h i s data base f o r c r y s t a l s w i t h s p e c i f i c space groups, or d e n s i t i e s are p o s s i b l e and c r y s t a l s w i t h u n i t c e l l s of given dimensions can a l s o be found. I t i s hoped t h a t t h i s may prove to be a very r a p i d method of i d e n t i f y i n g compounds from the r e a d i l y measured c r y s t a l p r o p e r t i e s . e. X-ray Powder D i f f r a c t i o n R e t r i e v a l Program. Compounds t h a t f a i l to c r y s t a l l i z e may s t i l l be examined by X-ray d i f f r a c t i o n , because powders g i v e c h a r a c t e r i s t i c d i f f r a c t i o n p a t t e r n s . A c o l l e c t i o n of powder d i f f r a c t i o n p a t t e r n s proves to be a very e f f e c t i v e means by which to i d e n t i f y m a t e r i a l s and indeed, one of the very e a r l i e s t search systems i n chemical a n a l y s i s was based upon such data by Hanawalt (10) n e a r l y f o r t y years ago. The data base of some 27,000 powder d i f f r a c t i o n p a t t e r n s ( 1 1 ) t h a t i s used i n the CIS i s i n f a c t a d i r e c t descendant of that w i t h which Hanawalt c a r r i e d out h i s p i o n e e r i n g work. A problem that a r i s e s i n connection w i t h t h i s p a r t i c u l a r component stems from the f a c t t h a t powders, as opposed to c r y s t a l s , are f r e q u e n t l y impure. The p a t t e r n s t h a t are obtained e x p e r i m e n t a l l y , t h e r e f o r e , are o f t e n combinations of one or more f i l e e n t r i e s . A reverse searching program, t h a t examines the experimental data to see i f each e n t r y from the f i l e i s contained i n i t ( 1 2 ) , has been w r i t t e n and seems to cope w i t h t h i s p a r t i c u l a r d i f f i c u l t y . I t i s c u r r e n t l y running i n t e s t on the NIH PDP-10 and w i l l be made a v a i l a b l e to the s c i e n t i f i c community d u r i n g 1977. f. Substructure Search System (SSS). A l l the compounds i n the f i l e s of the CIS have been assigned a r e g i s t r y number by the CAS. The r e g i s t r y number i s a unique i d e n t i f i e r f o r that compound, and may be used to r e t r i e v e from the CAS Master R e g i s t r y , a l l the synonyms that the CAS has i d e n t i f i e d f o r the compound, these being names t h a t have been used f o r the compound, i n a d d i t i o n to the name used i n the CAS 9th C o l l e c t i v e Index. F u r t h e r , the r e g i s t r y number can be used to l o c a t e i n the CAS f i l e s , the connection t a b l e f o r the compound's s t r u c t u r e . This i s a two-dimensional record of a l l atoms i n the molecule together w i t h the atoms to which each i s bonded and the nature of the bonds (13). This connection t a b l e i s the b a s i s of the s u b s t r u c t u r e search component of the CIS (14). The purpose of the SSS i s to permit a search f o r a userd e f i n e d s t r u c t u r e or s u b s t r u c t u r e through data bases of the CIS. I f a s u b s t r u c t u r e i s found to be i n a CIS data base, then armed

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch003

3.

M I L N E AND HELLER

NIH-EPA

Chemical

Information

System

35

w i t h i t s r e g i s t r y number, the user can access that f i l e and l o c a t e the compound and hence i n s p e c t whatever data i s a v a i l a b l e f o r i t . As the f i r s t step i n t h i s process, the user must, of course, be a b l e to d e f i n e the s t r u c t u r e of i n t e r e s t to the computer. This i s done w i t h a f a m i l y of s t r u c t u r e generation programs which can, f o r example, c r e a t e a r i n g of a given s i z e , a c h a i n of a given l e n g t h , a fused r i n g system and so on. Branches, bonds and atoms can be added and the nature of bonds can be s p e c i f i e d . The element represented by any nodes can be d e f i n e d ; i n the absence of such d e f i n i t i o n , the atom i s presumed to be carbon. As the query s t r u c t u r e i s developed u s i n g these commands, the computer s t o r e s the growing connection t a b l e . I f the user wishes to view the c u r r e n t s t r u c t u r e a t any p o i n t , the d i s p l a y command (D) can be invoked. T h i s command, u s i n g the c u r r e n t connection t a b l e , generates a s t r u c t u r e diagram that can be p r i n t e d at a convent i o n a l terminal. When the a p p r o p r i a t e query s t r u c t u r e has been generated, a number of search options can be invoked to f i n d occurrences of t h i s query s t r u c t u r e i n the data base. The two most u s e f u l search options are the fragment probe and the r i n g probe. The fragment probe w i l l search through the assembled connection t a b l e s of the data base f o r a l l occurrences of a p a r t i c u l a r fragment, i . e . , a s p e c i f i c atom, together w i t h a l l i t s neighbors and bonds. The user may s p e c i f y p a r t i c u l a r fragments i n the query s t r u c t u r e which are thought to be f a i r l y unique and c h a r a c t e r i s t i c of that s t r u c t u r e . A l t e r n a t i v e l y , a search f o r every fragment i n the query s t r u c t u r e may be requested. The general form of a fragment probe i s as shown i n F i g u r e 4. The query s t r u c t u r e contains only one r e l a t i v e l y unique node, C3, and t h i s i s the one which i s sought i n the data base. I t i s found to occur 980 times and a temporary f i l e of j u s t those p a r t i c u l a r e n t r i e s i s s t o r e d as f i l e #2. T h i s can be accessed by the user e i t h e r f o r the purpose of l i s t i n g i t s contents, as i s shown i n the f i g u r e , or to i n t e r s e c t i t w i t h other s c r a t c h f i l e s . The r i n g probe search i s a search through the data base f o r a l l s t r u c t u r e s c o n t a i n i n g the same r i n g or r i n g s as the query s t r u c t u r e . A r i n g that i s considered to be an answer to such a query must be the same s i z e as that i n the query s t r u c t u r e . I t must a l s o c o n t a i n the same number of non-carbon atoms (heteroatoms) but the nature of the heteroatoms and the p o s i t i o n of any s u b s t i t u e n t s can be r e q u i r e d by the user to be the same or d i f f e r e n t to t h a t i n the query s t r u c t u r e . Thus w i t h a query s t r u c t u r e of p y r r o l e , the only "exact" answer i s p y r r o l e but the user may permit the r e t r i e v a l of "imbedded" answers which would i n c l u d e f u r a n and thiophene. S i m i l a r l y , o-xylene i t s e l f i s the only "exact" match f o r o-xylene, but m- and p-xylene would be considered as "imbedded" matches. An example of a r i n g probe search i s given i n F i g u r e 5. Here the query s t r u c t u r e i s a 3 , 4 - d i c h l o r o f u r a n , but imbedded matches f o r heteroatoms and s u b s t i t u t e n t s have been

COMPUTER-ASSISTED

STRUCTURE

ELUCIDATION

OPTION? D 5 6

4

1

3**7CL

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch003

2 OPTION? FPROB 3 TYPE E TO EXIT FROM A L L SEARCHES, T TO PROCEED TO NEXT FRAGMENT SEARCH FRAGMENT: 7CL*»**3C

2C

4C REQUIRED OCCURRANCES FOR HIT : 1 THIS FRAGMENT OCCURS IN 980 COMPOUNDS FILE =

2,

980 COMPOUNDS CONTAIN THIS FRAGMENT

OPTION? SSHOW 2 HOW MANY STRUCTURES (E TO EXIT) ? 1 STRUCTURE 1 CAS REGISTRY NUMBER

50293 C14H9C15

C

C

CL«C

C

C

C

C»»CL C

C*****C**C •

•

• • C

•

* CL*C»«CL «

•

•

•

C

CL Benzene, 1,1'- (2,2,2 -trichloroethylidene)bis [4-chloro- (9CI) Figure

4.

Fragment

probe in the substructure

search system

M I L N E AND H E L L E R

?

?

NIH-EPA

Chemical

Information

System

7CL

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch003

5+++++4 ? ? ? ? 10 3??6CL ? + ? + 2 OPTION? EXIM 3 4 OPTION? RPROB C?????0 ? ? ? ? ??C C ? ? ? ? C ? ? CONDITIONS OF SEARCH CHARACTERISTICS TO BE MATCHED TYPE OF MATCH TYPE OF RING OR NUCLEUS EXACT HETEROATOMS AT 1 EXACT HETEROATOMS ARE 0 IMBED SUBSTITUENTS AT 4 3 IMBED THIS RING/NUCLEUS OCCURS IN 304 COMPOUNDS FILE =

3,

304 COMPOUNDS CONTAIN THIS RING/NUCLEUS

OPTION? SHOW 3 HOW MANY STRUCTURES (E TO EXIT) ? TYPE E TO TERMINATE DISPLAY STRUCTURE 1 CAS REGISTRY NUMBER

50817 C6H806

0

+ + o»*«»»c 0«»C»»C"C

• •

*

+ *

o

C»«0

+

c

L-Ascorbic acid (8CI9CI) Figure

5.

Ring probe in the substructure

search

system

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch003

38

COMPUTER-ASSISTED S T R U C T U R E E L U C I D A T I O N

allowed and so the l i s t of 304 answers w i l l i n c l u d e any d i s u b s t i tuted p y r r o l e as w e l l as any d i s u b s t i t u t e d furan and so on. Higher s u b s t i t u t i o n w i l l a l s o be p e r m i t t e d . In a d d i t i o n t o these s t r u c t u r a l searches, there are a number of " s p e c i a l p r o p e r t i e s " searches that o f t e n prove to be very usef u l as a means of reducing a l a r g e l i s t of answers r e s u l t i n g from s t r u c t u r e searches. The s p e c i a l p r o p e r t i e s searches i n c l u d e searches f o r a s p e c i f i c molecular weight or range of molecular weights and a search f o r compounds c o n t a i n i n g a given number of r i n g s . Searches may a l s o be conducted f o r the molecular formula corresponding to the query s t r u c t u r e , or f o r a d i f f e r e n t u s e r d e f i n e d molecular formula. T h i s may be s p e c i f i e d completely or p a r t i a l l y and the number of atoms of any element may be entered e x a c t l y or as a p e r m i s s i b l e range. I f one's purpose i s to determine only the presence or absence i n a data base of a s p e c i f i c s t r u c t u r e , t h i s can be accomplished w i t h the search o p t i o n "IDENT". This program hash-encodes the query s t r u c t u r e connection t a b l e and searches through a f i l e of hash-encoded connection t a b l e f o r an exact match. The search, which i s very f a s t by s u b s t r u c t u r e search standards, has been designed s p e c i f i c a l l y f o r those users who, to comply w i t h the Toxic Substances C o n t r o l A c t ( 1 5 ) , have to determine the presence or absence of s p e c i f i c compounds i n Environmental P r o t e c t i o n Agency f i l e s . F i n a l l y , i f one has completed r i n g probe and fragment probe searches f o r a s p e c i f i c query s t r u c t u r e and i s s t i l l confronted w i t h a s i z e a b l e f i l e of compounds that s a t i s f y the c r i t e r i a that were nominated, a s u b s t r u c t u r e search through t h i s f i l e may be c a r r i e d out. T h i s i n v o l v e s an atom-by-atom, bond-by-bond compari s o n of every s t r u c t u r e . The s u b s t r u c t u r e search system i s c u r r e n t l y o p e r a t i n g on 19 f i l e s which are g i v e n i n Table 1. The whole system i s a v a i l a b l e f o r g e n e r a l use on the Tymshare computer network. A s u b s c r i p t i o n fee of $150 per year must be p a i d f o r use of the system and the o n l y other charges are the connect-time charge and the searching c o s t s which range upwards from $3.00-5.00 f o r an i d e n t i t y search. g. Mass Spectrometry L i t e r a t u r e Search System. The accumulated f i l e s of the Mass Spectrometry B u l l e t i n , a s e r i a l p u b l i c a t i o n of the Mass Spectrometry Data Centre, Aldermaston, England, have been made the b a s i s of an o n - l i n e search system. The B u l l e t i n , which s i n c e 1967, has c o l l e c t e d about 60,000 c i t a t i o n s to papers on mass spectrometry, may be searched i n t e r a c t i v e l y f o r a l l papers by given authors, a l l papers d e a l i n g w i t h one or more s p e c i f i c s u b j e c t s or w i t h one or more p a r t i c u l a r elements. In a d d i t i o n , c i t a t i o n s d e a l i n g w i t h general index terms may a l s o be r e t r i e v e d . Simple Boolean l o g i c i s a v a i l a b l e , and thus searches may be conducted f o r papers by Smith and Jones, or Smith but not Jones, and so on. C i t a t i o n s r e t r i e v e d may be l i m i t e d to s p e c i f i c p u b l i c a t i o n y e a r s , between 1967 and the

3.

MILNE AND HELLER

NIH-EPA

Chemical

Information

System

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch003

TABLE 1. COMPONENT FILES OF THE NIH-EPA SUBSTRUCTURE SEARCH SYSTEM. Cambridge (Xray) Crystal File. CPSC Chemicals in Consumer Products. EPA AEROS SOTDAT File. EPA Las Vegas Chemical Spill File. EPA Storage and Retrieval of Air Data. EPA Pesticide Standards. EPA STORET Water Data Base. EPA-FDA Pesticide Repository Standards. EPA Inactive Ingradients in Pesticides. EPA Oil and Hazardous Materials File. EPA Pollutants in Drinking Water. EPA Pesticides File. EROICA Thermodynamics Data File. Merck Index. NBS Gas Phase Proton Affinities. NBS Heats of Formation of Gaseous Ions. NBS Single Crystal File. NCI-SRI Industrial Chemicals File. NCI PHS-149 File of Carcinogens. NIMH File of Psychotropic Drugs. NIH-EPA Carbon-13 Nuclear Magnetic Resonance Search System. NIH-EPA Mass Spectral Search System. WHO International Non-proprietary Name File of Drugs.

39

40

COMPUTER-ASSISTED STRUCTURE

ELUCIDATION

present. The i n t e r a c t i v e nature of the search p r o v i d e s great c o n t r o l to the user. One can l e a r n w i t h i n a few minutes that w h i l e there are i n the B u l l e t i n , 463 papers d e a l i n g w i t h mass measurement f o r example, and 678 on chemical i o n i z a t i o n , only 8 r e p o r t on mass measurement i n chemical i o n i z a t i o n mass s p e c t r a . S i m i l a r l y , one can r a p i d l y d i s c o v e r that although there are 532 papers d e a l i n g w i t h carbon d i o x i d e , only 1 of these was presented at the 1975 NATO meeting i n B i a r r i t z . No numerical codes are used by the system. A search f o r a s p e c i f i c subject can be c a r r i e d out by e n t e r i n g the subject word i t s e l f . I f the word "mass i s entered, searches f o r 7 terms ( a l l those c o n t a i n i n g the fragment "mas", i . e . , mass s p e c t r a , mass d i s c r i m i n a t i o n , mass measurement and so on) are conducted and the user i s asked to s e l e c t the one of i n t e r e s t . In t h i s way, knowledge of the c o r r e c t subject words or of t h e i r c o r r e c t s p e l l i n g i s not necessary. The whole search system has been w r i t t e n f o r use on a PDP-10 computer and i s a component of the MSDC-NIH-EPA Mass S p e c t r a l Search System. As such, i t i s a c c e s s i b l e v i a the ADP-Cyphernetics computer network.

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch003

11

h. X-ray C r y s t a l L i t e r a t u r e R e t r i e v a l System. The data base used i n the X-ray c r y s t a l l o g r a p h i c search system described i n (c) above possesses complete l i t e r a t u r e r e f e r e n c e s to a l l e n t r i e s i n the f i l e ( 8 ) . This i n f o r m a t i o n has been made the b a s i s of a system f o r searching the l i t e r a t u r e p e r t a i n i n g to the X-ray d i f f r a c t i o n study of organic molecules. As i n the Mass Spectrometry B u l l e t i n Search System, i t i s p o s s i b l e to search f o r papers by a s p e c i f i c a u t h o r ( s ) , and papers that appeared i n given years i n given j o u r n a l s may a l s o be r e t r i e v e d . A d d i t i o n a l l y , papers may be l o c a t e d on the b a s i s of s p e c i f i c words appearing i n t h e i r t i t l e s . These words may be truncated by the user and so the fragment "ERO" w i l l r e t r i e v e papers w i t h the word "STEROID" on t h e i r t i t l e s or papers whose t i t l e s use the word "MEROQUININE". The system generates s c r a t c h f i l e s from searches, as i n the s u b s t r u c t u r e search system, and f i l e s can be i n t e r s e c t e d upon request w i t h "AND" or "NOT" operators. Thus one could, f o r example, r e t r i e v e a l l papers p u b l i s h e d i n Acta C r y s t a l l o g r a p h i c a s i n c e 1970 by A t k i n s , e x c l u d i n g s p e c i f i c a l l y those on c o r t i c o s t e r o i d s . Once a paper of i n t e r e s t has been i d e n t i f i e d , a l l the c r y s t a l l o g r a p h i c i n f o r m a t i o n i n that paper can be examined because the c r y s t a l s e r i a l number of the paper can be used i n the c r y s t a l l o g r a p h i c search system to r e t r i e v e that i n f o r m a t i o n . A l t e r n a t i v e l y , the CAS r e g i s t r y number of any p a r t i c u l a r compound can be used to r e t r i e v e any data of i n t e r e s t on that compound from other f i l e s of the CIS. The X-ray l i t e r a t u r e search system i s o p e r a t i n g on the ADPCyphernetics network as a component of the X-ray c r y s t a l l o g r a p h i c search system. Searches are not t r a n s a c t i o n p r i c e d but cost under $2.00 on average.

3.

M I L N E AND HELLER

NIH-EPA

Chemical

Information

System

41

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch003

i. Proton A f f i n i t y R e t r i e v a l Program. With the c u r r e n t h i g h l e v e l of i n t e r e s t i n chemical i o n i z a t i o n mass spectrometry, there i s a need f o r a r e l i a b l e f i l e of gas phase proton a f f i n i t i e s . No data base of t h i s s o r t has p r e v i o u s l y been assembled and f o r these reasons, the task of g a t h e r i n g and e v a l u a t i n g a l l p u b l i s h e d gas phase proton a f f i n i t i e s has been undertaken by Rosenstock and coworkers a t NBS. This f i l e (16), which has about 400 c r i t i c a l l y evaluated gas phase proton a f f i n i t i e s drawn from the open l i t e r a t u r e , can be searched on the b a s i s of compound type or of the proton a f f i n i t y v a l u e . I t w i l l be appended to the MSSS and the b i b l i o g r a p h i c component w i l l be merged w i t h the Mass Spectrometry B u l l e t i n Search System. j. NMR Spectrum A n a l y s i s Program. Many proton nmr s p e c t r a can be s a t i s f a c t o r i l y analyzed by hand, and such f i r s t order a n a l y s i s i s , i n these cases, a q u i t e s a t i s f a c t o r y way of a s s i g n i n g chemical s h i f t s and c o u p l i n g constants to the v a r i o u s n u c l e i i n v o l v e d . In c e r t a i n cases, however, s o - c a l l e d second order e f f e c t s become important and as a r e s u l t , more or fewer s p e c t r a l l i n e s than are i n d i c a t e d by f i r s t order c o n s i d e r a t i o n s w i l l r e s u l t . A way to analyze such s p e c t r a i s to estimate the v a r i o u s c o u p l i n g constants and chemical s h i f t s and then, u s i n g any of a v a r i e t y of standard computer programs (17), c a l c u l a t e the t h e o r e t i c a l spectrum corresponding to these v a l u e s . The c a l c u l a t e d spectrum can be compared to the observed spectrum and a new estimate of the data can be made. In t h i s way, by a s e r i e s of s u c c e s s i v e approximations, the c o r r e c t c o u p l i n g constants and chemical s h i f t s can be determined. The CIS component GINA ( G r a p h i c a l I n t e r a c t i v e NMR A n a l y s i s ) which i s based upon the programs developed by Johannesen et a l . ( 1 8 ) , permits these operations i n r e a l time i n an i n t e r a c t i v e f a s h i o n . The program i s designed f o r use w i t h a v e c t o r cathode ray tube t e r m i n a l upon which each new t h e o r e t i c a l spectrum can be d i s p l a y e d f o r comparison by the user w i t h the observed spectrum. The program has been a v a i l a b l e at NIH f o r over four years (19) and i s c u r r e n t l y being exported to a computer network i n the p r i v a t e s e c t o r . The cost of u s i n g the program i s not yet w e l l e s t a b l i s h e d because i t i s subject t o wide v a r i a t i o n s . k. Mathematical M o d e l l i n g System(MLAB). MLAB i s a program set, developed by Knott a t NIH (20) which can a s s i m i l a t e a f i l e of experimental data, such as a t i t r a t i o n curve and perform on i t any of a wide v a r i e t y of mathematical o p e r a t i o n s . Included amongst these are d i f f e r e n t i a l and i n t e g r a l c a l c u l u s , s t a t i s t i c a l a n a l y s i s (mean and standard d e v i a t i o n , curve and d i s t r i b u t i o n f i t t i n g and l i n e a r and n o n - l i n e a r r e g r e s s i o n a n a l y s i s ) . Output data can be presented i n any form, but the PDP-10-resident program i s e s p e c i a l l y powerful i n the area of g r a p h i c a l output.

42

COMPUTER-ASSISTED STRUCTURE

ELUCIDATION

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch003

Data can be d i s p l a y e d i n the form of two- or three-dimensional p l o t s which can be viewed and modified on a CRT t e r m i n a l p r i o r to pen-and-ink p l o t t i n g . This program set i s now a v a i l a b l e f o r general use on the ADP-Cyphernetics network. The cost of using MLAB i s based upon the computer resource u n i t s and depends, t h e r e f o r e , upon the type of work t h a t i s being done. 1. I s o t o p i c L a b e l I n c o r p o r a t i o n Determination (LABDET). Radioisotopes are p a r t i c u l a r l y w e l l - s u i t e d to l a b e l l i n g s t u d i e s because they can be very e a s i l y detected at very low l e v e l s . In recent years, however, there has been i n c r e a s i n g concern about the shortcomings of r a d i o i s o t o p e s i n medical research. Current standards, i n f a c t , take the p o s i t i o n that the use of r a d i o i s o t o p e s such as carbon-14 i n c h i l d r e n and women of c h i l d - b e a r i n g age i s precluded. Consequently, i t i s not p o s s i b l e to study the metabolism of drugs i n such p a t i e n t s using r a d i o i s o t o p e s , and t h i s leads to some d i f f i c u l t y because i t i s only i n such p a t i e n t groups that the metabolism of drugs, such as o r a l c o n t r a c e p t i v e s , i s of relevance. Much e f f o r t has gone i n t o s t u d i e s of the p o s s i b i l i t i e s of carbon-13 as a surrogate f o r carbon-14, and t h i s type of work a p p l i e s a l s o to problems i n v o l v i n g oxygen and n i t r o g e n which have s t a b l e i s o t o p e s , but no convenient r a d i o i s o t o p e s . Mass spectrome t r y i s the best general method f o r d e t e c t i o n and q u a n t i t a t i o n of s t a b l e i s o t o p e s i n molecules, but there a s e r i o u s problem i n v o l v e d i n i t s a p p l i c a t i o n i s that s t a b l e i s o t o p e s such as carbon-13, nitrogen-15 and oxygen-18 occur n a t u r a l l y as minor components of n a t u r a l elements. This i s most pronounced i n the case of carbon. N a t u r a l l y - o c c u r r i n g carbon i s about 99% C-12 but a s m a l l v a r i a b l e amount of a l l n a t u r a l carbon i s C-13. This creates a "background a g a i n s t which determinations of i s o t o p e l e v e l s i n l a b e l l e d compounds must be measured. The purpose of LABDET(21) i s to compare the mass spectrum of an u n l a b e l l e d compound w i t h t h a t of the same compound i s o t o p i c a l l y enriched. This i s u s u a l l y done u s i n g the molecular i o n r e g i o n of the s p e c t r a . The program c a l c u l a t e s an estimate of the l e v e l of i n c o r p o r a t i o n of i s o t o p e and then c a l c u l a t e s a t h e o r e t i c a l spectrum which can be compared w i t h the a c t u a l spectrum. The t h e o r e t i c a l spectrum i s then adjusted and a f u r t h e r comparison i s made, and i n t h i s way, the program proceeds through a predetermined number of i t e r a t i o n s , f i n a l l y c a l c u l a t i n g the c o r r e l a t i o n c o e f f i c i e n t between the observed spectrum and the best t h e o r e t i c a l spectrum. T h i s c a l c u l a t i o n i s not d i f f i c u l t so much as tedious and i f one must c a r r y i t out many times per day, use of the computer i s i n d i c a t e d . LABDET i s an o p t i o n w i t h i n MSSS on the ADP-Cyphern e t i c s network and i t s use c o s t s $2.00. 11

m. Conformational A n a l y s i s of Molecules i n S o l u t i o n . A problem of long standing i n chemistry has been to estimate the

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch003

3.

M I L N E AND HELLER

NIH-EPA

Chemical

Information

System

43

r e l a t i o n s h i p between the conformation of a molecule i n the c r y s t a l , as measured by X-ray methods, w i t h t h a t i n s o l u t i o n where b a r r i e r s to r o t a t i o n are g r e a t l y reduced. A s o p h i s t i c a t e d program set f o r Conformation A n a l y s i s of Molecules i n S o l u t i o n by E m p i r i c a l and Quantum-mechanical methods (CAMSEQ) has been developed f o r t h i s purpose by Hopfinger and coworkers (22) at Case Western Reserve U n i v e r s i t y . T h i s program can run i n batch or i n t e r a c t i v e l y . As i n p u t d a t a , i t r e q u i r e s the s t r u c t u r e of the compound and t h i s can be provided as a s e t of c o o r d i n a t e data from X-ray measurements, i t can be entered i n t e r a c t i v e l y i n the form of a connection t a b l e or the program can simply be provided w i t h a CAS r e g i s t r y number, and i f the corresponding connection t a b l e i s i n the f i l e s of the CIS, i t w i l l use t h a t . The f i r s t t a s k i s to generate the coordinate data corresponding to a p a r t i c u l a r compound. Then the f r e e energy of t h i s conformation i n s o l u t i o n i s c a l c u l a t e d , Next the program begins to change t o r s i o n angles s p e c i f i e d by the user i n the conformat i o n and w i t h each new conformation, a s t a t i s t i c a l thermodynamic p r o b a b i l i t y i s c a l c u l a t e d , based upon p o t e n t i a l ( s t e r i c , e l e c t r o s t a t i c , and t o r s i o n a l ) f u n c t i o n s and terms f o r the f r e e energy a s s o c i a t e d w i t h hydrogen-bonding, m o l e c u l e - s o l v e n t , and moleculed i p o l e i n t e r a c t i o n s . This program, i n i t s i n t e r a c t i v e v e r s i o n , can be run i n under 40K words of core, and work i s i n progress to export i t t o a commercial networked computer. Conclusions One of the f i r s t goals of the CIS was to produce a s e r i e s of searchable chemical data bases f o r use by working a n a l y t i c a l chemists w i t h no e s p e c i a l computer e x p e r t i s e . A second aim was to l i n k these data bases together so that the user need not be r e s t r i c t e d to a c o n s i d e r a t i o n o f , f o r example, only mass s p e c t r a l data. The v a r i o u s problems inherent i n these plans i n c l u d e d a c q u i s i t i o n of data bases, design of programs, d i s s e m i n a t i o n of the r e s u l t i n g system and l i n k i n g , v i a CAS r e g i s t r a t i o n numbers, of the v a r i o u s CIS components. These problems, as has been d e s c r i b e d above, have been s o l v e d c o n c e p t u a l l y and, to a l a r g e e x t e n t , p r a c t i c a l l y , and the CIS, as i t now stands, i s the r e s u l t . I t i s now p o s s i b l e , t h e r e f o r e , to review the system i n an e f f o r t to d e f i n e f u t u r e g o a l s , and a number of these seem f a i r l y c l e a r . Searches through more than one data base i n combination would be v e r y d e s i r a b l e . For example, one o f t e n possesses mass s p e c t r a l and nmr data f o r an unknown, and i t would very u s e f u l to be a b l e to i d e n t i f y any compounds t h a t match these data i n a s i n g l e search. In another development, i t i s expected t h a t the CONGEN program developed f o r the DENDRAL p r o j e c t (23) w i l l be merged i n t o CIS d u r i n g the coming year. This program, which generates s t r u c t u r e s corresponding to a s p e c i f i c e m p i r i c a l

COMPUTER-ASSISTED S T R U C T U R E E L U C I D A T I O N

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch003

44

formula, could be extremely u s e f u l i n a s t r a t e g y f o r s t r u c t u r e s o l v i n g u s i n g the CIS. I t i s not at a l l d i f f i c u l t to envisage s i t u a t i o n s i n which a reduced set of s t r u c t u r e s could be produced by CONGEN f o r c o n s i d e r a t i o n . Each s t r u c t u r e i n t u r n could be used as an i n p u t i n the s u b s t r u c t u r e search system, and the v a r i o u s compounds whose r e g i s t r y numbers are so r e t r i e v e d could be considered to be p o s s i b l e answers to the problem. Confirmation f o r any of them could then be sought i n the s p e c t r a l data bases, the r e g i s t r y number being a l l that i s necessary to l o c a t e and r e t r i e v e data. One can even speculate f u r t h e r to the day when s y n t h e t i c pathways to any l i k e l y candidates could be designed by the computer system which could e a s i l y add the very p r a c t i c a l touch of checking t h a t any s t a r t i n g m a t e r i a l s f o r such syntheses are commercially a v a i l a b l e at an a p p r o p r i a t e l y low c o s t ! In a d i f f e r e n t approach, the power of p a t t e r n r e c o g n i t i o n techniques could be assessed w i t h i n some of the very l a r g e f i l e s contained i n the CIS. This i s a very u s e f u l e x e r c i s e because there i s l i t t l e reported work of t h i s s o r t on l a r g e f i l e s and thus we have begun to explore the value of such methods i n h a n d l i n g the problem of i d e n t i f i c a t i o n of true unknowns such as water p o l l u t a n t s . Programs designed to t e s t mass s p e c t r a f o r the presence of the compound of oxygen or n i t r o g e n are c u r r e n t l y being t e s t e d (24) and t h e i r u t i l i t y as p r e f i l t e r s on mass s p e c t r a l data p r i o r to data base searching w i l l be t e s t e d as soon as feasible. In summary, i t i s f e l t t h a t progress to date w i t h the CIS has demonstrated economic f e a s i b i l i t y i n that a number of r e l a t i v e l y s t a b l e CIS components have now been i n the p r i v a t e s e c t o r f o r some time. The t e s t before us i s whether we can c a p i t a l i z e on t h i s to explore the new and e x c i t i n g p o s s i b i l i t i e s t h a t l i e ahead i n the area of s t r u c t u r e determination by computer.

Literature Cited 1. Heller, S. R., Milne, G. W. Α., and Feldmann, R. J. Science, (1977) 195, 253. 2. Heller, S. R., Fales, Η. Μ., and Milne, G. W. A. Org. Mass Spectrom., (1973) 7, 107; Heller, S. R., Koniver, D. Α., Fales, Η. Μ., and Milne, G. W. A. Anal. Chem. (1974) 46, 947; Heller, S. R., Feldmann, R. J., Fales, Η. Μ., and Milne, G. W. A. J. Chem. Soc., (1973) 13, 130; Heller, R. S., Milne, G. W. Α., and Heller, S. R. J. Chem. Inf. Comp. Sci. (1976) 16, 176. 3. This data base (ref. 8) is leased by NIH on behalf of the entire U.S. scientific community. 4. McCarthy, G. and Johnson, G. G., paper C3 presented as a part of the Proceedings of the American Crystallographic Associa tion meeting, State College, Pa., 1974.

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch003

3.

M I L N E AND HELLER

NIH-EPA

Chemical

Information

System

45

5. Heller, S. R., Milne, G. W. A., and Feldmann, R. J. J. Chem. Inform. Comp. Sci (1976) 16, 232. 6. This is carried out using an unpublished program developed by McLafferty and co-workers at Cornell University. 7. Schwarzenbach, R., Meili, J., Koenitzer, H., and Clerc, J. T. Org. Magn. Res (1976) 8, 11. 8. Kennard, O., Watson, D. G., and Town, W. G. J. Chem. Doc. (1972) 12, 14. 9. These data are available as NBS Tape #9 through the National Technical Information Service, Springfield, VA 22151. 10. Hanawalt, J. D., Rinn, H. W., and Frevel, L. K. Ind. Eng. Chem. (Anal.) (1938) 10, 457. 11. This file is a proprietary product of the Joint Committee on Powder Diffraction Standards, 1601 Park Lane, Swarthmore, PA 19801. 12. Abramson, F. P. Anal. Chem. (1975) 47, 45. 13. Chemical Abstracts Service Standard Distribution Format File, 1976. Chemical Abstracts Service, Columbus, Ohio 43210. 14. Feldmann, R.J.,Milne, G. W. A., Heller, S. R., Fein, A., Miller, J. A., and Koch, B. A. J. Chem. Inf. & Comp. Sci., (1977) in press. 15. Toxic Substances Control Act, Public Law 94-469, enacted October, 1976. 16. Hartmann, K., Lias, S., Ausloos, P.J.,and Rosenstock, H. M. Publication NBSIR 76-1061, July, 1976. 17. Castellano, S. and Bothner-By, A. A. J. Chem. Phys. (1964), 41, 3863; Swalen, J. D. and Reilly, C. A. J. Chem. Phys. (1965) 42, 440. 18. Johannesen, R. B., Ferretti, J. A., and Harris, R. K. J. Magn. Res. (1970) 3, 84. 19. Heller, S. R. and Jacobson, A. E. Anal. Chem. (1972) 44, 2219. 20. Knott, G. D., and Shrager, R. I. Assn. Comp. Machin. SIGGPAPH Notes 6 (1972) 138. 21. Hammer, C. F. Department of Chemistry, Georgetown University, Washington, DC, unpublished work. 22. Weintraub, H. J. R. and Hopfinger, A. J. Intnl. J. Quant. Chem. (1975) 9, 203. 23. Carhart, R. E., Smith, D. H., Brown, H. and Djerassi, C. J. Amer. Chem. Soc. (1975) 97, 5755. 24. Meisel, W., Jolley, M. and Heller, S. R., in preparation.

4 An Information Theoretical Approach to the Determination of the Secondary Structure of Globular Proteins JAMES A. DE HASETH and THOMAS L. ISENHOUR

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch004

Department of Chemistry 045A, University of North Carolina, Chapel Hill, NC 27514

Thedeterminationof three-dimensional structures of proteins directly from their amino acid sequences is a major problem in molecular biology. One facet of the overall three-dimensional structure, secondary structure, is in part described by alpha helical and beta sheet conformations. These conformations are difficult to determine in solution and even analysis by X-ray diffraction techniques is a long and complex task given that the appropriate crystals can be grown. For these reasons the development of a reliable method to aid in the assignment of secondary structure to proteins directly from the amino acid sequences would be useful. Several secondary structure prediction methods have been formulated (1-27) and schemes employing several of these methods (28-30) have been developed; however, none of these systems is capable of a high degree of correlation between observation and prediction. Information theory has not previously been well exploited in the assignment of protein secondary structure. The method described here makes use of third-order peptide approximations to the structure of the proteins. Once these third-order approximations have been formulated they are used to decode the amino acid sequence of a protein and assign secondary structure to the decoded fragments. In this manner alpha helices and beta sheets may be predicted within an entire protein sequence. Theory Information theory may be easily described in the context of the written English language. The illustration used here follows principles stated by Shannon and Weaver (31_). The first supposition of the illustration is that there are only twentyseven symbols in the written English language, specifically, twenty-six alphabetic characters and a blank. A zero-order approximation to the language is defined as having all symbols independent and equiprobable. An example of a message constructed under these criteria is as follows: 46

4.

D E H A S E T H AND ISENHOUR

Structure

of Globular

Proteins

47

PROJDQJPDNZZBUE CUXYRQVBENIJEWHDGEAIWBREGYDNQEKUZUWEKXTRJRQYWUZZBTWIJG WCUBONAOCM.

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch004

As can be seen i n t h i s r a t h e r small sample the twenty-seven symbols are f a i r l y evenly d i s t r i b u t e d , as would be expected. Each symbol has a 1/27 chance o f being s e l e c t e d and the informat i o n content of t h i s c o l l e c t i o n o f symbols, o r message, i s very low. The most the message can hope to convey i s a l i s t o f a l l p o s s i b l e symbols. A f u r t h e r c r i t e r i o n may be imposed on the approximation: that the symbols are not equiprobable but have p r o b a b i l i t i e s based on t h e i r frequencies o f occurrence i n the w r i t t e n E n g l i s h language. These new c r i t e r i a define a f i r s t - o r d e r approximation. An example i s as f o l l o w s : FSGEHTETMO HONT I SDHEAPC MBDNNIOL AGWEONEEYSEN ODLU OEY S. This message more c l o s e l y approximates the w r i t t e n E n g l i s h language than the z e r o - o r d e r approximation. The symbols E , 0 , T and N occur r a t h e r f r e q u e n t l y ; others such as P, C and W occur only once, whereas X, Z, J , K, Q and V are not represented at a l l in t h i s short message. These data are more c o n s i s t e n t with E n g l i s h and the message r e f l e c t s these f a c t s . Second-order approximations are constructed by s e l e c t i n g the f i r s t symbol using the f i r s t - o r d e r c r i t e r i a , the succeeding symbol i s then chosen by i t s frequency o f occurrence when i t i s preceded by the f i r s t symbol. The t h i r d i s chosen by i t s dependency upon the second symbol. This process i s repeated u n t i l the message i s terminated. The symbols are no longer independent but are biased by the preceding c h a r a c t e r . A second-order approximat i o n message has been c o n s t r u c t e d : INOMETHE COALT 0R0SIE 0 MY ARONANIF IVIVIF ENTHICHATOFATR. As can be seen two and three l e t t e r words and word fragments are formed and the message f o r the f i r s t time begins to have an E n g l i s h appearance. When the frequencies o f occurrence o f symbols are biased by the two immediately preceding symbols, a t h i r d - o r d e r approximation r e s u l t s . An example o f a t h i r d - o r d e r approximation message is: CIENCEMAT WE LINEREMPHYSION ATICESESITY PHAW. The word and word fragments are now l a r g e r than before; f o r example, CIENCE as i n s c i e n c e ; LINE; EMPH as i n emphasis; PHYS as in p h y s i c a l ; e t c . These processes can be continued u n t i l n t h order approximations are formed; however, i t i s more i n s t r u c t i v e at t h i s p oint to jump t o approximations t h a t use l a r g e r d i s c r e t e u n i t s than the twenty-seven symbols.

American Chemical Society Library 1155 16th St. N. W. Washington, D. C. 20036

48

COMPUTER-ASSISTED S T R U C T U R E E L U C I D A T I O N

A f i r s t - o r d e r word approximation message i s constructed by randomly s e l e c t i n g words dependent upon t h e i r frequencies of occurrence i n E n g l i s h . Here the words are independent but not equiprobable, for example: BEFORE THAT ISSUES POTENTIAL CREATION WHETHER THE THROUGHOUT IN OF TO OF FOLLOWING HIMSELF WHEREAS FOR OCCUPATIONS.

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch004

Two and three word phrases have been formed and although the message obviously has an English o r i g i n very l i t t l e information is conveyed. In a manner s i m i l a r to the second-order approximat i o n a second-order word approximation may be formed. As can be seen i n the f o l l o w i n g message multiword phrases r e s u l t . THE DOOR AND THEN HE THOUGHT OF THE MAN WILL PROTEST. As the order o f the word approximations increases sentences are formed and messages are conveyed. Protein sequences are analogous to the w r i t t e n E n g l i s h language to the extent that s p e c i f i c amino a c i d sequences form proteins that possess c e r t a i n b i o l o g i c a l a c t i v i t y . In t h i s sense a p r o t e i n represents a "message". The "symbols" o f the protein "message" are atoms o f carbon, hydrogen, oxygen, n i t r o gen and s u l f u r , as well as s i n g l e and double covalent bonds, and hydrogen bonds. These "symbols" form d i s c r e t e units or amino acids under the c o n s t r a i n t s of p r o t e i n chemistry. The amino acids are d i s c r e t e units as are words i n a w r i t t e n language; but the number o f these d i s c r e t e units (peptides) commonly found i n g l o b u l a r proteins t o t a l s only twenty. As was i l l u s t r a t e d above, a good representation o f w r i t t e n E n g l i s h language was made with a second order word approximation. Because proteins have an extremely small "vocabulary" (twenty amino acids) a low order peptide approximation may represent p r o t e i n messages to a high degree o f accuracy. The necessary c r i t e r i a f o r the formation of alpha h e l i c e s and beta sheets i n p r o t e i n sequences may be d e t e r mined using n t h - o r d e r peptide approximations. The n t h - o r d e r peptide approximations may i n turn be used to p r e d i c t secondary s t r u c t u r e i n unknown sequences. Experimental Two data bases have been used for t h i s study. The f i r s t i s the Protein Sequence Data Tape 76 c o n s i s t i n g o f 767 p r o t e i n sequences prepared by M. 0. Dayhoff at the National Biomedical Research Foundation (32). The second data base i s a subset o f the AMSOM (Atlas o f Macromolecular S t r u c t u r e on M i c r o f i c h e ) data base compiled at the National I n s t i t u t e s o f Health by R. J . Feldmann ( 3 3 ) . * The AMSOM subset contains f i f t y p r o t e i n AM5UM is a v a i l a b l e from: Tracor J i f c o , I n c . , AMSOM program, 1776 E . J e f f e r s o n S t . , R o c k v i l l e , MD 20852, USA. X

4.

DEHASETH AND ISENHOUR

Structure

of Globular

Proteins

49

sequences f o r which d e t a i l e d secondary s t r u c t u r e and p r o t e i n sequence are known. A l l computer programs were w r i t t e n i n FORTRAN and a l l computations were c a r r i e d out at the U n i v e r s i t y o f North C a r o l i n a Computation Center at Chapel H i l l on the IBM Model 360 S e r i e s 75 and IBM Model 370 S e r i e s 155 computers.

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch004

Procedure The basis o f the information t h e o r e t i c a l approach i s t h a t an n t h - o r d e r peptide approximation can be made to represent the p r o t e i n secondary s t r u c t u r e . To t e s t the v a l i d i t y o f t h i s premi s e frequencies o f occurrence of various n t h - o r d e r peptides were determined in the Protein Sequence Data Tape 76. The data base contains 76,094 c l a s s i f i a b l e p e p t i d e s . There i s a t o t a l o f 77,267 residues i n the data base; however, not a l l of these were useable i n t h i s study. Some o f the peptides are designated as "unknown" or a t y p i c a l (and are coded as UNK); others could not be d i s t i n g u i s h e d between a s p a r t i c a c i d and asparagine (coded as ASX); and s t i l l others could not be d i s t i n g u i s h e d between glutamic a c i d , glutamine and p y r r o l i d i n e - c a r b o x y l i c a c i d (coded as GLX). These three residue t y p e s , UNK, ASX and GLX are cons i d e r e d u n c l a s s i f i a b l e in t h i s study. As could be expected a l l p o s s i b l e twenty monopeptides are found to occur at l e a s t once in the data base; a l l 400 dipeptides also o c c u r . (The order o f a polypeptide i s defined by s t a r t i n g at the amino end and proceeding towards the carboxy end.) There are 8,000 p o s s i b l e t r i peptides and 89.60% are found at l e a s t one time i n the e n t i r e data base. I t i s b e l i e v e d that t h i s approaches a s t a t i s t i c a l l y v a l i d r e s u l t as there are only 8,000 p o s s i b l e t r i p e p t i d e combinations i n a data base o f 72,363 c l a s s i f i a b l e t r i p e p t i d e s , ( i . e . , they do not contain UNK, ASX o r GLX). Therefore i t would appear that approximately 10% of the p o s s i b l e t r i p e p t i d e s are r a r e l y o c c u r i n g combinations i n p r o t e i n b i o c h e m i s t r y . Tetrapeptides were also examined. From the 70,865 c l a s s i f i a b l e t e t r a p e p t i d e s i n the data base, only 21.55% o f the p o s s i b l e 160,000 t e t r a p e p t i d e s occur at l e a s t one time. This i s not a s t a t i s t i c a l l y v a l i d r e s u l t as there are more than twice as many p o s s i b l e t e t r a p e p t i d e s as there are a v a i l a b l e i n the data base. For t h i s reason the present work was l i m i t e d to the study o f tripeptides. I t i s true that as the order o f approximation i s i n c r e a s e d , the approximation more c l o s e l y approaches the t o t a l information content o f the message. However, i f higher than t h i r d - o r d e r peptide ( t r i p e p t i d e ) approximations are used, the approximations become s e r i o u s l y biased by the data base s i z e . Further i n v e s t i g a t i o n o f the Protein Sequence Data Tape 76 y i e l d e d that the most commonly o c c u r i n g t r i p e p t i d e i s p r o l i n e proline-glycine. This residue occurs almost e x c l u s i v e l y i n c o l l a g e n and k e r a t i n , both fibrous p r o t e i n s . As the study i n v o l v e d only g l o b u l a r p r o t e i n s the e n t i r e data base could not

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch004

50

COMPUTER-ASSISTED S T R U C T U R E

ELUCIDATION

be adequately used and a g l o b u l a r subset of the Protein Sequence Data Tape 76 was formed by removing a l l the fibrous p r o t e i n s . This subset data base contains 65,783 c l a s s i f i a b l e peptides and 62,511 c l a s s i f i a b l e t r i p e p t i d e s . This i s s t i l l a s t a t i s t i c a l l y v a l i d data c o l l e c t i o n f o r the 8,000 p o s s i b l e t r i p e p t i d e combinations. S t a t i s t i c s on the g l o b u l a r subset i n d i c a t e that 87.85% o f a l l p o s s i b l e t r i p e p t i d e s are present. Unfortunately, not a l l the conformations are known f o r a l l the p r o t e i n sequences i n the g l o b u l a r subset data base; t h e r e f o r e , t h i s data base could only be used f o r determining the s t a t i s t i c a l make-up o f t r i p e p t i d e s i n globular proteins. The AMSOM data base subset contains complete secondary s t r u c t u r e and amino a c i d sequence f o r f i f t y p r o t e i n s . There i s a t o t a l o f 9,702 c l a s s i f i a b l e peptides i n the subset and 9,364 classifiable tripeptides. This data subset i s not s u f f i c i e n t l y large to y i e l d a s t a t i s t i c a l l y v a l i d r e s u l t f o r t r i p e p t i d e occurrence; however, from data on the most frequently occuring t r i p e p t i d e s i t was a s c e r t a i n e d that the AMSOM subset i s an adequate representation o f the Protein Sequence Data Tape 76 g l o b u l a r subset. Once t h i s had been demonstrated a t h i r d - o r d e r peptide approximation t a b l e o f the AMSOM subset was c o n s t r u c t e d . A s i n g l e t a b l e to represent a l l the t r i p e p t i d e frequencies would i n v o l v e 8,000 x 27 (or 216,000) e n t r i e s . There are 8,000 p o s s i b l e t r i p e p t i d e s - each o f which may e x h i b i t twenty-seven d i f f e r e n t conformational s t a t e s . As t r i p e p t i d e s are s e l e c t e d from the p r o t e i n sequences each i n d i v i d u a l peptide can e x h i b i t e i t h e r a , 3 or random c o i l (R) conformation. A l i s t of a l l twenty-seven p o s s i b l e t r i p e p t i d e conformations may be found i n Table I. Table I. P o s s i b l e T r i p e p t i d e Conformations RRR

RRa

RR3

RaR

Raa

Ra3

R3R

R3a

R33

aRR

aRa

aR3

aaR

aaa

aa3

a3R

a3a

a33

3RR

3Ra

3R3

3aR

3aa

3a3

33R

33a

333

I t was hoped that the s i z e o f the t a b l e could be e f f e c t i v e l y reduced as computer storage of such a l a r g e t a b l e i s p r o h i b i t i v e even on a large machine. As an alpha h e l i x cannot be defined f o r any fewer than four r e s i d u e s , the four t r i p e p t i d e conformations that exclude any more than one alpha h e l i c a l peptide are f o r bidden conformations. These four conformations are RaR, R a 3 , 3aR and 3 a 3 . Although a s i n g l e beta sheet peptide i s defined i t was found that only the R3R t r i p e p t i d e conformation o c c u r s ; whereas the other t h r e e , R 3 a , a3R and a 3 a , do not. A l l remaining twenty t r i p e p t i d e conformations occur at l e a s t once i n the data base. The t a b l e had thus been reduced to 8,000 x 20 (or 160,000) entries.

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch004

4.

DEHASETH A N D ISENHOUR

Structure

of Globular

Proteins

51

Rather than use the t r i p e p t i d e frequency t a b l e i n i t s p r e s ent form i t was decided to reduce i t to 8,000 x 3 (or 24,000) entries. This was accomplished by determining the o v e r a l l alpha h e l i x , beta sheet and random c o i l tendencies o f every t r i p e p t i d e . For example, the various conformations of the t r i p e p t i d e a l a n i n e a s p a r t i c a c i d - a l a n i n e are shown i n Figure 1(a). The t r i p e p t i d e occurs e i g h t times i n the e n t i r e AMSOM subset: three times as RRR (or 37.5% o f the t i m e ) , twice as a a a (25.0%), twice as 333 (25.0%), and once as 3RR (12.5%). As t h i s t r i p e p t i d e has a s i n g l e alpha conformation, a a a , the t o t a l alpha c o n t r i b u t i o n of the t r i p e p t i d e i s 0.250. As a l a n i n e - a s p a r t i c a c i d - a l a n i n e has two beta conformations, 333 and 3RR, i t s t o t a l beta c o n t r i b u t i o n is 0.250 + 0.125 or 0.375. S i m i l a r l y the t o t a l random c o i l cont r i b u t i o n i s 0.375 + 0.125 or 0.500. The f i n a l t a b l e entry f o r t h i s t r i p e p t i d e i s shown i n Figure 1(b). In t h i s way the t r i peptide conformations are biased towards the most l i k e l y conforma t i o n o f the t r i p e p t i d e . An i l l u s t r a t i o n o f how t h i s information i s used to determine the s t r u c t u r e of a s i n g l e amino a c i d i n a p r o t e i n i s shown i n Figure 2. The pentapeptide g l u t a m i n e - a l a n i n e - a s p a r t i c a c i d a l a n i n e - a s p a r a g i n e was randomly e x t r a c t e d from a p r o t e i n sequence. There are three t r i p e p t i d e s in the pentapeptide that contain a s p a r t i c a c i d and these appear in the f i r s t column of Figure 2. The method o f p r e d i c t i o n i s to sum the a l p h a , beta and random c o i l c o n t r i b u t i o n s and use the maximum as an i n d i c a t o r f o r the conforma t i o n of the s i n g l e peptide under i n v e s t i g a t i o n . As can be seen from Figure 2 the maximum i s 1.700 f o r random c o i l conformation. I f the t r i p e p t i d e s are i n d i v i d u a l l y examined one o f the three c l e a r l y favors random c o i l , one favors alpha h e l i x , and the t h i r d cannot d i s t i n g u i s h between beta sheet and random c o i l . By t h i s method a s p a r t i c a c i d would always be p r e d i c t e d random c o i l i n t h i s p a r t i c u l a r pentapeptide, but should any o f the other four peptides be d i f f e r e n t , the p r e d i c t i o n may a l s o change. Reduction o f the t r i p e p t i d e frequency t a b l e from twenty to three conformations must introduce some e r r o r . I t has been found that approximately 78% o f a l l the c l a s s i f i a b l e t r i p e p t i d e s i n the AMSOM subset are e i t h e r a a a , 333, or RRR. Thus f o r t h i s 78% o f a l l conformations the reduction to a l p h a , beta or random c o i l c o n t r i b u t i o n s makes no d i f f e r e n c e i n t h i s assignment. An a d d i t i o n a l 21% of the t r i p e p t i d e s have two o f the peptides the same and one d i f f e r e n t . (A complete l i s t i s given i n Figure 3.) T h e r e f o r e , the assignment of o v e r a l l conformation w i t h i n t h i s 21% to a s i n g l e peptide i n a given t r i p e p t i d e i s c o r r e c t f o r an a d d i t i o n a l 14% of a l l conformations (2/3 x 21% = 14%). Consequently the o v e r a l l conformation approximation i s c o r r e c t f o r 92% (78% + 14%) o f a l l conformations i n the data base subset. Results and Discussion The

above p r e d i c t i o n scheme was performed on a l l

fifty

COMPUTER-ASSISTED S T R U C T U R E

52

(

ADA

a)

RRR

ctotct

g0J3_

gRR

0.375

0.250

0.250

0.125

(b)

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch004

ELUCIDATION

ADA

o

£

R

0.250

0.375

0.500 Journal of Biological Chemistry

Figure 1. Reduction of the table entry of alanine-aspartic acid-alanine from 20 potential conformations to three. Four of the potential 20 entries occur (a) and are reduced to three by summing all conformations that contain a-helices, all that contain ^-sheets, and all that contain random coils to give total a-tendency, total ^-tendency, and total random coil tendency (b). The codes A , D, A are the IUPAC-IUB single letter codes for the amino acids (34).

...

QADAN

... R

a

3_

QAD

0.0

1.000

1.000

ADA

0.250

0.375

0.500

DAN

0.800

0.0

0.200

1.050

1.375

1.700

Figure 2. The conformation for the amino acid, aspartic acid, is determined in the pentapeptide glutamine-alanine-aspartic acid-alanine—asparagine by summing the a, /?, and random coil tendencies of the three tripeptides that incorporate aspartic acid. In this pentapeptide aspartic acid is assigned a random coil configuration as the sum of the random coil tendencies is the maximum.

DEHASETH A N D ISENHOUR

4.

Structure

of Globular

Proteins

53

proteins t h a t comprised the AMSOM subset. Two c o r r e c t i v e procedures were added to the p r e d i c t i o n a l g o r i t h m . F i r s t , all single peptide alpha h e l i c e s were disallowed and converted to e i t h e r random c o i l , or beta sheet only i f preceded and succeeded by beta sheets. Second, a l l s i n g l e peptide beta sheets were disallowed except those preceded and succeeded by random c o i l p e p t i d e s . In order to measure how well the p r e d i c t i o n algorithm c o r r e l a t e d with the observed s t r u c t u r e , two parameters were computed f o r each prediction. The f i r s t o f these two parameters, R e c a l l , i s def i n e d as f o l l o w s :

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch004

Recall

=

Total Number o f Conformation P r e d i c t e d C o r r e c t l y Total Number o f Conformation Observed

That i s to s a y , the numerator i s the number o f peptides found to have the c o r r e c t conformation a f t e r p r e d i c t i o n . The denominator is the t o t a l number of peptides of the s p e c i f i c conformation observed in the p r o t e i n and known to be c o r r e c t . Recall does not account f o r o v e r - p r e d i c t i o n s . For example, i f a p r o t e i n i s known to contain only alpha h e l i c e s and random c o i l p r i o r to p r e d i c t i o n , the Recall can be forced to 100% by p r e d i c t i n g the e n t i r e sequence alpha h e l i c a l . The second parameter, P r e c i s i o n , i s a measure o f o v e r - p r e d i c t i o n , and i s defined as: p

. e

.

_ Total Number o f Conformation P r e d i c t e d C o r r e c t l y Total Number P r e d i c t e d to be in Conformation

The numerator i s the same as i n the R e c a l l , but the denominator is e q u i v a l e n t to the t o t a l number that were p r e d i c t e d by the algorithm to be i n the s p e c i f i c conformation. Hence, a P r e c i s i o n of 100% i n d i c a t e s no o v e r - p r e d i c t i o n , and as the o v e r - p r e d i c t i o n increases the P r e c i s i o n decreases. The r e s u l t s o f the c o r r e l a t i o n between p r e d i c t i o n and observation o f alpha h e l i c e s and beta sheets f o r a l l f i f t y proteins i n the AMSOM subset are l i s t e d i n Table I I . Table II shows that the p r e d i c t i v e algorithm i s q u i t e s u c cessful. The Recall f o r alpha and beta conformations i s genera l l y g r e a t e r than 75% and the P r e c i s i o n i s g e n e r a l l y g r e a t e r than 80%. The o v e r a l l r e s u l t s f o r the e n t i r e data base show that the Recall and P r e c i s i o n f o r alpha h e l i c e s are 82.6% and 93.7%, r e s p e c t i v e l y ; whereas, the same values f o r beta sheets are 83.6% and 90.6%, r e s p e c t i v e l y . One recent study (30), shown to be at l e a s t as accurate as those preceding i t , y i e l d e d an average Recall of 68.6% f o r alpha h e l i c e s and an average Recall o f 53.1% f o r beta sheets. P r e c i s i o n , per s e , was not c a l c u l a t e d . The information t h e o r e t i c a l approach y i e l d e d average R e c a l l s o f 79.3% and 83.7% f o r alpha h e l i c e s and beta s h e e t s , r e s p e c t i v e l y . These r e s u l t s would i n d i c a t e t h a t the information t h e o r e t i c a l approach i s more accurate than other techniques; however, these c o r r e l a t i o n s were determined using a small data set of only 9,364 c l a s s i f i a b l e

54

COMPUTER-ASSISTED S T R U C T U R E E L U C I D A T I O N

Table I I .

Recall and P r e c i s i o n a

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch004

P r o t e i n Name Lamprey Cyanmethemoglobin Horse Aquomethemoglobin Dimer Human Deoxyhemoglobin Dimer Marine Worm Carboxyhemoglobin Bovine Ferricytochrome b5 Tuna Ferrocytochrome c Tuna Ferricytochrome c " o u t e r " B a c t e r i a l Ferricytochrome c2 Bonito Ferrocytochrome c Sperm Whale Metmyoglobin B a c t e r i a l Rubredoxin B a c t e r i a l High P o t e n t i a l Iron P r o t e i n B a c t e r i a l Ferredoxin B a c t e r i a l Thioredoxin Marine Worm Hemerythrin Fungal F e r r i c h r y s i n S u b t i l i s i n BPN' Bovine Tosyl Alpha-Chymotrypsin A Bovine Chymotrypsinogen A Bovine Gamma-Chymotrypsin A Bovine T r y p s i n Porcine Tosyl E l a s t a s e B a c t e r i a l Serine Protease b Papain B a c t e r i a l Thermolysin Bovine Carboxypeptidase a Complex Bovine Carboxypeptidase b Bovine T r y p s i n I n h i b i t o r Dogfish A p o - l a c t a t e Dehydrogenase Lobster Glyceraldehyde-3-P Dehydrogenase "Red" Horse A l c o h o l Dehydrogenase Complex B a c t e r i a l O x i d i z e d Flavodoxin Bovine Ribonuclease S Complex B a c t e r i a l Nuclease Complex Mouse Immunoglobin A Fab "MCPC#603" Human Bence-Jones P r o t e i n Dimer "REI" Human Immunoglobin G Fab' "New" Human Bence-Jones Dimer "MCG" Porcine Adenylate Kinase Yeast Phosphoglycerate Kinase Jack Bean Concanavalin A Chicken Lysozyme

3

RePrecic a l l % sion %

RecalU

Precision %

85.5 82.5 92.1 88.8 80.8 88.5 88.5 89.4 89.1 83.5 n/a 55.6 n/a 85.7 77.8 n/a 75.6 87.5 53.9 100.0 72.7 100.0 75.0 83.3 79.0 86.1 84.0 72.3 76.2

100.0 96.9 97.7 96.3 100.0 97.9 97.9 95.5 87.2 98.1 n/a 71.4 n/a 92.3 95.5 n/a 100.0 100.0 87.5 78.8 80.0 84.6 81.8 92.6 98.9 99.0 96.7 88.9 95.2

n/a* n/a n/a n/a 95.8 n/a n/a n/a n/a n/a n/a 82.4 n/a 81.3 n/a n/a 83.8 93.9 93.0 93.0 87.2 86.1 79.2 71.9 84.9 84.4 88.9 100.0 88.6

n/a n/a n/a n/a 82.1 n/a n/a n/a n/a n/a n/a 100.0 n/a 86.7 n/a n/a 72.1 95.9 95.9 95.9 90.1 89.2 83.6 71.9 80.5 74.1 78.4 91.3 87.5

87.1 80.2 82.5 93.8 88.9 n/a n/a n/a n/a 91.5 29.0 n/a 67.6

88.0 93.9 90.4 100.0 78.1 n/a n/a n/a n/a 97.9 71.4 n/a 100.0

82.7 77.7 81.0 93.0 81.3 80.2 86.3 86.0 87.1 95.8 47.3 80.5 58.3

96.5 95.9 94.4 90.9 95.1 97.1 90.0 93.5 97.1 85.2 74.3 95.7 63.6

4.

D E H A S E T H A N D ISENHOUR

Table

II.

Structure

Recall

of Globular

and P r e c i s i o n

55

Proteins

(cont.)

a Precision %

Re call*

Precision %

Chicken T r i o s e Phosphate Isomerase (Monomer 1) Carp Calcium Binding P r o t e i n B Bovine Cu, Zn Superoxide Dismutase Human Carbonic Anhydrase B Human Carbonic Anhydrase C Human Prealbumin Dimer Sea Snake Neurotoxin Sea Snake Neurotoxin

86.3

95.2

76.0

92.7

82.3 n/a 78.7 83.0 87.5 n/a n/a

98.1 n/a 97.4 95.1 53.9 n/a n/a

100.0 74.1 85.1 79.1 81.0 93.8 64.5

81.8 92.3 93.8 93.6 97.1 75.0 90.9

Overall

82.6

93.7

83.6

90.6

P r o t e i n Name

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch004

3

RecalU

Results

*n/a = not a p p l i c a b l e ,

i.e.,

p r o t e i n d i d not e x h i b i t

RRR

conformation.

aR3 < 15

3Ra

78% 333

RRa

R33

aa3

3aa

RR3

aRR

a33

33R

Raa

aRa

3RR

33a

R3R

aaR

3R3

Figure

3.

Distribution

of the 20 occurring tions

- 21%

tripeptide

conforma-

56

COMPUTER-ASSISTED S T R U C T U R E E L U C I D A T I O N

tripeptides. It i s not known whether t h i s data base adequately represents a l l p r o t e i n sequences; hence, p r e d i c t i o n o f alpha h e l i c e s and beta sheets using these data may not y i e l d r e l i a b l e r e s u l t s f o r unknown p r o t e i n s t r u c t u r e s . This uncertainty may be resolved with a larger,more r e p r e s e n t a t i v e data base. Acknowledgement The authors would l i k e to thank R. J . Feldmann of the National I n s t i t u t e s o f Health for supplying the AMSOM data base subset. The support of the National Science Foundation i s g r a t e f u l l y acknowledged.

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch004

Literature Cited 1. Guzzo, A. V., Biophys. J. (1965), 5, 809-822. 2. Prothero, J. W., Biophys. J. (1966), 6, 367-370. 3. Periti, P. F., Quagliarotti, G., and Liquori, A. M., J. Mol. Biol. (1967), 24, 313-322. 4. Cook, D. A., J. Mol. Biol (1967), 29, 167-171. 5. Schiffer, M. and Edmundson, A. B., Biophys J. (1968), 8, 29-39. 6. Dunnill, P., Biophys J. (1968), 8, 865-875. 7. Low, B. W., Lovell, F. M. and Rudko, A. D., Proc. Natl. Acad. Sci. U.S. (1968), 60, 1519-1526. 8. Kotelchuck, D. and Scheraga, H. A., Proc. Natl. Acad. Sci. U.S. (1969), 62, 14-21. 9. Lewis, P. N., Go, N., Go, M., Kotelchuck, D. and Scheraga, H. A., Proc. Natl. Acad. Sci. U.S. (1970), 65, 810-815. 10. Ptitsyn, O. B. and Finkel'shtein, A. V., Dohl. Akad. Nauk. S.S.S.R. (1970), 195, 221-224. 11. Ptitsyn, O. B. and Finkel'shtein, A. V., Biofizika (U.S.S.R.), (1970), 15, 757-767. 12. Leberman, R., J. Mol. Biol. (1971 ), 55, 23-30. 13. Robson, B. and Pain, R. H., J. Mol. Biol. (1971), 58, 237259. 14. Lewis, P. N., Momany, F. A., and Scheraga, H. A., Proc. Natl. Acad. Sci. U.S. (1971), 68, 2293-2297. 15. Wu, T. T. and Kabat, E. A., Proc. Natl. Acad. Sci. U.S. (1971), 68, 1501-1506. 16. Finkel'shtein, A. V. and Ptitsyn, O. B., J. Mol. Biol. (1971), 62, 613-624. 17. Nagano, K., J. Mol. Biol (1973), 75, 401-421. 18. Kabat, E. A. and Wu, T. T., Proc. Natl. Acad. Sci. U.S. (1973), 70, 1473-1477. 19. Kabat, E. A. and Wu, T. T., Proc. Natl. Acad. Sci. U.S. (1974), 71, 4217-4220. 20. Nagano, K., J. Mol. Biol. (1974), 84, 337-372. 21. Burgess, A. W., Ponnuswamy, P. K. and Scheraga, H. A., Isr. J. Chem. (1974), 12, 239-286.

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch004

4.

D E H A S E T H A N D ISENHOUR

Structure

of Globular

Proteins

57

22. Chou, P. Y. and Fasman, G. D., Biochemistry (1974), 13, 211-222. 23. Chou, P. Y. and Fasman, G. D., Biochemistry (1974), 13, 222-245. 24. Lim, V. I., Biofizika (1974), 19, 366-378. 25. Lim, V. I., Biofizika (1974), 19, 562-575. 26. Lim, V. I., J. Mol. Biol. (1974), 88, 857-872. 27. Lim, V. I., J. Mol. Biol. (1974), 88, 873-894. 28. Schulz, G. E., Barry, C. D., Friedman, J., Chou, P. Y., Fasman, G. D., Finkel'shtein, A. V., Lim, V. I., Ptitsyn, O. B., Kabat, E. A., Wu, T. T., Levitt, M., Robson, B. and Nagano, K., Nature (1974), 250, 140-142. 29. Matthews, B. W., Biochim. Biophys. Acta. (1975), 405, 442451. 30. Argos, Patrick, Schwarz, James and Schwarz, John, Biochim. Biophys. Acta. (1976), 439, 261-273. 31. Shannon, C. E. and Weaver, W., "The Mathematical Theory of Communication," pp. 43-44, University of Illinois Press, Urbana, Illinois, 1972. 32. Dayhoff, M. O., "Atlas of Protein Sequence and Structure," vol. 5, National Biomedical Research Foundation, Washington D.C., 1972; vol. 5., suppl I, 1973; vol. 5., supp. 2, 1976. 33 Data base supplied on magnetic tape from Richard J. Feldmann, personal communication. 34. IUPAC-IUB Commission on Biochemical Nomenclature, J. Biol. Chem. (1968), 243, 3557-3559.

5 Computer-Assisted Structure Elucidation Using Automatically Acquired

C

13

NMR

Rules

GRETCHEN M. SCHWENZER and TOM M. MITCHELL

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch005

Department of Computer Science, Stanford University, Stanford, CA 94305

Carbon-13 nuclear magnetic resonance (CMR) has developed i n t o an important tool for the structural chemist. A CMR spectrum exhibits a wide range of shifts which have been shown to have a strong correlation with structure (1,2). A natural abundance CMR spectrum which i s fully proton decoupled c o n s i s t s of a number of sharp peaks which correspond to the resonance frequencies in an applied magnetic field of the various types of carbon atoms present. A C-13 shift is the amount an observed peak is shifted from that of a reference peak, u s u a l l y t e t r a m e t h y l s i l a n e (TMS). Molecular structure elucidation using CMR c o n s i s t s of establishing a set of r u l e s which summarize the CMR behavior for a set of compounds and then using the rules to identify unknown compounds. In the traditional approach to structure elucidation using CMR the chemist forms a set of empirical r u l e s by s o r t i n g through a large amount of data looking for correlations between structural arrangements i n the molecules and the observed C-13 shift. The total shift is then given as a function of these structural parameters. The functional form i s u s u a l l y chosen to be a linear combination of independent parameters. The optimized value of the c o e f f i c i e n t of each s t r u c t u r a l parameter i s obtained by a curve f i t t i n g procedure. This approach leaves the decisions of the s t r u c t u r a l parameter s e l e c t i o n and the s e l e c t i o n of a reasonable f u n c t i o n a l form to the chemist. In both cases the correct decisions are e a s i l y overlooked. The best known example of t h i s approach i s that of Lindeman and Adams Q ) . Although the r u l e form r e s u l t i n g from the parameter approach i s useful for p r e d i c t i n g spectra of a given structure i t cannot be used e f f i c i e n t l y for 58

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch005

5.

S C H W E N Z E R AND

MITCHELL

C

13

NMR

59

Rules

structure elucidation. O b s e r v i n g a s h i f t does not g i v e any information about what substructure i s present without constructing a set of candidate structures, calculating their shifts and then comparing the calculated shifts w i t h the observed spectrum. This procedure was followed i n the work of C a r h a r t and Djerassi which used the parameter s e t o b t a i n e d by Eggert and Djerassi f o r the acyclic a m i n e s (5.). However, this method does not reflect the actual t h i n k i n g p r o c e d u r e a c h e m i s t w o u l d use when i d e n t i f y i n g an unknown and i t suffers from i t s i n a b i l i t y t o be g e n e r a l i z e d . The s e l e c t i o n of e m p i r i c a l r u l e s and t h e algorithms f o r the generation of s u b s t r u c t u r e s are s p e c i f i c f o r each c l a s s o f compounds. To the chemist i t i s more l i k e l y t h a t the appearance of an o b s e r v e d shift or set of s h i f t s would suggest a structural arrangement. A second approach to s t r u c t u r e e l u c i d a t i o n i s to e s t a b l i s h a d a t a b a s e o f C-13 NMR s p e c t r a and t h e n use t h e d a t a b a s e t o do s t r u c t u r e e l u c i d a t i o n by s e a r c h i n g for peak patterns s i m i l a r to those i n t h e unknown spectrurn(6,7,8). A d a t a base which consists of the e n t i r e s p e c t r u m r e s u l t s i n s t o r a g e o f a l a r g e amount o f redundant information. Structure elucidation is accomplished by supplying the chemist with the molecules having p o r t i o n s of their spectra which are s i m i l a r t o t h a t o f t h e unknown. The c h e m i s t then uses these to suggest p a r t i a l s u b s t r u c t u r e s that are present in t h e unknown. T h i s m e t h o d has suffered from the n e c e s s i t y o f o b t a i n i n g l a r g e d a t a b a s e s and methods of a s s e m b l i n g the p a r t i a l s u b s t r u c t u r e s . We o f f e r an a l t e r n a t i v e to these approaches with the following procedure. First rule formation i s a c c o m p l i s h e d by a c o m p u t e r p r o g r a m w h i c h f o r m s a s e t o f e m p i r i c a l r u l e s by a s s o c i a t i n g an o b s e r v e d total shift with a s u b s t r u e t u r a l arrangement i n the molecule. A second program uses these rules for structure elucidation by assembling substruetural fragments s u g g e s t e d by t h e r u l e s . A sample r u l e c o n s t r u c t e d by the program from a s e t o f p a r a f f i n s and a c y c l i c a m i n e s i s CH *-CH -CH -CH 3

2

2

2

--->

14. Oppm

l^ppm.

The a s t e r i s k d e n o t e s t h e atom f o r which the shift is p r e d i c t e d . The p r e d i c t i o n i s g i v e n i n ppm down f i e l d from TMS. The important s u b s t r u c t u r a l arrangements given i n the r u l e are actually constructed by t h e program from a language of features supplied by t h e user. Molecular structure elucidation i s accomplished

60

COMPUTER-ASSISTED S T R U C T U R E E L U C I D A T I O N

by o b s e r v i n g a t o t a l s h i f t and f i n d i n g t h e r u l e s which are possible explanations f o r the shift. The rules selected postulate p a r t i a l substructures w h i c h m i g h t be i n the m o l e c u l e . These s u b s t r u c t u r e s are assembled to construct the f i n a l molecule. A description of the rule formation and structure elucidation programs a p p l i e d t o t h e p a r a f f i n s and a c y c l i c a m i n e s i s g i v e n i n the f o l l o w i n g s e c t i o n s . We b e l i e v e t h e a l g o r i t h m s u s e d are g e n e r a l enough to t r e a t w i d e l y d i f f e r e n t c l a s s e s of compounds. Rules generated for decalins, methyldecalins and h y d r o x y - s t e r o i d s a r e shown i n the third section.

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch005

Empirical

Rule

Formation

Rule G e n e r a t i o n . The r u l e g e n e r a t i o n p r o g r a m (.2.) must be s u p p l i e d a training set of known s t r u c t u r e s with t h e i r assigned spectra. A set of p r i m i t i v e terms which w i l l form the l a n g u a g e o f atom f e a t u r e s used to d e s c r i b e t h e a t o m s and b o n d s ( a t o m t y p e , number o f nonhydrogen neighbors, orientation of substituents, e t c . ) m u s t a l s o be supplied. These terms are combined to c o n s t r u c t s t r u c t u r a l fragments which imply a total shift. The chemist also sets two parameters which regulate the generality of the rules generated. MINIMUM-EXAMPLES is a parameter which s p e c i f i e s the minimum number o f d a t a p o i n t s w h i c h a r u l e must e x p l a i n w i t h i n the t r a i n i n g s e t . The o t h e r p a r a m e t e r , MAXIMUMRANGE, s p e c i f i e s t h e maximum a l l o w a b l e s h i f t range f o r a rule. I f the c h e m i s t wants o n l y the most g e n e r a l t r e n d s i n the d a t a he c a n require a larger number o f examples with moderately s i z e d s h i f t ranges. The f o r m a t o f t h e r u l e s g e n e r a t e d i s substructure

implies^

13

Q

s

h

i

f

t

r a n

ge.

I f the s u b s t r u c t u r e to the l e f t of the arrow w i t h i n some m o l e c u l e t h e n there i s a s h i f t range g i v e n to the r i g h t of the arrow. The i n T a b l e I was g e n e r a t e d on a c o m b i n e d s e t a m i n e s and p a r a f f i n s .

i s present w i t h i n the r u l e shown of a c y c l i c

5.

S C H W E N Z E R AND

MITCHELL

Table

C

NMR

I

Rule

13

61

Rules

Form

1 >

5-4-3-2-7

44.7ppm

^

g(3)$

44.9ppm

8

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch005

Node

1,5,7,8 2 3 4

Atom Type C C C C

Number o f n o n - h y d r o g e n Neighbors 5^

4 2 2

For the s u b s t r u c t u r e shown i n Table I w i t h the c o r r e s p o n d i n g atom f e a t u r e s f o r atom t y p e and number o f n o n - h y d r o g e n n e i g h b o r s atom number 3 w i l l h a v e a C-13 shift in the range 4 4 . 7 p p m to 4 4 . 9 p p m d o w n f i e l d from TMS. The r u l e search procedure i s shown i n Figure 1 . The search begins with the general seed rule C->-oo^g ^ oo ( w h e r e C may be any c a r b o n atom w i t h any atom prSperties and is the observed s h i f t ) and proceeds to expand t h i s r u l e by a d d i n g new a t o m s and atom f e a t u r e s t o t h e s u b s t r u c t u r e w h i c h w i l l n a r r o w t h e p r e d i c t e d range of s h i f t s . The s e e d r u l e i n Figure 1 is expanded by considering a l l possible v a l u e s of "number of n e i g h b o r s " of the central carbon. Each resulting level 1 s u b s t r u c t u r e i s expanded i n level 2 by adding either an "atom type" or "number of neighbors" s p e c i f i c a t i o n to e a c h atom one bond away from the c e n t r a l carbon. At each step o n l y a single atom f e a t u r e from the user s u p p l i e d list i s added. Each s u b s t r u c t u r e g e n e r a t e d i s a s s o c i a t e d w i t h a range o f C-13 s h i f t s . This range i s determined by s e a r c h i n g for o c c u r r e n c e s of the s u b s t r u c t u r e w i t h i n the t r a i n i n g set molecules. The s h i f t range associated w i t h the substructure i s the range of a l l occurrences of the s u b s t r u c t u r e i n the t r a i n i n g s e t . Each s u b s t r u c t u r e g e n e r a t e d i n the r u l e search i s e v a l u a t e d i n terms of the a s s o c i a t e d s h i f t range. If the shift range i s narrower than the range of the p a r e n t r u l e t h e n the added s p e c i f i c a t i o n i s considered to be u s e f u l and t h e search continues from the new substructure, o t h e r w i s e the path i s terminated. The

62

COMPUTER-ASSISTED

STRUCTURE

ELUCIDATION

C*-^-oo^S ^oo c

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch005

LEVEL I: SPECIFICATION^ OF NEIGHBORS

(paths to other descendents)

C*-*8.I^S <45.6

-C- — I7.9^S ^60.I

r

C

LEVEL 2= SPECIFICATION OF NODE TYPE OR NEIGHBORS

(paths to other descendents) C-C-N —38.7^8 <60.I C

C-N^28.8<S <45.6 C

C-C—8.KS <33.5

11.0$ 8 $ 45.6

r

r

C*X--*8.I<8 ^36.7 C

Figure 1. Partial schematic of the rule search. B values are approximate and are given in ppm downfield from TMS. (*) identifies the carbon atom to which the shift is assigned; (X) indicates any non-hydrogen atom. c

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch005

5.

S C H W E N Z E R AND

MITCHELL

C

13

NMR

Rules

63

program r u n s u n t i l a l l b r a n c h e s o f the s e a r c h have been explored. F o r a s u b s t r u c t u r e t o be a c c e p t e d as a f i n a l r u l e i t must s a t i s f y t h e c o n d i t i o n s o f MINIMUM-EXAMPLES and MAXIMUM-RANGE. Since the r u l e search procedure can result in s l i g h t l y d i f f e r e n t substructures being generated which c o v e r t h e same d a t a i n t h e t r a i n i n g s e t , the g e n e r a t e d r u l e set may be redundant. A l e s s redundant set of r u l e s i s c h o s e n by a s s i g n i n g a s c o r e t o e a c h r u l e . The score i s peaks/range w h e r e p e a k s i s t h e number o f d a t a peaks covered by t h e rule in the t r a i n i n g d a t a , and range i s the w i d t h of the s h i f t range f o r the r u l e . The r u l e w i t h t h e h i g h e s t s c o r e i s s e l e c t e d and a l l t h e data points explained by t h a t r u l e are removed. If unexplained data points remain, the score of the remaining rules are reevaluated and the procedure r e p e a t e d . The i n t e n t i s to s e l e c t the strongest rule during each iteration and weaken the rules with e v i d e n c e w h i c h o v e r l a p w i t h i t . The r e s u l t i s a s u b s e t of the strongest rules covering t h e same d a t a as t h e o r i g i n a l r u l e set The a l g o r i t h m u s e d t o g e n e r a t e t h e C-13 NMR rules i s s i m i l a r t o t h e a l g o r i t h m i n t h e Meta-DENDRAL p r o g r a m which generates empirical rules of molecular f r a g m e n t a t i o n f r o m mass s p e c t r a l d a t a ( 1 0 , 1 1 ) . For a s e t o f 22 p a r a f f i n s and 47 a c y c l i c amines w i t h a t o t a l o f 435 d a t a p e a k s a s e t o f 138 r u l e s were generated with the parameter settings MINIMUMEXAMPLES=2 and MAXIMUM-RANGE=2.0 ppm. Structure Selection. To test the information content of the r u l e s and t h u s t h e i r value for use i n structure elucidation, a structure selection test was designed. A program was w r i t t e n w h i c h u s e s the r u l e s t o p r e d i c t s p e c t r a f o r a s e t o f c a n d i d a t e m o l e c u l e s and compares the p r e d i c t e d s p e c t r a to that of an unknown. The c a n d i d a t e s are then ranked a c c o r d i n g to a "best" m a t c h c r i t e r i a t o t h e unknown s p e c t r u m . A s p e c t r u m i s p r e d i c t e d by a p p l y i n g t h e r u l e s t o a molecule, searching for p l a c e s the rule substructure fits i n t o the molecule. This i s done with a graph matching routine. When a match i s found the s h i f t range associated with the r u l e i s predicted f o r the associated carbon atom. Figure 2 is a example of predicting a spectrum. Each rule shown has a s u b s t r u c t u r e w i t h c o r r e s p o n d i n g atom f e a t u r e s t h a t maps i n t o the m o l e c u l e . O f t e n more t h a n one r u l e a p p l i e s t o a g i v e n atom to give d i f f e r e n t p r e d i c t i o n s . I f the predicted ranges are c o n s i s t e n t ( i . e . one of the predicted ranges is contained in the others) the

COMPUTER-ASSISTED

STRUCTURE

ELUCIDATION

?7 C

1- 2-?3-V 5 6 C

C

C

C

6

OBSERVED SPECTRUM 8.7

15.4

179

27.1

334

449

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch005

PREDICTED SPECTRUM (8.1 8.7)

(14.0 15.4) (17.9 243) (27.1

CARBON ATOM

29.7) (29.7 35.6) (312 343) (44.7

RULES WHICH APPLY SUBSTRUCTURE

TO CARBON ATOM

ATOM

ATOM

NODE

TYPE

C,

5-4-3-2-7

4

I

-4-3-2-

1-2-3

NUMBER OF

PREDICTION

NON-H NEIGHBORS

1

I

US)

1,5,7.8

C

>1

2

C

4

3

C

2

4

C

2

2

C

4

3

C

2

4

C

2

1,3

C

*1

2

C

2

1,3.7.8

C

£1

447«?S(3U449

41.1<S(3) $45.1

17.9sS(2)$56.9

3 C

8-2-1

3

7

Figure

2.

Partial

2

C

4

spectrum predictions for 3,3-dimethylhexane. for atom n in ppm downfield from TMS.

2fl7*S(2)*3&6

S(n) is the shift

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch005

5.

SCHWENZER

AND

MITCHELL

C

13

NMR

65

Rules

narrowest p r e d i c t e d range i s used. This i s i l l u s t r a t e d by the r u l e s which apply t o C4 i n Figure 2. The p r e d i c t e d r a n g e 44.7 t o 4 4 . 9 i s c o n t a i n e d i n t h e o t h e r two rule predictions thus i t i s selected as t h e prediction. T h i s m e t h o d c a n be r a t i o n a l i z e d since the actual s h i f t should fall i n t o a l l of the p r e d i c t e d r a n g e s and t h u s i n t o the narrowest. I f the p r e d i c t e d ranges overlap i n c o m p l e t e l y or are d i s j o i n t then the r a n g e s a r e merged t o a r r i v e a t a f i n a l p r e d i c t e d range f o r t h e c a r b o n atom. The p r e d i c t e d s p e c t r u m i s c o m p a r e d t o an unknown s p e c t r u m by a s s i g n i n g each atom's p r e d i c t e d range to t h e c l o s e s t o b s e r v e d s h i f t i n t h e unknown s p e c t r u m . In o r d e r t o be a v a l i d a s s i g n m e n t t h e number o f c a r b o n s i n the s t r u c t u r e l e s s t h e number o f o b s e r v e d s h i f t s must be greater than or e q u a l t o t h e number of m u l t i p l y assigned s h i f t s . I f the assignment does not s a t i s f y t h i s c r i t e r i o n t h e r e q u i r e d number o f m u l t i p l y a s s i g n e d observed s h i f t s are r e a s s i g n e d . The a t o m s which are c h o s e n t o be r e a s s i g n e d a r e t h o s e whose r e a s s i g n m e n t w i l l cause the s m a l l e s t change i n the c o m p a r i s o n s c o r e a s s i g n e d t o the match o f the p r e d i c t e d spectrum to the unknown s p e c t r u m . Results. A s e t o f r u l e s has been generated using a s u b s e t o f the p a r a f f i n d a t a from Lindeman and Adams (3.) and a s u b s e t o f t h e a c y c l i c a m i n e d a t a from Eggert and D j e r a s s i (^L) . M o l e c u l e s w i t h t h e e m p i r i c a l f o r m u l a C H and CgH^ N w e r e e x c l u d e d f r o m t h e t r a i n i n g s e t . The structure selection test was performed by generating a l l structural isomers with the e m p i r i c a l formulas Q ? Q (35 isomers) and C^H^ N (39 isomers). For each of the candidate isomers a s p e c t r u m was predicted. T h e r e were 24 C H spectra a v a i l a b l e from the work o f L i n d e m a n and Adams. The 35 p r e d i c t e d s p e c t r a were c o m p a r e d and r a n k e d a g a i n s t e a c h of these available spectra. The r e s u l t s of t h i s ranking for C H „ as w e l l as a s i m i l a r t e s t on C^H._N a r e shown i n 2 Q

C

H

p n

n

•thU i i .

6

15

66

COMPUTER-ASSISTED

Table

Empirical Formula

II.

Results

of Structure

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch005

C H N 6 15

35

39

ELUCIDATION

Selection

Number o f Candidates

Number o f Structures ^ t

C H 9 20

STRUCTURE

2

ni

a n k

L

" ^fh

th

20

3

1

24

24

24

8 11

2

1

11

11

Peak i n t e n s i t y i n f o r m a t i o n w h i c h g i v e s t h e number o f c a r b o n atoms corresponding t o an o b s e r v e d p e a k was not used. The use o f t h i s information would have r e s u l t e d i n t h e c o r r e c t a s s i g n m e n t f o r t h o s e w h i c h were poorly ranked. The f o r m o f t h e r u l e w h i c h i s p r o p o s e d h a s s e v e r a l advantages. E a c h r u l e h a s i t s own p r e d i c t e d e r r o r and the r u l e s e t c o n s i s t s of r u l e s of varying detail. In the parameter s e t approach t o r u l e f o r m a t i o n the e r r o r i s the s t a n d a r d d e v i a t i o n o f the t r a i n i n g s e t data from the f i t t e d curve. When a n a l y z i n g carbon atoms t h a t e x h i b i t magnetic nonequivalence ( r e s u l t i n g i n d i f f e r e n t c h e m i c a l s h i f t s f o r two i d e n t i c a l g r o u p s i n molecules having an a s y m m e t r i c c a r b o n atom) the parameter set a p p r o a c h w h i c h a t t e m p t s t o p r e d i c t t h e a r i t h m e t i c mean w i l l a l w a y s be i n error. The a d v a n t a g e in predicting total shifts over h y p o t h e s i z i n g partial contributions i s i n a v o i d i n g i n i t i a l b i a s e s as t o what c o n t r i b u t e s t o the shift and how these contributions are to be combined. Our p r o g r a m b y p a s s e s t h e b i a s o f assumed functional form and introduces only a weak bias c o n c e r n i n g w h i c h s t r u c t u r a l f e a t u r e s may be c o n s i d e r e d . The program can c o n s i d e r any s t r u c t u r e which c a n be c o n s t r u c t e d f r o m t h e atom f e a t u r e l a n g u a g e s u p p l i e d by the c h e m i s t . Another advantage of the r u l e format i s t h a t i t c a n be r e a d b a c k w a r d s , t h a t i s , t h e a p p e a r a n c e o f a p e a k i n an unknown s p e c t r u m i m p l i e s a structural feature. This property i s what d i s t i n g u i s h e s the e f f i c i e n c y o f t h i s r u l e f o r m o v e r o t h e r f o r m s when u s e d for structure e l u c i d a t i o n . The m e t h o d o f s t r u c t u r e s e l e c t i o n d e s c r i b e d a b o v e

5.

schwenzer

and mitchell

C

13

NMR

67

Rules

is inefficient when there are a large number o f s t r u c t u r a l isomers. Instead of applying this test to a l l s t r u c t u r a l i s o m e r s f o r a g i v e n e m p i r i c a l f o r m u l a we w i s h t o s e l e c t a s u b s e t o f l i k e l y c a n d i d a t e s t o be p u t through t h i s f i n a l ranking procedure. The a b i l i t y t o read the r u l e s backwards t o do m o l e c u l a r s t r u c t u r e e l u c i d a t i o n w i l l e n a b l e us t o a c h i e v e t h i s g o a l .

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch005

Structure

Elucidation

Structure Search. The i n f o r m a t i o n the chemist must supply to the structure elucidation program includes the empirical f o r m u l a o f t h e unknown and t h e observed spectrum. Two p a r a m e t e r s must a l s o be s e t by the chemist. The f i r s t p a r a m e t e r i s t h e number o f p l a u s i b l e s t r u c t u r e s w h i c h s h o u l d be f o u n d t o be r a n k e d by t h e s t r u c t u r e s e l e c t i o n p r o c e d u r e d e s c r i b e d e a r l i e r . The s e c o n d p a r a m e t e r i s t h e e r r o r r a n g e i n ppm w h i c h should be a s s i g n e d to the rules to account f o r d e f i c i e n c i e s i n the t r a i n i n g s e t , experimental error, solvent e f f e c t s , e t c . I n o r d e r t o make t h e g e n e r a t e d r u l e s more u s e f u l for structure elucidation additional information i s added to the rules. The f o r m of the rule used f o r s t r u c t u r e e l u c i d a t i o n i s shown i n T a b l e I I I . The r u l e now s a y s t h e o b s e r v a t i o n of a shift implies a partial substructure. T h i s i s t h e same r u l e a s shown i n Table I with the addition o f two new p r o p e r t i e s , support p r e d i c t i o n and s e c o n d a r y p r e d i c t i o n .

Table

I I I R u l e Form f o r S t r u c t u r e

Elucidation 1

I 44.7ppm ^ $ ( 3 ) $44.9ppm

> 5-4-3-2-7 8

Node

At om Type C C C C C C C

Number o f non-H Neighbors

Support Prediction

4

29.7^5(2)^35.

2 2

17.9«$(4)«56.

>1 >1

Secondary Prediction 27.1$S(1)434. 30.7^S< U33. 2

1 7 . 9 « S ( U 7 15.4^£(5)«24.3 4

27. 27.

2

6

1«S(7U34.9 US(8)434.9

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch005

68

COMPUTER-ASSISTED

STRUCTURE

ELUCIDATION

Once a s e t o f r u l e s h a s b e e n o b t a i n e d from t h e rule generation program, support predictions c a n be made. F o r e a c h r u l e an a t t e m p t i s made t o f i n d s u p p o r t predictions f o r a l l atoms i n the rule substructure e x c e p t t h e atom f o r w h i c h t h e r u l e i s d e f i n e d . For each r u l e ' s s u b s t r u c t u r e t h e r u l e s a r e a p p l i e d t o i t t o obtain predictions. Predictions made by t h e r u l e s e t f o r a n y atom i n t h e s u b s t r u c t u r e a r e t a b u l a t e d . The final support prediction f o r a p a r t i c u l a r atom i s o b t a i n e d by m e r g i n g t h e t a b u l a t e d p r e d i c t i o n s f o r that atom. Support p r e d i c t i o n s are r e a l l y main p r e d i c t i o n s merged into the r u l e . This i s done to introduce a d d i t i o n a l c o n s t r a i n t s upon t h e s e l e c t i o n o f t h e r u l e s e a r l y i n the search procedure. In the r u l e shown i n T a b l e I I I t h e r e were o t h e r r u l e s w h i c h g a v e p r e d i c t i o n s f o r n o d e s 2 and 4. Although support p r e d i c t i o n s cannot be o b t a i n e d for a l l atoms i n the rule substructure, predicted ranges are associated with a l l atoms using the f o l l o w i n g p r o c e d u r e . S e c o n d a r y p r e d i c t i o n s a r e made by f i n d i n g a l l places the r u l e substructure applies i n the t r a i n i n g s e t d a t a and t a b u l a t i n g t h e o b s e r v e d s h i f t s f o r t h e atoms i n t h e s u b s t r u c t u r e . F o r each atom i n the substructure the observed shifts found i n the training set a r e merged t o form the secondary predictions. The r e l i a b i l i t y o f a s e c o n d a r y p r e d i c t i o n is highly d e p e n d e n t on t h e v a r i e t y of molecular structures i n the t r a i n i n g s e t . Secondary p r e d i c t i o n s f o r atoms which a r e fewer b o n d s away f r o m the r u l e ' s p r e d i c t e d atom w i l l h a v e a s m a l l e r e x p e c t e d e r r o r than those more bonds away. The s h i f t ranges of the secondary predictions a r e broadened by a p r e d e f i n e d constant to account f o r the d e f i c i e n c i e s i n the training set. The a d d i t i o n of support and s e c o n d a r y predictions to the o r i g i n a l rules w i l l form t h e new rule set. The f i r s t s t e p i n t h e r u l e s e a r c h i s t o s e l e c t a subset of the rule s e t which w i l l a c t as p o s s i b l e explanations f o r t h e o b s e r v e d s p e c t r u m . The r u l e s w h i c h have s h i f t ranges consistent with the s h i f t s i n the observed spectrum are s e l e c t e d . Each r u l e selected i s checked f o r agreement w i t h t h e e m p i r i c a l f o r m u l a o f t h e unknown m o l e c u l e and f o r t h e p r e s e n c e o f o b s e r v e d p e a k s in the spectrum f o r a l l support predictions i n the rule. The c o n s t r a i n t t h a t t h e number o f c a r b o n s i n t h e structure less t h e number o f o b s e r v e d s h i f t s must be greater than or equal t o t h e number of multiply a s s i g n e d s h i f t s must a l s o be s a t i s f i e d . For the rule t o be a v a l i d possible explanation t h e r e must be an assignment o f t h e main and s u p p o r t p r e d i c t i o n s which

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch005

5.

S C H W E N Z E R AND

MITCHELL

C

13

NMR

Rules

69

d o e s n o t v i o l a t e t h i s c o n s t r a i n t . The s u b s e t of r u l e s selected by t h i s procedure from the set of p a r t i a l s u b s t r u c t u r e h y p o t h e s e s f o r the observed spectrum. The s t r u c t u r e s e a r c h p r o c e d u r e i s shown i n terms of a specific example in Figure 3. The observed spectrum i s f o r a m o l e c u l e w i t h the e m p i r i c a l formula CgH.g. The number of rules which are possible explanations for the observed s h i f t s are shown. The observed spectrum i n Figure 3 corresponding to 3 , 3 d i m e t h y l h e x a n e was a l s o shown i n F i g u r e 2 . In F i g u r e 2 o n l y t h e r u l e s w h i c h were c o r r e c t e x p l a n a t i o n s f o r the m o l e c u l e a r e shown. F o r e x a m p l e t h e r e was one c o r r e c t e x p l a n a t i o n f o r C3 e x p l a i n i n g the observed shift 33.4* In F i g u r e 3 there are e i g h t p o s s i b l e explanations for 3 3 . 4 of which only one i s c o r r e c t f o r this particular spectrum. An approximate lower bound to the number of possible ways the partial substructures may be a s s e m b l e d i s t h e p r o d u c t o f t h e number of e x p l a n a t i o n s given for the observed shifts. Although in this example t h i s lower bound i s greater than 1000, the c o n s t r a i n t s i m p o s e d by t h e o b s e r v e d s p e c t r u m , e m p i r i c a l f o r m u l a and r u l e s e t d i r e c t e d t h e s e a r c h t o t h e c o r r e c t structure after considering o n l y 12 p a t h s through the search t r e e . The s e a r c h s t r a t e g y shown i n F i g u r e 3 i s t h a t o f a depth f i r s t tree search. Solutions exist a t unknown locations in the tree. A set of heuristics or judgmental r u l e s are chosen to guide the search to the f i n a l s t r u c t u r e i n t h e most e f f i c i e n t m a n n e r . The subset of rules which are possible e x p l a n a t i o n s f o r the observed spectrum are ordered to s e l e c t t h e r u l e w h i c h has t h e g r e a t e s t c h a n c e of being i n t h e m o l e c u l e and w h i c h w i l l l e a d most r a p i d l y t o t h e f i n a l molecule. The h e u r i s t i c s which order the r u l e s include the quality of the main and support predictions. The q u a l i t y of a p r e d i c t i o n means t h e w i d t h o f t h e p r e d i c t i o n r a n g e and t h e c l o s e n e s s s o f t h e o b s e r v e d peak t o the r u l e p r e d i c t i o n r a n g e . The number of peaks e x p l a i n e d i n t h e t r a i n i n g s e t by t h e rule i s a l s o a f a c t o r i n o r d e r i n g the r u l e s . An a n a l o g y c a n be drawn between t h i s h e u r i s t i c and t h e s e l e c t i o n by t h e chemist of a s u b s t r u c t u r e which has f r e q u e n t l y been o b s e r v e d t o be p r e s e n t when a p a r t i c u l a r s h i f t occurs in a spectrum. Another heuristic is the number o f other e x p l a n a t i o n s a p a r t i c u l a r r u l e has. If there i s o n l y one e x p l a n a t i o n f o r an o b s e r v e d s h i f t i t i s v e r y likely t h a t the rule's substructure w i l l be i n the unknown m o l e c u l e . T h e s e c o n d i t i o n s c o u l d be c o n s i d e r e d

COMPUTER-ASSISTED

EMPIRCAL

FORMULA

C

OBSERVED SPECTRUM 8.7 15.4

STRUCTURE

8 18 H

17.9

27.1

33.4

34.9

44.9

6

4

NUMBER OF EXPLANATIONS FOR EACH SHIFT 1 4 5 9 8

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch005

C

.

1 p 8.1-8.7 ' 2 sup 312-34.9 1 2 »3 3 sec 33.4-35.5

ELUCIDATION

lV?3

1 SUP 8.1-8.7 2 P 31.2-34.9 3SEC 33.4-35.5

' 1P 8.1-8.7 ^T273 2 P 312-34.9 3 SEC 33.4-35.5 r

^4 C

6 Unsuccessful paths

5?3 7 6 C

3 p 29.7-35.6 4567 sec 23.8- 62.3

C

1 P 8.1-8.7 2 P 312-349 3 P 29.7-35.6 45,6 SEC 23.8-62.3 c- c- c- c- c 5 4 372 7

5 Unsuccessful paths

8

3 p 44.7- 44.9

4 sup 17.9-56.9 5 sec 15.4-243 178 sec 271- 34.9

C-C-C-C-C-C i C Figure 3. Structure search. Main (P), support (SUP), and secondary (SEC) predictions are given in ppm downfield from TMS. The candidate structures are shown in capitals; the rules selected to build on the candidate structure are shown in lower case.

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch005

5.

S C H W E N Z E R AND

MITCHELL

C

13

NMR

Rules

71

as contributions to the certainty that the rule s u b s t r u c t u r e i s i n the f i n a l m o l e c u l e . The final heuristic which o r d e r s the subset of rule explanations will d e p e n d on the width of the s e c o n d a r y and s u p p o r t p r e d i c t i o n s . Instead of being a c e r t a i n t y measure of the r u l e substructure's presence it will be a m e a s u r e o f e f f i c i e n c y of building with this particular substructure. The neighbors of the rule's predicted atom are examined as potential building sites. The subset of neighbors w h i c h have more t h a n one n o n - h y d r o g e n n e i g h b o r w i l l be t h e atoms on w h i c h c o n s t r u c t i o n w i l l t a k e p l a c e . The s u p p o r t and secondary p r e d i c t i o n s of t h e s e atoms s u g g e s t a subset of r u l e s from those s e l e c t e d as e x p l a n a t i o n s f o r t h e observed spectrum. This subset w i l l form e x p l a n a t i o n s for the s h i f t s of these atoms. The n a r r o w n e s s of the support or secondary p r e d i c t i o n determines t h e number of r u l e s which are p o s s i b l e explanations. The r u l e w h i c h has the narrowest prediction for a neighboring atom w i l l l e a d most r a p i d l y to the f i n a l s t r u c t u r e or the e l i m i n a t i o n o f t h e r u l e as a p o s s i b l e e x p l a n a t i o n . The c o m b i n a t i o n o f t h e s e h e u r i s t i c s w i l l o r d e r t h e set of r u l e e x p l a n a t i o n s , h o p e f u l l y p l a c i n g t h o s e which are the c o r r e c t e x p l a n a t i o n s w i t h the most p r o m i s i n g s t r u c t u r e s to b u i l d on e a r l y i n the l i s t . The r u l e c h o s e n w h i c h s a t i s f i e s t h e s e c o n d i t i o n s f o r thej e x a m p l e i s shown i n F i g u r e 3. The s u b s t r u c t u r e C 1 - C 2 - C 3 - h a s a s u p p o r t p r e d i c t i o n f o r C1, a m a i n p r e d i c t i o n f o r C2 and a s e c o n d a r y p r e d i c t i o n f o r C3Once a r u l e has been s e l e c t e d i t i s n e c e s s a r y t o c h o o s e an atom i n t h e r u l e t o b u i l d o n . Neighbors of a t o m s w i t h m a i n p r e d i c t i o n s and w h i c h do n o t have main predictions themselves w i l l be the atoms which are possible candidates for building. First the support p r e d i c t i o n s f o r t h e s e a t o m s a r e c o n s i d e r e d . I f any o f the support p r e d i c t i o n s are s u f f i c i e n t l y narrow that atom w i l l be c h o s e n as t h e c o n s t r u c t i o n site otherwise the secondary p r e d i c t i o n s are examined. The atom w i t h the narrowest secondary p r e d i c t i o n range i s chosen. For t h e e x a m p l e i n F i g u r e 3 atom C1 was c h o s e n f o r the site of construction. Rules are selected from the subset of r u l e e x p l a n a t i o n s which have s h i f t ranges c o n s i s t e n t w i t h the secondary or support p r e d i c t i o n of the atom c h o s e n . The s e t o f r u l e s s e l e c t e d i s ordered with t h e most likely rule f i r s t using the c r i t e r i o n stated previously. The e x a m p l e i n F i g u r e 3 c h o s e C1 t o b u i l d on and t h e r u l e e x p l a n a t i o n f o r C1 i s shown. Two r u l e s have now b e e n chosen. The process of selection of the second r u l e results i n an i n t i a l mapping between the two r u l e s u b s t r u c t u r e s . The atom

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch005

72

COMPUTER-ASSISTED STRUCTURE E L U C I D A T I O N

from the first rule which was selected as the construction site maps to the atom with the main p r e d i c t i o n i n the second r u l e . I t i s now necessary to find a l l substructures consistent with the i n i t i a l mapping t h a t can r e s u l t from the o v e r l a p o f the r u l e s substructures. A graph matching routine is used to find the overlaped substructures. The properties checked in overlapping are those which make up t h e language of atom features. For the p a r a f f i n s and acyclic amines p r o p e r t i e s checked are atom t y p e and number of non-hydrogen neighbors. If no possible overlap exists a new rule explanation is tried, i f there a r e no possible explanations for the s e l e c t e d atom t h e o r i g i n a l r u l e c h o s e n i s d i s c a r d e d and t h e n e x t choice i s t r i e d . Any c a n d i d a t e s t r u c t u r e s w h i c h r e s u l t f r o m t h e o v e r l a p must be c o n s i s t e n t w i t h the e m p i r i c a l f o r m u l a o f the unknown. The overlap process must also merge the p r e d i c t i o n s f o r t h e two rules. An atom f r o m one r u l e w h i c h has b e e n mapped i n t o an atom i n t h e other r u l e must h a v e a resultant prediction consistent with both p r e d i c t i o n s from the o r i g i n a l r u l e s . For instance the merging of two support predictions results in a p r e d i c t i o n range which i s the i n t e r s e c t i o n of the two p r e d i c t i o n s . F o r the c a n d i d a t e s u b s t r u c t u r e to be v a l i d there must be an o b s e r v e d s h i f t which i s consistent w i t h t h e new m a i n and s u p p o r t p r e d i c t i o n s . In a d d i t i o n t h e r e must be an a s s i g n m e n t o f o b s e r v e d s h i f t s to main and s u p p o r t p r e d i c t i o n s w h i c h s a t i s f i e s the c o n s t r a i n t o f t h e number o f m u l t i p l y a s s i g n e d s h i f t s . In F i g u r e 3 t h e o v e r l a p p i n g o f t h e f i r s t two r u l e s d o e s not r e s u l t i n any c h a n g e i n the s u b s t r u c t u r e but the p r e d i c t i o n s o f t h e two r u l e s a r e m e r g e d . The list of candidate structures which s u r v i v e t h e s e t e s t s have the same f o r m as t h e original rules. The d e c i s i o n as t o t h e l i k e l i h o o d of these candidate s t r u c t u r e s being i n the m o l e c u l e can be made on t h e basis of the candidate structure itself without combining the l i k e l i h o o d of the parent rules. The candidate s t r u c t u r e s are r a n k e d on the b a s i s of the q u a l i t y o f t h e m a i n and s u p p o r t p r e d i c t i o n s . The new subproblem is now identical to the o r i g i n a l p r o b l e m and t h e s t e p s o f s e l e c t i n g an atom t o b u i l d on, selecting rule explanations for t h a t atom, o v e r l a p p i n g t h e two s u b s t r u c t u r e s , m e r g i n g p r e d i c t i o n s , and the ranking of the resultant candidates are repeated. The p r o c e d u r e t e r m i n a t e s when the r e q u i r e d number of plausible molecules are found. Figure 3 shows the paths examined until the first plausible s t r u c t u r e was f o u n d w h i c h was t h e c o r r e c t solution for

5.

S C H W E N Z E R AND

MITCHELL

C

13

NMR

Rules

73

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch005

t h e o b s e r v e d s p e c t r u m . An u n s u c c e s s f u l p a t h i n F i g u r e 3 means t h a t t h e o v e r l a p p i n g o f a r u l e w i t h t h e c a n d i d a t e s t r u c t u r e f a i l e d on s t r u c t u r a l g r o u n d s o r t h e c a n d i d a t e structures generated were not consistent with the observed spectrum. If a rule f a i l s another rule i s t r i e d u n t i l a candidate s t r u c t u r e i s generated which i s consistent with the c o n s t r a i n t s of the observed s p e c t r u m and e m p i r i c a l f o r m u l a o f t h e unknown m o l e c u l e . Results. One p o s s i b l e search procedure has b e e n explained whose goal i s to obtain a few p l a u s i b l e s t r u c t u r e s which c a n t h e n be r a n k e d by the s t r u c t u r e s e l e c t i o n procedure described e a r l i e r . Other p o s s i b l e s e a r c h s t r a t e g i e s a r e b e i n g c o n s i d e r e d whose g o a l w o u l d be to obtain substantial sized s u b s t r u c t u r e s which would then be u s e d as starting points in a structure g e n e r a t i n g p r o g r a m s u c h as C O N G E N ( 1 2 ) . Handling

Stereochemistry

The work on the p a r a f f i n s and a c y c l i c amines requires only topological descriptors in the language o f atom features. Because of the dependence of C-13 shifts on stereochemical features (13,14) i t is necessary to have the facility to include stereochemical terms when they are required. Substituents placed on systems which have static c o n f o r m a t i o n s s u c h as t r a n s d e c a l i n and a n d r o s t a n e w i t h t r a n s r i n g f u s i o n s c a n be d e s c r i b e d i n d i s c r e t e terms. The t e r m s we s e l e c t e d d e s c r i b e the o r i e n t a t i o n on t h e r i n g o f t h e s u b s t i t u e n t as e i t h e r a x i a l or e q u a t o r i a l , and e i t h e r a l p h a o r b e t a . A s u b s t i t u e n t i s b e t a i n 10m e t h y 1 - t r a n s - d e c a l i n i f i t i s on t h e same s i d e of the r i n g as t h e m e t h y l g r o u p and a l p h a i f on the o p p o s i t e side of the r i n g from the methyl group. The r u l e g e n e r a t i o n program w i t h the e x t e n s i o n of the language to i n c l u d e t h e s e atom f e a t u r e s was r u n on a combined set of trans decalins, 10-methy1-trans-decalols and monohydroxylated androstanes with trans ring fusions s e l e c t e d f r o m t h e w o r k s o f G r o v e r and S t o t h e r s ( J J 1 ) and E g g e r t e t . a l . ( J _ 4 . ) . S i x t y r u l e s were g e n e r a t e d t o c o v e r t h e 249 data peaks o f 17 compounds. Samples of the r u l e s g e n e r a t e d a r e shown i n F i g u r e 4. The examination of these r u l e s w i l l show t h a t t h e y a r e u s e f u l f o r the c h e m i s t who w a n t s to study c o n t r i b u t i o n s to the t o t a l shift as well as for the structure elucidation p r o c e d u r e we h a v e o u t l i n e d . In F i g u r e 4 a r e shown two p a i r s o f r u l e s f o r a l p h a carbons. W i t h i n each p a i r of alpha carbon r u l e s there is little shift d i f f e r e n c e between the axial or

COMPUTER-ASSISTED STRUCTURE

Alpha

Carbon

\

C

/

Rules

\

• 70.0 $ S ( * k 70.5

\

C

I* 0 H

ELUCIDATION

/

\

• 71.8SSMS72.5

I*

eq

0

ax

H

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch005

C OH-C* ^

\ ^

I c

I

\

C ^ N ^

OH — > 66.9<S(^68.0

a

x

I

I

\

c

•

67.6$S(*k68.1

*

33.9<:S(*k34.1

c

Beta Carbon Rules

/

/

OH-C e

OH —

|

q

|

Gamma

> 35.6<:S(x)
a

X

C j

Carbon Rules

G

i*

i

V

t

r

I OH

/v

fax/

eq

C

?ax

i

i * » 20A$SW$20.5

r

V

A

• 16.9s«*)$17.1

I OH ax

Figure 4. Sample rules constructed from decalins and hydroxy steroids with trans ring fusions. (*) identifies the carbon atom to which the shift is assigned; is in ppm downfield from TMS.

5.

SCHWENZER AND M I T C H E L L

C NMR Rules

13

75

equatorial orientation of the hydroxyl substituent. However, t h e r e i s a d i f f e r e n c e between t h e p a i r s w h i c h could r e f l e c t t h e i m p o r t a n c e o f t h e number o f gamma carbons. The p a i r o f beta carbon rules show t h e equatorial substitution results i n a shift a t lower f i e l d s than the a x i a l substitution. The gamma r u l e s shown illustrate an a x i a l substituted hydroxy group will result i n a gamma carbon s h i f t at higher f i e l d s t h a n t h a t o f an e q u a t o r i a l s u b s t i t u t e d h y d r o x y g r o u p .

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch005

Conclusion Computer programs have been w r i t t e n t o generate empirical C-13 NMR rules and to do structure e l u c i d a t i o n using these r u l e s . Rules generated on a set o f p a r a f f i n s and a c y c l i c amines have s u c c e s s f u l l y i d e n t i f i e d t h e C-13 NMR s p e c t r a o f m o l e c u l e s n o t i n t h e t r a i n i n g set data. The f o r m o f t h e r u l e i s s u i t e d f o r efficient structure elucidation using an algorithm which assembles substructures s u g g e s t e d by t h e r u l e s a s explanations o f the observed s h i f t s . The i n t r o d u c t i o n of a l i m i t e d s e t o f s t e r e o c h e m i c a l terms t o t h e r u l e generation procedure demonstrated the f e a s i b i l i t y of e x t e n d i n g t h e method t o more c o m p l i c a t e d s y s t e m s . Acknowledgements. T h i s work was s u p p o r t e d by t h e N a t i o n a l I n s t i t u t e s o f H e a l t h under g r a n t s 5R24 0 0 6 1 2 07 a n d AM-17896-02 a n d by t h e A d v a n c e d Research P r o j e c t s A g e n c y u n d e r g r a n t DAHC 1 5 - 7 3 - C - 0 4 3 5 . Computer resources were provided b y t h e SUMEX facility at Stanford University under National I n s t i t u t e s o f H e a l t h g r a n t RR-00785. We a r e g r a t e f u l t o Jim McDonald, Bruce Buchanan, and C a r l D j e r a s s i for helpful discussions and W i l l i a m C. W h i t e f o r providing parts o f t h e Meta-DENDRAL program code f o r u s e i n t h i s work. Literature 1.

Cited

S t o t h e r s , J.B., "Carbon-13 NMR S p e c t r o s c o p y , " Academic P r e s s , New Y o r k , N . Y . 1972. 2. Levy, G . C . and G . L . N e l s o n , "Carbon-13 Nuclear Magnetic Resonance f o r Organic C h e m i s t s , " W i l e y - I n t e r s c i e n c e , New Y o r k , N . Y . 1972. 3. Lindeman, L . P . and J.Q. Adams, A n a l . Chem., (1971), 43,p. 1245. 4. C a r h a r t , R . and C . D j e r a s s i , J.Chem. S o c . , P e r k i n Trans., ( 1 9 7 3 ) , 2 , p . 1753. 5. E g g e r t , H. and C . D j e r a s s i , J. Amer. Chem. Soc. ( 1 9 7 3 ) , 9 5 , p . 3710.

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch005

76

COMPUTER-ASSISTED STRUCTURE

ELUCIDATION

6. Bremser, W., M. Klier, and E. Meyer, Org. Magn. Resonance, (1975),7,p. 97. 7. J e z i , B.A. and D.L. Dalrymple, Anal. Chem. (1975), 47,p. 203. 8. Schwarzenbach, J., J. Meili, H. K o n i t z e r , J . T . C l e r c , Org. Magn. Resonance, (1976),8,p. 11. 9. Mitchell, T.M. and G.M. Schwenzer, to be published 10. Buchanan, B.G., D.H. Smith, W.C. White, R.J. Gritter, E.A. Feigenbaum, J. Lederberg, and C. D j e r a s s i , J. Amer. Chem. Soc., (1976),98, p. 6168. 11. Buchanan, B.G., T.M. Mitchell, Proceedings of the Workshop on Pattern Directed Inference, Honolulu, Hawaii, (1977). 12. Carhart, R., D. Smith, H. Brown, and C. D j e r a s s i , J. Amer. Chem. Soc., (1975),97,p. 5755. 13. Grover, S.H. and J . B . Stothers, Can. J. Chem. (1974),52,p. 870. 14. Eggert, H., C. VanAntwerp, N. Bhacca, and C. D j e r a s s i , J. Org. Chem.,(1976),41.p. 71.

6 Computerized Structural Predictions from

C NMR

13

Spectra HENRY L. SURPRENANT and CHARLES N . REILLEY

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch006

Department of Chemistry, University of North Carolina, Chapel Hill, NC 27514

The wide chemical shift range and narrow line width of C NMR spectra combine to yield an apparently limitless number of spectra. For organic molecules, C spectra are unique finger-printing tools and in an effort to understand C chemical shift patterns, classes of compounds have been studied and spectral features correlated to structural changes within the class. These correlations are usually cast in the form of linear additivity parameters and have been largely successful for alkanes (1, 2) and monofunctional molecules, such as amines (3, 4) and alcohols (5). An unassisted human cannot effectively consider the enormous number of possibilities that exist for molecular structures and C spectra. A number of attempts have been made involving the use of computers to predict structures directly from C spectra. Burlingame, McPherson and Wilson (6) have correctly predicted the structures of seven acyclic saturated hydrocarbons up to empirical formula of C20H42. These compounds were not in the original data set used for the parameterization (2). Their procedure involved (a) nonredundantly generating codes for the structure of each geometrical isomer, (b) checking the code for the proper multiplicity (i.e., the correct number of primary, secondary, tertiary and quaternary carbons, information which can be obtained from an off-resonance decoupling experiment), (c) predicting the spectrum from the code using additivity parameters, and (d) comparing predicted and observed spectra, saving the best matches. Because their program could rapidly generate the structural codes, but was slow in computing the spectrum from that code, a multiplicity f i l t e r was employed to minimize computation time. An off-resonance decoupling experiment is particularly time consuming and i t s requirement is a major drawback of their approach. Although identification of seven structures is far from exhaustive proof of the u t i l i t y , success in these cases i s nevertheless 13

13

13

13

13

77

COMPUTER-ASSISTED STRUCTURE E L U C I D A T I O N

78

encouraging, p a r t i c u l a r l y i n l i g h t o f the s m a l l number of available spectra. A more e f f i c i e n t approach was i n v e s t i g a t e d by Carhart and D j e r a s s i (7_) i n the a n a l y s i s o f C s p e c t r a of amines. T h e i r program (a) enumerated the p o s s i b l e s u b s t r u c t u r e s t h a t could account f o r the observed chemical s h i f t o f each l i n e , (b) comb i n e d the s u b s t r u c t u r e s i n t o l a r g e r fragments, "pruning" i l l o g i c a l and mismatching fragments from the t r e e of p o s s i b l e s t r u c t u r e s , and (c) generated the spectrum f o r each p o s s i b l e s t r u c t u r e and compared these w i t h t h a t o f the unknown. A l l o f the 100 amine s p e c t r a s t u d i e d , w i t h the e x c e p t i o n o f s i x , were unambiguously i d e n t i f i e d , and even i n those s i x cases the c o r r e c t s t r u c t u r e was always the best or second best match. An i n t e r e s t i n g feature of t h e i r program was i t s a b i l i t y t o p r e d i c t s t r u c t u r e s from chemical s h i f t i n f o r m a t i o n a l o n e , not r e q u i r i n g knowledge of the number o f carbon atoms a s s o c i a t e d w i t h each peak. The C s t r u c t u r a l a n a l y s i s problem was examined by J e z l and Dalrymple (8_) using a c l a s s i c a l search system approach. While p o l y f u n c t i o n a l and c y c l i c s t r u c t u r e s can r e a d i l y be handled w i t h i n t h e i r search system, the a b i l i t y t o decipher s t r u c t u r e s of compounds not present i n t h e i r data f i l e i s , o f course, l i m i t e d . This paper summarizes a recent i n v e s t i g a t i o n (9_) o f the uniqueness o f C s p e c t r a and how u n c e r t a i n t i e s i n s p e c t r a l i n f o r m a t i o n a f f e c t the confidence i n the p r e d i c t e d s t r u c t u r e s . In a d d i t i o n , a method f o r h a n d l i n g f u n c t i o n a l groups as d e r i v a t i v e s of hydrocarbons has been developed, and a search based system f o r hydrocarbon, amine, a l c o h o l , and e t h e r molecules i s demonstrated. Four FORTRAN programs have been w r i t t e n t o accomplish these g o a l s . A l l programs were executed on a Modular Computer Systems Model 11/25 minicomputer. More program i n f o r m a t i o n i s a v a i l a b l e elsewhere (9_).

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch006

13

13

13

Spectra Generation and Uniqueness Tests A d d i t i v i t y Parameters. L i n e a r a d d i t i v i t y schemes have been used i n various forms i n many s c i e n t i f i c f i e l d s . The a p p l i c a t i o n t o C NMR chemical s h i f t s began w i t h Grant and P a u l s (1_) study o f s a t u r a t e d a c y c l i c hydrocarbons. T h e i r scheme showed s i g n i f i c a n t d e v i a t i o n s f o r h i g h l y branched a l k a n e s , and was m o d i f i e d by Lindeman and Adams (20 who a l s o expanded the observed data f i l e . This p a r a m e t e r i z a t i o n scheme has been used as a s t a r t i n g b a s i s f o r i n c l u d i n g the e f f e c t s of the presence o f f u n c t i o n a l groups such as amines (_3, MO. The standard e r r o r i n the hydrocarbon p a r a m e t e r i z a t i o n was o n l y 0.79 ppm i n d i c a t i n g a h i g h degree o f r e l i a b i l i t y o f the p r e d i c t e d s p e c t r a . This r e l i a b i l i t y and the l a r g e s i z e o f t h e i r o r i g i n a l data set were the primary reasons f o r the choice o f hydrocarbons f o r t h i s study. 13

T

6.

SURPRENANT AND REILLEY

Predictions

from

C NMR

13

Spectra

79

S t r u c t u r e Generation. The generation o f a l l t h e nonredundant g e o m e t r i c a l isomers f o r C H2n+2 hydrocarbons was accomplished using program ISOMER. To help e n v i s i o n the magnitude o f the problem an enumeration o f the number o f isomers vs. carbon number i s l i s t e d i n Table I . Because i t was not p o s s i b l e t o check every s t r u c t u r e code by hand, ISOMER was assumed t o execute p r o p e r l y because the t o t a l number o f isomers matched e x a c t l y w i t h previous enumerations (10). n

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch006

Table I .

Number o f Geometrical Isomers: C H2n+2 n

n

No. o f Isomers

n

No. o f Isomers

1 2 3 4 5 6 7 8 9 10

1 1 1 2 3 5 9 18 46 75

11 12 13 14 15 16 17 18 19 20

159 355 802 1,858 4,347 10,359 24,894 60,523 148,284 366,319

Figure 1 g i v e s a flow chart o f the o p e r a t i o n o f ISOMER. For a given number o f carbon atoms, ISOMER f i r s t generates a m u l t i p l i c i t y (number o f primary, secondary, t e r t i a r y and quaternary carbons) and checks the v a l i d i t y o f the m u l t i p l i c i t y . A l i n e n o t a t i o n i s generated f o r t h a t m u l t i p l i c i t y , and t h a t l i n e n o t a t i o n s t o r e d on a disk f i l e . The l i n e n o t a t i o n i s then m o d i f i e d attempting t o form new g e o m e t r i c a l isomers o f t h e same m u l t i p l i c i t y . Each nonredundant isomer i s a l s o s t o r e d on the disk f i l e . Because only f o u r types o f carbons occur i n a s a t u r a t e d hydrocarbon, only two b i t s are necessary t o encode each atom: Code 00 01 10 11

= = = =

Meaning 0 1 2 3

= = = =

CH ) = CH f o r f i r s t and l a s t CH CH( C(( 3

3

2

The parentheses are used t o e l i m i n a t e ambiguity o f s t r u c t u r e s around branches. A methyl group, which i s a chain t e r m i n a t o r , ends the most recent chain s t a r t e d , not the f i r s t c h a i n . This n o t a t i o n i s analogous t o an a r i t h m e t i c e x p r e s s i o n i n which the contents w i t h i n the innermost parentheses are evaluated

COMPUTER-ASSISTED STRUCTURE E L U C I D A T I O N

0

ENTER -#• OF CARBONS

./GENERATE "y

7_

MULTIPLICITY/^-

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch006

LAST MULTIPLICITY^—YES - ^ E X I ^

/

GENERATE LINE NOTATION, BRANCHES/ PUSHED TO RIGHT /

->(ST0RE LINE NOTATION)

7

REORDER ATOMS TO (A). RIGHT OF MOVED BRANCH

/ REORGANIZE NOTATION (B) TO MINIMUM NUMERICAL V A L U E / WITHOUT BREAKING BONDS /

Analytical Chemistry

ure 1. Flowchart for program ISOMER (9)

6.

SURPRENANT AND REILLEY

Predictions

from

C

13

NMR

Spectra

81

p r i o r t o any outer parentheses. For example, 3-methyl-4ethylheptane would be encoded as CH CH2CH2CH(CH2CH3)CH(CH3)CH2CH 3

0 1 1 2 1 0

2 0

3

1 0

Another r u l e i s t h a t the arrangement o f chains a f t e r a branch should give the lowest numerical code. For example, 3,4dimethyl-4-ethylheptane would be encoded: CH CH CH C((CH )CH2CH )CH(CH )CH CH 3

2

2

3

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch006

0 1 1 3 0

3

3

1 0

2 0

2

3

1 0

and not as CH CH CH C((CH(CH )CH CH )CH CH )CH 3

2

2

3

0 1 1 3 2 0

2

1 0

3

2

1 0

3

3

0

The s t r u c t u r e s are s t o r e d bit-compressed, 2 b i t s p e r atom, 8 atoms p e r 1 6 - b i t word. Spectra Generation. The l i n e n o t a t i o n s t r u c t u r e codes from ISOMER are r e t r i e v e d as w e l l as program SPECGEN which then computes t h e expected C s p e c t r a on the b a s i s o f Lindeman and Adams a d d i t i v i t y r e l a t i o n s h i p s (2). The flow chart f o r SPECGEN i s given i n Figure 2. Each l i n e n o t a t i o n i s r e a d , unpacked, and processed t o determine the number o f bonds which separate each atom from every other atom i n the molecule. On the b a s i s o f t h i s i n f o r m a t i o n , a r e l a t i v e l y s t r a i g h t f o r w a r d a l g o r i t h m can be c o n s t r u c t e d t o add up proper constants according t o the a d d i t i v i t y parameter r u l e s and thereby compute the p r e d i c t e d spectrum. A 1 ppm. bar graph showing the chemical s h i f t d i s t r i b u t i o n f o r the hydrocarbons C2 through i s presented i n Figure 3. The f i n e s t r u c t u r e observed i n the d i s t r i b u t i o n i s i n t e r e s t i n g and probably stems from r e g u l a r i t i e s inherent i n hydrocarbon molecular s t r u c t u r e s . While the chemical s h i f t s f o r methyl groups f a l l i n a d i f f e r e n t r e g i o n from those f o r methylene groups, the e x t e n s i v e overlap prevents d e f i n i t i v e d i f f e r e n t i a t i o n between the two simply on the b a s i s o f t h e i r chemical s h i f t s . 13

1

1 3

Uniqueness Tests. In t h i s study, a C spectrum was considered unique when, w i t h each o f i t s l i n e s d e v i a t i n g by a given amount from the p r e d i c t e d v a l u e , the proper s t r u c t u r e was s t i l l the best matching and non-unique otherwise. We found t h a t , i n almost every i n s t a n c e when the proper s t r u c t u r e was not the best matching, i t was the second best matching. The matching procedure was simple when the number o f l i n e s i n

COMPUTER-ASSISTED

STRUCTURE

0

READ DISC DIRECTORY

READ NUMBER OF CARBONS - ^ R E A D LINE NOTATION^

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch006

/ UNPACK 7 / L I N E NOTATION / CREATE POSITION MATRIX i.e. Atom i is 2 Bonds from Atom 3, etc. READ POSITION MATRIX AND PROCESS ACCORDING TO PARAMETERIZATION> RULES TO GIVE SPECTRUM

Analytical Chemistry

Figure

2.

Flowchart

for program

SPECGEN

(9)

ELUCIDATION

SURPRENANT A N D REILLEV

Predictions

from

C

NMR

13

Spectra

Total

15,000 #

of

carbons

60

50

40

30

20

10

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch006

CH

1

1

60

1

50

40

IM

30

3

10

20

0

CHo

1— 60

ML 50

40

30

20

10

CH

60

50

40

20

rri-rfTmfhhp 60

50

40 1 3

Figure

3.

Computed

C

Chemical

r~

~i

30

30 Shifts,

10

1

1

20

10

ppm v s .

2

16

r 0

TMS

distribution of the number of carbons having C chemical shifts, C —C 13

0

various

COMPUTER-ASSISTED

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch006

84

STRUCTURE

ELUCIDATION

the r e f e r e n c e and unknown s p e c t r a were equal. The chemical s h i f t s o f both the r e f e r e n c e and the unknown s p e c t r a were ordered from low t o h i g h , and the sum o f the squares o f the d e v i a t i o n s computed. When the r e f e r e n c e spectrum had more l i n e s than the unknown, the chemical s h i f t s were again o r d e r e d , but t o determine which o f the r e f e r e n c e l i n e s should be omitted i n the comparison c a l c u l a t i o n i n v o l v e d more t e s t i n g . To o b t a i n the minimum sum o f the squares o f the d e v i a t i o n s each combination o f r e f e r e n c e l i n e s was t e s t e d against the unknown spectrum. These f u n c t i o n s were implemented by the program COMPARE. The r e s u l t s o f comparisons o f the s p e c t r a l f i l e w i t h normal "unknown" s p e c t r a are l i s t e d i n Table I I . The "unknowns" were the p r e d i c t e d s p e c t r a f o r each isomer but where each l i n e o f the "unknown" spectrum was s h i f t e d by a g i v e n amount, A. The term "normal" r e f e r s t o those proton-decoupled C s p e c t r a w i t h a l l chemical s h i f t s and number o f carbons ( v i a i n t e n s i t i e s ) c o r r e c t l y determined. As i n d i c a t e d i n Table I I u n t i l the number o f carbon atoms becomes l a r g e o r a s i g n i f i c a n t e r r o r e x i s t s i n the chemical s h i f t , the c o r r e c t s t r u c t u r e was p r e d i c t e d . For l a r g e molecules, C13 and above, the number o f wrong matches i n c r e a s e d . Here, i t may be necessary t o perform an off-resonance decoupling experiment t o determine the m u l t i p l i c i t i e s . 1 3

13

Table I I . Normal C Spectra

No. o f Carbons No. o f Isomers O.lia 0.22 0.33 0.44 0.55 0.66 0.77 0.88 0.99 1.10 1.21 1.32 1.43 1.54 1.65 a

% Wrong Matches 9

10

11

12

13

14

35 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

75 0 0 0 0 0 0 0 0 0 0 0 1.3 1.3 2.6 2.6

159 0 0 0 0 0 0 0 0 0 0 1.9 2.5 4.4 6.9 8.8

355 0 0 0 0 0 0 0 0 0.3 0.6 1.7 2.5 5.9 11.0 20.6

802 0 0 0 0 0 0 0.2 0.5 1.4 2.5 4.5 9.7 15.8 26.7 37.9

1858 0 0 0 0 0 0.1 0.5 0.9 2.1 4.6 11.1 20.1 32.4 44.8 56.6

V e r t i c a l column shows A, d e v i a t i o n o f each l i n e , (ppm).

6.

SURPRENANT AND REILLEY

Predictions

from

C

13

NMR

85

Spectra

E x c e l l e n t confidence can be p l a c e d i n the p r e d i c t e d s t r u c t u r e , even f o r C13-C20 molecules, i f the m u l t i p l i c i t y i s known. This i s evidenced by the s t r i k i n g l y low percentage o f wrong matches, even f o r s p e c t r a whose chemical s h i f t s are q u i t e d e v i a n t ; see Table I I I .

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch006

Table I I I .

13

C

No. o f Carbons No. o f Isomers 0.11* 0.22 0.33 0.44 0.55 0.66 0.77 0.88 0.99 1.10 1.21 1.32 1.43 1.54 1.65 a

Spectra w i t h Known M u l t i p l i c i t i e s a Wronjg Matches 12

13

14

355 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

802 0 0 0 0 0 0 0 0 0 0 0 0 .12 0 .12 0 .12 0 .12

1858 0 0 0 0 0 0 0.05 0.05 0.05 0.05 0.11 0.16 0.16 0.22 0.48

V e r t i c a l column shows A, d e v i a t i o n o f each l i n e ,

ppm.

The values given i n Tables I I and I I I assume, however, that i n t e n s i t y i n f o r m a t i o n was s u f f i c i e n t l y accurate t o permit c o r r e c t determination o f the number o f carbons under each peak, and t h a t each peak i s observable. Because o f the v a r i a b i l i t y of the n u c l e a r Overhauser enhancement (NOE) and wide range o f r e l a x a t i o n t i m e s , i n a c c u r a t e area measurements are the r u l e and not the exception w i t h t y p i c a l C s p e c t r a . Indeed, carbons w i t h no attached protons are o f t e n not observed. Two a d d i t i o n a l uniqueness s t u d i e s were c a r r i e d out t o study the influence of inaccurate i n t e n s i t y information. As i n d i c a t e d i n Table IV where i t i s assumed t h a t quaternary carbons were not observed, confidence i n p r e d i c t i o n drops c o n s i d e r a b l y . Even more dramatic i s the drop i n confidence i f i n t e n s i t y i n f o r m a t i o n i s s u f f i c i e n t l y poor so t h a t the number o f carbons f o r a given peak cannot be assigned and thus l i n e s f o r e q u i v a l e n t carbons cannot be d i s t i n g u i s h e d from other l i n e s , Table V. These r e s u l t s p o i n t t o p o t e n t i a l problems a r i s i n g from the manner i n which s p e c t r a are obtained. High pulse r e p e t i t i o n r a t e s without e i t h e r gated decoupling o r 13

COMPUTER-ASSISTED

86

STRUCTURE

ELUCIDATION

p u r p o s e f u l a d d i t i o n o f r e l a x a t i o n reagents o f t e n reduce the i n t e n s i t y o f quaternary carbons and d i s t o r t areas f o r other carbons.

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch006

Table IV.

1 3

C Spectra w i t h Unobserved Quaternary Carbons

0. Wrong Matches 0 No. o f Carbons 12 8 9 10 11 No. o f 18 35 355 Isomers 75 159 O.lia 0 0 0 0 0 0.22 0 0 0 0 0 0.33 0 0 0 0 0 0.44 0 0 0 0 0 0.55 0.5 0 0 0 0 0.66 0 0 0.5 0 0 0.77 0 0 0.5 0 0 0.88 0 0.9 0 0 1 .1 0.99 0 0 5.5 0 3 .2 1.10 0 0 10.5 0 5 .4 0 0 15.0 0 5 .4 1.21 1.32 0 5.9 22.7 5.3 12 .9 1.43 0 5.9 7.9 30.5 22 .6 1.54 0 11.8 39.5 15.8 26 .9 1.65 0 17.6 52.7 21.0 31 .2 Number o f isomers ; c o n t a i n i n g quaternary carbon(s 7 17 38 93 220 a

13

14

802 0 0 0 0 0.2 0.4 1.1 3.4 10.4 19.0 29.6 42.8 52.1 61.3 69.8 ) 537

1858 0 0 0 0.2 0.5 1.5 4.1 7.1 17.8 30.3 44.7 57.5 67.8 75.9 81.6 1306

V e r t i c a l column shows A, d e v i a t i o n of each l i n e , ppm.

The conclusions are obvious. In order t o i d e n t i f y l a r g e molecules n e a r l y unambiguously, C s p e c t r a must be obtained without s a t u r a t i n g quaternary resonances, w i t h minimum NOE e f f e c t s , and perhaps w i t h off-resonance decoupling. I t i s a l s o a d v i s a b l e t o use the same temperature and s o l v e n t c o n d i t i o n s u t i l i z e d i n o b t a i n i n g the data base employed i n t h e parameterization. 1 3

Hydrocarbon

Based

1 3

C Search System 1 3

Program MATCH i s a hydrocarbon based C search system designed t o handle s a t u r a t e d a c y c l i c hydrocarbons and r e l a t e d monofunctional amines, a l c o h o l s , and e t h e r s . The number o f C peaks observed, the type o f f u n c t i o n a l group search d e s i r e d , and the observed chemical s h i f t s are the i n p u t s t o the program. Correct i n t e n s i t y i n f o r m a t i o n i s i m p l i e d by i n p u t t i n g the proper number o f resonances. MATCH outputs the t e n best s t r u c t u r e s and the assignment o f the spectrum. In Figure 4 the o u t l i n e o f the program i s d e p i c t e d . 1 3

SURPRENANT A N D REILLEY

Predictions

from

C

13

NMR

Spectra

(

E N T E R N NUMBER OF CARBONS ) FUNCTIONAL GROUP CODEJ f ENTER N (CHEMICAL SHIFTS ) V^OF UNKNOWN J

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch006

__^/READ STRUCTURE^ FROM FILE J /UNPACK STRUCTURE"^

GENERATE REFERENCE SPECTRUM / /COMPARE REFERENCE / A N D UNKNOWN SPECTRA

/

7

f OUTPUT TEN BEST "\ \STRUCTURES AND ASSIGNMENTS/ Analytical Chemistry

Figure

4. Flowchart

for program

MATCH

(9)

COMPUTER-ASSISTED S T R U C T U R E E L U C I D A T I O N

88

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch006

Table V.

No. o f Carbons No. o f Isomers O.lia 0.22 0.33 0.44 0.55 0.66 0.77 0.88 0.99 1.10 1.21 1.32 1.43 1.54 1.65 Number of a

Equivalent Carbons Treated as One Carbon % Wrong Matches 13

14

802 355 9 18 75 159 35 3.0 0 2.0 4.8 0 0 5.8 4.1 0 3.4 0 6.0 0 7.2 6.9 4.6 0 0 4.8 9.4 7.2 0 6.2 0 9.4 7.9 6.1 8.7 0 8.0 5.9 7.2 9.1 9.4 13.0 0 5.9 9.5 12.1 12.2 9.4 13.0 0 17.6 15.6 15.9 10.9 16.0 18.8 0 17.6 15.6 20.3 12.9 20.5 29.2 0 23.5 18.8 24.6 20.4 27.5 38.3 0 23.5 21.9 26.1 23.8 34.7 49.2 12.5 23.5 21.9 31.9 30.6 43.2 58.3 12.5 23.5 21.9 39.1 37.4 50.5 65.4 12.5 29.4 25.0 44.9 44.9 56.5 71.7 12.5 29.4 31.3 49.3 53.1 63.4 77.2 25.0 35.3 46.9 50.7 57.1 69.2 80.9 isomers c o n t a i n i n g equivalent: carbons 8 760 17 331 32 69 147

1858 1.8 2.6 4.2 7.3 9.1 15.2 23.8 36.8 49.2 60.0 67.9 75.8 81.6 86.0 88.8

7

8

9

10

11

12

1779

V e r t i c a l column shows A, d e v i a t i o n o f each l i n e , ppm.

MATCH reads the f i l e o f encoded hydrocarbon s t r u c t u r e s and computes the expected spectrum f o r each s t r u c t u r e , and then compares each spectrum with t h a t o f the unknown. The storage o f the compressed s t r u c t u r e codes, but not t h e a s s o c i a t e d s p e c t r a , saves s i g n i f i c a n t amounts o f storage without undue degradation o f searching speed. Storage o f the expected hydrocarbon s p e c t r a would double the searching speed f o r hydrocarbons, but would not change the speed f o r f u n c t i o n a l group searches. The hydrocarbons l i s t e d i n Table V I , i n a d d i t i o n t o t h e 69 hydrocarbons u t i l i z e d i n forming the parameters data (2_) were c o r r e c t l y i d e n t i f i e d by MATCH u s i n g t h e i r chemical s h i f t s and i n t e n s i t y i n f o r m a t i o n . To handle f u n c t i o n a l groups, MATCH simply searches t h a t s t r u c t u r e f i l e f o r hydrocarbons which i s one carbon l a r g e r than the molecule o f i n t e r e s t . Nonequivalent carbons i n the s t r u c t u r e s are then determined and the f u n c t i o n a l group r e p l a c e s one o f the nonequivalent carbons t o o b t a i n one new molecule. Because a given hydrocarbon isomer can have more than one nonequivalent carbon, more than one isomer o f the f u n c t i o n a l groups s u b s t i t u t e d molecule i s obtained from each hydrocarbon. F o r example, a l l the monofunctional C^s a l c o h o l i s o m e r i c s t r u c t u r e s can be d e r i v e d from the C^g hydrocarbon

6.

SURPRENANT AND REILLEY

Predictions

from

C NMR

13

89

Spectra

Table V I . Hydrocarbons I d e n t i f i e d by Program MATCH

Unknown Compound n-Decane 2,7-Dimethyloctane 4-n-Propylheptane 2,4,6-Trimethylheptane 2,8-Dimethylnonane 2,5,8-Trimethylnonane 2,9-Dimethyldecane n-Hexadecane 2,5,8,11-Tetramethyldodecane Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch006

Number o f Possible Isomers 75 75 75 75 159 355 355 10,359 10,359

Number o f Carbons 10 10 10 10 Cn C12 Cl2 Cl6 Cl6 c

c

c

c

f i l e , by r e p l a c i n g , one a t a t i m e , each nonequivalent methyl group w i t h a h y d r o x y l . S i m i l a r l y , f o r ethers nonequivalent methylenes are r e p l a c e d w i t h an oxygen. Some o f the f u n c t i o n a l groups t h a t can be handled i n t h i s manner are shown i n Table V I I . Once the i s o m e r i c s t r u c t u r e s are o b t a i n e d , the s p e c t r a can q u i c k l y be generated u s i n g the a d d i t i v i t y parameters f o r t h a t f u n c t i o n a l group ( 4 , 5 ) . Table V I I . Extension o f Hydrocarbon Scheme t o Cover F u n c t i o n a l Groups Base Hydrocarbon:

R"

R"

(

R

R-CH3

1

R-ICHI-R

R-{CJ-R

t

••(CHU-R' Functional Groups:

-NH2

-NH-

-OH

-0-

-C00H

-S-

R

><

R

-SH

R"

R

R

1

c

R<

c=o

-c=c-

R"

c= r c

T

R x

n

R v

N

-F,CI,Br,I

R

1 -N-

R S

F

c=c^

R

c=c " x

-C=N -NO2 The replacement o f a carbon w i t h a f u n c t i o n a l group i n c r e a s e s the number o f p o s s i b l e g e o m e t r i c a l isomers manyfold. Table V I I I l i s t s the enumeration o f the number o f monofunctional hydrocarbon isomers as a f u n c t i o n o f the number o f carbons and type o f f u n c t i o n a l group. While 10,359 isomers e x i s t f o r ^16^3^+3 "th corresponding number o f amine isomers f o r C15H33N e

COMPUTER-ASSISTED

90

STRUCTURE

ELUCIDATION

Table V I I I . Number o f Isomers Having P o t e n t i a l l y Nonequivalent Spectra

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch006

(Monofunctional Groups) No. o f Hydrocarbon Isomers 1 1 2 3 5 9 18 35 75 159 355 802 1858 4347 10359

). o f Lrbons 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

No. o f nonequivalent carbons CH

3

1 1 2 4 8 17 39 89 211 507 1238 3057 7639 19241 48863

CH

2

0 1 1 3 6 15 33 82 194 482 1188 2988 7527 19181 49056

CH 0 0 1 1 3 7 17 40 102 249 631 1594 4073 10442 26974

C 0 0 0 1 1 3 7 18 42 109 269 691 1759 4542 11732

i s 48,863 + 49,056 + 26,974 o r 124,893. With t h i s i n c r e a s e i n number o f isomers, a decrease i n confidence i n p r e d i c t e d s t r u c t u r e s i s expected. Approximately 50 amine s p e c t r a (3_) and 10 a l c o h o l s p e c t r a (5_) were a l l c o r r e c t l y i d e n t i f i e d by program MATCH o p e r a t i n g i n the f u n c t i o n a l group mode. For a C15 amine approximately 1/2 hour was r e q u i r e d f o r the i d e n t i f i c a t i o n , s e a r c h i n g through 124,893 amine s t r u c t u r e s and spectra.

Abstract On the basis of computer generated isomeric structures for acyclic saturated hydrocarbons and Lindeman and Adams' parameter values throughγ-carbons,a number of questions concerning C NMR spectra were addressed. The principle features included determination of the uniqueness of the spectra and the influence of uncertainties in the spectral information which can arise when multiplicities of individual carbons i s unknown (i.e. off-resonance data not available), when quaternary carbon signals are not observed (i.e. because of saturation), and when equivalent carbons are treated as one carbon. Methods of handling the presence of monofunctional groups and utilization of the above information to form the basis of a search system for hydrocarbon, amine, alcohol, and ether molecules are also considered. 13

6.

SURPRENANT A N D REILLEY

Predictions

from

C

13

NMR

Spectra

91

Acknowle dgement s The authors wish t o thank the N a t i o n a l Science Foundation and the N a t i o n a l I n s t i t u t e s o f Health f o r t h e i r f i n a n c i a l support o f t h i s study.

Literature 1. 2.

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch006

3. 4. 5. 6. 7. 8. 9. 10.

Cited

Grant, D. M. and E. G. Paul, J . Am. Chem. Soc. (1964) 86, 2984. Lindeman, L. P. and J . Q. Adams, Anal. Chem. (1971) 43, 1245. Eggert H. and C. Djerassi, J . Am. Chem. Soc. (1973) 95, 3710. Sarneski, J . E., H. L. Surprenant, F. K. Molen, and C. N. R e i l l e y , Anal. Chem. (1975) 47, 2116. Roberts, J. D., F. J . Weigert, J. I. Kroschwitz, and H. J . Reich, J . Am. Chem. Soc. (1970) 92, 1338. Burlingame, A. L., R. V. McPherson, and D. M. Wilson, Proc. Nat. Acad. S c i . , USA (1973) 70, 3419. Carhart, R. E. and C. Djerassi, J. Chem. Soc., Perkin Trans. 2 (1973) 1753. J e z l , B. A. and D. L. Dalrymple, Anal. Chem. (1975) 47, 203. Surprenant, H. L and C. N. R e i l l e y , Anal. Chem. i n press. Lederberg, J . , G. L. Sutherland, B. G. Buchanan, E. A. Feigenbaum, A. V. Robertson, A. M. Duffield, and C. Djerassi, J . Am. Chem. Soc. (1969) 91, 2973.

7 Interactive Structure Elucidation C. A. SHELLEY, H . B. WOODRUFF, C. R. SNELLING, and M . E. MUNK

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch007

Department of Chemistry, Arizona State University, Tempe, AZ 85281

The t o p i c of computer-assisted s t r u c t u r e e l u c i d a tion cuts across disciplinary boundaries as w e l l as the traditional boundaries w i t h i n the discipline of chemistry itself. As a result, scientists with v a r i e d backgrounds and i n t e r e s t s have been a t t r a c t e d t o it. We entered the area through the door marked "natural products chemists." I t was our own work in s t r u c t u r e e l u c i d a t i o n (1-8) that l e d us t o the computer and the belief that the process as p r a c t i c e d by the n a t u r a l products chemist is amenable t o computer modeling. It is q u i t e evident that our efforts bear the imprint of our background. While it is true that no two n a t u r a l products chemists p r a c t i c e the science and art of s t r u c t u r e e l u c i d a t i o n in e x a c t l y the same way, c e r t a i n common features may be discerned (Figure 1). Three i n t e g r a l components of the process are: 1. the reduction of chemical and p h y s i c a l data to t h e i r s t r u c t u r a l i m p l i c a t i o n s ; 2. the familiar partial s t r u c t u r e , an expression comprised of known s t r u c t u r a l fragments and unaccounted-for atoms that summarizes the status of the s t r u c t u r e problem a t any given stage; 3. the design of new experiments, guided by the p a r t i a l s t r u c t u r e , o r some o r a l l of the mol e c u l a r s t r u c t u r e s compatible with i t . The f i n a l s o l u t i o n o f the problem may be described as the c y c l i c process that leads t o the reduction of the number o f s t r u c t u r a l fragments and atoms i n the part i a l s t r u c t u r e t o one. D e s c r i p t i o n of CASE In developing a computer model of the s t r u c t u r e 92

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch007

7.

SHELLEY E T A L .

Interactive

Structure

93

Elucidation

e l u c i d a t i o n p r o c e s s as we e n v i s i o n e d i t , we f i r s t f o c u s s e d o u r a t t e n t i o n on two o f i t s major components: 1. t h e e x p a n s i o n o f a p a r t i a l s t r u c t u r e t o a l l m o l e c u l a r s t r u c t u r e s c o n s i s t e n t w i t h i t and any o t h e r i n f o r m a t i o n a v a i l a b l e t o t h e chemi s t , and 2. t h e r e d u c t i o n o f c h e m i c a l and s p e c t r o s c o p i c data to t h e i r s t r u c t u r a l i m p l i c a t i o n s . The c u r r e n t s t a t u s o f CASE, o u r acronym f o r c o m p u t e r - a s s i s t e d s t r u c t u r e e l u c i d a t i o n , i s summarized i n F i g u r e 2. CASE i s a network o f computer programs d e s i g n e d t o a c c e l e r a t e and make more r e l i a b l e t h e e n t i r e p r o c e s s o f s t r u c t u r e e l u c i d a t i o n . The system i s h i g h l y i n t e r a c t i v e and i s c o n t i n u a l l y e v o l v i n g . The t a s k o f r e d u c i n g c h e m i c a l and s p e c t r o s c o p i c d a t a t o s t r u c t u r a l i n f o r m a t i o n i s p r e s e n t l y s h a r e d by t h e c h e m i s t and t h e computer. An i n f r a r e d i n t e r p r e t e r designed s p e c i f i c a l l y f o r a p p l i c a t i o n to multifunct i o n a l i z e d m o l e c u l e s i s a t an advanced s t a g e o f development and f u l l y o p e r a t i o n a l interp r e t e r i s t o be o f v a l u e t o t h e n a t u r a l p r o d u c t s chemist i n s o l v i n g a c t u a l s t r u c t u r e e l u c i d a t i o n problems, s e v e r a l c r i t e r i a must be met. The program must be a b l e t o make d e c i s i o n s c o n c e r n i n g t h e p r e s e n c e o r absence o f a l a r g e number o f f u n c t i o n a l groups. Iti s i n s u f f i c i e n t s i m p l y t o d i s t i n g u i s h e s t e r s from nonesters. R a t h e r one s h o u l d be a b l e t o make a more s p e c i f i c d i s t i n c t i o n (e.g., s a t u r a t e d e s t e r s v s . unsaturated esters vs. lactones). In a d d i t i o n , t h e program must be a b l e t o i n t e r p r e t t h e r e l a t i v e l y complex s p e c t r a o f compounds found i n n a t u r e . Our program t e s t s f o r t h e p r e s e n c e o r absence o f 169 chemical f u n c t i o n a l i t i e s . I t has been t e s t e d on o v e r 500 s p e c t r a o f v a r y i n g c o m p l e x i t y w i t h a h i g h degree of success. I t i s an a r t i f i c i a l i n t e l l i g e n c e program t h a t attempts t o p a r a l l e l t h e c h e m i s t ' s r e a s o n i n g i n i n t e r p r e t i n g an i n f r a r e d spectrum as much as p o s s i b l e . The c h e m i s t uses an e m p i r i c a l approach t o i n t e r p r e t infrared spectra. A set of guidelines f o r i n t e r p r e t i n g i n f r a r e d s p e c t r a i s determined. These g u i d e l i n e s may r e s u l t from o b s e r v a t i o n o f a s u f f i c i e n t number o f s p e c t r a and/or from r e a d i n g t e x t b o o k s and l e a r n i n g from t h e o b s e r v a t i o n s o f o t h e r s . An i n i t i a l set o f g u i d e l i n e s f o r i d e n t i f y i n g saturated c a r b o x y l i c a c i d s might be t o l o o k f o r a b r o a d , medium t o s t r o n g peak c e n t e r e d around 3000 c m l , a s t r o n g c a r b o n y l peak near 1715 c m l , and a b r o a d , medium i n t e n s i t y peak i n the v i c i n i t y o f 920 cm" . Any compounds w i t h s p e c t r a which f o l l o w e d t h e s e g u i d e l i n e s would be i n t e r p r e t e d I

_

_

1

f

t

n

e

94

COMPUTER-ASSISTED

CHEMICAL AND SPECTROSCOPIC DATA

ELUCIDATION

MOLECULAR FORMULA

PARTIAL STRUCTURE

CHEMISTI

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch007

STRUCTURE

COMPATIBLE STRUCTURES

—>

UNIQUE STRUCTURE

EXPERIMENTAL DESIGN AND EXECUTION

Figure

1.

"Manual"

structure

elucidation

MOLECULAR FORMULA

COMPATIBLE

STRUCTURES

CHEMICAL DATA f AUTOMATED INTERPRETER/

Figure

2.

CASE

network

TRUNCATED LIST OF STRUCTURES

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch007

7.

SHELLEY

ET

AL.

Interactive

Structure

Elucidation

95

as c o n t a i n i n g a c a r b o x y l i c a c i d f u n c t i o n a l i t y . Simil a r i n i t i a l s e t s o f g u i d e l i n e s must be e s t a b l i s h e d f o r a l l other f u n c t i o n a l i t i e s . Next the program i s t e s t e d on some i n f r a r e d s p e c t r a . Anytime the program makes an e r r o n e o u s i n t e r p r e t a t i o n , i t i s n e c e s s a r y t o a l t e r the g u i d e l i n e s t o c o r r e c t the m i s t a k e . As l o n g as a c h e m i s t can do a b e t t e r j o b o f i n t e r p r e t i n g a spectrum than the program, then the program can be a l t e r e d by u s i n g t h i s new i n f o r m a t i o n so t h a t i t w i l l do a b e t t e r j o b o f i n t e r p r e t a t i o n i n the f u t u r e . T h i s a b i l i t y t o a l t e r the program t o c o r r e c t m i s t a k e s i s a major advantage o f a r t i f i c i a l i n t e l l i g e n c e programming. As w i t h o t h e r components o f CASE, the i n f r a r e d i n t e r preter i s continually evolving. The c o m p u t e r - d e r i v e d a n a l y s i s o f t h e i n f r a r e d spectrum can be reviewed by the c h e m i s t p r i o r t o a u t o m a t i c e n c o d i n g , or t h e c h e m i s t can be bypassed, w i t h t h e a u t o m a t i c a l l y encoded f u n c t i o n a l group i n f o r m a t i o n b e i n g s e n t d i r e c t l y t o the m o l e c u l e assembler. The development o f programs f o r the automated i n t e r p r e t a t i o n of other spectroscopic information i s a t an e a r l i e r s t a g e . A preliminary investigation u s i n g p a t t e r n r e c o g n i t i o n t e c h n i q u e s t o a i d i n the i n t e r p r e t a t i o n o f 13c-NMR s p e c t r a has y i e l d e d promising results. A t the p r e s e n t t i m e , much o f the r e maining i n t e r p r e t a t i o n of s p e c t r a l data i s chemistderived. CASE a l s o i n c o r p o r a t e s programs f o r the automated i n t e r p r e t a t i o n of chemical information. As an example, CASE a c c e p t s t h e number o f moles o f p e r i o d a t e consumed by a compound and uses the i n f o r m a t i o n t o c o n s t r a i n t h e m o l e c u l e assembler. Thus, o n l y m o l e c u l e s c o n s i s t e n t w i t h the p e r i o d a t e i n f o r m a t i o n a r e assembled. The m o l e c u l e assembler a c c e p t s b o t h the computerd e r i v e d and u s e r - d e r i v e d s t r u c t u r a l i n f o r m a t i o n , and, g i v e n the m o l e c u l a r f o r m u l a , c o n s t r u c t s a l l s t r u c t u r a l isomers c o m p a t i b l e w i t h the i n p u t . A nonredundant l i s t i n g o f the c o m p a t i b l e m o l e c u l e s i s p r e s e n t e d t o the c h e m i s t i n the c o n v e n t i o n a l s t r u c t u r a l language. The l i s t o f c o m p a t i b l e m o l e c u l e s u s u a l l y can be f u r t h e r t r u n c a t e d by comparing c e r t a i n p r e d i c t e d s p e c t r o s c o p i c p r o p e r t i e s f o r each m o l e c u l e c o n s t r u c t e d w i t h the o b s e r v e d s p e c t r o s c o p i c p r o p e r t i e s o f the unknown. The t a s k s o f p r e d i c t i n g , comparing, and r a n k i n g are a s s i g n e d t o the spectrum s i m u l a t o r . These programs a r e a l s o a t an e a r l y s t a g e of development. One o p e r a t i o n a l component c a l l e d PEAK w i l l be i l l u s trated later. G i v e n a m o l e c u l a r s t r u c t u r e , PEAK p r e d i c t s the number o f s i g n a l s e x p e c t e d i n the l^C-NMR spectrum.

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch007

96

COMPUTER-ASSISTED

STRUCTURE

ELUCIDATION

The m o l e c u l e assembler ( F i g u r e 3) i s unique i n approach and i n i t s s i m p l e s t form was d e s i g n e d t o expand e x h a u s t i v e l y the c o n v e n t i o n a l p a r t i a l s t r u c t u r e i n t o a l l s t r u c t u r a l isomers c o n s i s t e n t w i t h i t . The a l g o r i t h m on which the assembler i s based i s r e c u r s i v e . By c o n c e n t r a t i n g more on p a r t i a l s t r u c t u r e e x p a n s i o n than on m o l e c u l a r f o r m u l a e x p a n s i o n , g r e a t e r compactness and e f f i c i e n c y were a c h i e v e d . From the s t a r t i t was r e c o g n i z e d t h a t a b r o a d l y a p p l i c a b l e m o l e c u l e assembler must be c a p a b l e o f u t i l i z i n g more i n f o r m a t i o n than j u s t the c o n v e n t i o n a l p a r t i a l s t r u c t u r e , because the c h e m i s t g e n e r a l l y has more i n f o r m a t i o n than can be e x p r e s s e d by the conventional p a r t i a l structure. Thus, i n F i g u r e 3 the term " p a r t i a l s t r u c t u r e " i s used i n a b r o a d e r c o n t e x t . The p a r t i a l s t r u c t u r e i n c l u d e s the m o l e c u l a r f o r m u l a and c o m p u t e r - d e r i v e d and/ o r c h e m i s t - d e r i v e d s t r u c t u r a l fragments. Atoms i n t h e s e fragments must not d u p l i c a t e one a n o t h e r ; t h a t i s , t h e fragments must be n o n o v e r l a p p i n g . The p a r t i a l s t r u c t u r e a l s o i n c l u d e s supplementary i n f o r m a t i o n t h a t cannot be e x p r e s s e d i n terms o f n o n o v e r l a p p i n g s t r u c t u r a l fragments. Communication w i t h CASE i s a c h i e v e d by means t h a t mimic the n a t u r a l language o f the c h e m i s t . The molecu l a r f o r m u l a i s i n p u t i n s t a n d a r d format. The v e r s a t i l e l i n e a r code d e s i g n e d f o r s t r u c t u r a l fragment i n p u t i s i l l u s t r a t e d i n F i g u r e 4. Thus, l i n e s 1, 3, 5, 7 and 9 r e p r e s e n t r e s p e c t i v e l y , a 1 - h y d r o x y e t h y l group w i t h a s i n g l e r e s i d u a l v a l e n c e a t the 1 - p o s i t i o n , a c a r b o n y l group w i t h d o u b l y d e f i c i e n t carbon atom, a c a r b o n y l group j o i n e d t o oxygen w i t h v a l e n c e d e f i c i e n c i e s a t carbon and s i n g l e bonded oxygen, a c a r b o x y l group and a c y c l o h e x a n e r i n g w i t h each c a r b o n atom doubly valence d e f i c i e n t . In the l a t t e r example, note t h a t r i n g d e s i g n a t i o n i s a c h i e v e d by l a b e l i n g one carbon atom w i t h a number (1 i n t h i s case) and f o r m i n g a bond between a carbon f i v e atoms removed and t h e label. The l i n e a r code may be f u r t h e r e l a b o r a t e d by a d d i n g atom and fragment tags t o d e s c r i b e t h e l o c a l environment o f atoms i n a s t r u c t u r a l fragment w i t h o u t c o n c e r n f o r o v e r l a p p i n g atoms. F o r example, the v a l e n c e d e f i c i e n t carbon atom o f the 1 - h y d r o x y e t h y l group i s r e q u i r e d t o bond t o a methine c a r b o n by the a d d i t i o n o f t h e atom t a g ( l i n e 2 ) . S i n c e the i n f o r m a t i o n i s p r o v i d e d by means of t h e t a g , the c h e m i s t need not be concerned whether t h a t methine c a r b o n d u p l i c a t e s a s i m i l a r group i n a n o t h e r s t r u c t u r a l fragment.

7.

SHELLEY E T AL.

Interactive

Structure

97

Elucidation

MOLECULAR FORMULA NONREDUNDANT LIST OF COMPATIBLE STRUCTURES

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch007

STRUCTURAL FRAGMENTS

SUPPLEMENTARY INFORMATION

Figure

3.

Molecule

assembler

1. CH3-CH-0H

2. CH -CH-0H 3

3. 0=C L

0 = C

MOLECULAR FORMULA

5. 0 = C-0 6. 0 = C-0<=> 7. 0 = C-OH

STRUCTURAL FRAGMENTS

8. 0 = C-OH 9. 1:C-C-C-C-C-C-1 10. 1:C-C-C-C-C-C-1<#> MULTIPLE BONDS RINGS SUBSTRUCTURE CONTROL AUTOMATED CHEMICAL CONSTRAINTS AUTOMATED SPECTROSCOPIC CONSTRAINTS^

Figure

4.

SUPPLEMENTARY INFORMATION

Constraints

PARTIAL STRUCTURE

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch007

98

COMPUTER-ASSISTED S T R U C T U R E E L U C I D A T I O N

Statement 4 r e s t r i c t s the environment o f t h e c a r b o n y l c a r b o n atom w i t h t h r e e atom t a g s : p r o h i b i t s i t s p r e s e n c e as p a r t o f a 3-5 membered r i n g , i . e . , t h e c a r b o n y l group i s u n s t r a i n e d (R f o r r i n g , 03 f o r minimum r i n g s i z e , 05 f o r maximum r i n g s i z e , 0 f o r minimum number o f s u c h r i n g s , 0 f o r maximum number o f such r i n g s ) , and and r e q u i r e t h e c a r b o n y l t o be f l a n k e d by a methine and a methylene group, respectively. Statement 6 s p e c i f i e s a five-membered lactone w i t h u n s a t u r a t i o n c o n j u g a t e d t o t h e a l c o h o l oxygen. The environment o f t h e c a r b o n y l c a r b o n atom i s d i s c l o s e d by t h r e e atom t a g s : r e q u i r e s t h a t i t j o i n t o c a r b o n , d e s i g n a t e s t h a t i t must be p a r t o f a five-membered r i n g (the absence o f minimum and maximum v a l u e s f o r t h e number o f such groups i n f e r s a minimum o f 1) and <=00> p r e c l u d e s t h e p r e s e n c e o f a , 3 u n s a t u r a t i o n t o t h a t c a r b o n atom (= i s t h e double bond symbol, 0 f o r minimum number o f d o u b l e bonds, 0 f o r maximum number). The two atom t a g s f o r the a l c o h o l oxygen r e q u i r e t h a t i t bond t o c a r b o n () and the p r e s e n c e o f a , 3 - u n s a t u r a t i o n (<=>). Statement 8 r e p r e s e n t s a c a r b o x y l i c a c i d f u n c t i o n t h a t cannot b e a r a , 3 - u n s a t u r a t i o n . The symbol <#> i s a fragment t a g t h a t i n s t r u c t s t h e program t o f o r b i d the f o r m a t i o n o f i n t e r n a l bonds i n a fragment. Thus, because o f statement 10, no m o l e c u l a r s t r u c t u r e s cont a i n i n g m u l t i p l e bonds o r b r i d g e s i n t h e c y c l o h e x a n e r i n g w i l l be assembled. Some s t r u c t u r a l i n f o r m a t i o n i s n o t s p e c i f i c t o atoms or fragments and must be t r e a t e d s e p a r a t e l y . Thus, M u l t i p l e Bonds d e s i g n a t e s the number, e i t h e r e x a c t o r a range, and k i n d s o f m u l t i p l e bonds a l l o w e d ; Rings p e r m i t s t h e same c o n t r o l o v e r r i n g s . Substruct u r e C o n t r o l can be used t o r e q u i r e t h e p r e s e n c e o r absence o f any s p e c i f i e d s t r u c t u r a l fragment u s i n g t h e same l i n e a r code d e s c r i b e d . Automated C h e m i c a l and S p e c t r o s c o p i c C o n s t r a i n t s i n s t r u c t the m o l e c u l e assemb l e r to generate only those s t r u c t u r e s c o n s i s t e n t with the a p p l i e d c o n s t r a i n t s . F o r example, the a p p l i e d c o n s t r a i n t may be t h e number o f moles o f p e r i o d a t e t h a t an unknown compound consumes o r the number o f s i g n a l s i n t h e -^C-NMR spectrum o f t h e unknown. These a p p l i c a t i o n s w i l l be i l l u s t r a t e d l a t e r . The i n f o r m a t i o n c o n t a i n e d i n " p a r t i a l s t r u c t u r e " ( F i g u r e 3) i s used i n a way t h a t maximizes t h e e f f i c i e n c y o f the molecule assembler. The g o a l i s t o p r e c l u d e the assembly o f an i n v a l i d m o l e c u l e r a t h e r t h a n r e j e c t i t a f t e r assembly. Thus, i n most c a s e s , atom and fragment t a g s , and supplementary i n f o r m a t i o n

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch007

7.

SHELLEY E T A L .

Interactive

Structure

Elucidation

99

c o n s t r a i n the assembly p r o c e s s . In o t h e r c a s e s , retrospective searching i s required. The m o l e c u l e assembler i s c o n s t r a i n e d i n o t h e r ways as w e l l . 1. A u s e r - d e f i n e d l i b r a r y o f h i g h l y s t r a i n e d o r c h e m i c a l l y u n s t a b l e m o e i t i e s c o n s t r a i n s the assembly o f m o l e c u l e s c o n t a i n i n g t h e s e s t r u c t u r a l f e a t u r e s , £•£•/ a c y c l o p r o p a n o n e r i n g . 2. The assembly o f d u p l i c a t e s t r u c t u r e s i s minimized. A number o f t e c h n i q u e s a r e i n c o r p o r a t e d i n t o the program t o e l i m i n a t e duplicates prospectively. The most i m p o r t a n t t e c h n i q u e i n v o l v e s the p e r c e p t i o n o f t o p o l o g i c a l symmetry (11) a t each s t e p o f t h e assembly p r o c e s s . In t h i s way, o n l y one member o f a group o f t o p o l o g i c a l l y e q u i v a l e n t v a l e n c e - d e f i c i e n t atoms i s p e r m i t t e d t o i n i t i a t e bond f o r m a t i o n . In s p i t e o f t h e h e u r i s t i c s used, t h e t o t a l e x c l u s i o n o f d u p l i c a t e s cannot always be a s s u r e d . Those d u p l i c a t e s s t i l l formed can be e l i m i n a t e d retrospectively. A newly d e s i g n e d and h i g h l y e f f i c i e n t c a n o n i c a l naming a l g o r i t h m performs t h i s r e m a i n i n g t a s k . The a l g o r i t h m a l s o r e c o g n i z e s resonance forms and i n c l u d e s o n l y one member on t h e l i s t o f v a l i d s t r u c t u r e s . Applications In t h e d i s c u s s i o n o f a p p l i c a t i o n s , our purpose i s t o p r o v i d e an o v e r v i e w o f the system network, i t s i n t e r a c t i v e n a t u r e , and i t s scope. CASE was o r i g i n a l l y d e v e l o p e d on r e a l w o r l d problems under s t u d y i n our own l a b o r a t o r y and i n a c o u p l e o f o t h e r n a t u r a l products l a b o r a t o r i e s . In more r e c e n t y e a r s "simul a t e d " r e a l w o r l d problems, t h a t i s , problems t a k e n from t h e c h e m i c a l l i t e r a t u r e , have a l s o p l a y e d an important r o l e . Such problems, a l t h o u g h a d m i t t e d l y somewhat c o n t r i v e d , p r o v i d e d the n e c e s s a r y b r e a d t h o r depth e x a c t l y when needed i n t h e program development and t e s t i n g . S p e c i f i c k i n d s o f problem s i t u a t i o n s a r e d i f f i c u l t t o produce on demand w i t h slower moving r e a l w o r l d s t r u c t u r e problems. F a u l k n e r e t a l . (_12) r e c e n t l y i s o l a t e d a h a l o g e nated monoterpene from a sea hare and r e p o r t e d t h e s t r u c t u r e (A). The assignment was made i n p a r t on the c h e m i c a l and s p e c t r o s c o p i c d a t a r e p o r t e d i n the paper and i n p a r t by analogy t o a r e l a t e d known compound. The s t r u c t u r a l i n f o r m a t i o n d e r i v e d from the c h e m i c a l and s p e c t r o s c o p i c d a t a i s shown i n the I/O p r i n t o u t o f

100

COMPUTER-ASSISTED

STRUCTURE

ELUCIDATION

the i n i t i a l run o f t h e problem ( F i g u r e 5 ) . The mol e c u l a r f o r m u l a i s i n p u t as C 1 0 H 1 3 B 2 X 3 , where X, a s t a n d a r d h a l o g e n symbol, i s CI i n t h i s problem, and B i s u s e r - d e f i n e d as B r . S i n c e B i s n o t a s t a n d a r d mol e c u l a r f o r m u l a symbol, t h e computer asks t h e u s e r t o s p e c i f y i t s v a l e n c e . Next the n o n o v e r l a p p i n g f r a g ments are l i s t e d . In o r d e r they a r e : an a l l y l c h l o r i d e fragment w i t h a l l carbon atoms h a v i n g r e s i d u a l v a l e n c e , a v i n y l bromide u n i t , a bromomethylene u n i t , and two CH groups each a t t a c h e d t o a q u a t e r n a r y carbon. The s t r u c t u r e c o n s t r a i n i n g d e v i c e , S u b s t r u c t u r e C o n t r o l , i s used t o l i m i t t h e number o f CH groups t o t h e known v a l u e o f 2. A t t h i s p o i n t we a r e ready t o s t a r t the m o l e c u l e assembler. S i n c e we have no e s t i m a t e on t h e t o t a l number o f s t r u c t u r a l isomers c o n s i s t e n t w i t h t h i s i n p u t , we s e t some a r b i t r a r y l i m i t on m o l e c u l e assembly, 20 i n t h i s c a s e , t o make c e r t a i n t h a t the problem s t r a t e g y i s i n d e e d sound b e f o r e g i v i n g f r e e r e i n t o t h e m o l e c u l e assembler. The f i r s t 20 s t r u c t u r e s were examined and we found t h a t we c o u l d e a s i l y c o n s t r a i n the m o l e c u l e assembler w i t h 2 a d d i t i o n a l p i e c e s o f i n f o r m a t i o n because we saw a c o n j u g a t e d d i e n e i n some s t r u c t u r e s and a gem-dimethyl group i n some s t r u c t u r e s . One o f the a u t h o r s o f t h e paper, Dr. I r e l a n d , i n d i c a t e d the a v a i l a b i l i t y o f e v i d e n c e , not i n t h e paper, to e x c l u d e a c o n j u g a t e d d i e n e . In a d d i t i o n , the geminal methyl groups s h o u l d have been e x c l u d e d by us based on the p u b l i s h e d PMR i n f o r m a t i o n . These two c o n s t r a i n t s were added by c a l l i n g Substructure C o n t r o l (Figure 6). Sixteen structures were g e n e r a t e d . Of the 16, 8 had geminal c h l o r i n e atoms. I r e l a n d b e l i e v e d t h e s e s t r u c t u r e s t o be unl i k e l y on t h e b a s i s o f i n d i r e c t c h e m i c a l e v i d e n c e . The r e m a i n i n g 8 c a n d i d a t e s c o u l d be pruned t o 4 s t r u c t u r e s on t h e b a s i s o f some o f t h e mass s p e c t r o s c o p i c data. The s t r u c t u r e a s s i g n e d (A) i s one o f the f o u r . 3

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch007

3

In our own s t r u c t u r e work on the a n t i b i o t i c a c t i n o b o l i n (_5 ) , which i s a compound u n r e l a t e d i n s t r u c t u r a l type t o known a n t i b i o t i c s , an e a r l i e r v e r s i o n o f CASE was used. The s t r u c t u r e study was i n i t i a t e d by an e x a m i n a t i o n o f a d e g r a d a t i o n p r o d u c t ,

SHELLEY E T AL.

Interactive

Structure

Elucidation

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch007

T I T L E J FAULKNER J0Cv41. v2461 (1976) • E N T E R T H E MAXIMUM NUMBER OF S T R U C T U R E S TU B E G E N E R A T E D J MOLECULAR FORMULA• C 1 0 H 1 3 B 2 X 3 WHAT I S T H E V A L E N C E OF A B ATOM? t 1 FRAGMENT ( S ) t CH-CH-CH--X FRAGMENT(S)• CH«CH-B FRAGMENT(S)i CH2-B F R A G M E N T ( S ) : CH3 CH3 FRAGMENT(S): ? C O N S T R A I N T ( S ) J SUBSTR FRAGMENT(S): CH3 5 MINIMUM• 2 MAXIMUM: 2 CONSTRAINTS) : y COMMAND: G E N E R A T E

20

STRUCTURE'S G E N E R A T E D Figure

5

T I T L E J FAULKNER JOC.41*2461(1976)• M 0 L E C U L A R F 0 R MIJ I... A I C :l. 0 H 1.3 B 2 X 3 WHAT I S T H E V A L E N C E OF A B ATOM? S 1. F R A G M E N T ( S ) : CH=CH-CH-X F R A G M E N T ( S ) : CH=CH~B FRAGMENT(S)t CH2-B F R A G M E N T ( S ) J CH3 C H 3 < C H 0 > FRAGMENT(S ) : ? CONSTRAINT< S ) J SUBSTR FRAGMENT(S)J CH3 t MINIMUM: 2 MAXIMUM: 2 C O N S T R A I N T ( S ) : SUBSTR F R A G M E N T ( S ) : C=C - C=C ? MINIMUM: o MAXIMUM: o C O N S T R A I N T ( S ) : SUBSTR FRAGMENT(S): CH3-C-CH3 v MINIMUM: o MAXIMUM: o CONSTRAINT(S): y COMMAND: G E N E R A T E

1.6 S T R U C T U R E S G E N E R A T E D Figure 6

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch007

102

COMPUTER-ASSISTED S T R U C T U R E

ELUCIDATION

actinobolamine (B), c o n t a i n i n g 9 o f the o r i g i n a l 13 c a r b o n atoms ( 2_ ) . T h a t problem has been r e r u n on the c u r r e n t v e r s i o n o f CASE. The IR i n t e r p r e t e r program r e p o r t s the p r e s e n c e o f a l c o h o l and ketone w i t h h i g h c o n f i d e n c e l e v e l s , 4 and 3 ( F i g u r e 7 ) . Subclass information, with c o n f i dence l e v e l s o f 2, i s o f i n s u f f i c i e n t c e r t a i n t y f o r use by t h e m o l e c u l e assembler. The same i s t r u e o f o t h e r major c l a s s e s o f f u n c t i o n a l groups l i s t e d . Note t h a t the program c o n s i d e r s o n l y f u n c t i o n a l groups c o n s i s t e n t w i t h the m o l e c u l a r f o r m u l a . In a d d i t i o n , atoms o f f u n c t i o n a l groups r e c e i v i n g a c o n f i d e n c e l e v e l o f 4 a r e a u t o m a t i c a l l y s u b t r a c t e d from the m o l e c u l a r formula and a r e not c o n s i d e r e d f u r t h e r . The I/O p r i n t o u t shown i n F i g u r e 8 i n c l u d e s u n s t r a i n e d ketone c a r b o n y l , i . e . , not p a r t o f a 3-5 membered r i n g , f l a n k e d by a C b e a r i n g f o u r r e a d i l y exchangeable p r o t o n s , a secondary h y d r o x y l group, a secondary amine a t t a c h e d t o two d i f f e r e n t methine c a r b o n s , and a 1 - h y d r o x y e t h y l group. The M u l t i p l e Bond C o n s t r a i n t s DOUBLE and TRIPLE p r e c l u d e the g e n e r a t i o n o f double and t r i p l e bonds. Substructure C o n t r o l r e s t r i c t s assembly o f k e t a l and a m i n a l l i n k a g e s , and a l s o t h e number o f CH groups t o one. The consumption o f two moles o f p e r i o d a t e by a c t i n o bolamine i s i m p o r t a n t s t r u c t u r a l i n f o r m a t i o n , the s i g n i f i c a n c e o f which i s a u t o m a t i c a l l y c o n s i d e r e d by c a l l i n g PERIODATE and g i v i n g t h e molar u p t a k e . This i n f o r m a t i o n c o n s t r a i n s t h e m o l e c u l e assembler i t s e l f ; the s e a r c h f o r c o m p a t i b i l i t y i s not done r e t r o s p e c tively. CASE produced f i v e s t r u c t u r e s c o n s i s t e n t w i t h the a v a i l a b l e e v i d e n c e ( F i g u r e 9 ) . Armed w i t h the a s s u r ance t h a t no v a l i d s t r u c t u r e had been o v e r l o o k e d , an examination of these f i v e s t r u c t u r e s provided i n v a l u a b l e g u i d a n c e i n the d e s i g n o f the minimum number o f experiments t o a s s i g n the s t r u c t u r e o f a c t i n o b o l a m i n e correctly. A l l o f the e x p e r i m e n t s were s p e c t r o s c o p i c i n n a t u r e and l e d t o the c o r r e c t s t r u c t u r e (B). 3

OH

CHOH CH

3

(B)

7.

Interactive

SHELLEY E T AL.

Structure

Elucidation

H30 C13H20N206

•

C9H15N03

AC T I N O B O L I N

ACTINOBOLAMINE

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch007

IR INTERPRETER OUTPUT

EACH C L A S S HAS A CONFIDENCE L E V E L OF 0-4• 4 •••• D E F I N I T E L Y PRESENT 3 - HIGH P R O B A B I L I T Y 2 - MEDIUM P R O B A B I L I T Y 1 - LOW P R O B A B I L I T Y 0 ™ D E F I N I T E L Y ABSENT

CLASS 1 • ALCOHOL-

SUBCLASSES •--

4

2, PRIMARY 4 |

6. KETONE 8* LACTAM-

~

- 3

••-

2

10 • CARBAMATE -

3-A f B-A ' >B' ~

7. SATURATED

- -

9. 5 MEMBER W/0

2 1 1 . PRIMARY

2 2 2 . SATURATED

2 4 . ETHER 2 7 . ACETAL 2 8 . KETAL

— —

—

2

5* SEC• I N RING

- 2

NH-~

2

2 16* TERTIARY

17. C=C(NON~AROMATIC) 2 1 8 . CHR=CR2 19. METHYL— --- 2 2 0 . GEM DIMETHYL

2 3 . PYRROLE

•—

1

2 15* SECONDARY

2 1 . NITRO GROUP

3* 2-A9B-

- 2 1 2 . SECONDARY

1 3 , TERTIARY 14* AMINE-

-—

2

- 1. 1 2

2 1 2 5 . SATURATED - 1

1 2 6 . UNSATURATED-

1. Figure

7

104

COMPUTER-ASSISTED S T R U C T U R E E L U C I D A T I O N

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch007

T I T L E: A C T I N 0 B 0 L AMINE MOLECULAR FORMULA * C 9 H 1 5 N 1 0 3 0=C < R 0 3 0 5 0 0 > < c I-1222 > FRAGMENT(S) FRAGMENT(S) 0H NLKCILI 2 2 > FRAGMENT(S) F R A G M E N T ( S ) :CH3-CH--0H FRAGMENT(S)J 9 C O N S T R A I N T ( S ) : DOUBLE CONSTRAINTS) : TRIPLE C O N S T R A I N T S ) : SUBSTR F R A G M E N T ( S ) : O-C-0 9 MINIMUM: 0 MAXIMUM: o JBSTR CONSTRAINT(S) 0~C -N 9 FRAGMENTS) MINIMUM: 0 MAXIMUM: o C O N S T R A I N T S ) : SUBSTR F R A G M E N T S ) : CH3 9 MINIMUM: i MAXIMUM: I PERIODATE CONSTRAINTS) CONSTRAINTS) t 5 COMMAND: G E N E R A T E

Figure

8

5 STRUCTURES

GENERATED

QH

Figure

9.

Program

CASE-draw,

Arizona

State University,

actinobolamine

7.

SHELLEY E T

AL.

Interactive

Structure

C o r o n a t i n e i s a t o x i n produced by a m i c r o o r g a n i s m of the Pseudomonas genus. I t s s t r u c t u r e was r e p o r t e d e a r l y t h i s y e a r (13). A d e g r a d a t i o n p r o d u c t , c o r o n a f a c i c a c i d , p l a y e d a key r o l e , and a l t h o u g h some chemi c a l and s p e c t r o s c o p i c d a t a and t h e i r s t r u c t u r a l s i g n i f i c a n c e a r e r e p o r t e d i n the paper, the f i n a l d e t e r m i n a t i o n o f the d e g r a d a t i o n p r o d u c t was by x - r a y . We d e c i d e d t o see how c l o s e t o the a c t u a l s t r u c t u r e the r e p o r t e d c h e m i c a l and s p e c t r o s c o p i c i n f o r m a t i o n would have taken the a u t h o r s . The computer i n p u t i s shown i n F i g u r e 10. Coronafacic acid i s C i 2 H i 0 . I t contains a c y c l o p e n t a n o n e r i n g w i t h t h r e e r e a d i l y exchangeable hydrogen atoms, an u n s t r a i n e d a , $ - u n s a t u r a t e d c a r b o x y l i c a c i d moeity and an e t h y l group t h a t i s not p a r t o f a p r o p y l group. There a r e no a d d i t i o n a l m u l t i p l e bonds and o n l y a s i n g l e methyl group. CASE assembled 88 s t r u c t u r e s , t h u s , the c h e m i c a l and s p e c t r o s c o p i c e v i d e n c e b r o u g h t the a u t h o r s t o w i t h i n 88 s t r u c t u r e s o f the c o r r e c t one. One l a s t s i m p l e , but i n f o r m a t i v e example i l l u s t r a t e s one a p p l i c a t i o n o f the spectrum s i m u l a t o r ( F i g u r e 11). The monoterpene c i n e o l e , C i o H i 0 , was r e c e n t l y examined by l^C-NMR. o f f - r e s o n a n c e and broad-band p r o t o n d e c o u p l e d s p e c t r a r e v e a l q u a t e r n a r y carbon b e a r i n g e t h e r oxygen, a t l e a s t one methine carbon, two methylene carbons and two methyl c a r b o n s , and no u n s a t u r a t e d c a r b o n s . The 1 C-NMR e v i d e n c e i s c o m p a t i b l e w i t h 458 s t r u c t u r a l isomers a c c o r d i n g t o CASE. I f PEAK i s c a l l e d ( F i g u r e 12), the number o f CNMR s i g n a l s expected f o r each o f the 458 compounds i s predicted. Those s t r u c t u r e s not c o n f o r m i n g t o the observed number, 7 i n t h i s c a s e , are r e j e c t e d . In t h i s way the l i s t o f 458 s t r u c t u r e s i s pruned t o 38. Of the 38 s t r u c t u r e s , o n l y 5 conform t o the i s o p r e n e r u l e . Peak p r e d i c t i o n i s based on m o l e c u l a r t o p o l o g y , b u t the d e t e r m i n a t i o n o f c l a s s e q u i v a l e n c e i n t h i s case c o n s i d e r s o n l y n e i g h b o r i n g atoms no more than t h r e e bonds removed. S i n c e a p e r f e c t match between p r e d i c t i o n and o b s e r v a t i o n cannot be expected f o r each and e v e r y s t r u c t u r e examined by PEAK, the p r u n i n g s t e p of PEAK can compare the a c t u a l number of observed s i g n a l s t o a range o f p r e d i c t e d v a l u e s , g e n e r a l l y the a c t u a l number p l u s o r minus one. Thus, i f PEAK i s s e t a t 7 w i t h a range o f p l u s o r minus one, the l i s t o f 458 s t r u c t u r e s i s reduced t o 144. Of the 144, o n l y 19 comply w i t h the i s o p r e n e r u l e . 6

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch007

105

Elucidation

3

8

3

1 J

computer-assisted structure elucidation

T I T L E : CORONAFACIC ACID M O L E C U L A R FORMULA: C 1 2 H 1 A 0 3 FRAGMENT S ) : 1:C(=0)-CH-C~C-CH2-1<#> FRAGMENT ( S ) •* C H < R 0 3 0 5 0 0 X H 0 0 > = C < H 0 0 > - C < = 0 ) -OH FRAGMENT(S)t CH3-CH2 FRAGMENTS) : $ C O N S T R A I N T S ) : DOUBLE 0 0 C O N S T R A I N T S ) : SUBSTR FRAGMENT(S)* CH3 i MINIMUM: 1 MAXIMUM: i CONSTRAINTS) * 9 COMMAND: G E N E R A T E

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch007

8 8 STRUCTURES GENERATED Figure 10

T I T L E : CINEOLE MOLECULAR FORMULA: C1.OH 1 S01 F R A G M E N T S ) : C 0-C F R A G M E N T ( S ) J CH FRAGMENT < S ) * CH2 CH2 F R A G M E N T S ) * CH3 CH3 v C O N S T R A I N T ( S ) : DOUBLE 0 0 CONSTRAINTS): TRIPLE 0 0 CONSTRAINT(S)J * COMMAND: G E N E R A T E

Figure

11

458 S T R U C T U R E S G E N E R A T E D

T I T L E : CINEOLE M O L E C U L A R FORMULA: C10H1B01. F R A G M E N T ( S ) : C-0-C F R A G M E N T ( S ) J CH FRAGMENT ( S ) : CH2 CH2 COMMAND: G E N E R A T E

Figure

12

38

STRUCTURES GENERATED

7.

SHELLEY E T A L .

Interactive

Structure

Elucidation

107

Summary In summary CASE i s a h i g h l y i n t e r a c t i v e network o f computer programs f o r r e l i a b l y and e f f i c i e n t l y a s s i s t i n g the chemist i n t h e c o n v e r s i o n o f chemical and s p e c t r o s c o p i c d a t a t o m o l e c u l a r s t r u c t u r e . Comm u n i c a t i o n i s i n t h e c o n v e n t i o n a l language o f t h e c h e m i s t and program e x e c u t i o n i s s u f f i c i e n t l y r a p i d to make problem s o l v i n g a h i g h l y c o n v e r s a t i o n a l p r o cess. CASE i s d e s i g n e d t o grow and expand, and we a r e c o n f i d e n t i t w i l l be more p o w e r f u l tomorrow than i t i s today.

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch007

Literature 1.

2. 3.

4. 5.

6. 7. 8. 9. 10. 11. 12. 13.

Cited

S t e v e n s , C a l v i n L., Taylor, K . G r a n t , Munk, Morton E., M a r s h a l l , W. S., Noll, K l a u s , Shah, G . D., Shah, L . G . and U z u , K., J. Med. Chem. (1965), 8, 1. Munk, Morton E., Sodano, C h a r l e s S . , McLean, Robert L. and Haskell, Theodore H., J. Am. Chem. Soc. (1967), 89, 4158. Munk, Morton E., N e l s o n , Denny B., A n t o s z , F r e d e r i c k J., H e r a l d , Jr., D e l b e r t L . and H a s k e l l , Theodore H., J. Am. Chem. S o c . (1968), 90, 1087. N e l s o n , D. B., Munk, M. E., Gash, K . B . and H e r a l d , Jr., D. L., J. O r g . Chem. (1969), 34, 3800. A n t o s z , F . J., N e l s o n , D. B., H e r a l d , Jr., D. L. and Munk, M. E., J. Am. Chem. S o c . (1970), 92, 4933. N e l s o n , D. B . and Munk, M. E., J. O r g . Chem. (1970), 35, 3832. N e l s o n , D. B . and Munk, M. E., J. O r g . Chem. (1971), 36, 3456. Bognar, R . , Sztaricskai, F., Munk, M. E. and Tamas, J., J. O r g . Chem. (1974), 39, 2971. Woodruff, H . B . and Munk, M. E., J. O r g . Chem. (1977), 42, 0000. Woodruff, H . B . and Munk, M. E., Anal. Chim. A c t a / Computer T e c h n i q u e s and O p t i m i z a t i o n , i n p r e s s . S h e l l e y , C . A . and Munk, M. E., J. Chem. I n f . Comput. Sci. (1977), 17, 0000. Ireland, C., Stallard, M. O., F a u l k n e r , D. J., Finer, J. and C l a r d y , J., J. O r g . Chem. (1976), 41, 2461. Ichihara, A., Shiraishi, K., S a t o , H., Sakamura, S., N i s h i y a m a , K., S a k a i , R . , F u r u s a k i , A . and Matsumoto, T., J. Amer. Chem. S o c . (1977), 99, 636.

8 C H E M I C S : A Computer Program System for Structure Elucidation of Organic Compounds TOHRU YAMASAKI, HIDETSUGU ABE, YOSHIHIRO KUDO, and SHIN-ICHI SASAKI

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch008

Miyagi University of Education, Aoba, Sendai 980 Japan

There have been many articles concerned with computer programs for structure elucidation of organic compounds by analyzing chemical spectra. The methodologies and the techniques employed for this purpose can be classified into two categories, one i s the identification of unknown compounds by the retrieval method of f i l e d spectra (1,2) i s carried out and the other is the generation of structural formula based on the analytical results of spectral data and other chemical evidence (3,4,5). As reported previously, our integrated computer system for structure elucidation of organic compounds named CHEMICS stands mainly on the latter methodology (6). IR and H NMR spectral data of an organic compound are analyzed and plausible structural formula consistent with the analytical results are generated. Since generation of correct structure i s the major premise of this system, rather ample allowance for elucidation of partial structures is made during data analysis. Thus, an excessive number of candidate structures (informational homologues) are generated upon occasion. In order to prevent this undesirable situation, two different strategies are considered to be practical. They are; 1) Application of the f i l e retrieval method as a complement to the data analysis, and 2) introduction of other kinds of information sources and/or improvement of the spectral data analysis more precisely. The former solution has been already actualized as CHEMICS-F as shown in Fig. 1 (7). For the latter strategy, several t r i a l s have been made at our laboratory, for example, quantitative analysis of IR spectra(£0 , spectral simulation of NMR( 1

108

YAMASAKI

ET AL.

Structure

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch008

/

of Organic

M o l . F o r m . , NMR, IR, MS, UV

DATA FILE

SEARCH

Match ing Resul

Compounds

/

/

Plausible

109

/

ANALYSIS

'components'

STRUCTURE

GENERATOR

Candidate

Structure

/

/

Matching Result

/

OUTPUT

/

Figure

1.

Plausible Structure

Block diagram of CHEMICS-F.

/

Dashed arrow means off-line

routine.

110

COMPUTER-ASSISTED

STRUCTURE

ELUCIDATION

ALACON)(9), a n a l y s i s o f n u c l e a r double resonance d a t a (1H{1H}, NMDR)(10) and p r e d i c t i o n o f NMR s p e c t r a ( 11) . In t h i s paper we d e s c r i b e i n c o r p o r a t i o n o f NMR s p e c t r a l a n a l y s i s i n t o CHEMICS t o extend i t s capabilities. General

feature of 13Q NMR s p e c t r a l

data

analysis

R e c e n t l y , 13C NMR s p e c t r o s c o p y has been e f f e c t i v e l y employed f o r s t r u c t u r e e l u c i d a t i o n o f o r g a n i c compounds. Here we i n t e n d t o i n t r o d u c e t h e s p e c t r a l d a t a as a new i n f o r m a t i o n s o u r c e because o f i t s gene r a l l y a p p l i c a b l e nature. The e n t i r e system i s shown i n F i g . 2. The program f o r a n a l y s i s o f C NMR s p e c t r a ( ASSINC) i s composed o f t h e f o l l o w i n g f o u r elements as shown i n F i g . 2. a) DATA INPUT b) PRIMARY ANALYSIS c) SECONDARY ANALYSIS d) CHEMICAL SHIFT TABLE The i d e a o f ASSINC i s much t h e same as t h a t o f 1'H NMR d a t a a n a l y s i s o f t h e system CHEMICS (ASSIN) (6), i n which knowledge o b t a i n e d by a n a l y z i n g s p e c t r a l d a t a o f unknown compounds i s r e p r e s e n t e d as a group o f subs t r u c t u r e s named 'components . A c c o r d i n g t o t h i s i d e a , 189 k i n d s o f 'components' a r e p r e v i o u s l y d e f i n e d f o r the ASSINC as shown (part i a l l y ) i n T a b l e I , i n s t e a d o f t h e 179 'components' f o r the former e d i t i o n . Each 'component' i s d e f i n e d by i t s a d j a c e n t atoms and/or f u n c t i o n a l groups bonded with i t .

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch008

l 3

1

DATA INPUT. Input d a t a f o r 13c NMR d a t a a n a l y s i s c o n s i s t o f p o s i t i o n s and i n t e n s i t i e s o f e v e r y s i g n a l and t h e i r m u l t i p l i c i t i e s . We use the example o f s t r u c t u r e 1, C9H14O, whose spectrum i s shown i n F i g . 3. Both c a r d and paper t a p e image d a t a a r e O^x^vy acceptable. Even i f the m u l t i p l i c i t i e s T j a r e n o t a v a i l a b l e , the ASSINC can a n a l y z e t h e r e s t o f t h e d a t a and w i l l o f f e r u s a b l e I answers f o r s u c c e s s i v e r o u t i n e s . But i n such a c a s e , some a m b i g u i t i e s c o u l d not be 1 avoidable. PRIMARY ANALYSIS. The b l o c k diagram o f the p r i m a r y a n a l y s i s r o u t i n e i s shown i n F i g . 2. As shown i n this f i g u r e , i t c o n s i s t s o f two major p a r t s . One i s a l l o c a t i o n o f carbons t o each s p e c t r a l s i g n a l and t h e o t h e r i s e x a m i n a t i o n o f t h e p r e s e n c e o f 'components'.

8.

YAMASAKI E T A L .

Structure

of Organic

Data

111

Compounds

analysis

IR,

lH

NMR

!3c

NMR

Spectrum

off-resonance

/

multiplicity

>

primary

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch008

Allocation

Selection

||

of

of

carbon

'components'

Making

set

of

by

analysis

atoms

chemical

shift

table

'components

secondary

analysis

Structure

Figure

NO

POSITION(ppm)

2.

Flow

chart of C

INTENSITY

13

generation

NMR

spectral

data

analysis

MULTIPLICITY

1

24.4

1679

Q

2

28.3

4549

Q

3

33.5

895

4

45.2

2380

5

50.8

2119

6

125.4

2494

7

159.9

1084

199.2

861

S T T D S S

Figure

3.

C

13

NMR data of pound 1

com-

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch008

112

COMPUTER-ASSISTED S T R U C T U R E E L U C I D A T I O N

A l l o c a t i o n o f c a r b o n s . The f i r s t s t e p o f t h e p r i m a r y a n a l y s i s i s t h e a l l o c a t i o n o f t h e p r o p e r number o f carbons t o each s i g n a l . However, i t must be emphasized t h a t t h e p r o c e s s i s n o t aimed a t o b t a i n i n g the e x p l i c i t s o l u t i o n f o r a l l c a s e s , b u t g a t h e r i n g as much u s e f u l i n f o r m a t i o n as p o s s i b l e . I t i s well known t h a t s i g n a l i n t e n s i t i e s a r e n o t always p r o p o r t i o n a l t o t h e carbon numbers c o n t r i b u t e d t o t h e s i g n a l s i n 13Q NMR s p e c t r a , m a i n l y because o f t h e p r e s ence o f n u c l e a r Overhauser e f f e c t ( N O E ) ( 1 2 ) . However, i t can be assumed t h e s i g n a l i n t e n s i t i e s o f p r o t o n a t e d carbons a r e p r o p o r t i o n a l t o the amount o f carbons because o f t h e i r almost complete enhancement a c c o r d i n g to t h e NOE. The a l l o c a t i o n o f carbon numbers i s based on t h i s assumption. The b l o c k diagram o f t h e r o u t i n e f o r t h e a l l o c a t i o n o f carbons i s shown i n F i g . 4. By u t i l i z i n g t h e m u l t i p l i c i t y d a t a , the i n p u t s i g n a l s a r e c l a s s i f i e d i n t o two c a t e g o r i e s , namely, s i g n a l s a s s i g n e d t o p r o t o n a t e d carbons and t h o s e which are a s s i g n e d t o n o n - p r o t o n a t e d c a r b o n s . Allocation of carbons f o r t h e s i g n a l s i s performed s e p a r a t e l y f o r each c a t e g o r y . At f i r s t , t h e a l l o c a t i o n i s t r i e d f o r s i g n a l s assigned t o protonated carbons. Then t h e amount o f carbons (AOC) c o r r e s p o n d e d t o t h i s c a t e g o r y i s l i m i t e d i n t h e range o f R^ t o R2 d e f i n e d by e q u a t i o n ( 1 ) . , , , , w h o l e c a r b o n numbers^ V of the molecule )

(

1

/number o f s i g n a l s \ f _ \ V p r o t o n a t e d carbon / n Q n

(1) 2

_/number o f s i g n a l s a s s i g n e d \ ^ t o p r o t o n a t e d carbons /

A f t e r e s t i m a t i o n o f t h e AOC, t h e number o f carbons f o r v a l u e o f each s i g n a l (CNS) i s e v a l u a t e d by means of t h e e q u a t i o n (2) and a s e t o f t h e CNS v a l u e s i s o b t a i n e d w i t h r e s p e c t t o each AOC v a l u e . However, i f any one o f t h e CNS v a l u e i n t h e s e t i s g r e a t e r than 0.3 and l e s s than 0.7, t h e s e t i s abandoned t o a v o i d an error. AOS CNSi = I N T i / I

(INT) j * AOC

CNSi: carbon number a l l o c a t e d t o s i g n a l " i " INTi: i n t e n s i t y of signal " i " AOC : amount o f c o r r e s p o n d i n g carbons

(2)

YAMASAKI E T AL.

Structure

of Organic

c

D for evaluation for

of

AOC

protonated

of

CNS

estimation

of

CNS

to

category

range

calculation sets

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch008

Compounds

e a c h AOC

sets

for

evaluat ion

non-protonated

category

o f AOC

c o r r e s ponded

to

p r o t o r a t e d AOC

i estimat ion

Figure

4.

of

CNS

Procedure for the allocation to each signal

of

carbons

114

COMPUTER-ASSISTED

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch008

AOS

: amount o f c o r r e s p o n d i n g

STRUCTURE

ELUCIDATION

signals

The a l l o c a t i o n p r o c e s s f o r t h e s i g n a l s a s s i g n e d t o n o n - p r o t o n a t e d carbon i s t h e f o l l o w i n g s t e p . At t h i s s t a g e , t h e AOC v a l u e i s e s t i m a t e d i n t h e b a s i s o f r e m a i n i n g carbons which a r e n o t consumed a t p r e c e e d i n g stage. As t h e r e s u l t o f s o l v i n g t h e e q u a t i o n (2) , the s e t s o f CNS v a l u e s w h i c h c o r r e s p o n d t o non-protona t e d carbons a r e o b t a i n e d . Here, i t i s assumed t h a t the weakest i n t e n s i t y o f t h e s i g n a l i s s h a r e d w i t h a u n i t number(1,2,3,...) o f c a r b o n s . Consequently, a l l o c a t e d numbers, namely, a s e t o f e n t i r e CNS i s a c q u i r e d f o r each i n p u t s i g n a l . I f t h e r e i s more than one s o l u t i o n f o r t h i s problem, any one o f them c o u l d be chosen as a c o r r e c t s e t o f a l l o c a t e d numbers t o t h e signals. The a p p l i c a t i o n o f t h e p r o c e d u r e t o t h e spectrum o f compound 1 i s d e s c r i b e d below. The i n p u t s i g n a l s shown i n F i g . 3 a r e c a l s s i f i e d i n t o e i t h e r p r o t o n a t e d o r n o n - p r o t o n a t e d c a t e g o r y where s i g n a l s number 1,2,4, 5 and 6 a r e grouped i n t o t h e former and 3,7 and 8 a r e grouped i n t o t h e l a t t e r . Through t h e p r o c e d u r e o f p r o t o n a t e d c a t e g o r y t h e AOC i s a p p r a i s e d as 5 and 6 because i s c a l c u l a t e d as 6 ( 9 - 3 ) and R2 i s e q u a l t o 5. The c o r r e s p o n d i n g s e t s o f t h e CNS a r e shown below where each i n t e g e r v a l u e e n c l o s e d by p a r e n t h e s i s i s a l l o c a t e d number o f c a r b o n s . signal number

1

2

4

5

6

A0C=5

0.63 (*)

1.72 (2)

0.90 (1)

0.80 (1)

0.94 (1)

AOC=6

0.76 (1)

2.06 (2)

1.08 (1)

0.96 (1)

1.13 (1)

S i n c e i t i s i m p o s s i b l e t o a l l o c a t e carbons t o s i g n a l number 1 a t t h e f i r s t s e t , t h i s s e t i s abandoned. T h e r e f o r e o n l y one s o l u t i o n i s d e r i v e d from the case where t h e AOC i s e q u a l t o 6. A t the f o l l o w i n g s t a g e , t h e AOC f o r n o n - p r o t o n a t e d c a t e g o r y i s f i x ed t o 3, and so each r e s i d u a l s i g n a l must be a l l o c a t e d t o one carbon i n d i v i d u a l l y . The f i n a l r e s u l t o f a l l o c a t e d number i s as f o l l o w s : signal number allocated number

1 1

2 2

3 1

4 1

5 1

6 1

7 1

8 1

8.

YAMASAKI

ET AL.

Structure

of Organic

Compounds

115 1

E x a m i n a t i o n o f t h e p r e s e n c e o f 'components . Now we have c o n f i r m e d two k i n d s o f i n f o r m a t i o n about a g i v e n C NMR s p e c t r a l d a t a . They a r e t h e amount o f carbons a s s i g n e d t o each s i g n a l and n a t u r e o f carbons (protonated o r non-protonated). By c o n s i d e r i n g t h e i n f o r m a t i o n , t h e p o s s i b l e p r e s e n c e o f each 'component' i s examined and t h o s e which a r e i n c o n s i s t e n t w i t h t h e i n f o r m a t i o n a r e abandoned. The p r e s e n c e o f each 'components' i s judged t o be a p p r o p r i a t e by i t s c h e m i c a l s h i f t range ( r e f e r t o T a b l e I ) , i n o t h e r words, i f t h e r e a r e no s i g n a l s w i t h i n a c h e m i c a l s h i f t range c o r r e s p o n d i n g t o a 'component', i t i s judged t o be n o t p r e s e n t i n a sample compound. As shown i n F i g . 5, twenty-nine components s u r v i v e for compound 1., through t h e p r i m a r y a n a l y s i s . The r e s u l t o f t h e p r i m a r y a n a l y s i s i s r e p r e s e n t e d by the m a t r i x named NM m a t r i x , i n which each row i s c o r r e sponding t o a s u r v i v e d 'component' and each column t o each s i g n a l o f t h e g i v e n 13c NMR spectrum. Each mat r i x element i n d i c a t e s maximum number o f t h e carbons f o r 'component' a s s i g n e d t o t h e c o r r e s p o n d i n g s i g n a l . Those elements w i t h v a l u e -1 i n d i c a t e c o r r e s p o n d i n g 'components' were n o t a s s i g n e d t o t h e s i g n a l s .

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch008

1 3

SECONDARY ANALYSIS. A t t h e f i r s t s t e p o f t h i s r o u t i n e , a s e t o f 'components' which i s c o n s i s t e n t w i t h t h e m o l e c u l a r f o r m u l a i s s e l e c t e d from s u r v i v e d 'components'. One o f t h e f i v e s e t s which was f i n a l l y g e n e r a t e d f o r compound 1_ i s shown i n F i g . 6. As d e s c r i b e d b e f o r e , each o f t h e s i g n a l s i s t r e a t e d as i f i t were independent o f t h e o t h e r s and the 'components' which can be a s s i g n e d t o a t l e a s t one s i g n a l s u r v i v e w i t h o u t any f u r t h e r e x a m i n a t i o n a t t h e primary a n a l y s i s . However, i t i s n e c e s s a r y t o examine whether t h e s e t i s c o n s i s t e n t w i t h t h e g i v e n spectrum o r n o t , i n o t h e r words, each o f a l l 'components' o f t h e s e t s h o u l d be c o n f i r m e d whether they a r e f u l l y consistent w i t h t h e i n p u t spectrum w i t h n e i t h e r excess n o r d e f i ciency. To make t h i s e x a m i n a t i o n , t h e s e l e c t i v e NM m a t r i x i s made f o r t h e s e t by e x t r a c t i n g t h e rows c o r r e s p o n d i n g t o s e l e c t e d 'components' from NM mateix shown i n F i g . 5. T h i s s e l e c t i v e m a t r i x i s shown i n F i g . 6. As shown i n F i g . 7 , t h i s m a t r i x N i s c o n v e r t e d i n t o another m a t r i x X by s u b s t i t u t i n g t h e p o s i t i v e elements by v a r i a b l e s ( x ^ j ) and t h e n e g a t i v e elements by z e r o s . A s e t o f simultaneous l i n e a r equations i s made from X and two c o n s t n a t v e c t o r s C and D, r e p r e -

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch008

COMPUTER-ASSISTED STRUCTURE ELUCIDATION

NO

CMP

SUB/STRUCTURE

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

10 11 12 14 17 33 38 40 106 107 108 109 118 143 144 145 146 153 172 173 174 175 177 182 184 185 186 187 188

GEM-DI M E T H Y L - ( D ) GEM-DI M E T H Y L - ( T ) GEM-DI M E T H Y L - ( C ) CH3-CO(Y) (T) CH3-COCH3(D) CH3COCD) CH3CO(C) -CH2(C)(K) -CH2(C)(D) -CH2(C)(T) -CH2(C)(C) -CH=<0LEFIN> :C= < 0 L E F I N > =C= =C= FURAN(O) -0-CO(C)(D) -CO(C)(T) -CO(C)(C)

NM

o=c= Y Y C C C C

(0) (C) (Y) (K) (D) (T) (C)

c

SAMPLE

MATRIX

1 1 1 1 1 1 1 1 -1 -1 -1 -1 -1 -1 -1 -1

2 2 2 2 2 2 -1 2 -1 -1 -1 -1 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1 -1 -1 1 1 1 1 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1 -1 -1 1 1 1 1 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1

-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1

-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 1

-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1 1 1 1 1 1

-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

-1 -1 -1 -1 1 1 -1 -1 -1 -1 -1

1 1 1 1 -1 -1 -1 -1 -1 -1 -1

X

1 J O B END

Figure

5.

Survived

components

of compound analysis

1 through

C

13

NMR

data

8.

YAMASAKI E T A L .

Structure

of Organic

number 1

12

GEM-DIMETHYL-

(C)

2

33

CH 3

number

of

of

selective

carbons

'components'

117

Compounds

2#

NM m a t r

-1

-

(D)

-1

-

1

3

106

-CH -

(C)(K)

4

107

-CH -

(C)(D)

5

118

-CH=

<0LEFIN>

-1

6

143

-C=

<0LEFIN>

-1

-

7

172

-CO-

(C)(D)

-1

-

8

188

-c-

(C)

-1

-

1

1

2

2

1

( 1 2

-

1

allocation

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch008

Only methyl

carbon

Figure

f

is

considered

Selective

6.

components

1

2 -1 -1 -1 -1 -1 -1 >

1

2 -1 -1 -1 -1 -1 -1

-1 -1 -1

1

1 -1 -1 -1

-1 -1 -1

1

1 -1 -1 -1

-1 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1

s

1 -1

a

)

a

n

1

1

1

1

1

1

1 )

d:

( 1

of

1

0 0 0 0 0 0

0 0 0 0 0 0 X 0 0 0 0 0 0 83 0

0 0 0 0\ 0 0 0 0 0 0 0 X 0 0 0 0 0 0 0 0 67 0 0 0 0 78 0 0 0 0 )

2

1

1

x:

0 0 0 0 0 0

1

group

for the fifth set of compound

1 -1 -1 -1 -1

1

( 2

gem-dimethyl

1 -1 -1

-1 -1 -1 -1 -1 -1 -1 -1 -1

in

1

number

carbons #

1

X

1

X

X

1

1

1 )

d

N' X' O D mean s e l e c t i v e NM m a t r i x , s e l e c t i v e NM m a t r i x r e p l a c e d by XJLj, m o d i f i e d 'component' v e c t o r a n d a l l o c a t i o n v e c t o r , r e s p e c t i v e l y .

r x •i = c ^ b)

I

•

X

representation o f simultaneous v e c t o r having e i g h t elements.

Figure

7.

Representation

of

=

D

linear

equations

simultaneous linear pound 1

equations

where I means u n i t row

for

the

fifth

set of

com-

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch008

118

COMPUTER-ASSISTED

STRUCTURE

ELUCIDATION

s e n t i n g carbon numbers i n t h e 'components' and a l l o c a t e d carbon numbers, r e s p e c t i v e l y . The number o f e q u a t i o n s i s t h e number o f 'components' i n t h e s e t plus that of the s i g n a l s . The e q u a t i o n s have a r e s t r i c t i o n , t h a t t h e v a r i a b l e x^- s h o u l d n o t exceed t h e range between z e r o and t h e v a l u e o f the c o r r e s p o n d i n g s e l e c t i v e m a t r i x element. To s o l v e t h e s e s i m u l t a n e o u s e q u a t i o n s i s t h e major function of t h i s routine. When no s o l u t i o n i s o b t a i n e d , the s e t i s judged t o be i n a p p r o p r i a t e one, and when a s o l u t i o n i s g i v e n , the s e t i s s e n t t o t h e f o l l o w i n g r o u t i n e (the s t r u c ture generator). At the f i n a l stage o f the s p e c t r a l a n a l y s i s , f i v e s e t s o f components which a r e g e n e r a t e d from twentyn i n e components a r e s e l e c t e d as p l a u s i b l e ones f o r compound 1. F i v e s e t s a r e shown as f o l l o w s , numera l i n p a r e n t h e s i s e x p r e s s e s number o f t h e component; NO.

1

10 (1), 38 (1), 107 (1), 109 (1), 118 (1), 143 (1), 189 (1)

NO.

2

10 (1), 40 (1), 106 (1), 107(1), 118 (1), 143(1), 189 (1)

NO.

3

12 (1), 38(1), 107 (2), 118(1), 143 (1), 189 (1),

NO.

4

10 (1), 33 (1), 106 (1), 109 (1), 118 (1), 143 (1), 172 (1), 189(1)

NO.

5

12 (1), 33 (1), 106 (1), 107 (1), 118 (1), 143 (1), 172 (1), 189 (1)

The o v e r a l l p r o c e s s t h a t 189 components a r e r e duced i n t o 29 by means o f t h e e x a m i n a t i o n o f m o l e c u l a r f o r m u l a f o l l o w e d by t h e s u c c e s s i v e a n a l y s e s o f IR, 1H NMR and C NMR i s shown i n F i g . 8. In f i g . 8, num e r a l s 10 8, 105, 59 and 29 i n p a r e n t h e s e s i n d i c a t e t h e amounts o f s u r v i v e d components by s u c c e s s i v e restrict i o n s o f m o l e c u l a r f o r m u l a , IR, J-H NMR and 13c NMR, respectively. Only f i v e s e t s o f components u n c o n t r a d i c t o r y w i t h m o l e c u l a r f o r m u l a and g i v e n NMR spectrum a r e p i c k e d up from t h e s e twelve components. Finally, the s t r u c t u r e generator(L3) i s a p p l i e d t o generate the s t r u c t u r e s from each s e t o f components so t h a t 3, 1, 2, 3 and 3 s t r u c t u r e ( s ) produced f o r s e t s , 1, 2, 3, 4 and 5. These s t r u c t u r e s a r e shown i n F i g . 9 as i n f o r m a t i o n a l homologues f o r t h e i n p u t m o l e c u l a r f o r m u l a and c h e m i c a l s p e c t r a . The u n d e r l i n e d one i s t h e s t r u c t u r e o f t h e compound 1. 1 3

PREPARATION OF CHEMICAL SHIFT TABLE. A c h e m i c a l s h i f t ranges f o r a s i g n a l o f a 'component' was d e t e r mined f o r the a n a l y s i s d e s c r i b e d i n the previous

YAMASAKI E T A L .

Structure

of

Organic

119

Compounds

189 COMPONENTS Molecular Formula ( C

9

H

1 4

^

0 )

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 21 22 23 24 25 27 29 30 31 32 33 34 38 39 40 41 42 43 44 45 46 49 52 53 55 57 58 59 60 61 67 71 76 79 80 82 84 85 86 87 88 90 92 93 99 100 101 102 104 105 106 107 108 109 110 113 114 115 116 117 118 126 127 136 137 138 139 140 141 142 143 144 145 146 153 165 172 173 174 175 176 177 182 183 184 185 186 187 188 (108)

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch008

IR S p e c t r a l Data 1 2 3 4 5 6 7 8 9 10 11 12 13 22 23 24 25 27 29 30 31 32 33 34 38 39 49 52 53 55 57 58 59 60 61 67 71 76 79 90 92 93 99 100 101 102 104 105 106 107 108 109 137 138 139 140 141 142 143 144 145 146 153 165 172 184 185 186 187 188 (105)

14 40 80 110 173

15 16 17 18 19 21 41 42 43 44 45 46 82 84 85 86 87 88 114 115 116 117 118 136 174 175 176 177 182 183

iH NMR S p e c t r a l Data 7 9 10 11 12 13 14 15 16 17 18 33 34 38 39 40 41 43 44 45 46 67 71 76 79 80 85 86 87 88 104 106 107 108 109 116 117 118 141 142 143 144 145 146 153 165 172 173 174 175 176 177 182 183 184 185 186 187 188 (59) 1 3

C NMR S p e c t r a l Data

5»»

10 11 12 14 17 33 38 40 106 107 108 109 118 143 144 145 146 153 172 173 174 175 177 182 184 185 186187 188 (29)

selected 'components' 10 12 33 38 40 106 107 109 118 143 172 189

generated structures

set o f • components' #2 #4 #3

#1 1 0 0 1 0 0 1 1 1 1 0 1

( Figure

8.

Feature

1 0 0 0 1 1 1 0 1 1 0 1

0 1 0 1 0 0 2 0 1 1 0 1

1 0 1 0 0 1 0 1 1 1 1 1

0 1 1 0 0 1 1 0 1 1 1 1

1

2

3

3

1

3

#5

1 1 )

of reducing the number of components analyses of compound 1

through

consecutive

120

COMPUTER-ASSISTED

STRUCTURE

ELUCIDATION

s e c t i o n i n t h e f o l l o w i n g way. The components which c o n t a i n carbon atoms a r e 177 o u t o f e n t i r e 189. F o r t h o s e 'components , t h e i r c h e m i c a l s h i f t v a l u e s i n v a r i o u s k i n d s o f compounds were c o l l e c t e d from s e v e r a l s o u r c e s ( 1 4 , 1 5 , 1 6 ) . The c o l l e c t e d d a t a f o r 'component no.25 o f m e t h y l c a r b o n s , as an example, a r e shown i n F i g . 10. By u s i n g t h e s e d a t a , the c h e m i c a l s h i f t range f o r t h e 'component' i s o b t a i n e d as f o l l o w s . 1

1

1

1

i.

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch008

ii.

An assumed r e g i o n o f the mean v a l u e ( y ) i s c a l c u l a t e d by means o f common s t a t i s t i c a l procedure. An a r b i t r a r y v a l u e up.

(y')

i n the r e g i o n i s p i c k e d

iii.

The s t a n d a r d d e v i a t i o n ( a ) lated.

iv.

Whether a l l t h e c o l l e c t e d d a t a f o r the 'component' a r e w i t h i n t h e range between TT' - 3 a t o I T ' + 3 a i s examined.

v.

y~' i s c a l c u -

f o r the

I f n o t , the TT' i s updated and p r o c e d u r e s i i i and i v a r e r e p e a t e d , i f i t i s , the v a l u e s I T ' - 3 a and y ' + 3 a a r e determined as the upper and lower l i m i t s o f the s h i f t o f the 'component' r e spectively. -

The assumed r e g i o n o f mean v a l u e o f component 25 was c a l c u l a t e d as 19.15 - 24.48ppm based on v a r i o u s k i n d s o f d a t a s o u r c e s as shown i n F i g . 10. Here, an a p p a r e n t mean v a l u e o f t h e s e c o l l e c t e d d a t a i s 21.8ppm and t h i s i s an i n i t i a l v a l u e o f y"' . Some d a t a o f samples a r e o f t e n o u t o f the normal G a u s s i a n d i s t r i b u t i o n , t h e r e f o r e s t a n d a r d d e v i a t i o n has t o be c o n s i d e r e d s e p a r a t e l y i n h i g h e r magnetic f i e l d ( a ) and lower magnetic f i e l d ( C L ) compared w i t h y ' , f o r d e t e r m i n a t i o n o f the s t a n d a r d d e v i a t i o n f o r y . The y ' i s renewed by ' f l i p - f l o p ' u n t i l l y ' - 3 a and y* + 3 a can i n c l u d e the whole sampling d a t a . In case o f component 25, mean v a l u e i s f i n a l l y found o u t t o be 21.4ppm, when a =2.05 and a =1.39. The upper and lower l i m i t s o f the s h i f t determined a c c o r d i n g t o t h i s manner i s 15.21 - 25.53ppm which i s r e g i s t e r e d i n T a b l e I. T h i s p r o c e d u r e i s a p p l i e d t o a l l 'components' and the c h e m i c a l s h i f t t a b l e i s o b t a i n e d as shown i n T a b l e I. H

1

H

R

Result and

The

1

L

L

Discussion

r e s u l t o b t a i n e d f o r twenty two

compounds by

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch008

8.

YAMASAKI E T A L .

Structure

of Organic

121

Compounds

obtained shift

chemical

range

assumed r e g i o n ^

mean

of

value

-

"

CH

> ( C ) 3

iteration sample

•

mean

I

15.0

Figure

1 '

10.

' —

1

— I —

1

20.0

Estimation

—

1

—

1

—

1

—

1

—

1

—

1

1—

t i m e s = 46

amount

value

=

21.4

1

25.0

chemical

of C NMR chemical shift range of component 13

= 32

shift

#25

122

COMPUTER-ASSISTED

Table I

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch008

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 #

ELUCIDATION

Components and t h e i r appearance range of T3c NMR chemical

NO

STRUCTURE

shift

COMPONENT TERT-BUTYLTERT-BUTYLTERT-BUTYLTERT-BUTYLTERT-BUTYLTERT-BUTYLGEM-DIMETHYLGEM-DIMETHYLGEM-DIMETHYLGEM-DIMETHYLGEM-DIMETHYLGEM-DIMETHYLCH3-COCH3-COCH3-COCH3-COCH3-COCH3-COISO-PROPYLISO-PROPYLISO-PROPYLISO-PROPYLISO-PROPYLISO-PROPYLISO-PROPYLCH30CH30CH30CH30CH30CH30CH3CH3CH3CH3C0CH3C0CH3COCH3COCH3COCH3CO-

SHIFT (0># (Y) (K) (D) (T) (C) CO) (Y) (K) CD) (T) (C) (0) (Y) (K) (D) (T) (C) (0) (A) (Y) CIO CD) CT) CO CO) CY) CK) CD) CT) CO CY) CD) CT) CO) CY) CO CD) CT) CO

RANGE (ppm)

26,02 *«•* 31.13 24.47 * * * * 33,57 2 5 . 4 8 *•»» 3 4 , 0 4 2 8 . 2 3 * * * * 36.78 25.48 **** 34.04 23.65 * * « * 32.97 27.42 * * * * 32.95 10.72 36.27 36.27 10.12 14.72 36.27 36.27 10,12 6.80 * * * * 3 2 . 6 1 4,58 * * * * 3 2 . 0 1 4 . 5 8 *#•* 3 2 , 0 1 5.25 ***» 1 5 . 5 0 10.43 21.53 4.58 32.01 9.92 12.97 25.83 15.09 16.63 25.83 25.45 20.95 15,09 **** 23.87 16.33 25.83 15.09 25.83 15.21 25.53 52.88 61.61 54.59 57,92 50.34 52.53 56.68 * * « * 61.51 52.88 61.51 60.60 49.95 7.26 26.10 7.06 • ••• 3 3 . 0 8 -2.49 **** 8.49 19.81 * * * * 23.39 22.95 31.79 22,95 33.92 22.22 28,15 8.49 -2,49 2 0 . 8 0 #*»» 3 0 . 0 1

means t h e a d j a c e n t atom o r f u n c t i o n a l g r o u p , t h e y a r e , s a t u r a t e d oxygen (O), a r o m a t i c c a r b o n ( Y ) , c a r b o n y l carbon(K), o l e f i n i c carbon(D), a c e t y l e n i c carbon(T), and s a t u r a t e d c a r b o n ( C ) , r e s p e c t i v e l y .

8.

YAMASAKI E T A L . Table

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch008

no. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 *1 22

II

Structure

Results obtained f o r

a-Methyltetrahydrofuran p-Quinone 2-Methylpentane 3-Methylpentane 2,3-Dimethylbutane 3-Heptanone 2-Heptanone m-Xylene E t h y l benzene Cyclohexylacetate 2-0ctanol Coumarine Isophorone Diisobutylketone n-Nonanol Dicyclopentadiene Verbenone Camphor n-Decanol 2-Cyclohexylcyclohexanone 3-Ionone Methyl m y r i s t a t e

III

Results

several

5 6 6 6 6 7 7 8 8 8 8 9 9 9 9 10 10 10 10 12 13 15

obtained

compounds by CHEMICS

10 4 14 14 14 14 14 10 10 14 18 6 14 18 20 12 14 16 22 20 20 30

by

123

Compounds

through

molecular C H

compound

Table

of Organic

number o f I H _ I R , I H ' N M R " through I R , H N M R , analysis analysis 1

1 2 1 1 1 2 1 21 5 1 1 116 12 1 1 41 42 75 1 147 481 1

10 589 3 3 4 3 4 40 5 161 38 834 1895 30 24 1729 53274< 3253 50 2109 57827< 4767

utilizing

various

combinations

of

information

sources

compound

C

H

analytical

0

mode*

number o f ' i n f o r m a t i o n a l homologues'

P

3

6

C 7

3-Heptanone

14

3 2

P+C

1

C+0

2

P+C+0

38

P 2-0ctanol

8

18

41

C P+C

1

13 1

C+0

1

P+C+0

1895

P Isophorone

*

P, C,

9

and 0

off-resonance

mean

14

12

P+C C+0 P+C+0

1

an a n a l y s i s

spectra,

27

C

of

iH

respectively.

27

12

NMR,

13c

NMR

and/or

1 3

CNMR

124

COMPUTER-ASSISTED

STRUCTURE

ELUCIDATION

means o f t h e o l d system and new system a r e p r e s e n t e d i n Table II. The c o r r e c t s t r u c t u r e has been always gene r a t e d among t h e p l a u s i b l e s t r u c t u r e s . The numbers o f i n f o r m a t i o n a l homologues o b t a i n e d by means o f ASSINC a r e reduced t o 19.7 p e r c e n t ( i n s i m p l e average) o r 3.1 p e r c e n t ( i n weighted average) o f those o b t a i n e d by means o f t h e o l d system where o n l y IR and NMR d a t a were a n a l y z e d . As a r e s u l t o f t h e a d d i t i o n o f C NMR s p e c t r a l d a t a a n a l y s i s , i t becomes p o s s i b l e t o d e c r e a s e remarkably t h e numbers o f i n f o r m a t i o n a l homologues. F o r example, t h e number was reduced from 4767 t o one f o r compound 20_ as shown i n T a b l e I I . T a b l e III shows t h e number o f i n f o r m a t i o n a l homologues o f s e v e r a l compounds o b t a i n e d by u t i l i z i n g v a r i o u s c o m b i n a t i o n s o f i n f o r m a t i o n s o u r c e s , namely, 1H NMR, 13c NMR, l H NMR p l u s 13c NMR, 13c NMR p l u s i t s o f f - r e s o n a n c e d a t a , and 1H NMR p l u s 13c NMR p l u s o f f resonance d a t a . As shown i n t h i s t a b l e , t h e number o f i n f o r m a t i o n a l homologues and t h e number o f t h e component s e t s a r e b o t h d e c r e a s e d i n a c c o r d a n c e w i t h t h e a d d i t i o n o f new i n f o r m a t i o n s o u r c e s . In c o n c l u s i o n , t h e number o f t h e ' i n f o r m a t i o n a l homologues' and t h e 'component' s e t s a r e s a t i s f a c t o r i l y reduced by c o n s e c u t i v e a n a l y s e s . As mentioned above, the e f f o r t s t o reduce t h e e x c e s s i v e 'components' b e a r good f r u i t s , i . e . , t h e number o f t h e produced s e t s a r e l e s s than t e n f o r a l l c a s e s . Therefore, the informat i o n about t h e c o n e c t i v i t i e s between a l l t h e 'components ' i n a s e t become i m p o r t a n t d a t a t o i n c l u d e i n a f u t u r e system. That k i n d o f i n f o r m a t i o n w i l l work e f f e c t i v e l y t o r e duce t h e e x c e s s i v e ' i n f o r m a t i o n a l homologues' and n u c l e a r magnetic resonance t e c h n i q u e s w i l l g i v e such information.

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch008

1 3

T h i s work was s u p p o r t e d i n p a r t by a S c i e n t i f i c R e s e a r c h Grant from t h e M i n i s t r y o f E d u c a t i o n , Japan.

(1) (2) (3) (4) (5) (6)

Literature Cited Schwarzenbach,R.,Meili,J.,Koenitzer,H. and Clerc, J.T., Org. Mag. Resonance, (1976),8,11 Bremser,W.,Klier,M. and Meyer,E., ibid, (1975), 7,97 Carhart,R.E.,Smith,D.H.,Brown,H. and Djerassi,C., J . Am. Chem. Soc., (1975), 97, 5755 Beech,G.,Jones,R.T. and M i l l e r , K . , Anal. Chem., (1976), 46, 714 Gray,N.A.B., ibid, (1975), 47, 2426 Sasaki,S. et a l , Mikrochimica Acta(Wien), (1971), 726

8.

YAMASAKI E T A L .

(7)

(8) (9) (10) (11) (12)

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch008

(13) (14) (15) (16)

Structure

of Organic

Compounds

125

S a s a k i , S . , CHEMICS-F in " I n f o r m a t i o n C h e m i s t r y " , p227, The U n i v e r s i t y o f Tokyo P r e s s , T o k y o , 1975, and the detail o f CHEMICS-F will be r e p o r t e d i n the near f u t u r e . M i y a s h i t a , Y . and Sasaki,S., Jpn.Chem.Soc. Meeting, (1975), I, 174 Y a m a s a k i , T . and Sasaki,S., Jpn. Anal., (1975),213 unpublished Ochiai,S.,Hirota,Y.,Kudo,Y. and Sasaki,S., Jpn. Anal., (1973), 22, 399 Stother,J.B., "Carbon-13 NMR S p e c t r o s c o p y " , Academic P r e s s , New Y o r k , 1972 K u d o , Y . and Sasaki,S., J.Chem. I n f . Comput. Sci., (1976), 16, 43 B e a c h , L . B . , "API 44 S e l e c t e d 13CNMR S p e c t r a l Data" API Research P r o j e c t 44 Publication, T e x a s , 1975 N u c l e a r M a g n e t i c Resonance S p e c t r a l S e a r c h System, NIH/EPA, USA J o h n s o n , L . F . and J a n k o w s k i , W . C . , "Carbon-13 NMR S p e c t r a " , W i l e y I n t e r s c i e n c e , New Y o r k , 1972

9 Computer Assistance for the Structural Chemist

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch009

RAYMOND E. CARHART—Department of Computer Science, Stanford University, Stanford, CA 94305 TOMAS H. VARKONY—Department of Chemistry, Stanford University, Stanford, CA 94305 DENNIS H. SMITH—Department of Genetics, Stanford University, Stanford, CA 94305 Elucidation of unknown molecular structures can be thought of as a process of systematic posing, testing and rejection of hypotheses. Each hypothesis i s of course a partial or complete structure which i s evaluated in light of available evidence. Chemists perform these tasks quite well if not completely systematically. We seek computer programs which emulate processes of reasoning about chemical structure to save time, to stimulate the chemist's thinking about an unknown and to guarantee that no plausible alternatives have been overlooked. There are several areas of structure elucidation which are amenable to some degree of computer assistance. Computer techniques are employed routinely to aid in collection and preliminary analysis of data from several types of spectrometers. Applications of problem solving programs to more sophisticated analysis of molecular structure are more recent. For example, several reports have appeared describing computer programs for assisting chemists in the task of constructing plausible candidate structures for unknown compounds (1-3). These programs have matured to the point where successful applications to real-world structural problems have been demonstrated (1,4). The structure generating capabilities of these programs f u l f i l l only the generate phase of the "plangenerate-test" paradigm of heuristic search (5). Much work remains to be done on the structure generators to make them simpler to use and to reduce them i n size, complexity and execution time. However, this work i s primarily developmental. In this report we focus on new research on "planning" and "testing" ("predicting"), and illustrate how computer techniques can provide assistance in these areas also. We define planning to include i n i t i a l interpretation of chemical and spectroscopic data and the translation of results into fragments of molecular structure which can be used manually, or together with a structure generator, to generate structures (2, £.-12 and other papers i n this symposium) . Computer assistance can be provided to the chemist during 126

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch009

9.

CARHART E T AL.

Computer

Assistance

for the Structural

Chemist

127

planning i n s e v e r a l ways. We d i s c u s s i n the subsequent s e c t i o n ways i n which our s t r u c t u r e generator, CONGEN, i s being modified to plan more e f f i c i e n t s t r u c t u r e generation by t r a n s l a t i o n o f s t r u c t u r a l data input t o the program. We f e e l that such i n t e l l i g e n t use o f s t r u c t u r a l data as c o n s t r a i n t s i s e s s e n t i a l during s t r u c t u r e generation. In most problems, however, i n i t i a l data and p r e l i m i n a r y a n a l y s i s are i n s u f f i c i e n t t o determine a s t r u c t u r e uniquely; many, perhaps hundreds, of candidate s t r u c t u r e s may remain. Therefore, we a r e a l s o pursuing development o f computer methods to examine or evaluate l a r g e s e t s o f candidate structures to a s s i s t the chemist i n focussing on the c o r r e c t s t r u c t u r e . These methods i n v o l v e : a) t e s t i n g o f s t r u c t u r e s with c o n s t r a i n t s ; and b) p r e d i c t i o n o f r e s u l t s o f manipulation o f the s t r u c t u r e s which can be matched against a c t u a l l a b o r a t o r y measurements. The goal i n both methods i s t o r e j e c t i m p l a u s i b l e candidate s t r u c t u r e s . We d e s c r i b e recent r e s u l t s i n a subsequent s e c t i o n . We f e e l such computer-aided f a c i l i t i e s w i l l be u s e f u l i n d i s c r i m i n a t i n g among a l a r g e set o f candidate s t r u c t u r e s . .11 Planning; v i a C o n s t r a i n t I n t e r p r e t a t i o n We have p r e v i o u s l y described the CONGEN program f o r c o n s t r u c t i n g s t r u c t u r e s under c o n s t r a i n t s ( 3 ) . The program has an extensive r e p e r t o i r e of c o n s t r a i n t s "which prevent the generating algorithm from producing undesired p a r t i a l or complete s t r u c t u r e s ( 1 3 ) . I t has a r i c h language f o r d e f i n i n g p a r t i a l s t r u c t u r e s (substructures) i n c l u d i n g s e v e r a l terms d e s c r i b i n g atom and bond p r o p e r t i e s , r i n g s , c h a i n s , and so f o r t h . The language allows r e p r e s e n t a t i o n o f a r o m a t i c i t y , atoms of indeterminate, or r e s t r i c t e d , i d e n t i t y and chains and r i n g s of v a r i a b l e numbers o f atoms. CONGEN i s an i n t e r a c t i v e program with a h e l p f u l user i n t e r f a c e with extensive e r r o r checking to avoid program crashes. I t i s u s e f u l , and used by a nationwide community o f persons v i a a computer network, even though i t only f u l f i l l s the r o l e o f a s t r u c t u r e generator. I t depends on user inputs t o determine i t s knowledge o f chemistry and, u n t i l r e c e n t l y , u t i l i z e d t h i s i n f o r m a t i o n much as i t was r e c e i v e d . This i s not the most e f f i c i e n t way t o solve problems. Much needs t o be known about the program to use i t i n the most e f f i c i e n t way f o r a given problem. In m o d i f i c a t i o n s and extensions t o CONGEN (below) we are s t r i v i n g to d i v o r c e the chemist from the a l g o r i t h m s . We seek t o solve problems e f f i c i e n t l y , beginning with i n f o r m a t i o n supplied by the chemist to the program i n ways which are i n t u i t i v e to him/her, independent o f the program. Other workers have discussed aspects o f data i n t e r p r e t a t i o n at some l e n g t h (2, .6-12). We a r e d i r e c t i n g our a t t e n t i o n t o the problem o f how to t r a n s l a t e and u t i l i z e the s t r u c t u r a l i n f o r m a t i o n provided by these i n t e r p r e t a t i o n s . A l l s t r u c t u r a l

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch009

128

COMPUTER-ASSISTED S T R U C T U R E

ELUCIDATION

i n f o r m a t i o n about an unknown can be viewed as c o n s t r a i n t s . Such c o n s t r a i n t s may be p o s i t i v e statements involving substructures or r i n g systems which must be present, or negative statements f o r b i d d i n g the presence o f other s u b s t r u c t u r e s or r i n g systems. There i s sometimes a s i g n i f i c a n t conceptual gap between the i n t u i t i v e chemical phrasing o f a CONGEN problem and the phrasing f o r most e f f i c i e n t problem s o l v i n g by the program. There are u s u a l l y many ways o f d e f i n i n g a given problem and d i f f e r e n t d e f i n i t i o n s can place widely d i f f e r e n t demands upon the program. We have a c o n t i n u i n g i n t e r e s t i n reducing t h i s conceptual gap by making CONGEN r e s p o n s i b l e f o r rephrasing a problem i n an e f f i c i e n t way, thus f r e e i n g the chemist t o concentrate upon the chemical, r a t h e r than the a l g o r i t h m i c , aspects o f a given problem. We have r e c e n t l y described some i n i t i a l e f f o r t s toward automatic rephrasing o f problems, e.g., i m p l i c a t i o n s o f ranges of hydrogen atoms on a given atom ( 1 3 ) . We have so f a r t r e a t e d i n d e t a i l only the problem o f t r a n s l a t i o n o f GOODLIST (Y3) c o n s t r a i n t s ( r e q u i r e d s t r u c t u r a l f e a t u r e s ) . Although this i s o n l y a part o f the c o n s t r a i n t s i n t e r p r e t a t i o n problem, the r e s u l t i n g a l g o r i t h m i s a s i g n i f i c a n t step forward i n development of our s t r u c t u r e generating c a p a b i l i t i e s f o r reasons discussed below. A. GOODLIST I n t e r p r e t a t i o n . C o n s t r u c t i v e Substructure Search. A l l c u r r e n t s t r u c t u r e generation programs, i n c l u d i n g u n t i l now CONGEN, have a s e r i o u s l i m i t a t i o n . Substructures used as b u i l d i n g b l o c k s , or "superatoms" (3) are r e q u i r e d to be nonoverlapping i . e . , they must have no ""atoms i n common. In many s t r u c t u r a l problems s e v e r a l l a r g e fragments o f the s t r u c t u r e may be known, but the overlaps among them are not known. The chemist ( o r the program ( 2 ) ) must decide what s e t o f fragments cannot overlap and use those as superatoms. Larger fragments which may overlap are t e s t e d a t the end o f s t r u c t u r e generation by a graph-matching procedure. This d i s t i n c t i o n i s c o n c e p t u a l l y i n e l e g a n t and p u z z l i n g to chemists who wish t o use CONGEN. Lack of understanding o f the d i s t i n c t i o n can lead t o t e r r i b l e inefficiencies. I n e f f i c i e n c y a r i s e s i n two ways. A set of s m a l l superatoms w i l l y i e l d many more f i n a l s t r u c t u r e s than a set o f l a r g e superatoms. A great many o f the s t r u c t u r e s produced by the former s e t w i l l be r e j e c t e d on p o s t - t e s t i n g (graph-matching) w i t h the l a r g e r GOODLIST fragments. These computations r e q u i r e c o n s i d e r a b l e time. But chemists must contend w i t h the problem o f o v e r l a p p i n g s u b s t r u c t u r e s i n t h e i r manual e x p l o r a t i o n o f s t r u c t u r a l p o s s i b i l i t i e s . By examining how such problems are solved manually we have developed a new procedure which employs both the e f f i c i e n c y o f non-overlapping superatoms and the e f f i c i e n c y o f i n c o r p o r a t i n g GOODLIST i n f o r m a t i o n a t the beginning o f the process o f s t r u c t u r e generation ( t h u s , "planning") r a t h e r than the end. In s h o r t , we have developed a general s o l u t i o n t o the problem o f c o n s t r u c t i n g

9.

CARHART E T A L .

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch009

structures degree.

from

Computer

fragments

Assistance

which

for the Structural

overlap

t o an

Chemist

129

unspecified

1) An Example. A recent problem s t u d i e d using CONGEN i n v o l v e d the s t r u c t u r a l i n f o r m a t i o n summarized i n Figure 1. The bonds with u n s p e c i f i e d t e r m i n i i n CEMB are f r e e valences. A l l atoms (excepting seven remaining hydrogens) i n the e m p i r i c a l formula C^^H^O^ are included i n CEMB. The problem c o n s i s t s o f f i n d i n g a l l ways o f a l l o c a t i n g three new bonds among the free valences i n the superatom CEMB such t h a t the three i n d i c a t e d s u b s t r u c t u r e s (Figure 1) are present i n the f i n a l molecules. There are perhaps 10,000 unique a l l o c a t i o n s o f those three new bonds, but only seven pass the GOODLIST t e s t s . Using GOODLIST as a p o s t - t e s t o n l y , CONGEN would generate a l l 10,000 and d i s c a r d n e a r l y a l l o f them, a process which would have been so lengthy t h a t i t was never completed. Yet most chemists given t h i s i n f o r m a t i o n could f a i r l y quickly w r i t e down the seven solutions. I t i s clear c o n c e p t u a l l y how the problem i s solved manually. I t i s obvious that there are only three places i n CEMB where the f i r s t GOODLIST item (Figure 1) can f i t , because there a r e only three methyl groups on the periphery o f CEMB which f i t the s t r u c t u r a l c r i t e r i a o f the s u b s t r u c t u r e . For each o f these matchings, there are four ways o f matching the second s u b s t r u c t u r e , and so f o r t h . In t h i s case, some matchings l e a d t o c o n s t r u c t i o n o f new bonds. In the general case when there are s e v e r a l superatoms and remaining atoms, matchings may r e q u i r e c o n s t r u c t i n g new bonds and s t r u c t u r a l u n i t s , by u t i l i z i n g parts o f e x i s t i n g superatoms or remaining atoms. In t h i s example, CONGEN, using the method o u t l i n e d below, solves the problem i n much the same way as described above f o r the manual method. Rather than generating and t e s t i n g thousands of undesired s t r u c t u r e s , i t q u i c k l y a r r i v e s at the seven s o l u t i o n s by i n c o r p o r a t i n g the GOODLIST i n f o r m a t i o n from the very beginning. 2) The Method. We have developed a method, as an extension to CONGEN, which emulates the manual method. I t matches GOODLIST items t o the i n i t i a l problem f o r m u l a t i o n , which may be anything from the raw e m p i r i c a l formula t o a l i s t o f nonoverlapping superatoms. When new bonds or atoms are r e q u i r e d t o complete the matching, they are constructed from other segments (atoms, superatoms) o f the problem by forming new bonds. In order t o i n c o r p o r a t e a GOODLIST s u b s t r u c t u r e i t i s necessary t o f i n d a l l unique ways that the given s u b s t r u c t u r e can be created using parts o f the e x i s t i n g b u i l d i n g blocks (atoms and superatoms). Figure 2 shows s c h e m a t i c a l l y , together w i t h an example, some o f the ways t h i s c o n s t r u c t i o n might occur: a) by bonding together two (or more) e x i s t i n g superatoms t o c r e a t e one l a r g e r one; b) by bonding a d d i t i o n a l atoms t o a superatom to

130

COMPUTER-ASSISTED

STRUCTURE

ELUCIDATION

Cemb: H c

r

3

J.

,

CH

3

CH CH OH 3

N

2

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch009

GOODLIST: CH3-C=CH-CH2-

Figure 1. Structural information available for an unknown cembranolide. The superatom "CEMB" was inferred from a variety of data and inferences derived from related, co-occurring compounds. The GOODLIST items were inferred from additional spectroscopic and chemical data.

SUPERATOMS

ATOMS

CH3-C=CH-CHI I

ATOMS

SUPERATOMS

CONGE PROBLEM

, 2 H

+

G00DLIST

+ —

ENTRY

•CH CH

ccc 2

2

"CH2CH2CH2

CONSTRUCTIVE SUBSTRUCTURE SEARCH

CCC

NEW

cc

CONGEN PROBLEMS

O Q

0

0 0

Figure 2. A schematic and an example of ways to construct a GOODLIST item from a collection of atoms and superatoms. Each way results in larger superatoms in new problems to be carried on to the next step in structure generation, for example, incorporation of another GOODLIST item.

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch009

9.

CARHART ET AL.

Computer

Assistance

for

the

Structural

Chemist

c r e a t e a l a r g e r one; and c) by c o n s t r u c t i n g a copy of the s u b s t r u c t u r e from s i n g l e atoms, c r e a t i n g a new superatom. We c a l l t h i s method " c o n s t r u c t i v e s u b s t r u c t u r e search" f o r obvious reasons. The procedure i s stepwise. I t begins with f i n d i n g a l l ways to c o n s t r u c t the f i r s t GOODLIST item. Each o f these ways i s a new problem with the same or l a r g e r superatoms (e*g., Figure 2 ) . Each new problem i s t r e a t e d i n a depth f i r s t generation scheme by c o n s i d e r i n g the next GOODLIST item, and proceeding until a l l GOODLIST items are accounted f o r or u n t i l the next GOODLIST item cannot be b u i l t from the current set of superatoms and atoms. Depending on the problem, the f i n a l set of s o l u t i o n s may be f i n a l s t r u c t u r e s , as i n the previous example ( F i g u r e 1), or may be s e v e r a l new s e t s of superatoms and atoms f o r each of which CONGEN can be used to c o n s t r u c t f i n a l s t r u c t u r e s . In the l a t t e r i n s t a n c e , a l l GOODLIST items are guaranteed to be present so that time-consuming p o s t - t e s t i n g f o r t h e i r e x i s t e n c e i s not r e q u i r e d . D e t a i l s of the a l g o r i t h m are beyond the scope of t h i s p r e s e n t a t i o n and w i l l be discussed s e p a r a t e l y . B r i e f l y , the a l g o r i t h m i s derived from the CONGEN graph-matching r o u t i n e (1_3) w i t h the a d d i t i o n a l f e a t u r e that as i t searches f o r the s u b s t r u c t u r e i t i s allowed to create new bonds (up to the l i m i t of a v a i l a b l e new bonds i n the o r i g i n a l CONGEN problem) whenever they are necessary f o r the search to proceed. New bonds may be used to form m u l t i p l e bonds or r i n g s w i t h i n a superatom. A l t e r n a t i v e l y , use of new bonds to form new connection to atoms outside the superatom leads to extended superatoms (Figure 2 ) . During the search, f u l l account i s taken of the t o p o l o g i c a l symmetry of the superatoms i n the o r i g i n a l problem so t h a t f i t t i n g s which are redundant with respect to these symmetries are avoided. The p o t e n t i a l equivalence of GOODLIST items i s not c u r r e n t l y considered; d u p l i c a t e problems or f i n a l s t r u c t u r e s are removed i n subsequent steps. Most CONGEN problems c o n t a i n one or more GOODLIST items which can be t r e a t e d with our method. The algorithm i s c u r r e n t l y being t e s t e d to understand i t s scope and l i m i t a t i o n s . Although i n t e g r a t e d i n t o CONGEN, i t does not yet have an appropriate user i n t e r f a c e so i s not yet a v a i l a b l e as p a r t of our v e r s i o n a v a i l a b l e over the network. 3) A Second Example. The s t r u c t u r a l i n f o r m a t i o n f o r a second example i s summarized i n Figure 3. This i n f o r m a t i o n i s formulated from data presented i n a previous d i s c u s s i o n of t h i s s t r u c t u r e (Ijl) . In t h i s example, the i n i t i a l set o f nonoverlapping s t r u c t u r a l components c o n s i s t e d o f the off-resonance decoupled 13c NMR data plus the atom Z ( F i g u r e 1). Every bond connecting atoms i n the f i n a l s t r u c t u r e must be used to connect these components. To generate s t r u c t u r e s from t h i s i n f o r m a t i o n alone, with subsequent t e s t i n g f o r the presence of S1- S4 ( F i g u r e 3) would be e f f e c t i v e l y i m p o s s i b l e . The GOODLIST items -SI -S4 can overlap s i g n i f i c a n t l y . The problem was a l s o solved

132

COMPUTER-ASSISTED S T R U C T U R E E L U C I D A T I O N

by reducing the s u b s t r u c t u r e s S1-S4 t o non-overlapping components and generating structures under extensive constraints. But t h i s s t r u c t u r a l i n f o r m a t i o n can be used d i r e c t l y and e f f i c i e n t l y i n our new method, and CONGEN, using constructive substructure search, arrives quickly at r e p r e s e n t a t i o n s o f the two p o s s i b l e s o l u t i o n s , J_ and 2, F i g u r e 4. I f one assumes t h a t the two amide f u n c t i o n a l i t i e s S2 must be completely d i s j o i n t , the s t r u c t u r e 2, proposed p r e v i o u s l y (14) i s the o n l y s o l u t i o n , excepting s t r u c t u r a l v a r i a t i o n i n Z (see c a p t i o n t o Figure 3 ) .

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch009

II.

T e s t i n g v i a PRUNE, SURVEY, REACT and MSPRED

From an a l g o r i t h m i c s t a n d p o i n t , CONGEN i s s u c c e s s f u l i f i t can, i n a reasonable amount o f time and without exhausting storage r e s o u r c e s , produce a l i s t o f candidate s t r u c t u r e s s a t i s f y i n g the chemist's c o n s t r a i n t s . However, t h i s l i s t i s o f t e n q u i t e l a r g e , perhaps s e v e r a l hundred s t r u c t u r e s , and from an a n a l y t i c a l standpoint the problem may be f a r from complete. I t remains f o r the chemist t o d i s c r i m i n a t e among the c a n d i d a t e s , e v e n t u a l l y reducing the p o s s i b i l i t i e s t o j u s t one s t r u c t u r e . We are studying s e v e r a l ways t o provide computer a s s i s t a n c e i n examining and further constraining lists of structural candidates. We have implemented s e v e r a l types o f t e s t s designed to help d i s c r i m i n a t e among candidate s t r u c t u r e s * A. D i r e c t Tests. Two t e s t s are d i r e c t l y s t r u c t u r a l f e a t u r e s o f the candidates.

related

to

1. PRUNE. The chemist can f u r t h e r reduce a s e t o f s t r u c t u r e s by "pruning" (J3) the s e t with new s t r u c t u r a l information. This subsequent t e s t i n g has been described i n d e t a i l p r e v i o u s l y (3,13). 2. SURVEY. The second type o f d i r e c t t e s t allows surveying the s e t o f candidate s t r u c t u r e s using a l i b r a r y o f predefined structural features. The f u n c t i o n s o f SURVEY and examples o f a p p l i c a t i o n s are summarized i n Figure 5. The examples are chemical areas where we have u t i l i z e d SURVEY i n our own research. An extensive l i b r a r y o f f u n c t i o n a l groups i s maintained i n order that s e t s o f s t r u c t u r e s may be surveyed for unlikely f u n c t i o n a l groupings. Such groupings are o f t e n overlooked i n the i n i t i a l d e f i n i t i o n o f a problem p r e c i s e l y because they are unlikely. But CONGEN w i l l c o n s t r u c t them u n l e s s c o n s t r a i n t s forbid i t . SURVEY a c t s as a reminder o f such groupings. I t provides a c l a s s i f i c a t i o n o f s t r u c t u r e s by f u n c t i o n a l groups, any set o f which can be kept or pruned away a t the chemist's discretion. Other l i b r a r i e s of structural f e a t u r e s are maintained, i n c l u d i n g , f o r example, l i b r a r i e s o f mono- and

9.

CARHART E T A L .

Computer

Assistance

for the Structural

133

Chemist

o

II -C-NH-

S1

S2

-CH=CH—CH—CH=CH—CH—CH2-

I

I

CH

3

S3

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch009

x 2

c=c—z

Figure 3. Structure information for a dehydrotryptophan derivative. The atom Z represents an aromatic ring with several other substituents. The variety of possible substitution patterns on the aromatic ring lends considerable structural variety to this problem beyond our brief presentation. Emperical formula: C J/ N 0 Z. C13 NMR: CH X 3, CH X 2, CH X 12, C X 9. 2C

2

S4

S

28

2

Figure 4. Two structural possibilities for the compound whose structural information is summarized in Figure 3

FUNCTION:

AIDS

IN

PERCEPTION OF ANY OF A

PRE-SPECIFIED SET FEATURES IN

OF STRUCTURAL

A GROUP OF

STRUCTURAL CANDIDATES.

E.G.

A)

FUNCTIONAL GROUPS

B)

TERPENOID SKELETONS

C)

AMINO ACID SKELETONS

Figure 5. Function and examples of application of SURVEY, a subprogram of CONGEN, used to test candidate structures

3

134

COMPUTER-ASSISTED STRUCTURE E L U C I D A T I O N

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch009

sesquiterpane s k e l e t o n s . We have u t i l i z e d these l i b r a r i e s i n s t u d i e s of terpenoid c y c l i z a t i o n and rearrangement (1_5) i n order to l o c a t e known skeletons and e l u c i d a t e t h e i r pathways o f formation. A l i b r a r y of amino a c i d skeletons i s used to t e s t s t r u c t u r a l candidates which may be conjugates of organic a c i d s w i t h amino a c i d s i n s t u d i e s of organic c o n s t i t u e n t s o f human body f l u i d s . Any set of f e a t u r e s or s t r u c t u r e s can be defined and used by SURVEY. B. I n d i r e c t Tests. There are two i n f o r m a t i v e sources of data about s t r u c t u r a l candidates which cannot always be phrased as d i r e c t t e s t s on the candidates themselves: 1) s t r u c t u r a l f e a t u r e s observed i n chemical r e a c t i o n s ; and 2) empirical s p e c t r o s c o p i c measurements on the unknown which cannot be i n t e r p r e t e d unambiguously i n p r e c i s e s t r u c t u r a l terms* The program REACT addresses the f i r s t problem while MSPRED i s concerned with the second i n the context of mass spectrometric observations. 1) REACT. The REACT program has two b a s i c g o a l s : 1) to provide the chemist with a computer-based language f o r d e f i n i n g graph transformations and applying them to s t r u c t u r e s , thus simulating chemical reactions; and 2) to keep t r a c k a u t o m a t i c a l l y of the i n t e r r e l a t i o n s h i p s among s t r u c t u r e s i n a complex sequence of r e a c t i o n s so t h a t whenever s t r u c t u r a l claims are made about any product, the i m p l i c a t i o n s of these claims on s t r u c t u r e s at other steps i n the sequence can be t r a c e d . The f i r s t v e r s i o n of the REACT program has been discussed p r e v i o u s l y (J6). Based on our experience and s e v e r a l d i f f i c u l t i e s with r e p r e s e n t a t i o n of the network of r e a c t i o n s and associated s t r u c t u r e s , we have r e c e n t l y completed a second v e r s i o n with s e v e r a l new f e a t u r e s . The goals we set during the w r i t i n g o f t h i s version and for near-term f u t u r e developments are summarized i n Figure 6 . The program has been separated from CONGEN f o r purposes of e f f i c i e n c y (the combination i s too l a r g e a program) and because c e r t a i n f u n c t i o n s l i k e graph matching have a s l i g h t l y d i f f e r e n t meaning when a p p l i e d to r e a c t i o n s v s . structure generation. EDITREACT, the r e a c t i o n - e d i t i n g language, has been extended to a l l o w the user to d e f i n e subgraph c o n s t r a i n t s which apply r e l a t i v e to a p o t e n t i a l r e a c t i o n s i t e r a t h e r than to the molecule as a whole. For example, i n the present v e r s i o n of REACT, we can say e i t h e r that a hydroxyl group (OH), i f present anywhere i n the reactant molecule, would i n h i b i t the r e a c t i o n , or that such i n h i b i t i o n would take place only i f the OH group i s adjacent to the r e a c t i o n s i t e . Such s i t e - s p e c i f i c c o n s t r a i n t s , a p p l i e d e i t h e r before or a f t e r the transformation (i.e., r e a c t i o n ) has been c a r r i e d out on the s i t e , are c r i t i c a l to the d e t a i l e d d e s c r i p t i o n of r e a l chemical r e a c t i o n s . The i n c l u s i o n of t h i s f a c i l i t y i n REACT s u b s t a n t i a l l y i n c r e a s e s i t s usefulness

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch009

9.

CARHART ET AL.

Computer

Assistance

for

the

Structural

Chemist

135

i n r e a l - w o r l d chemical problems. The control structure for REACT (Figure 6, 3) has undergone major r e v i s i o n . In the initial implementation, a set of products a r i s i n g from the a p p l i c a t i o n of a given r e a c t i o n to a given s t a r t i n g s t r u c t u r e could be subjected to a m u l t i - l e v e l c l a s s i f i c a t i o n which grouped the products based upon user-defined substructural constraints. Each of these c l a s s e s had an a s s o c i a t e d minimum and maximum number, r e p r e s e n t i n g the numbers o f products which were allowed to be members of the c l a s s . Any s t a r t i n g m a t e r i a l s whose products could not s a t i s f y these c o n d i t i o n s were removed from the l i s t of candidates. S t r u c t u r e s i n any c l a s s could be f u r t h e r r e a c t e d , t h e i r products c l a s s i f i e d , and so on. This treatment of bookkeeping was sufficient for s t a t i n g many chemical problems. For example, suppose a chemist knew t h a t a p a r t i c u l a r r e a c t i o n on an unknown compound y i e l d e d two carbonyl compounds ( i . e . , c o n t a i n i n g C=0), at l e a s t one of which was an e s t e r (R-O-C=0(R')). He could d e f i n e a product c l a s s CARBONYL using the C=0 s u b s t r u c t u r e with a minimum and maximum of two products. He could then d e f i n e a sub-class o f CARBONYL c a l l e d ESTERS using the substructure R-0-C=0(R') w i t h a minimum of one and a maximum of two products. The program would a u t o m a t i c a l l y use t h i s i n f o r m a t i o n to e l i m i n a t e candidate starting structures which could not give the i n d i c a t e d product d i s t r i b u t i o n w i t h the given r e a c t i o n . There are chemical problems, though, f o r which the above scheme i s too r i g i d . For example, suppose a r e a c t i o n g i v e s s e v e r a l products two of which are i s o l a t e d and l a b e l e d dA and dB. Suppose that only a s m a l l amount of dA i s a v a i l a b l e so only mass s p e c t r o s c o p i c measurements are p r a c t i c a l and that a deuterium-exchange experiment shows that dA has two exchangeable protons (say, e i t h e r N-H or 0-H). Presume that dB shows a strong carbonyl absorption i n the IR. Now, _dA might a l s o c o n t a i n a carbonyl group, but t h a t was never determined, and n e i t h e r was the number of exchangeable protons i n dB, which c o u l d be two. No matter how one attempts to use the abovedescribed c l a s s i f i c a t i o n system, one cannot express t h i s information accurately. Our new approach i s designed to express chemical i n f o r m a t i o n to REACT i n a much more n a t u r a l sequence which p a r a l l e l s the experimental steps. Current o p e r a t i o n s , which include functional commands and descriptive terms, are summarized i n Figure 7. Thus, a r e a c t i o n i s c a r r i e d out by using the command REACT and a p p l y i n g a named r e a c t i o n to a set of s t r u c t u r e s ( i n i t i a l l y STRUCS ( F i g u r e 7 ) , but subsequently to the contents o f any named product f l a s k ) . Products are assigned to a named f l a s k . The f i r s t experimental step a f t e r a r e a c t i o n i s u s u a l l y the separation and p u r i f i c a t i o n of products. An analogous step i s i n c l u d e d i n REACT, i n which the separation amounts to d e f i n i n g a number of l a b e l l e d " f l a s k s " each of which i s u l t i m a t e l y to c o n t a i n a s p e c i f i e d number ( u s u a l l y 1) of the products. As experimental data are gathered on each r e a l

136

COMPUTER-ASSISTED REACTION CHEMISTRY

1.

STRUCTURE

ELUCIDATION

DEVELOPMENTS

SEPARATION FROM CONGEN - COMMUNICATION VIA FILES OF STRUCTURES.

2.

ADDING CONSTRAINTS - SITE - AND TRANSFORM - S P E C I F I C

3.

CONTROL STRUCTURE -

4.

RAMIFICATION

A.

ESTABLISH RELATIONSHIPS AMONG PRODUCTS AND REACTANTS

B.

DEAL PROPERLY

WITH RANGES OF NUMBERS OF PRODUCTS

INTERACTION - DEVELOP MANIPULATION COMMANDS WHICH PARALLEL LABORATORY OPERATIONS, SEPARATE VARIOUS

INTO FLASKS, FLASKS,

E.G.,

TEST CONTENTS OF

INCOMPLETE

SEPARATIONS,

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch009

ETC. 5.

REPRESENTATION

6.

PROSPECTIVE

OF REACTIONS

DETECTION OF DUPLICATE PRODUCTS BASED ON

SYMMETRY PROPERTIES OF: B)

Figure

A)

STARTING MATERIAL; AND

TRANSFORMATION.

6.

Major

milestones REACT

in the development program

of the

FUNCTIONS

EDITREACT REACT, MREACT SEPARATE PRUNE

TERMS 1, E, NAMES;

Figure 7. Functional commands and descriptive terms used in REACT. EDITREACT allows reaction definition, REACT and MREACT carry out reactions, SEPARATE selects numbers of and flasks for products, PRUNE tests the content of specified flasks for given structural features. Reactions may be carried out as one-step (1), exhaustive (EX), or as equilibrium (E) reactions. STRUCS is the initial list of candidate structures.

EX STEPS;

STRUCS; PRODUCTS;

FLASKS;

LABELS

CONSTRAINTS

DURING REACTION ON STARTING

MATERIALS

SITE-SPECIFIC TRANSFORM-SPECIFIC ON PRODUCTS POST-REACTION NUMBERS OF PRODUCTS AT ANY LEVEL SUBSTRUCTURAL CONSTRAINTS APPLIED TO PRODUCTS (PRUNING) AT ANY LEVEL

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch009

9.

CARHART E T A L .

Computer

Assistance

for

the

Structural

Chemist

137

product, corresponding s u b s t r u c t u r a l c o n s t r a i n t s are attached to the corresponding f l a s k i n the program. As each such a s s e r t i o n i s made, the bookkeeping mechanism v e r i f i e s t h a t , f o r a set of r e a c t i o n products from a given s t a r t i n g m a t e r i a l , there i s at l e a s t one way to d i s t r i b u t e them among the f l a s k s such that each product s a t i s f i e s the c o n s t r a i n t s f o r i t s f l a s k . As an example, consider the problem described above w i t h two products dA and dB. Assume that an o x i d a t i v e cleavage r e a c t i o n w i t h appropriate p r o t e c t i o n had been a p p l i e d to a set of candidate s t r u c t u r e s , y i e l d i n g f o r each s t r u c t u r e a set of products placed i n a f l a s k l a b e l l e d ? ( i l l u s t r a t e d s c h e m a t i c a l l y i n Figure 8 ) . Separation y i e l d e d two i d e n t i f i a b l e products and perhaps o t h e r s , placed i n the i n d i c a t e d f l a s k s A, B and C (Figure 8 ) . Data acquired on dA and dB were summarized above. The task of determining l e g a l assignments o f products to i n d i v i d u a l f l a s k s i s given to an a l l o c a t i o n mechanism. The o p e r a t i o n of the a l l o c a t o r w i l l be described i n a subsequent p u b l i c a t i o n d e s c r i b i n g REACT i n more d e t a i l . However, we can illustrate the problem conceptually with the (purely h y p o t h e t i c a l ) s t r u c t u r e s 2rL i F i g u r e 9 . Assuming that 2rl are candidate s t r u c t u r e s f o r an unknown, the statements that the compound dA, i n f l a s k A, must have two exchangeable hydrogens and dB i n f l a s k B, i s a ketone ( F i g u r e 8) are c o n s t r a i n t s on which products o f the r e a c t i o n can be i n which f l a s k . Initial separation r e j e c t s 2. i product, d]_ , because only one product i s obtained and at l e a s t two were r e q u i r e d ( F i g . 9 ) . Cleavage of j£-£ y i e l d s two products d2. However, n e i t h e r product obtained from U s a ketone (both are aldehydes), nor does e i t h e r product possess two exchangeable hydrogens. Therefore n e i t h e r product can be i n e i t h e r f l a s k and k i s r e j e c t e d . Compound £ y i e l d s one product, d±, which i s a ketone and possesses two exchangeable hydrogens. d1_ can be i n e i t h e r f l a s k A or B. But the other product 62, obeys n e i t h e r c o n s t r a i n t , can be i n n e i t h e r f l a s k and £ i s r e j e c t e d . Compound 6. y i e l d s two products which can be assigned to f l a s k s A and B unambiguously. Compound J. y i e l d s two products each o f which possess both a keto group and two exchangeable hydrogens. These c o n s t r a i n t s are i n s u f f i c i e n t f o r unambiguous a l l o c a t i o n and e i t h e r s t r u c t u r e may be i n e i t h e r f l a s k . In the general case, r e a c t i o n s may be c a r r i e d out i n any c o n s i s t e n t sequence, whether or not the contents o f a f l a s k are uniquely s p e c i f i e d . REACT keeps account o f the f a c t that a r e a c t i o n may be applied to a f l a s k whose contents are known but not assigned to a s i n g l e s t r u c t u r e . M u l t i p l e r e a c t i o n s to any level may be c a r r i e d out. The a l l o c a t i o n scheme and r a m i f i c a t i o n mechanism t r a n s l a t e statements about products at any l e v e l to determine the i n f l u e n c e of each statement on the contents o f each f l a s k f o r every s t r u c t u r a l candidate. n

a n d

An

Example. We

t

s

have r e c e n t l y

reexamined

the c o n s t r a i n t s

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch009

COMPUTER-ASSISTED

A)

C/

B)

C*

A

B

HAS

2 EXCHANGEABLE

A KETONE BY

STRUCTURE

H'S

IR

Figure 8. Schematic of reaction which yields at tory, both placed in flask separation of P into

the computer representation of a least two products in the laboraP. Flasks A, B, and C result from from two to many products.

ELUCIDATION

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch009

9.

CARHART E T A L .

Computer

Assistance

for

the

Structural

Chemist

139

derived from a p p l i c a t i o n s of a dehydration r e a c t i o n to the s t r u c t u r e of p a l u s t r o l . This problem was studied p r i o r to the e x i s t e n c e of REACT, so that c o n s t r a i n t s derived from measurements on the products were t r a n s l a t e d manually i n t o a s u b s t r u c t u r a l c o n s t r a i n t on the s t r u c t u r a l c a n d i d a t e s . The data are summarized i n Figure 10. Three products were i s o l a t e d , one of which possessed only a v i n y l methyl group (1H NMR), 8, a second which possessed only a v i n y l hydrogen, and a t h i r d product with n e i t h e r a v i n y l hydrogen nor a v i n y l methyl group, 10. The substructure derived from those data was JJ_, and pruning with t h i s c o n s t r a i n t reduced a set o f 88 s t r u c t u r e s to 22 ( i ) . This c o n s t r a i n t i s incomplete, however, because s u b s t r u c t u r e JJ_ does not prevent s t r u c t u r e s i n which a methyl group i s connected to the methine carbon which does not already bear a methyl group. A more l o g i c a l way to do t h i s problem i s to c a r r y out the r e a c t i o n , s e p a r a t i o n and t e s t i n g with c o n s t r a i n t s i n REACT and l e t the a l l o c a t i o n and r a m i f i c a t i o n mechanisms decide which s t r u c t u r e s are l e g a l and which are n o t . Doing the problem i n t h i s way, using substructure 8., £ and J_0 as c o n s t r a i n t s on the contents of the i n d i c a t e d f l a s k s (Figure 10), reduces the set of 88 s t r u c t u r e s t o 14 r a t h e r than 22 because now a l l i m p l i c a t i o n s of the c o n s t r a i n t s are u t i l i z e d p r e c i s e l y . Major tasks mentioned i n Figure 6 which remain uncompleted i n c l u d e the problem d e a l i n g with i n c o m p l e t e l y separated mixtures and a general treatment of symmetry p r o p e r t i e s o f s t r u c t u r e s and transforms. The former problem i s d i f f i c u l t because of the c o m b i n a t o r i a l complexity of the a l l o c a t i o n scheme when one cannot assume a s i n g l e s t r u c t u r e per f l a s k (as mentioned p r e v i o u s l y , however, many p o s s i b l e i d e n t i t i e s f o r that s i n g l e s t r u c t u r e are handled p r o p e r l y ) . The l a t t e r problem r e q u i r e s only implementation of procedures f o r d e t e c t i o n of symmetry p r o p e r t i e s o f graphs; procedures which e x i s t i n parts o f the CONGEN program. 2) MSPRED. Mass spectrometry i s an important t o o l i n organic a n a l y t i c a l chemistry. When no other s t r u c t u r a l i n f o r m a t i o n i s a v a i l a b l e mass spectrometry i s used as a standalone method f o r s t r u c t u r a l i n f e r e n c e . In combination with other a n a l y t i c a l methods i t i s a l s o extremely u s e f u l f o r post t e s t i n g p o s s i b l e candidate s t r u c t u r e s . Mass s p e c t r a l data may be used i n s t r u c t u r a l s t u d i e s i n s e v e r a l ways. For example, we can c r e a t e a fragmentation theory based on examination of s e t s o f known s t r u c t u r e s and t h e i r a s s o c i a t e d mass s p e c t r a . Or, assuming that peaks i n the mass spectrum o f an unknown o r i g i n a t e from unrearranged molecular i o n s , we can propose p o s s i b l e s t r u c t u r e s by combining the fragments together under the guidance o f a fragmentation theory. When the s t r u c t u r e i s g i v e n we can p r e d i c t a mass spectrum which w i l l obey the r u l e s o f a fragmentation theory. A l l of these

140

COMPUTER-ASSISTED

_

.1

I

r-

I

STRUCTURE

I

2

Cf

-7

ELUCIDATION

I

cf

^1

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch009

OH REJECT ON

REJECT ON

SEPARATE

PRUNE

PRUNE

(ONLY

(FAILS A

UFd -d

REJECT ON I

PRODUCT)

Figure

9.

AND

O.K.

O.K.

0

Rd

2

A

B)

Candidate structures and results of flask assignment by the under constraints on the contents of each flask (see Figure 8)

DEHYDRATION

REACT

MIN =

3,

MAX =

3

0 VINYL H

1 VINYL H

0 VINYL H

1 VINYL

0 VINYL

0 VINYL

CH3

\

r

N

c

>=c
9

8 Tetrahedron

separation, products

-CHHO-i-CH-CH*3 11 CH ?

n

CH3

A

c

Figure 10. Flask assignment, and constraints on the dehydration of palustrol(4)

allocator

c V

CH3

/

N

10

c

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch009

9.

CARHART ET A L .

Computer

Assistance

for

the

Structural

Chemist

141

o p e r a t i o n s can, i n p r i n c i p l e be t r a n s l a t e d i n t o a set of i n s t r u c t i o n s i n a computer program. In our DENDRAL programs we have demonstrated mass s p e c t r a l theory formation when given known s t r u c t u r e s and the r e l a t e d mass s p e c t r a ( V7 ,18) . A set of fragmentation r u l e s obtained t h i s way were used f o r studying the fragmentation of estrogens (Vp . Applying these c l a s s s p e c i f i c fragmentation r u l e s , helped us to a l l o c a t e the p o s i t i o n of s u b s t i t u e n t s i n unknown e s t r o g e n i c s t e r o i d s ( 1 9 ) . When d e a l i n g with a new compound (from a c l a s s f o r which mass s p e c t r a l fragmentation r u l e s are inadequate) f o r which the o n l y a v a i l a b l e i n f o r m a t i o n i s i t s mass spectrum, we have to use d i f f e r e n t approaches. I f we assume t h a t the fragmentation processes i n s i d e the mass spectrometer do not i n v o l v e s i g n i f i c a n t s t r u c t u r a l rearrangements, we can attempt to use the mass s p e c t r a l data f o r p l a n n i n g . The s t r u c t u r e s o f d i f f e r e n t i o n s which are obtained are c l o s e l y r e l a t e d to the s t r u c t u r e of the o r i g i n a l compound. We can t r e a t the observed i o n s as composition ( f o r high r e s o l u t i o n mass spectra) or mass ( f o r low r e s o l u t i o n ) nodes connected together i n a l l p o s s i b l e ways consistent w i t h the molecular formula or weight. The connections are determined by the k i n d o f fragmentation theory we use. We have a p r e l i m i n a r y v e r s i o n of a program which performs w e l l f o r simple t e s t cases but i s not yet e f f i c i e n t enough f o r complicated molecules. Another use of the mass s p e c t r a l data i s i n t e s t i n g or predicting. Since CONGEN and REACT y i e l d l i s t s o f candidate s t r u c t u r e s which obey s p e c t r o s c o p i c and chemical c o n s t r a i n t s , we can prune f u r t h e r the l i s t of s t r u c t u r e s u s i n g mass s p e c t r a l i n f o r m a t i o n . MSPRED i s a program which i s a combination of a p r e d i c t o r which uses a theory of mass spectrometry to p r e d i c t the s p e c t r a of candidate s t r u c t u r e s , and an e v a l u a t i o n f u n c t i o n which compares the p r e d i c t i o n s w i t h the observed spectrum of the unknown, a s s i g n i n g a g o o d n e s - o f - f i t score to each candidate. The candidates are sorted based upon how w e l l the p r e d i c t e d s p e c t r a match the observed. We have examined d i f f e r e n t t h e o r i e s f o r p r e d i c t i o n ; the most u s e f u l are summarized below. a) H a l f - o r d e r Theory. This theory assumes t h a t every bond i n the molecule can be broken under c e r t a i n c o n s t r a i n t s . In p r e d i c t i n g the spectrum, MSPRED e x p l o r e s a l l p o s s i b l e cleavages of the molecule w i t h i n these general (user-defined) c o n s t r a i n t s . C o n s t r a i n t s are s i m i l a r to those discussed p r e v i o u s l y (1J>1§)> and i n c l u d e l i m i t a t i o n of the number of bonds broken and the number of steps i n a process, the p r o x i m i t y of p a i r s o f cleaved bonds ( i . e . whether or not two adjacent bonds can break i n a g i v e n process) the m u l t i p l i c i t y or a r o m a t i c i t y of each cleaved bond, the allowed hydrogen atoms t r a n s f e r e d from or i n t o the charged fragment and the n e u t r a l fragments which can be l o s t . The program c a l c u l a t e s the composition and the mass o f the fragment which can be obtained i n a fragmentation process. The

COMPUTER-ASSISTED STRUCTURE E L U C I D A T I O N

142

program then combines these r e s u l t s i n t o a p r e d i c t e d mass spectrum with peaks of uniform i n t e n s i t y . The best p r e d i c t i v e t h e o r i e s of mass spectrometry are l i m i t e d to f a m i l i e s of c l o s e l y r e l a t e d s t r u c t u r e s ( i . e . , c l a s s s p e c i f i c t h e o r i e s ) . However, given the wide v a r i e t y of s t r u c t u r a l types which can be produced by CONGEN and REACT, i t i s necessary f o r MSPRED to use t h i s very general model of mass s p e c t r a l fragmentation.

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch009

E v a l u a t i o n and Ranking. For t h i s general approach we decided to use an e v a l u a t i o n f u n c t i o n which takes i n t o account that peaks at high m/e values and high i n t e n s i t y have more d i a g n o s t i c value then peaks i n the low m/e region of the spectrum with low i n t e n s i t y . The simplest form of v a r i o u s e v a l u a t i o n f u n c t i o n s we have used i s given i n Equation 1. SCORE = M _I j

i

1

where M^ i s the mass of a peak present i n both the p r e d i c t e d and observed s p e c t r a . I i s the i n t e n s i t y of the ( c o r r e c t l y predicted) peak i n the observed spectrum. We expect the h a l f - o r d e r theory to be o v e r l y complete i n the sense t h a t , when a p p l i e d to the c o r r e c t s t r u c t u r e f o r an unknown, i t w i l l doubtless p r e d i c t many p l a u s i b l e fragments which are not observed. This simply r e f l e c t s the f a c t that the "break everything" approach to mass spectrometry is a considerable o v e r s i m p l i f i c a t i o n . Thus the e v a l u a t i o n f u n c t i o n does not p e n a l i z e f o r p r e d i c t e d but unobserved peaks. What we do expect, though, i s that a l a r g e number of the observed peaks, p a r t i c u l a r l y the intense ones, w i l l have p l a u s i b l e explanations w i t h respect to the c o r r e c t s t r u c t u r e . Thus a "reward" i s given to every observed peak which i s c o r r e c t l y p r e d i c t e d (Equation i

1).

Example: Our i n t e r e s t i n mass s p e c t r a of s t e r o i d s l e d us to examine a c l a s s of mono-keto androstanes as a t e s t case. We obtained the high r e s o l u t i o n mass s p e c t r a f o r 10 of the 11 p o s s i b l e mono-ketoandrostanes. These 11 s t r u c t u r e s were our l i s t of candidate s t r u c t u r e s . We p r e d i c t e d the high r e s o l u t i o n spectra f o r each of the 11 s t r u c t u r e s using the h a l f - o r d e r theory, and then ranked them against each of the 10 observed s p e c t r a . The r e s u l t s are summarized i n Table I . In most of the cases i n Table I the c o r r e c t s t r u c t u r e was ranked f i r s t and i n the remaining i t was ranked second. The h a l f - o r d e r theory i s i n s u f f i c i e n t to d i f f e r e n t i a t e among monoketo androstanes when the keto group i s l o c a t e d i n one of the 4 p o s s i b l e p o s i t i o n s i n r i n g A or among s t r u c t u r e s which are d i f f e r e n t i n the l o c a t i o n of the keto s u b s t i t u e n t i n Ring D. MSPRED i s q u i t e new and we have not yet had s u f f i c i e n t experience with i t to evaluate i t s o v e r a l l u s e f u l n e s s . We are now doing a systematic study of v a r i o u s c l a s s e s of compounds by

9.

CARHART E T A L .

Computer

Assistance for the Structural

Chemist

143

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch009

ranking the spectrum o f a known s t r u c t u r e against a CONGEN or REACT generated l i s t o f s t r u c t u r e s which contains the c o r r e c t s t r u c t u r e among s e v e r a l which are c l o s e l y r e l a t e d . In most o f the t e s t cases ( i n c l u d i n g low and high r e s o l u t i o n mass s p e c t r a l data) the c o r r e c t s t r u c t u r e was ranked among the upper ten percent o f the s t r u c t u r e s . We are o p t i m i s t i c that the r e s u l t s of ranking based on the h a l f - o r d e r theory can be used as a p r e l i m i n a r y f i l t e r t o d i v i d e a set o f candidate s t r u c t u r e s i n t o two p o r t i o n s , one o f which has an extremely high p r o b a b i l i t y of c o n t a i n i n g the c o r r e c t s t r u c t u r e . To t h i s set o f top-ranked s t r u c t u r e s we can apply a more d e t a i l e d fragmentation theory t o make s p c i f i c p r e d i c t i o n s . We are developing computer-assisted methods to make use o f a rule-based theory. Table I . Ranking o f Mono-Ketoandrostanes Based on the Half-Order Theory. a

STRUCTURE (Keto Position)

RANKING

7 16 11 3 17 6 12 15 1 4

2 1 2 1 2 1 1 1 1 1

STRUCTURES WITH THE SAME SCORE

BETTER RANKED STRUCTURES

6 17,15 12 1,2,4 15,16

2,3,4,1

17,16 2,3,4 1,2,3

a) C o n s t r a i n t s used i n p r e d i c t i n g spectra i n c l u d e d Bonds/Step = 2, //Steps/Process = 1, H Transfers (-2 - 1 0 1 2 ) . 2) Rule-based theory. When the candidate s t r u c t u r e i s known to belong to a p r e v i o u s l y i n v e s t i g a t e d c l a s s o f compounds, then we can use a d d i t i o n a l information t o p r e d i c t a more p r e c i s e mass spectrum. This information i s i n the form o f s p e c i f i c fragmentation r u l e s . These r u l e s can be used to make p r e d i c t i o n s o f occurrence o f ions t o supplement or supplant the p r e d i c t i o n s o f the h a l f - o r d e r theory. We can a l s o use i n f o r m a t i o n about the frequency o f c o r r e c t a p p l i c a t i o n o f the r u l e i n the set o f compounds from which the r u l e s were developed. We c a l l t h i s information the confidence f a c t o r associated with the r u l e . Other important i n f o r m a t i o n i s the i n t e n s i t y range associated with the peaks which are p r e d i c t e d by a r u l e . We have the c a p a b i l i t y of p r e d i c t i n g mass s p e c t r a using a rule-based theory and have found that by t h i s approach we can

144

COMPUTER-ASSISTED S T R U C T U R E E L U C I D A T I O N

p r e d i c t a more accurate spectrum and get a b e t t e r ranking then w i t h the h a l f - o r d e r theory. D e t a i l s o f the methods and r e s u l t s of MSPRED using both rule-based and h a l f - o r d e r theory w i l l be presented s e p a r a t e l y *

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch009

Summary We have i l l u s t r a t e d a number o f approaches t o extend the concept o f computer-assisted s t r u c t u r e e l u c i d a t i o n beyond t h a t of simple s t r u c t u r e g e n e r a t i o n . We have i l l u s t r a t e d how chemical i n f o r m a t i o n together w i t h a computer program can a s s i s t chemists i n both planning p r i o r t o s t r u c t u r e generation and, subsequently, t e s t i n g o f candidates. In work described here, the chemist plays an i n t e g r a l part i n e f f e c t i v e use o f the problem-solving t o o l s we provide i n the form o f i n t e r a c t i v e programs* Literature

Cited

1. Munk, Μ. Ε., Sodano, C. S . , McLean, R. L., and H a s k e l l , H., J. Amer. Chem. Soc. (1967), 89, 4158. 2. S a s a k i , S . , Kudo, Y., O c h i a i , S . , and Abe, H . , Mikrochim. A c t a , (1971), 726. 3. Carhart, R. E., Smith, D. H., Brown, H., and D j e r a s s i , C., J. Amer. Chem. Soc. (1975), 97, 5755. 4. Cheer, C . , Smith, D. H . , D j e r a s s i , C . , Tursch, B., Braekman, J. C . , and Daloze, D . , Tetrahedron (1976), 2, 1807. 5. Feigenbaum, Ε. Α . , i n "Information Processing 68". North Holland P u b l i s h i n g C o . , Amsterdam, 1968. 6. Kwok, K.-S, Venkataraghavan, R . , and M c L a f f e r t y , F . W., J. Amer. Chem. Soc. (1973), 95, 4185. 7. Hertz, H . S . , H i t e s , R. Α . , and Biemann, Κ . , A n a l . Chem. (1971), 43, 681. 8. Yamasaki, T . , and S a s a k i , S . , J p n . A n a l . (1975), 213. 9. J e z l , Β. Α., and Dalrymple, D. L., A n a l . Chem. (1975), 47, 203. 10. Schwarzenbach, R., Meili, J., K o e n i t z e r , H . , and C l e r c , J. T . , Org. Mag. Resonance (1976), 8, 11. 11. Smith, D. H., Buchanan, B . G., Engelmore, R. S . , D u f f i e l d , Α. Μ., Yeo, Α., Feigenbaum, Ε. Α . , Lederberg, J., and D j e r a s s i , C . , J. Amer. Chem. Soc. (1972), 94, 5962. 12. Gray, Ν. Α. Β., A n a l . Chem. (1975), 4 7 , 2926. 13. C a r h a r t , R. E . , and Smith, D. H . , Computers and Chemistry (1976), 1, 79. 14. Gatti, G., Cardillo, R., Fuganti, C . , and Ghiringhelli, D . , Chem. Commun. (1976), 435. 15. Smith, D. H., and C a r h a r t , R. E., Tetrahedron (1976), 3 2 , 2513. 16. Varkony, T. H., C a r h a r t , R. E., and Smith, D. H . , i n "Computer-Assisted Organic S y n t h e s i s , " W. T. Wipke, E d . , T.

9.

CARHART E T A L .

Computer

Assistance for the Structural

Chemist

145

American Chemical Society, Washington, D. C. in press. 17. Smith, D. H., Buchanan, B. G., White, W. C., Feigenbaum, E. A., Lederberg, J., and Djerassi, C., Tetrahedron (1973), 29, 3117. 18. Buchanan, B. G., Smith, D. H., White, W. C., Gritter, R., Feigenbaum, E. A., Lederberg, J., and Djerassi, C., J. Amer. Chem. Soc. (1976),98,6168. 19. Smith, D. H., Buchanan, B. G., Engelmore, R. S., Adelcreutz, H., and Djerassi, C., J. Amer. Chem. Soc. (1973), 95, 6078.

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ch009

Ac kn o wl ed gme n t We wish to thank the National Institutes of Health, (RR 00612 and GM 20832), and the National Aeronautics and Space Administration (NGR 05-020-004) for their support of this research; and for the NIH support of the SUMEX computer f a c i l i t y (RR 00785) on which the CONGEN program i s developed, maintained and made available to the nationwide community of users.

INDEX

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ix001

A A 2 a n d A 3 ions 22 A c e t o x y groups 11 12/?-Acetoxy sandaracopimar-15-en8j8,lL*-diol 12 Actinobolamine 103 Actinobolin 100 A c y c l i c amines 59 A c y c l i c saturated hydrocarbons 77 Adamantyl 5 A d d i t i v i t y parameters 78 ALACON 110 Alanine-aspartic acid-alanine 52 A l g o r i t h m for c o m p l e x p e p t i d e mixtures, interpretive 21,24 Algorithm, predictive 53 A l l o c a t o r , candidate structures a n d flask assignment b y 140 Amines 78 A m i n o a c i d sequence of p o l y p e p t i d e s 21 AMSOM 48 A n a l y s i s (es) of c o m p o u n d , consecutive 119 of complex mixtures, comprehensive system for the 19 PRIMARY 110 program, N M R spectrum 41 SECONDARY 115 A n a l y t i c a l programs 27 Androstane 73 A O C ( a l l o c a t i o n of carbons) 112 ASSINC 110 Aspartic acid-alanine, alanine52 A s p a r t i c a c i d pentapeptide 52

B B a t c h - t y p e search 29,31 Benzphenanthrene 5 Best-matching compounds and their M F 11.0 values 12 B e t a sheets 46 Bis (2-chloroethyl) ether 7 Butterfat m e t h y l ester 13

C C -benzene 3

13

C (Continued) NMR c h e m i c a l shift range of component 121 d a t a analysis, s u r v i v e d c o m ponents of c o m p o u n d 1 through 116 data of c o m p o u n d Ill rules, automatically a c q u i r e d 58 spectra, c o m p u t e r i z e d structural predictions f r o m 77 spectra, data base of 59 spectral data analysis, general feature 110 spectrum analysis p r o g r a m 41 search system, h y d r o c a r b o n - b a s e d .. 86 spectra w i t h k n o w n multiplicities 85 normal 84 w i t h u n o b s e r v e d quaternary carbons 86 Carbon(s) allocation of ( A O C ) 112,113 h a v i n g C c h e m i c a l shifts 83 C spectra w i t h u n o b s e r v e d quaternary 86 nuclear magnetic resonance ( C N M R ) spectral search system 31 rules, a l p h a 73 for each signal, n u m b e r of ( C N S ) .. 112 treated as one carbon, e q u i v a l e n t . . . . 88 Carbonyl 5 ESTERS 135 group 96 C a r b o x y l i c acids 93 CASE network 94 description of 92 -draw, program 103 Cembranolide 130 Chemical i n f o r m a t i o n system ( C I S ) , NIH-EPA 26 a d d i t i o n of components to 27,30 components of 29 computers facilities u s e d b y 28 shift range of component, C N M R 121 shift table 118

1 3

7

1 3

1 3

1 3

C

c h e m i c a l shifts, carbons h a v i n g

83

COMPUTER-ASSISTED S T R U C T U R E E L U C I D A T I O N

148

C H E M I C S : Computer program system for structure e l u c i d a t i o n of organic c o m p o u n d s 108 - F , b l o c k d i a g r a m of 109 C h e m i s t s , n a t u r a l products 92 Cineole 105 COMPARE 84 C o m p a t i b i l i t y table 6 Component(s) of c o m p l e x mixtures b y G C - M S , identification of the 18 of c o m p o u n d s 1 t h r o u g h C N M R d a t a analysis, s u r v i v e d 116 selective 117 sets 124 Computer assistance for the structural chemist 126 assisted structure e l u c i d a t i o n 93 using automatically acquired C N M R rules 58 f r o m C N M R spectra 77 f o r organic c o m p o u n d s (CHEMICS) 108 of u n k n o w n mass spectra 1 facilities used b y the C I S 28 representation of a reaction 138 C o n f i d e n c e index ( K ) 2 C o n f o r m a t i o n a l analysis of molecules i n solution 42 graph-matching routine 131 structure generator 127 C o n s t r a i n e d structure generator ( C O N G E N ) program 6, 7 3 , 1 2 7 Constraints 97 interpretation, p l a n n i n g v i a 127 of p a l u s t r o l 140 Coronafacic acid 105 Coronatine 105 Crystal d a t a r e t r i e v a l system, x-ray 32 file, C a m b r i d g e 33 literature r e t r i e v a l system, x-ray 40 C r y s t a l l o g r a p h i c search system, x-ray 32 Cyclohexane ring 96

D e c a l i n , trans D e c a l i n s w i t h trans r i n g fusions, sample rules constructed f r o m .... D e h y d r a t i o n products of p a l u s t r o l D e h y d r o t r y p t o p h a n derivative 3,4-Dichlorofuran D i f f r a c t i o n r e t r i e v a l p r o g r a m , x-ray powder 3,4-Dimethyl-4-ethylheptane 3,3-Dimethylhexane D i s k - s t o r e d data bases

73 74 140 133 35 34 81 64 27

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ix001

1 3

E EDITREACT E m p i r i c a l rule formation E P A c h e m i c a l i n f o r m a t i o n system (CIS), N I H E q u a t i o n s , simultaneous linear Evaluation

134,136 60 26 117 142

1 3

F

1 3

D Data analysis, general feature of C spectral analysis, s u r v i v e d components of compounds 1 through C N M R base of C N M R spectra bases, disk-stored classes u s e d i n S T I R S , mass spectral INPUT processing 1 3

1 3

1 3

110 116 59 27 4 110 19

F a l s e positives ( F P ) F a t t y a c i d esters F l a s k assignment b y allocator F l a s k assignment of p a l u s t r o l FORTRAN F r a g m e n t p r o b e i n substrcuture search system F u n c t i o n a l groups, extension of h y d r o c a r b o n scheme to cover

2 11 140 140 78 36 89

G G C - M S , i d e n t i f i c a t i o n of the c o m ponents of complex mixtures b y .. 18 G C - M S systems configuration, basic .. 20 G l o b u l a r proteins, secondary structure of 46 G l y c e r y l tridodecanoate 15 G O O D L I S T interpretation 128 G O O D L I S T i t e m , construction of 130 G r a p h - m a t c h i n g routine, C O N G E N .. 131

H H a l f - o r d e r theory 141 r a n k i n g of mono-ketoandrostanes based o n the 143 Helices, alpha 46 Heuristics 69,126 Homologs, informational 124 Hydrocarbon(s) a c y c l i c saturated 77 based C search system 86 identified b y p r o g r a m M A T C H 89 scheme to cover f u n c t i o n a l groups, extension of 89 1 3

149

INDEX

M a t h e m a t i c a l m o d e l i n g system (MLAB) 41 MAXIMUM-RANGE 60 m/e values 21,142 M e t a - D E N D R A L program 63 Methine group 98 I M e t h y l ester, butterfat 13 IDENT 38 3-Methyl-4-ethylheptane 81 I n f o r m a t i o n theoretical a p p r o a c h to M e t h y l 10-methylnonadecanoate 14 the determination of secondary M e t h y l 3-oxonoadecanoate 15 M e t h y l 3,7-11,15-tetramethyl structure of g l o b u l a r proteins 46 hexadecanoic a c i d 11 I n f o r m a t i o n a l homologs 124 10-Methyl-trans-decalin 73 Interactive computing 26 M e t h y l e n e g r o u p 98 searches 2 9 , 3 1 M F 11.0 values i n S T I R S examination 12 MINIMUM-EXAMPLES 60 structure e l u c i d a t i o n 92 Interpretive a n d r e t r i e v a l system, Mixtures self-training ( S T I R S ) 1,3 analysis of complex ... 19 b y G C - M S , identification of the Ion components of c o m p l e x 18 A2 22 interpretive a l g o r i t h m for complex A3 22 Zl 22 peptide 21 I S O M E R ... ..... .79,80 M o l e c u l a r w e i g h t search i n C a m b r i d g e crystal file 33 Isomers, geometrical 79 M o l e c u l e assembler 96, 97 Isomers h a v i n g potentially n o n e q u i v a lent spectra, n u m b e r of 90 M o l e c u l e s i n solution, c o n f o r m a t i o n a l Isotopic l a b e l i n c o r p o r a t i o n deteranalysis of 42 mination ( L A B D E T ) 42 M o n e l l i n 21 Mono-ketoandrostanes 143 M o n o t e r p e n e f r o m a sea hare, K halogenated 99 K (confidence i n d e x ) 2 MREACT 136 MSPRED 132,139

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ix001

H y d r o x y steroids w i t h trans r i n g fusions 1-Hydroxyethyl group H y d r o x y l groups

74 96 11

L L A B D E T (isotopic l a b e l i n c o r p o r a t i o n determination) L i t e r a t u r e retrieval system, x-ray L i t e r a t u r e search system, mass spectrometry

N 42 38

M

M a n u a l structure e l u c i d a t i o n Mass spectra, computer-assisted structure identification of u n k n o w n spectral data classes used i n S T I R S spectral search system ( M S S S ) 29, spectrometry literature search system s p e c t r u m of n - p r o p y l p - h y d r o x y benzoate MATCH flowchart for p r o g r a m hydrocarbons identified b y p r o g r a m M a t c h factor ( M F ) M a t c h i n g system ( P B M ) , retrieval p r o b a b i l i t y based

N a t u r a l products chemists Networked machine N I H - E P A chemical information system ( C I S ) NMDR N M R , C (see C NMR) N u c l e a r Overhauser effect ( N O E ) 1 3

94

92 28 26 110

1 3

.85,112

O 1 4 30 38 9 86 87 89 3 1

O r g a n i c compounds, structure e l u c i d a t i o n of O v e r l a p process Oxazole

108 72 5

P Palustrol Paraffins Parameter a p p r o a c h P B M system (retrieval p r o b a b i l i t y based m a t c h i n g )

140 59 58 1

COMPUTER-ASSISTED STRUCTURE ELUCIDATION

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ix001

150

P B M system (Continued) e x a m i n a t i o n of u n k n o w n spectra of fatty a c i d esters 11 p r a c t i c a l use of 16 results o n u n k n o w n a n d r e s i d u a l spectra 9 s p e c t r u m subtraction 7 PEAK 95,105 Peak search i n M S S S 30 Peptide(s) approximations to the structure of the proteins, t h i r d - o r d e r 46 interpretive a l g o r i t h m 24 mixtures, interpretive a l g o r i t h m for complex 21 nth-order 49 Periodate 95 Phenyl 3 P l a n n i n g v i a constraint interpretation 127 P o l y a m i n o a l c o h o l derivative of a tetrapeptide, trimethylsilylated 23 P o l y p e p t i d e s , a m i n o a c i d sequence of 21 Precision 53-55 P r o b a b i l i t y based m a t c h i n g system, r e t r i e v a l (see P B M ) Probability weighting 1 P r o p y l ester 10 n - P r o p y l p-hydroxybenzoate 9 P r o t e i n sequence data tape 76 48 Proteins, secondary structure of globular 46 P r o t o n affinity retrieval p r o g r a m 41 PRUNE 132,136

R REACT 132,134,136 Recall ( R C ) 2,53-55 R e d u c t i o n of table entry of a l a n i n e aspartic a c i d - a l a n i n e 52 Reliability ( R L ) 2 Retention index 22 Retrieval p r o b a b i l i t y based m a t c h i n g system (see P B M ) p r o g r a m , p r o t o n affinity 41 p r o g r a m , x-ray p o w d e r d i f f r a c t i o n .. 34 system self-training 2 x-ray crystal data 32 x-ray crystal literature 40 Reverse search strategy 2 Ring cyclohexane 96 fusions, decalins a n d h y d r o x y steroids w i t h trans 74 p r o b e i n substructure search system 37 Rule(s) -based theory 143

Rule(s) (Continued) computer-assisted structure e l u c i d a t i o n u s i n g automatically acquired C N M R constructed f r o m decalins a n d h y d r o x y steroids w i t h trans r i n g fusions, sample form for structure e l u c i d a t i o n formation, empirical generation search seed 1 3

58

74 61 67 60 60 62 61

S Scan f u n c t i o n 19 Sea hare, halogenated monoterpene from a 99 Search (es) batch-type 29,31 i n C a m b r i d g e crystal file, space group a n d m o l e c u l a r w e i g h t .... 33 constructive substructure 128 heuristic 126 interactive 29, 31 rule 62 special properties 38 strategy, reverse 2 structure 67, 70 system c a r b o n nuclear m a g n e t i c resonance ( C N M R ) spectral 31 fragment p r o b e i n substructure .. 36 hydrocarbon-based C 86 mass spectrometry literature 38 ( M S S S ) mass spectral 29,30 ( S S S ) substructure 34 x-ray crystallographic 32 Seed r u l e 61 Selective components 117 Self-training interpretive a n d r e t r i e v a l system (see S T I R S ) SEPARATE 136 Sheets, b e t a 46 Shift ranges 60 of component, C N M R c h e m i c a l .. 121 Shift table, p r e p a r a t i o n of c h e m i c a l . . . . 118 Solution, c o n f o r m a t i o n a l analysis of molecules i n 42 Space g r o u p search i n C a m b r i d g e crystal file 33 S p e c i a l properties searches 38 Spectra of fatty a c i d esters, P B M / S T I R S examination of u n k n o w n 11 generation 81 a n d uniqueness tests 78 1 3

1 3

151

INDEX

Spectra

(Continued)

Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0054.ix001

PBM results on unknown and residual 9 reconstruction of the 21 Spectrum predictions for 3,3-dimethylhexane, partial 64 simulator 95 subtraction, PBM 7 procedure 2 SPECGEN 81,82 Stereochemistry, handling 73 Steroids with trans ring fusions, hydroxy 74 STIRS (self-training interpretive and retrieval system) 1,3 examination of 12/?-acetoxy san-

Structure(s) (Continued) of the proteins, third-order peptide approximations to the 46 search 67,70 selection 63 results of 66 Substructure control 100 identification 3 search, constructive 128 search system (SSS) 34,36,37 Subtraction procedure, spectrum 2 Superatoms 128 SURVEY 132,133 Survived components of compounds 1 through C NMR data analysis .. 116 13

daracopimar-15-en-8/?,lla-diol 12 T examination of unknown spectra of Terminals, typewriter 29 fatty acid ester 11 Terpenes in STIRS, unknown 10 mass spectral data classes used in .. 4 practical use of PBM 16 Testing via PRUNE, SURVEY, REACT, and MSPRED 132 results for the mass spectrum of 132 n-propyl p-hydroxybenzoate .... 9 Tests, direct 134 results on a river water extract 7 Tests, indirect unknown terpene 10 Tetrapeptide, trimethylsilylated polyaminoalcohol derivative of a 23 Structural chemist, computer assistance for the 126 Tripeptide conformations, possible . . . 5 0 , 5 5 Typewriter terminals 29 information available for unknown cembranolide 130 U possibilities 133 predictions from C NMR spectra, Uniqueness tests 78, 81 computerized 77 Unsaturation, a,/?98 Structure(s) application of SURVEY to test 133 W assignment by allocator candidate .. 140 Water extract, STIRS results on a elucidation 67 river 7 interactive 92 manual 94 Water supply, unknown from the Philadelphia drinking 8 of organic compounds 108 3 rule form for 67 Wiswesser line notation (WLN) using automatically acquired C X NMR rules, computerassisted 58 X-ray generated from each set of crystal data retrieval system 32 compounds 121 crystal literature retrieval system .... 40 generation 79 crystallographic search system 32 generator, CONGEN 127 powder diffraction retrieval of globular proteins, secondary 46 program 34 identification of unknown mass o-Xylene 35 spectra, computer-assisted 1 information for a dehydroZ tryptophan derivative 133 22 partial 96 Z l ion 13

13