Computational Biology: New Research

COMPUTATIONAL BIOLOGY: NEW RESEARCH No part of this digital document may be reproduced, stored in a retrieval system or...

Author: Alona S. Russe

321 downloads 1688 Views 7MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

COMPUTATIONAL BIOLOGY: NEW RESEARCH No part of this digital document may be reproduced, stored in a retrieval system or transmitted in any form or by any means. The publisher has taken reasonable care in the preparation of this digital document, but makes no expressed or implied warranty of any kind and assumes no responsibility for any errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of information contained herein. This digital document is sold with the clear understanding that the publisher is not engaged in rendering legal, medical or any other professional services.

COMPUTATIONAL BIOLOGY: NEW RESEARCH

ALONA S. RUSSE EDITOR

Nova Science Publishers, Inc. New York

Copyright © 2009 by Nova Science Publishers, Inc.

All rights reserved. No part of this book may be reproduced, stored in a retrieval system or transmitted in any form or by any means: electronic, electrostatic, magnetic, tape, mechanical photocopying, recording or otherwise without the written permission of the Publisher. For permission to use material from this book please contact us: Telephone 631-231-7269; Fax 631-231-8175 Web Site: http://www.novapublishers.com NOTICE TO THE READER The Publisher has taken reasonable care in the preparation of this book, but makes no expressed or implied warranty of any kind and assumes no responsibility for any errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of information contained in this book. The Publisher shall not be liable for any special, consequential, or exemplary damages resulting, in whole or in part, from the readers’ use of, or reliance upon, this material. Any parts of this book based on government reports are so indicated and copyright is claimed for those parts to the extent applicable to compilations of such works. Independent verification should be sought for any data, advice or recommendations contained in this book. In addition, no responsibility is assumed by the publisher for any injury and/or damage to persons or property arising from any methods, products, instructions, ideas or otherwise contained in this publication. This publication is designed to provide accurate and authoritative information with regard to the subject matter covered herein. It is sold with the clear understanding that the Publisher is not engaged in rendering legal or any other professional services. If legal or any other expert assistance is required, the services of a competent person should be sought. FROM A DECLARATION OF PARTICIPANTS JOINTLY ADOPTED BY A COMMITTEE OF THE AMERICAN BAR ASSOCIATION AND A COMMITTEE OF PUBLISHERS. LIBRARY OF CONGRESS CATALOGING-IN-PUBLICATION DATA Russe, Alona S. Computational biology : new research / Alona S. Russe. p. cm. ISBN 978-1-60876-545-4 (E-Book) 1. Computational biology. I. Title. QH324.2.R87 2008 570.285--dc22 2008035979

Published by Nova Science Publishers, Inc. 

New York

CONTENTS Preface Expert Commentary

ix Expressed Sequence Tags in Cancer Genomics Vincent Navratil and Abdel Aouacheria

Short Commentaries Short Commentary A

Short Commentary B

Short Commentary C

1 9

Protein Bioinformatics for Drug Discovery Concavity Druggability and Antibody Druggability Hiroki Shirai, Kenji Mizuguchi, Daisuke Kuroda, Haruki Nakamura,, Shinji Soga, Masato Kobori and Noriaki Hirayama A Linkage Disequilibrium-Based Statistical Approach to Discovering Interactions Among SNP Alleles at Multiple Loci Contributing to Human Skin Pigmentation Variation Sumiko Anno and Takashi Abe Sufficient Conditions for Exact Penalty in Constrained Optimization on Complete Metric Spaces Alexander J. Zaslavski

11

19

29

Short Commentary D

How to Create a Computational Medicine Study Viroj Wiwanitkit

39

Short Commentary E

Identifying Related Cancer Types C. D. Bajdik, Z. Abanto, J. J. Spinelli, A. R. Brooks-Wilson, and R. P. Gallagher

47

Research and Review Studies Chapter 1

Sample Size Calculation and Power in Genomics Studies Danh V. Nguyen, Damla Şentürk Danielle J. Harvey and Chin-Shang Li

57 59

vi Chapter 2

Chapter 3

Chapter 4

Contents Coupling Computational and Experimental Analysis for the Prediction of Transcription Factor E2F Regulatory Elements in the Human Gene Promoter Kenichi Yoshida Solving a Stochastic Generalized Assignment Problem with Branch and Price David P. Morton, Jonathan F. Bard and Yong Min Wang Reconstruction and Analysis of Large-Scale Phylogenetic Data, Challenges and Opportunities Toni Gabaldón, Marina Marcet-Houben and Jaime Huerta-Cepas

Chapter 5

Chromatin Fiber: 30 Years of Models Julien Mozziconacci and Christophe Lavelle

Chapter 6

Fast Modelling of Protein Structures Through Multi-level Contact Maps Davide Baù, Ian Walsh, Gianluca Pollastri and Alessandro Vullo

Chapter 7

Coarse-Grained Structural Model of Protein Molecules Kilho Eom and Sungsoo Na

Chapter 8

Differentiating Superficial and Advanced Urothelial Bladder Carcinomas Based on Gene Expression Profiles Analyzed Using Self-Organizing Maps Phei Lang Chang, Ke Hung Tsui, Tzu Hao Wang, Chien Lun Chen and Sheng Hui Lee

Chapter 9

Chapter 10

Chapter 11

Full Sibling Reconstruction in Wild Populations From Microsatellite Genetic Markers Mary V. Ashley, Tanya Y. Berger-Wolf, Isabel C. Caballero, Wanpracha Chaovalitwongse, Bhaskar DasGupta and Saad I. Sheikh Recent Issues and Computational Approaches for Developing Prognostic Gene Signatures from Gene Expression Data Seon-Young Kim and Hyun Ju Kim Comparison of Φ-Values and Folding Time Predictions by Using Monte-Carlo and Dynamic Programming Approaches Oxana V. Galzitskaya and Sergiy O. Garbuzynskiy

89

99

129

147

165

193

215

231

259

277

Contents Chapter 12

Chapter 13

Computational Methods for Protein Structural Class Prediction Susan Costantini and Angelo M. Facchiano

315

Fundamentals of Natural Computation in Living Systems Abir U. Igamberdiev

341

Chapter 14

Extraction of Position-Sensitive Promoter Constituents Yoshiharu Y. Yamamoto and Junichi Obokata

Chapter 15

Deconvolution of Positional Scanning Synthetic Combinatorial Libraries: Mathematical Models and Bioinformatics Approaches Yingdong Zhao and Richard Simon

Chapter 16

Chapter 17

Index

vii

361

375

Scripting of Molecular Structure Viewer for Data Analysis Using Lua Language Interpreter Yutaka Ueno

389

Computational Medicine Research in Hematology: A Study on Hemoglobin and Prothrombin Disorders Viroj Wiwanitkit

407 419

PREFACE Expert Commentary - Expressed sequence tag (EST) databases are a well established and continuously growing source to study gene expression, alternative splicing, genome sequences, gene-associated polymorphisms and sequence homologies through bioinformatic approaches. Here, the authors examine recent efforts to identify and characterize cancer genes and tumor markers using ESTs. Limitations of EST mining strategies and directions for future research are also summarized. Short Commentary A - The field of protein bioinformatics analyzes the sequence and structure of protein; it plays a critical role in the discovery of small therapeutic agents as well as protein drugs. Here the authors present our recent progress in this field, including the new concepts of concavity druggability and antibody druggability, which are expected to raise the drug discovery success ratio. Short Commentary B - Linkage disequilibrium (LD), the nonrandom association of alleles from different loci, can provide valuable information about the structure of human genome haplotypes. Because haplotype-based methods offer a powerful approach for disease gene mapping, this information may facilitate studies of the association between genomic variation and human traits. Single nucleotide polymorphism (SNP) alleles at multiple loci produce an LD pattern resulting from gene–gene interactions that can provide a foundation for developing statistics to detect other such interactions. Although several studies have used LD to address the role of gene interactions in various phenotypes and complex diseases, the current lack of formal statistics and the potential importance of data resulting from this research have motivated us to develop LD-based statistics. The authors chose to examine skin pigmentation because it is a complex trait, and SNP alleles at multiple loci may play a role in determining normal variation in skin pigmentation. The main purpose of this chapter is to outline the development of LD-based statistics for detecting interactions among SNP alleles at multiple loci that contribute to variation in human skin pigmentation. To accomplish this, the authors developed a general theory to study LD patterns in gene-interaction trait models. They then developed a definition of gene interaction and a measure of interactions among SNP alleles at multiple loci contributing to the trait in the framework of LD analysis. Short Commentary C - In this paper, the authors use the penalty approach in order to study two constrained minimization problems on complete metric spaces. A penalty function is said to have the generalized exact penalty property if there is a penalty coefficient for which approximate solutions of the unconstrained penalized problem are close enough to

x

Alona S. Russe

approximate solutions of the corresponding constrained problem. In this paper they establish sufficient conditions for the generalized exact penalty property. Short Commentary D - With the advent of computational research, several applications in science can be seen. The application in medicine is also documented. Computational medicine study can help answer a complicated query in medicine within a short period. How to create a computational medicine study is a common question from the beginner. In this article, the author will describe the steps for creating computational medicine research. Briefly, a simple process as that for simple in vivo and in vitro research can be used. The setting up of a conceptional framework based on literature review is the first necessary step. Next, selection of the proper database and tool for manipulation is needed. Simulating based on the designed framework can help one reach the answer. These steps must be thoroughly followed to complete computational medicine research. Short Commentary E - Background: Human cancer is often classified according to the anatomic site at which it occurs, and researchers are often taught these cancer types are actually a spectrum of disease. A review in 2000 (Hanahan and Weinberg; Cell 2000 100:5770) reported that all cancers share six characteristics: (1) self-sufficiency in growth signaling, (2) the ability to ignore external anti-growth signals, (3) the ability to avoid apoptosis, (4) sustained angiogenesis, (5) the capacity for limitless reproduction and (6) the ability to invade tissue and spread to other anatomic sites. Our goal was to identify related cancer types using different observational strategies. Methods: The authors employed one method that used textmining of online information about genes and disease. A second method used medical records of patients in British Columbia who were diagnosed with multiple cancer types between 1970 and 2004. A third method correlated Canadian provincial cancer rates for various cancer types. Results: Several pairs of related cancer types were identified using each method, although no pair was identified by all three strategies. The pairs of cancer types lung/bladder and lung/kidney were both identified by the text-mining and correlation studies. Esophageal cancer and melanoma were identified as related cancer types by both the analysis of patients with multiple primary cancers and the correlation study. Discussion: If cancer types are related, patients with one cancer might increase surveillance for other related cancer types, and drugs that are effective for treating one cancer might be successfully adapted for the related cancer types. Chapter 1 - High-throughput laboratory measurement technologies, including microarrays for genomics and proteomics, are now typically used in biomedical studies ranging from animal models to human clinical trials. These methods, such as gene expression microarrays, aim to capture the global expression patterns of thousands of genes or proteins simultaneously. A common feature is the high-dimensionality of the resulting data. Post-study analytical challenges involve methods to extract the meaningful information from the millions of data points. However, there is a need to develop systematic approaches to planning such studies. In this work, the authors provide a synthesis of the available and current trends in sample size and power analysis for genomics studies. Their emphasis will be on clarifying assumptions of the available methods as well as their applicability in practice, including the assumption of independent gene expression. The authors also emphasize emerging sample size design methods that focus on the false discovery rate (FDR) over the traditional familywise error rate as a criterion. Chapter 2 - Completion of the human genome sequencing has provided us with opportunities to understand the molecular complexity of the human body. The transcriptional

Preface

xi

regulatory circuits of gene expressions are one of the most promising matters to be resolved by exploring the human genome sequence. The authors have been interested in human cell fate regulated by the transcription factor E2F. To accelerate the investigation, the authors need to develop a strategy that can efficiently identify E2F target genes. Basically, their approach is to combine computational and experimental analysis. Annotated data of gene expression profiles deposited in the public database and knowledge accumulated in the published literature are a treasure-house of E2F candidate target genes. Next, a promoter region based on the information of the transcriptional start site can be used for motif searching of E2F regulatory elements. Finally, a set of predicted E2F regulatory elements are tested by molecular biological and biochemical assays. In this chapter, the author gives a basic introduction of our recent strategy for computational and experimental analysis for the prediction of transcription factor E2F regulatory elements in the human gene promoter. In addition, recent progress in unrevealing E2F functions achieved by genome wide approaches is discussed. Chapter 3 – In this chapter, the authors investigate the generalized assignment problem with the objective of finding a minimum-cost assignment of jobs to agents subject to capacity constraints. A complicating feature of the model is that the coefficients for resource consumption and capacity are random. The problem is formulated as a stochastic integer program with a penalty term associated with the violation of the resource constraints and is solved with a branch-and-price algorithm that combines column generation with branch and bound. To speed convergence, a stabilization procedure is included. The performance of the methodology was tested on four classes of randomly-generated instances. The principal results showed that the value of the stochastic solution (VSS), i.e., the gap between the stochastic solution and the expected value solution, was 35.5% on average. At the root node of the search tree, it was found that the linear programming relaxation of the master problem associated with column generation provided a much tighter lower bound than the relaxation of the original constraint-based formulation. In fact, two thirds of the test problems evidenced no gap between the optimal integer solution and the relaxed master problem solution. Additional testing showed that (1) initializing the master problem with a feasible solution out performs the classical big-M approach; (2) SOS type 1 branching is superior to singlevariable branching; and, (3) variable fixing based on reduced costs provides only a slight decrease in runtimes. Chapter 4 - The analysis of the evolutionary relationships among biological sequences, known as phylogenetics, constitutes one of the most powerful tools of computational biology. Besides its classical use to ascertain the evolution of a group of species, phylogenetics has many other applications such as the prediction of the function of a protein and the detection of genes under specific selective constrains. The advent of the genome era has brought about the possibility of extending such analyses to larger sets comprising thousands of sequences from complete genomes. The use of whole genomes, rather than that of reduced sets of genes or proteins, opens the door to a wide range of new possibilities. On the other hand, however, it poses many conceptual and technical challenges that require the development of new algorithms to interpret and manipulate large-scale phylogenetic data. Here the authors survey recent progress in the development of automated pipelines to reconstruct and analyze large collections of phylogenetic trees and provide some examples of how they have been used to address important biological questions.

xii

Alona S. Russe

Chapter 5 - A thorough understanding of electrostatic, elastic and topological behaviour of DNA has provided some relevant mechanistic insights into the regulation of genetic expression. Although this approach has proved valuable for the study of many biological processes, it is limited by the simple description level represented by DNA. Indeed, genomic DNA in eukaryotic cells is basically divided into chromosomes, each consisting in a single huge chromosomal fiber hierarchically supercoiled. Since this organisation plays a critical role in all processes involved in DNA metabolism, tremendous efforts have been done to build relevant models of chromatin structure and dynamics. Namely, by shifting from a DNA (as a simple molecular polyelectrolyte) point of view to a chromatin (as a polymorph supramolecular nucleoprotein complex) point of view, we should go towards more efficient mechanistic framework in which the control of genetic expression and other DNA metabolism processes could be interpreted. This review gives an historical overview of the progresses that have been done during the last 30 years in this field, and discusses what the most challenging outcomes are now. Chapter 6 – The authors present an algorithm to reconstruct protein Cα traces from 4class distance maps, and benchmark it on a non-redundant set of 258 proteins of length between 51 and 200 residues. They first represent proteins as contact maps, and show that even when exact maps are available, only low-quality models can often be obtained. The authors then adopt a more powerful simplification of distance maps: multi-class contact maps. They show that the reconstructions based on 4-class native maps are significantly better than those from binary maps. Furthermore, the authors build two predictors of 4-class maps based on recursive neural networks: one ab initio, or relying on the sequence and on evolutionary information; one in which homology information is provided as a further input, showing that even very low sequence similarity to PDB templates yields more accurate maps than the ab initio predictor. They reconstruct Cα traces based on both ab initio and homology-based 4class map predictions. The authors show that homology-based predictions are generally more accurate than ab initio ones even when homology is dubious. Chapter 7 - Understanding the protein mechanics is a priori requisite for gaining insight into protein’s biological functions, since most protein performs its function through the structural deformation renowned as conformational change. Such conformational change has been computationally delineated by atomistic simulations, albeit the mechanics of large protein structure is computationally inaccessible with atomistic simulation such as molecular dynamics simulation. In a recent decade, normal mode analysis with coarse-grained modeling of protein structures has been a computational alternative to atomistic simulations for understanding large protein mechanics. In this review, the authors delineate the current stateof-art in coarse-grained modeling of proteins for normal mode analysis. Specifically, the pioneered coarse-grained models such as Go model and elastic network model as well as recently developed coarse-grained elastic network model are summarized and discussed for understanding large protein mechanics. Chapter 8 - The aim of this study was to differentiate between superficial and advanced bladder cancers by analyzing the gene expression profiles of these tumors by using the selforganizing maps (SOMs). The authors also used the GoMiner software for the biological interpretation of 473 interesting genes. Materials and Methods: Between December 2003 and November 2004, 17 patients with urothelial bladder cancers who were admitted to the Chang Gung Memorial Hospital for

Preface

xiii

transurethral resection of the tumor were included in this study. The gene expression data included 7400 cDNAs in 17 arrays. The software, GeneCluster 2.0, was used for analyzing gene expression data by using SOMs. The authors used a 2-cluster SOM to cluster automatically a set of 17 tissues samples into superficial and advanced bladder cancers based on the gene expression patterns. The authors also used the GoMiner software for the biological interpretation of top 473 interesting genes. Results: Patients included 11 males and 6 females. Pathological studies confirmed the presence of superficial tumors in 9 patients and advanced tumors in 8 patients. Of the 7400 genes analyzed, 473 genes showed significant changes in their expression. Of these 268 were up-regulated, and 205 were down-regulated. Using the top 473 genes, SOMs were used to differentiate between the gene expression patterns of superficial and advanced bladder cancer tissue samples. The patient tissue samples were clustered into 2 groups, namely, superficial and advanced bladder cancers, comprising 10 and 7 samples, respectively. Only one patient tissue sample with advanced bladder cancer was clustered into the superficial bladder cancer group. This analysis had a high accuracy rate of 94% (16/17). The top 473 genes also were classified into biologically coherent categories by the GoMiner software. The results revealed that 452, 435, and 452 genes were associated with biological processes, cellular components, and molecular functions, respectively. Conclusion: Based on the authors’ results, they believe that superficial and advanced urothelial bladder cancers can be differentiated by their gene expression profiles analyzed by SOMs. The SOM method may be used on microarray data analysis to distinguish tumor stages and predict clinical outcomes. The genes that are uniquely expressed in either stage of bladder cancer can be considered as possible candidates for biomarkers. Chapter 9 - New technologies for collecting genotypic data from natural populations open the possibilities of investigating many fundamentalbiological phenomena, including behavior, mating systems, heritabilities of adaptive traits, kin selection, and dispersal patterns. The power and potential of genotypic information often rests in the ability to reconstruct genealogical relationships among individuals. These relationships include parentage, full and half-sibships, and higher order aspects of pedigrees. Some areas of genealogical inference, such as parentage, have been studied extensively. Although methods for pedigree inference and kinship analysis exist, most make assumptions that do not hold for wild populations of animals and plants. In this chapter, the authors focus on the full sibling relationship and first review existing methods for full sibship reconstructions from microsatellite genetic markers. The authors then describe our new combinatorial methods for sibling reconstruction based on simple Mendelian laws and its extension even in the presence of errors in the data. They also describe a generic consensus method for combining sibling reconstruction results from other methods. They present experimental comparison of the best existing approaches on both biological and simulated data. The authors discuss relative merits and drawbacks of existing methods and suggest a practical approach for reconstructing sibling relationships in wild populations. Chapter 10 - Microarray gene expression profiling, which monitors the expression of tens of thousands of genes simultaneously, is a promising tool for developing prognostic markers for cancer patients. Many researchers have applied microarray gene expression profiling in order to develop better prognostic markers, and have demonstrated promising results in many types of cancer. Unfortunately, there are concerns regarding the premature clinical use of newly-developed prognostic gene signatures, as problems associated with their application

xiv

Alona S. Russe

remain unresolved, diminishing the reliability of their intended results. This review first discusses these presently unsolved problems in the development of prognostic gene signatures. Recent computational approaches to circumventing these problems are then presented, and therein the authors discuss these approaches in the categorized framework of mechanism-derived bottom-up approaches, meta-analytic approaches, integrative approaches that combine genomics and clinical data, and sub-type-specific analysis approaches. The authors believe that recent bioinformatics approaches, which integrate rapidly accumulating genomics, clinical, and other forms of data, will help overcome current problems, and will help realize the successful application of prognostic gene signatures in personalized medicine. Chapter 11 - The authors calculate time of folding and explore the transition state ensembles for ten proteins with known experimental data at the point of thermodynamic equilibrium between unfolded and native state using a Monte Carlo Gō model and Dynamic programming where each residue is considered to be either folded as in the native state or completely disordered. The order of events in folding simulations has been explored in detail for each of the proteins. The times of folding for ten proteins which reach the native state within a limit of 108 Monte Carlo steps are in a good correlation with experimentally measured folding time at mid-transition point (the correlation coefficient is 0.71). A lower correlation was obtained if to use Dynamic programming approach (the correlation coefficient is 0.53). Moreover, Φ-values calculated from the Monte Carlo simulations for ten proteins correlate with experimental data (the correlation coefficient is 0.41) practically at the same level as Φ-values calculated from Dynamic programming approach (the correlation coefficient is 0.48). The model provides good prediction of folding nuclei for proteins whose 3D structures have been determined by X-ray, and exhibits a more limited success for proteins whose structures have been determined by NMR. Chapter 12 - The structural class of a given protein represents the first level in the hierarchical structural classification. Its knowledge starts the progressive identification of the next levels, which allows to relate the protein to a family, in evolutionary as well as functional terms. A number of computational methods have been proposed to predict the structural class based on primary sequences. Most of the prediction methods use simple sequence representations such as composition vectors and polypeptide composition or also more advanced representations that combine physico-chemical properties and sequence composition. Moreover, different classification algorithms, including neural network, rough sets and logistic regression, and the application of complex classification models, such as ensembles, bagging and boosting, have been recently used. However, the accuracy of all these methods is strongly affected by sequence similarity. Some algorithms were tested on small datasets with high sequence identity percentage, which results in an overestimated prediction accuracy. On the other hand, low similarity sequences pose a substantial challenge. The main aim of this paper is to present the state of the art in this field, describing some methods developed in the last years for the prediction of the protein structural class and underlining the need of using protein datasets of varying similarity and new testing procedures in order to evaluate correctly the quality and the accuracy of new prediction methods. Chapter 13 - The computational process is based on the fundamental semiotic activity linking mathematical equations to the materialized physical world. Its limits are defined by the set of imposed physical values constituting the background structure of the Universe.

Preface

xv

Computation becomes a direct consequence of semiosis, in the case where the arbitrariness of semiotic signs appears strictly defined in the semiotic context. This results in emergence of an internal formal structure of the system that can be modeled and computed. The authors consider formalization of the Peircean semiotics in the framework of Peirce algebra as a prerequisite of the understanding of fundamentals of computation. In reproducible semiotic structures such as biological entities, the factor that makes these systems “closed to efficient causation” (in Robert Rosen’s sense) is a basic element of the Peirce algebra that provides semantic closure to the system via introduction of the parameter of organizational invariance. Approaches to define this factor as a set of dual components, related both to relations and sets, are discussed in the frames of the semiotic interpretation of quantum mechanics where actualization is explained as a signification process within the network of quantum measurements. In this framework, enzymes are molecular automata, the set of which maintains highly ordered robust coherent state of the quantum computer, and genome concatenates error-correcting codes into a single reflective set that can be described by the Peirce algebra. The biological evolution can be viewed as a functional unfolding of the organizational invariance constraints in which new limits of iteration emerge, possessing criteria of perfection and having selective values. Chapter 14 - Extraction of functional sequences from promoters has been achieved with alignment-based motif search algorithms, but recently, several new approaches have been developed. In this review, I would like to introduce a methodology called as the LDSS (Local Distribution of Short Sequences) analysis. This approach evaluates distribution profiles of short sequences along the promoter region, and sequences that preferentially appear at specific promoter regions are extracted. Application of this strategy to human, mouse, Drosophila, C. elegans, Arabidopsis, rice, and also yeast resulted in successful extraction of both well known promoter elements and also novel putative elements. This method is so sensitive that various kinds of minor elements can be simultaneously detected by analysis of all the promoters of a genome as one batch. However, the LDSS analysis does not detect all the elements involved in transcriptional regulation, but position-insensitive elements are out of range of the analysis. No need of microarray data for the analysis enables its application to wide range of species beyond the model species. Chapter 15 - Combinatorial peptide library technology has been proven to be a powerful approach to both T-cell epitope determination and analysis of TCR specificity and degeneracy. During the past ten years, the authors have developed mathematical models and bioinformatics approaches for deconvolution of positional scanning synthetic combinatorial libraries (PS-SCL). PS-SCL are composed of trillions of peptides systematically arranged in mixtures with defined amino acids at each position. Starting from the mathematical model building to deconvolute the spectrum of PS-SCL, the authors proposed a biometrical approach using the score matrix to systematically search the protein databases for putative antigenic peptide candidates. The authors then evaluated of the assumption of independent contribution of the side chains of the amino acids in peptides and applied more sophisticated machine learning algorithms to improve the prediction accuracy based on synthesized peptide data. Finally, they implemented the above approach into a web based tool for searching protein database for T-cell epitopes based on experimental data from PS-SCL with the website employed a strong statistical analysis package, relational database and Java applets. The authors’ work has provided a sound basis for PS-SCL data analysis and has been proven efficient and successful for identifying target antigens and highly active peptide mimics.

xvi

Alona S. Russe

Chapter 16 - To improve the flexibility and extensibility of application programs, a method to support scripting function by embedding a Lua language interpreter is described. Using this approach, variations of input data and parameters for calculations are supported by a script file without rewriting the application program. For instance, the script information is supplied to the program and internal data can also be exported to a script for extended calculations. This chapter summarizes the basic framework of embedding this scripting language to interact with existing codes using the application programming interface provided by Lua. The method was applied to the molecular structure viewer program MOSBY to support additional visualizations and calculations from atomic coordinate data by script programs. Atomic structure data in the original C structure were mapped to a Lua script by using a mechanism called "metamethod" in Lua. In addition, the table data type in Lua provides a simple database useful for a configuration of molecular graphics. The design of a “domainspecific language” in biocomputing is discussed with reference to other scripting languages. Chapter 17 - At present, the third wave of medical experiments, in silico or computational simulating, is accepted as a powerful tool to drive the medical society into the new post genomics phase. Computational biology research is an important facet of bioinformatics. In this article, the author presents the concept and shares the experience in computational hematology research. Cases on hemoglobin and prothrombin disorders are demonstrated. Briefly, the computational research can help understand the genome, proteome and expression of hemoglobin and prothrombin disorders.

In: Computational Biology: New Research Editor: Alona S. Russe

ISBN: 978-1-60692-040-4 © 2009 Nova Science Publishers, Inc.

Expert Commentary

EXPRESSED SEQUENCE TAGS IN CANCER GENOMICS Vincent Navratil1 and Abdel Aouacheria2 1

INRA, Université de Lyon 1, Ecole Nationale Vétérinaire de Lyon, UMR754, IFR 128biosciences, Lyon, F-69007, France 2 Apoptosis and Oncogenesis Laboratory, IBCP - Institut de Biologie et Chimie des Protéines, UMR 5086 CNRS/UCBL - Université Claude Bernard Lyon 1, IFR128, 7 passage du Vercors, Lyon, F-69367, France

Abstract Expressed sequence tag (EST) databases are a well established and continuously growing source to study gene expression, alternative splicing, genome sequences, gene-associated polymorphisms and sequence homologies through bioinformatic approaches. Here, we examine recent efforts to identify and characterize cancer genes and tumor markers using ESTs. Limitations of EST mining strategies and directions for future research are also summarized.

Introduction The purpose of this short commentary is to present a snapshot of the current use of Expressed Sequence Tags (ESTs) in modern cancer research. ESTs are short (usually about 200-600 bp), single-pass DNA sequences corresponding to a fragment of a complimentary DNA (cDNA) molecule, which may be expressed in a cell at a particular time. ESTs reflect the transcriptome of the material from which they were generated. To date (April 2008), the human subset of EST database (dbEST) contains more than 8 million sequences - having doubled in size since 2002, see Figure 1 - with >100 different tissues and cell types. As exemplified herein, groups of genes or mutations that are characteristic of cancer cells (and may also be driving the disease process) can be identified using EST-based approaches. The preferred approach is to identify biomarkers from tissue-wide or genome-wide screening and then focus on a limited subset of candidates for further validation or in-depth investigation.

2

Vincent Navratil and Abdel Aouacheria

Figure 1. Number of human Expressed sequence tags (ESTs) from dbEST (1992-2008). This plot shows the rapid growth of human EST data.

Gene expression Since EST clone frequency is roughly proportional to the corresponding gene expression level in a given tissue, ESTs are useful for profiling genes expressed in various tissues, cell types or developmental stages. Based on EST frequencies from different cDNA libraries, one of the many interesting applications of the publicly available EST databases is therefore ‘expression profiling’, i.e. identifying the various mRNA expressed in one or more biological samples (from cancer and non-cancer tissues). A key benefit of EST profiling techniques is to allow ‘computer-based differential display’ (also referred to as ‘digital differential display’, 'in silico subtraction' or 'electronic northern'). During the past decade, this procedure was used by several groups (Dahl et al., 2005; Grutzmann et al., 2003; Schmitt et al., 1999) and international initiatives like CGAP (Cancer Genome Anatomy Project) (Lal et al., 1999) to identify transcripts preferentially expressed or repressed in the tumor context by comparing selected cancerous libraries (present in dbEST) against ‘control’ libraries. Care should be taken in such comparisons to use non-normalized, non-subtracted and reasonably sized EST libraries to prevent artifactual gene expression profiles. Mindful of these considerations, analysis of EST data has proven to be an effective method of identifying and characterizing genes expressed in a variety of malignancies including prostate, breast, ovary colorectal and gastric cancer (Asmann et al., 2002; Chakrabarti et al., 2002; Dahl et al., 2005; Kim et al., 2004; Lu et al., 2006; Shen et al., 2005), as well as in the tumour endothelium (Herbert et al.,

Expressed Sequence Tags in Cancer Genomics

3

2008). A number of attempts were also made to apply in silico transcriptomics to genomewide and multi-tissue screening of cancer genes (Aouacheria et al., 2006; Baranova et al., 2001; Brentani et al., 2003; Campagne and Skrabanek, 2006; Helftenbein et al., 2008; Scheurle et al., 2000). Other reports focused on the discovery of novel splice forms in tumor cells that are distinct from the predominant forms in normal tissues (Xu and Lee, 2003) or on cancer-specific transposable element fusion transcripts (Kim et al., 2007). Due to the increasing availability of transcribed sequences (Figure 1), we predict growth in the number of EST mining project for cancer research.

SNPs ESTs resources can also be exploited to reveal genetic variation within genes. A Single Nucleotide Polymorphism (SNP) is a DNA sequence variation occurring when a single nucleotide - adenine (A), thymine (T), cytosine (C) or guanine (G) - in the genome differs between individuals, or between homologous chromosomes within an individual. SNPs are the most frequent variation in the human genome, occurring every ~2000 bp throughout the genome (Sachidanandam et al., 2001). SNPs can be identified directly from alignments of ESTs sequenced from different alleles (Huntley et al., 2006; Picoult-Newberg et al., 1999). The in silico procedure identifies SNPs where, at the same base call, discrepancy occurs in multiple EST sequences, assuming that redundant discrepancies represent actual SNPs rather than simply sequencing errors. Using EST sequences from cancer EST libraries and from normal tissue EST libraries, genetic variants associated with cancer (i.e. those that are statistically over-represented in ESTs derived from cancerous libraries) can be detected (Buetow et al., 1999; Clifford et al., 2000; Irizarry et al., 2000). In a recent series of genomewide scale in silico analyses, ESTs have been used to predict SNPs related to the cancer phenotype (Aouacheria et al., 2005; Qiu et al., 2004), including SNPs located in micro-RNA (Yu et al., 2007) or 5’/3’ untranslated regions of transcripts (Aouacheria et al., 2007) that exhibit an aberrant allele frequency in tumours. EST sequence analysis tentatively detected genes with a signature of positive selection in tumours (genes with a significant excess of non-synonymous substitutions over synonymous substitutions in cancer ESTs compared to normal ESTs) (Babenko et al., 2006). Although ESTs could also be mined for small insertion/deletion events (indels), there has been no study reporting candidate cancerassociated indels in human EST data.

Databases A number of bioinformatics databases and web servers have been developed for performing digital expression analysis across normal and cancer tissues based on EST data. Aside from the NCI Cancer Genome Anatomy Project (http://cgap.nci.nih.gov), there are many websites with searchable databases including GeneHub-GEPIS (Zhang et al., 2007), DigiNorthern (Wang and Liang, 2003), Digital Differential Display (DDD) (http://www.ncbi.nlm.nih.gov/UniGene/info_ddd.html), cDNA Digital Gene Expression Displayer (DGED) (http://cgap.nci.nih.gov/Tissues/GXS) and xProfiler (http://cgap.nci.nih. gov/Tissues/xProfiler). Candidate SNPs identified in human EST data could be picked up

4

Vincent Navratil and Abdel Aouacheria

using CASCAD (Guryev et al., 2005), QualitySNP (Tang et al., 2006), HaploSNPer (Tang et al., 2008) and DigiPINS (Navratil et al., 2008), this latter tool offering additional access to the polymorphisms related to cancer.

Advantages and Limitations As they take advantage of pre-existing sequence resources generated for gene discovery rather than marker discovery, EST projects are extremely cost-efficient. Furthermore, ESTs are easily produced and, since they represent coding sequences, they directly identify genes of interest. Another advantage of the EST-based approach is that it can identify common sequence variations within the human genome and, more and more, uncommon sequence variations also, as the number of cDNA libraries continues to increase. Moreover, ESTs provide the opportunity to detect genetic variation and to profile expression for nearly all genes, either annotated or entirely novel (i.e. predicted) (Krukovskaja et al., 2005) in a quantitative and straightforward way. General limitations associated with the use of EST databases include poor sequencing depth of the libraries, and differences in library sizes and uncertainty concerning the origin of the samples. One related problem is that many cell types are often pooled together during the preparation of EST libraries. As more EST data become available, it will be possible to perform expression profiling in more detailed tissue subtypes. ESTs (as single pass reads) are also prone to sequencing errors. Moreover, all EST-based strategies are biased towards the 3’ ends of transcripts and towards highly-expressed genes. Despite the obvious benefits of cost and time savings, it is clear that laboratory techniques (such as real-time RT-PCR, western blotting and site-directed mutagenesis) are required to validate computational predictions. The results obtained through the use of ESTs may also be compared to results from other experimental platforms such as SAGE and microarrays, despite their own limitations or biases. Thus, EST libraries should be considered as a starting point for detecting differential expression and DNA sequence polymorphisms.

Conclusions After candidate genes or SNPs have been identified via computational methods into the ocean of sequence data available, the next step is to use bioinformatics to further down-select data. One route is to use functional annotations of the candidate set to dissect biological pathways, functional groups and molecular modules of interest. As we move toward such a “systems biology” approach in cancer genomics, the integration of logical databases such as GO, KEGG, REACTOME, OMIM will be of vital importance in the future. The clear goal is not only to identify biomarkers, but also to identify biologically relevant patterns in data that can serve to understand the underlying molecular mechanisms and pathways of cancer. Ultimately, an understanding of the molecular behaviour of tumours would help their molecular classification and also provide a basis for future patient-specific therapeutic approaches.

Expressed Sequence Tags in Cancer Genomics

5

Acknowledgement This work was supported by grants from La Ligue Contre le Cancer (Comités de la Drôme et du Rhône).

References Aouacheria, A., Navratil, V., Barthelaix, A., Mouchiroud, D. and Gautier, C. (2006) Bioinformatic screening of human ESTs for differentially expressed genes in normal and tumor tissues. BMC Genomics, 7, 94. Aouacheria, A., Navratil, V., Lopez-Perez, R., Gutierrez, N.C., Churkin, A., Barash, D., Mouchiroud, D. and Gautier, C. (2007) In silico whole-genome screening for cancerrelated single-nucleotide polymorphisms located in human mRNA untranslated regions. BMC Genomics, 8, 2. Aouacheria, A., Navratil, V., Wen, W., Jiang, M., Mouchiroud, D., Gautier, C., Gouy, M. and Zhang, M. (2005) In silico whole-genome scanning of cancer-associated nonsynonymous SNPs and molecular characterization of a dynein light chain tumour variant. Oncogene, 24, 6133-6142. Asmann, Y.W., Kosari, F., Wang, K., Cheville, J.C. and Vasmatzis, G. (2002) Identification of differentially expressed genes in normal and malignant prostate by electronic profiling of expressed sequence tags. Cancer Res, 62, 3308-3314. Babenko, V.N., Basu, M.K., Kondrashov, F.A., Rogozin, I.B. and Koonin, E.V. (2006) Signs of positive selection of somatic mutations in human cancers detected by EST sequence analysis. BMC Cancer, 6, 36. Baranova, A.V., Lobashev, A.V., Ivanov, D.V., Krukovskaya, L.L., Yankovsky, N.K. and Kozlov, A.P. (2001) In silico screening for tumour-specific expressed sequences in human genome. FEBS Lett, 508, 143-148. Brentani, H., Caballero, O.L., Camargo, A.A., da Silva, A.M., da Silva, W.A., Jr., Dias Neto, E., Grivet, M., Gruber, A., Guimaraes, P.E., Hide, W., Iseli, C., Jongeneel, C.V., Kelso, J., Nagai, M.A., Ojopi, E.P., Osorio, E.C., Reis, E.M., Riggins, G.J., Simpson, A.J., de Souza, S., Stevenson, B.J., Strausberg, R.L., Tajara, E.H., Verjovski-Almeida, S., Acencio, M.L., Bengtson, M.H., Bettoni, F., Bodmer, W.F., Briones, M.R., Camargo, L.P., Cavenee, W., Cerutti, J.M., Coelho Andrade, L.E., Costa dos Santos, P.C., Ramos Costa, M.C., da Silva, I.T., Estecio, M.R., Sa Ferreira, K., Furnari, F.B., Faria, M., Jr., Galante, P.A., Guimaraes, G.S., Holanda, A.J., Kimura, E.T., Leerkes, M.R., Lu, X., Maciel, R.M., Martins, E.A., Massirer, K.B., Melo, A.S., Mestriner, C.A., Miracca, E.C., Miranda, L.L., Nobrega, F.G., Oliveira, P.S., Paquola, A.C., Pandolfi, J.R., Campos Pardini, M.I., Passetti, F., Quackenbush, J., Schnabel, B., Sogayar, M.C., Souza, J.E., Valentini, S.R., Zaiats, A.C., Amaral, E.J., Arnaldi, L.A., de Araujo, A.G., de Bessa, S.A., Bicknell, D.C., Ribeiro de Camaro, M.E., Carraro, D.M., Carrer, H., Carvalho, A.F., Colin, C., Costa, F., Curcio, C., Guerreiro da Silva, I.D., Pereira da Silva, N., Dellamano, M., El-Dorry, H., Espreafico, E.M., Scattone Ferreira, A.J., Ayres Ferreira, C., Fortes, M.A., Gama, A.H., Giannella-Neto, D., Giannella, M.L., Giorgi, R.R., Goldman, G.H., Goldman, M.H., Hackel, C., Ho, P.L., Kimura, E.M., Kowalski, L.P., Krieger, J.E., Leite, L.C., Lopes, A., Luna, A.M., Mackay, A., Mari, S.K., Marques,

6

Vincent Navratil and Abdel Aouacheria

A.A., Martins, W.K., Montagnini, A., Mourao Neto, M., Nascimento, A.L., Neville, A.M., Nobrega, M.P., O'Hare, M.J., Otsuka, A.Y., Ruas de Melo, A.I., Paco-Larson, M.L., Guimaraes Pereira, G., Pesquero, J.B., Pessoa, J.G., Rahal, P., Rainho, C.A., Rodrigues, V., Rogatto, S.R., Romano, C.M., Romeiro, J.G., Rossi, B.M., Rusticci, M., Guerra de Sa, R., Sant' Anna, S.C., Sarmazo, M.L., Silva, T.C., Soares, F.A., Sonati Mde, F., de Freitas Sousa, J., Queiroz, D., Valente, V., Vettore, A.L., Villanova, F.E., Zago, M.A. and Zalcberg, H. (2003) The generation and utilization of a cancer-oriented representation of the human transcriptome by using expressed sequence tags. Proc Natl Acad Sci U S A, 100, 13418-13423. Buetow, K.H., Edmonson, M.N. and Cassidy, A.B. (1999) Reliable identification of large numbers of candidate SNPs from public EST data. Nat Genet, 21, 323-325. Campagne, F. and Skrabanek, L. (2006) Mining expressed sequence tags identifies cancer markers of clinical interest. BMC Bioinformatics, 7, 481. Chakrabarti, R., Robles, L.D., Gibson, J. and Muroski, M. (2002) Profiling of differential expression of messenger RNA in normal, benign, and metastatic prostate cell lines. Cancer Genet Cytogenet, 139, 115-125. Clifford, R., Edmonson, M., Hu, Y., Nguyen, C., Scherpbier, T. and Buetow, K.H. (2000) Expression-based genetic/physical maps of single-nucleotide polymorphisms identified by the cancer genome anatomy project. Genome Res, 10, 1259-1265. Dahl, E., Sadr-Nabavi, A., Klopocki, E., Betz, B., Grube, S., Kreutzfeld, R., Himmelfarb, M., An, H.X., Gelling, S., Klaman, I., Hinzmann, B., Kristiansen, G., Grutzmann, R., Kuner, R., Petschke, B., Rhiem, K., Wiechen, K., Sers, C., Wiestler, O., Schneider, A., Hofler, H., Nahrig, J., Dietel, M., Schafer, R., Rosenthal, A., Schmutzler, R., Durst, M., Meindl, A. and Niederacher, D. (2005) Systematic identification and molecular characterization of genes differentially expressed in breast and ovarian cancer. J Pathol, 205, 21-28. Grutzmann, R., Pilarsky, C., Staub, E., Schmitt, A.O., Foerder, M., Specht, T., Hinzmann, B., Dahl, E., Alldinger, I., Rosenthal, A., Ockert, D. and Saeger, H.D. (2003) Systematic isolation of genes differentially expressed in normal and cancerous tissue of the pancreas. Pancreatology, 3, 169-178. Guryev, V., Berezikov, E. and Cuppen, E. (2005) CASCAD: a database of annotated candidate single nucleotide polymorphisms associated with expressed sequences. BMC Genomics, 6, 10. Helftenbein, G., Koslowski, M., Dhaene, K., Seitz, G., Sahin, U. and Tureci, O. (2008) In silico strategy for detection of target candidates for antibody therapy of solid tumors. Gene, 414, 76-84. Herbert, J.M., Stekel, D., Sanderson, S., Heath, V.L. and Bicknell, R. (2008) A novel method of differential gene expression analysis using multiple cDNA libraries applied to the identification of tumour endothelial genes. BMC Genomics, 9, 153. Huntley, D., Baldo, A., Johri, S. and Sergot, M. (2006) SEAN: SNP prediction and display program utilizing EST sequence clusters. Bioinformatics, 22, 495-496. Irizarry, K., Kustanovich, V., Li, C., Brown, N., Nelson, S., Wong, W. and Lee, C.J. (2000) Genome-wide analysis of single-nucleotide polymorphisms in human expressed sequences. Nat Genet, 26, 233-236. Kim, D.S., Huh, J.W. and Kim, H.S. (2007) Transposable elements in human cancers by genome-wide EST alignment. Genes Genet Syst, 82, 145-156.

Expressed Sequence Tags in Cancer Genomics

7

Kim, N.S., Hahn, Y., Oh, J.H., Lee, J.Y., Oh, K.J., Kim, J.M., Park, H.S., Kim, S., Song, K.S., Rho, S.M., Yoo, H.S. and Kim, Y.S. (2004) Gene cataloging and expression profiling in human gastric cancer cells by expressed sequence tags. Genomics, 83, 10241045. Krukovskaja, L.L., Baranova, A., Tyezelova, T., Polev, D. and Kozlov, A.P. (2005) Experimental study of human expressed sequences newly identified in silico as tumor specific. Tumour Biol, 26, 17-24. Lal, A., Lash, A.E., Altschul, S.F., Velculescu, V., Zhang, L., McLendon, R.E., Marra, M.A., Prange, C., Morin, P.J., Polyak, K., Papadopoulos, N., Vogelstein, B., Kinzler, K.W., Strausberg, R.L. and Riggins, G.J. (1999) A public database for gene expression in human cancers. Cancer Res, 59, 5403-5407. Lu, B., Xu, J., Lai, M., Zhang, H. and Chen, J. (2006) A transcriptome anatomy of human colorectal cancers. BMC Cancer, 6, 40. Navratil, V., Penel, S., Delmotte, S., Mouchiroud, D., Gautier, C. and Aouacheria, A. (2008) DigiPINS: A database for vertebrate exonic single nucleotide polymorphisms and its application to cancer association studies. Biochimie, 90, 563-569. Picoult-Newberg, L., Ideker, T.E., Pohl, M.G., Taylor, S.L., Donaldson, M.A., Nickerson, D.A. and Boyce-Jacino, M. (1999) Mining SNPs from EST databases. Genome Res, 9, 167-174. Qiu, P., Wang, L., Kostich, M., Ding, W., Simon, J.S. and Greene, J.R. (2004) Genome wide in silico SNP-tumor association analysis. BMC Cancer, 4, 4. Sachidanandam, R., Weissman, D., Schmidt, S.C., Kakol, J.M., Stein, L.D., Marth, G., Sherry, S., Mullikin, J.C., Mortimore, B.J., Willey, D.L., Hunt, S.E., Cole, C.G., Coggill, P.C., Rice, C.M., Ning, Z., Rogers, J., Bentley, D.R., Kwok, P.Y., Mardis, E.R., Yeh, R.T., Schultz, B., Cook, L., Davenport, R., Dante, M., Fulton, L., Hillier, L., Waterston, R.H., McPherson, J.D., Gilman, B., Schaffner, S., Van Etten, W.J., Reich, D., Higgins, J., Daly, M.J., Blumenstiel, B., Baldwin, J., Stange-Thomann, N., Zody, M.C., Linton, L., Lander, E.S. and Altshuler, D. (2001) A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature, 409, 928-933. Scheurle, D., DeYoung, M.P., Binninger, D.M., Page, H., Jahanzeb, M. and Narayanan, R. (2000) Cancer gene discovery using digital differential display. Cancer Res, 60, 40374043. Schmitt, A.O., Specht, T., Beckmann, G., Dahl, E., Pilarsky, C.P., Hinzmann, B. and Rosenthal, A. (1999) Exhaustive mining of EST libraries for genes differentially expressed in normal and tumour tissues. Nucleic Acids Res, 27, 4251-4260. Shen, D., He, J. and Chang, H.R. (2005) In silico identification of breast cancer genes by combined multiple high throughput analyses. Int J Mol Med, 15, 205-212. Tang, J., Leunissen, J.A., Voorrips, R.E., van der Linden, C.G. and Vosman, B. (2008) HaploSNPer: a web-based allele and SNP detection tool. BMC Genet, 9, 23. Tang, J., Vosman, B., Voorrips, R.E., van der Linden, C.G. and Leunissen, J.A. (2006) QualitySNP: a pipeline for detecting single nucleotide polymorphisms and insertions/deletions in EST data from diploid and polyploid species. BMC Bioinformatics, 7, 438. Wang, J. and Liang, P. (2003) DigiNorthern, digital expression analysis of query genes based on ESTs. Bioinformatics, 19, 653-654.

8

Vincent Navratil and Abdel Aouacheria

Xu, Q. and Lee, C. (2003) Discovery of novel splice forms and functional analysis of cancerspecific alternative splicing in human expressed sequences. Nucleic Acids Res, 31, 56355643. Yu, Z., Li, Z., Jolicoeur, N., Zhang, L., Fortin, Y., Wang, E., Wu, M. and Shen, S.H. (2007) Aberrant allele frequencies of the SNPs located in microRNA target sites are potentially associated with human cancers. Nucleic Acids Res, 35, 4535-4541. Zhang, Y., Luoh, S.M., Hon, L.S., Baertsch, R., Wood, W.I. and Zhang, Z. (2007) GeneHubGEPIS: digital expression profiling for normal and cancer tissues based on an integrated gene database. Nucleic Acids Res, 35, W152-158.

SHORT COMMENTARIES

In: Computational Biology: New Research Editor: Alona S. Russe

ISBN: 978-1-60692-040-4 © 2009 Nova Science Publishers, Inc.

Short Commentary A

PROTEIN BIOINFORMATICS FOR DRUG DISCOVERY CONCAVITY DRUGGABILITY AND ANTIBODY DRUGGABILITY Hiroki Shirai1,*, Kenji Mizuguchi2, Daisuke Kuroda3,4, Haruki Nakamura3,, Shinji Soga1, Masato Kobori1 and Noriaki Hirayama5 1: Advanced Genomics, Molecular Medicine Research Laboratories, Astellas Pharma Inc., 21 Miyukigaoka, Tsukuba, Ibaraki 305-8585, Japan 2: National Institute of Biomedical Innovation, 7-6-8 Asagi, Saito, Ibaraki-city, Osaka 567-0085, Japan 3: Institute for Protein Research, Osaka University., 1-3 Yamadaoka, Suita-city, Osaka 565-0871, Japan 4: Graduate School of Frontier Biosciences, Osaka University, 1-3 Yamadaoka, Suita-city, Osaka 565-0871, Japan 5: Tokai University School of Medicine, 143 Shimokasuya, Isehara, Kanagawa 259-1143, Japan

Abstract The field of protein bioinformatics analyzes the sequence and structure of protein; it plays a critical role in the discovery of small therapeutic agents as well as protein drugs. Here we present our recent progress in this field, including the new concepts of concavity druggability and antibody druggability, which are expected to raise the drug discovery success ratio.

Key words: concavity druggability, antibody druggability, protein bioinformatics, drug discovery

12

Hiroki Shirai, Kenji Mizuguchi, Daisuke Kuroda et al.

Introduction Currently, the two in silico technologies, bioinformatics and computational chemistry, are effectively used for drug discovery[1-6]. Bioinformatics plays various roles in drug discovery at multiple stages: target discovery, target validation, building a screening system, and analysis of the mechanism of diseases as well as that of compounds[1-3]. It also contributes to the generation of protein drugs and small therapeutic agents. It comprises four components: information technology (IT), information science, theoretical biology, and theoretical biochemistry, although the boundaries between them are obscure. The IT aspect of bioinformatics includes the construction and maintenance of an information infrastructure (hardware, software, network systems, and in-house database). Various kinds of biological data are collected and entered into the in-house database to yield a user-friendly interface and easily accessible data. The information science aspect of bioinformatics is the analysis of microarray and other biological data using sophisticated data mining approaches. Text mining, which extracts useful information from literature, falls into this category. The theoretical biology aspect of bioinformatics, sometimes called pathway bioinformatics, is used to determine which genes are involved in physiological functions. The theoretical biochemistry aspect of bioinformatics, which is almost the same as protein bioinformatics, includes the analysis of protein sequence or structure in order to elucidate its molecular function. In addition, protein bioinformatics naturally plays a critical role in protein drug discovery. Despite the importance of protein bioinformatics, inefficient information-sharing between industrial and academic researchers in this field could prevent the development of more useful tools for real drug discovery. The surrent status of protein bioinformatics and problems therewith are summarized and addressed below. Just ten years ago, there were still significant difficulties with generating a lead compound via docking studies between compounds and a target protein. The improvement of software and hardware as well as the rapid increase in protein/compound interaction data over the past decade have enabled us to obtain lead compounds in many projects. In addition to docking studies, other technologies involving computational chemistry effectively contribute to drug discovery at various stages[4-6]. The structure-activity relationship is used for lead optimization, ADME and toxicity prediction for lead evaluation, and chemoinformatics for clustering and the rational selection of compounds. More importantly, the generation of the “druggability of small compounds” concept and its suitability as an indicator for lead compound evaluation changed the general flow of drug discovery. However, despite efforts made based on computational chemistry, compounds are often dropped for various reasons. Researchers try to reduce the risk of this happening by evaluating ADME and toxicity at an earlier stage, increasing the variety of lead compounds, and optimizing as many of them as possible. However, it should be noted that the use of computational chemistry alone would reduce only some specific risks and generate a limited variety of lead compounds. A combination approach, such as with bioinformatics, is more desirable as it is expected to reduce different types of risks while increasing lead compound variety. Bioinformatics and computational chemistry are currently used independently for drug discovery; however, if used in combination, a synergistic effect is expected. Protein

Protein Bioinformatics for Drug Discovery Concavity Druggability…

13

bioinformatics is considered to be a way to combine these two in silico technologies; thus, it plays a key role in the creation of an integrated solution, which may improve the next generation drug discovery success ratio. Effective usage of sequence and structure analysis could reduce side effects. This matter is described in detail in the next section. In addition, we recently proposed a new concept, concavity druggability, which is expected to increase the variety of lead compounds (described in section II).

I) Protein Sequence and Structure Analysis The efficiency of protein bioinformatics in drug discovery is illustrated below by introducing our work with the guanidino-group modifying enzyme (GME) superfamily[7-9]. The GME superfamily consists of enzymes (GMEs) that catalyze the modification of guanidino groups[7]. This superfamily includes many key metabolic enzymes, some of which are already recognized as attractive drug targets. GMEs adopt a unique tertiary structure known as the α/β propeller, which can accommodate a diverse set of sequences. The aminoacid sequence identities among GMEs are from 8% to 23%. Amidinotransferase (AT) is the first enzyme of this superfamily to have its structure determined. It catalyzes the transfer of amidino group from amino acid arginine to another substrate. The crystal structure of AT revealed a barrel-shaped fold with a cavity on one side where the substrate arginine binds. The chemical reaction is catalyzed by a Cys-His-Asp catalytic triad at the bottom of this cavity. Our first project dealing with the protein bioinformatics of GME was to propose the existence of the GME superfamily itself[8]. Before our proposal, despite the functional similarity, the low sequence similarities among the enzymes and the lack of the structural information except for AT prevented the researchers from understanding their evolutionarily relatedness (homology) for a long time. FUGUE is the sophisticated fold recognition software which one of the authors (Kenji Mizuguchi) developped[10]. By using FUGUE, we predicted three other enzyme families to be homologous to AT and share similar catalytic mechanisms[8]. These are dimethylarginine dimethylaminohydrolase (DDAH), Arginine deiminase (ADI) and Porphyromonasgingivalis peptidyl-arginine deiminase (PPAD) and related hypothetical proteins showing weak similarity to it (which we called PPADH). After our prediction, the structures of DDAH, ADI and PPADH were determined by X-ray crystallography, and confirmed our hypothesis[11-13]. The second GME project was the prediction of succinyl arginine dihydrolase (AstB) as a new member of the GME superfamily[9]. After prediction, the structure of AstB was determined by X-ray crystallography, which confirmed our hypothesis[14]. The useful information obtained from these two drug discovery projects are summarized as follows: i) Structures useful in docking studies for determining compounds that block GME can be predicted. ii) Catalytic residues of DDAH, ADI, PPAD, PPADH, and AstB; this information is valuable for validating protein as a drug target as well as useful for setting up an assay system. iii) Identification of PPADH proteins as new drug targets. They would have some enzymatic activity in common with GME, and produce ammonia. Proteins that generate ammonia sometimes act as virulence factors in pathogenic microorganisms, thus the PPADH from Helicobacter pylori might act as such by generating ammonia from the abundant peptides in the stomach. iv) Catalytic mechanisms of DDAH, ADI, PPAD and AstB. These

14

Hiroki Shirai, Kenji Mizuguchi, Daisuke Kuroda et al.

catalytic mechanisms are valuable for the design of chemical compounds at the lead optimization step. The sequence analysis aspect of protein bioinformatics could extract these pieces of valuable information prior to structure determination by X-ray crystallography. Mammalian peptidyl-arginine deiminase (PAD) is a metallo-enzyme that had been considered unrelated to GME until the crystal structure of human PAD was determined[15]. The structure of PAD is composed of several domains, one of which is the α/β propeller. The overall structural similarity, the conservation of catalytic residue, and the similarity of function confirmed that it is another member of the GME superfamily. Its relatedness could not be predicted prior to structural determination; therefore it is necessary to improve the sensitivity of the fold recognition tools. After many GME structures had been determined, we conducted a third GME project[7]. Structural superposition and structure-based alignment could identify new key GME residues involved in catalysis and substrate binding. We found that conserved guanidino-carboxyl interactions are used in two different ways: acidic residues in the catalytic site form hydrogen bonds to the substrate guanidino group, and the enzyme Arg residues at several key positions recognize the carboxyl group of the substrate and fix its orientation. Based on this observation, we proposed rules for classifying the GME sequences and predicting their molecular function from the conservation of the key acidic and Arg residues. The useful predictive tools generated by this third drug discovery project are summarized below. i) The GME commonality and diversity information extracted allows for the design of a compound with high selectivity. ii) The proposed rules allow for the prediction of pathogenic organism unannotated gene function and the proposal of new drug targets. Importantly, protein bioinformatics can be used to extract valuable information even after structural determination by X-ray crystallography. Generally, if a compound binds to its target protein, it can also bind to the homologues of the target, which would eventually generate a side effect. For this reason, we usually screen not only for the target protein, but also counter-screen for its homologues to select compounds with both high affinity and selectivity. Selecting the homologues to be checked is crucial, but there is some controversy over whether it is necessary to check remote homologues or not. Since remote homologues are so different, some researchers insist that only close homologues need to be checked, while others insist that all should be. We feel that checking depends on the features of the superfamily to which the target protein belongs. For example, the GME superfamily comprises enzymes capable of a diverse array of reactions (transferase, hydrolase, and dihydrolase), and whose substrates are the same or similar to each other (arginine or related ligands). Despite the very low sequence similarity among GMEs, they are considered to bind to the same or similar compounds. Thus, if finding a compound that blocks a certain GME with high selectivity is desired, the selectivity of remote homologues should also be checked. In contrast, the acyl-CoA N-acyltransferase (NAT) superfamily comprises enzymes whose substrates are diverse, but whose reactions are the same (acyl transferase)[9]. For this reason, NATs are not considered to easily bind to the same or similar compounds. Thus, if a compound that blocks a certain NAT with high selectivity is desired, checking the selectivity of remote homologues is not likely to be necessary. Using protein bioinformatics is expected to reduce the number of different types of side effects that must be tackled with computational chemistry.

Protein Bioinformatics for Drug Discovery Concavity Druggability…

15

II) Concavity Druggability Using protein bioinformatics, we proposed a new concept, “concavity druggability”, and developed a method to evaluate druggability from the amino-acids present in the concavity on the protein surface[16,17]. The specific binding of a ligand to its target protein is the key to drug action. The surface of a target protein usually possesses multiple concavities where small molecules may bind. However, each ligand binds preferentially to a specific concavity. We defined druggable concavity as the concave surface where drug-like molecules are highly inclined to bind. Drug targets that can be modified by a small molecular drug need to be identified for each disease phenotype. Thus, finding a druggable concavity in disease-related proteins is a crucial step towards validating new drug targets. It is also important for increasing the chance of obtaining a drug-like compound and for understanding the function of a given protein. Since the binding site of a drug is considered to be highly specific to its molecular characteristics, the binding concavity must have a distinct character significantly different from those of similar concavities on protein targets. One of the authors (Noriaki Hirayama) recently established a profile for determining the “drug-likeness” of a compound, which comprises multiple molecular descriptors to determine how much like a drug the molecule is[18]. We evaluated the drug-likeness of ligands, which was determined while their structures were complexed with proteins from the Protein Data Bank (PDB). The amino acid compositions around the binding sites for drug-like ligands in well-qualified X-ray structures were analyzed in detail. The analysis revealed a remarkable propensity for the presence of specific amino acids at the binding site of each drug-like compound. From this data, a simple discrimination index called Propensity for Ligand Binding (PLB) was developed, which allows the druggability of concavities on a given protein surface to be evaluated. Importantly, the PLB index can be used to identify druggable concavities in homology models. Use of the PLB index to find new druggable concavities would increase the variety of lead compounds that would be obtained if computational chemistry alone was used.

III) Antibody Modeling and Druggability Protein bioinformatics contributes to the generation of protein drugs as well as small compound drugs. Recently, more antibody drugs are expected to be developed because of their low toxicity and high efficiency. Establishing high affinity binding is crucial for expanding detection limits, extending dissociation half times, decreasing drug dosage, and increasing drug efficacy[19]. Affinity maturation of antibodies in vivo often fails to generate antibody drugs of the targeted potency, which makes it necessary to perform further affinity maturation in vitro using directed evolution or rational design via protein bioinformatics. For rational design, a tertiary structural model of an antigen binding site needs to be constructed from its amino acid sequence. The efficiency of in silico design depends on the accuracy of the model. An antigen-binding site is composed of six complementaritydetermining regions (CDRs). Five of these CDRs (CDR-L1, L2, L3, H1, and H2) have a limited number of canonical structures; however, the third CDR of the heavy chain (CDR-H3) shows substantial diversity in length, sequence, and structure. In addition, CDR-H3 sometimes changes its conformation depending on the existence of an antigen. Importantly, it

16

Hiroki Shirai, Kenji Mizuguchi, Daisuke Kuroda et al.

lies in the centre of the antigen-binding site and generally plays a dominant role in antigen recognition. Thus, building a model of CDR-H3 is the most important and the most difficult step. After examination of the qualified crystal structures of the antibody, we proposed empirical rules for predicting the structural features of CDR-H3 from its amino-acid sequence (H3-rules)[20,21]. Essentially the same proposal was also made by a different group[22]. We recently revised the rules (H3-Rules 2007) to apply only to the analysis of newly determined structures[23]. Multicanonical molecular dynamics simulation is one of the algorithms for enhanced conformational sampling[24]. We found it is quite useful for building an accurate model and capturing the structural variety of CDR-H3s[25,26]. At present, the usage of revised H3-rules and multicanonical molecular dynamics simulation provides the best approach for determining the antigen binding site. However, it remains difficult to predict the structure of long CDR-H3s, the non-canonical structures of the other five CDRs, and the dimerization angle of the H chain and L chains. Thus, the methods of antibody model building need to be improved. Current drug discovery practice uses the druggability of small compounds as an indicator of lead compound potential. However, a similarly rational approach has not yet been established for antibody evaluation. However, such rational approaches have not yet been established for antibody evaluation. We recently proposed a new concept, “antibody druggability”[23]. We examined the CDR-H3 structures of 12 antibody drugs and found the common structural features. Antibody druggability would be effectively applied to rational antibody design and selection.

Conclusion Although protein bioinformatics plays a critical role in drug discovery, the inefficiency of information-sharing between industrial and academic researchers in this field could prevent the development of more useful tools for real drug discovery. In this review, our recent progress in protein bioinformatics was presented, and the current status of this field, along with associated problems that may be experienced by pharmaceutical companies, were summarized. It is our hope that more useful tools will be developed in this field to increase the success ratio of drug discovery even further.

References [1] Chen YP, and Chen F. Identifying targets for drug discovery using bioinformatics. (2008) Expert Opin Ther Targets 12:383-389. [2] Mizuguchi K. (2004) Fold recognition for drug discovery. Drug Discovery Today: Targets;3:18-23. [3] Yan Q. (2008) The integration of personalized and systems medicine: bioinformatics support for pharmacogenomics and drug discovery. Methods Mol Biol.;448:1-19. [4] Mohan CG, Gandhi T, Garg D, and Shinde R. (2007) Computer-assisted methods in chemical toxicity prediction. 1: Mini Rev Med Chem. 7:499-507.

Protein Bioinformatics for Drug Discovery Concavity Druggability…

17

[5] Tropsha A, and Golbraikh A. (2007) Predictive QSAR modeling workflow, model applicability domains, and virtual screening. Curr Pharm Des.;13:3494-3504. [6] Taft CA, Da Silva VB, Da Silva CH. (2008) Current topics in computer-aided drug design. J Pharm Sci. 97:1089-1098. [7] Shirai H, Mokrab Y, and Mizuguchi K. (2006) The guanidino-group modifying enzymes: structural basis for their diversity and commonality. Proteins. 64:1010-1023. [8] Shirai H, Blundell TL, and Mizuguchi K. (2001) A novel superfamily of enzymes that catalyze the modification of guanidino groups. Trends Biochem Sci. 26:465-468. [9] Shirai H, and Mizuguchi K. (2003) Prediction of the structure and function of AstA and AstB, the first two enzymes of the arginine succinyltransferase pathway of arginine catabolism. FEBS Lett. 555: 505-510. [10] Shi J, Blundell TL, and Mizuguchi K. (2001) FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J Mol Biol. 310: 243-257. [11] Murray-Rust J, Leiper J, McAlister M, Phelan J, Tilley S, Santa-Maria J, Vallance P, and McDonald N. (2001) Structural insights into the hydrolysis of cellular nitric oxide synthase inhibitors by dimethylarginine dimethylaminohydrolase. Nat Struct Biol. 8: 679-683. [12] Das K, Butler GH, Kwiatkowski V, Clark AD Jr, Yadav P, and Arnold E. (2004) Crystal structures of arginine deiminase with covalent reaction intermediates; implications for catalytic mechanism. Structure 12: 657-667. [13] Galkin A, Kulakova L, Sarikaya E, Lim K, Howard A, and Herzberg O. (2004) Structural insight into arginine degradation by arginine deiminase, an antibacterial and parasite drug target. J Biol Chem. 279: 14001-14008. [14] Tocilj A, Schrag JD, Li Y, Schneider BL, Reitzer L, Matte A, and Cygler M. (2005) Crystal structure of N-succinylarginine dihydrolase, AstB, bound to substrate and product, an enzyme from the arginine catabolic pathway of Escherichia coli. J Biol Chem. 280: 15800-15808. [15] Arita K, Hashimoto H, Shimizu T, Nakashima K, Yamada M, and Sato M. (2004) Structural basis for Ca(2+)-induced activation of human PAD4. Nat. Struct. Mol. Biol. 11: 777-783. [16] Soga S, Shirai H, Kobori M, and Hirayama N. (2007) Use of amino acid composition to predict ligand-binding sites. J Chem Inf Model. 47:400-406. [17] Soga S, Shirai H, Kobori M, and Hirayama N. (2007) Identification of the druggable concavity in homology models using the PLB index. J Chem Inf Model. 47:2287-2292. [18] Horio K, Muta H, Goto J, and Hirayama N. (2007) A simple method to improve the odds in finding 'lead-like' compounds from chemical libraries. Chem. Pharm. Bull. 55: 980-984. [19] Lippow SM, Wittrup KD, and Tidor B. (2007) Computational design of antibodyaffinity improvement beyond in vivo maturation. Nat. Biotechnol. 10: 1171-1176. [20] Shirai H, Kidera A, and Nakamura H. (1996) Structural classification of CDR-H3 in antibodies. FEBS Lett. 399: 1-8. [21] Shirai H, Kidera A, and Nakamura H. (1999) H3-rules: identification of CDR-H3 structures in antibodies. FEBS Lett. 455: 188-197.

18

Hiroki Shirai, Kenji Mizuguchi, Daisuke Kuroda et al.

[22] Morea V, Tramontano A, Rustici M, Chothia C, and Lesk AM. (1998) Conformations of the third hypervariable region in the VH domain of immunoglobulins. J Mol Biol. 275: 269-294. [23] Kuroda D, Shirai H, Kobori M, and Nakamura H. (2008) Structural classification of CDR-H3 revisited: A lesson in antibody modeling. Proteins inpress. [24] N. Nakajima, H. Nakamura and A. Kidera (1997) Multicanonical ensemble generated by molecular dynamics simulation for enhanced conformational sampling of peptides. J. Phys. Chem B101: 817–824 [25] Shirai H, Nakajima N, Higo J, Kidera A, and Nakamura H. (1998) Conformational sampling of CDR-H3 in antibodies by multicanonical molecular dynamics simulation. J Mol Biol. 278: 481-496. [26] Kim ST, Shirai H, Nakajima N, Higo J, and Nakamura H. (1999) Enhanced conformational diversity search of CDR-H3 in antibodies: role of the first CDR-H3 residue. Proteins. 37: 683-696.

In: Computational Biology: New Research Editor: Alona S. Russe

ISBN: 978-1-60692-040-4 © 2009 Nova Science Publishers, Inc.

Short Commentary B

A LINKAGE DISEQUILIBRIUM-BASED STATISTICAL APPROACH TO DISCOVERING INTERACTIONS AMONG SNP ALLELES AT MULTIPLE LOCI CONTRIBUTING TO HUMAN SKIN PIGMENTATION VARIATION Sumiko Anno1 and Takashi Abe2 1

2

School of Engineering, Shibaura Institute of Technology School of Bio-Science, Nagahama Institute of Bio-Science and Technology

Abstract Linkage disequilibrium (LD), the nonrandom association of alleles from different loci, can provide valuable information about the structure of human genome haplotypes. Because haplotype-based methods offer a powerful approach for disease gene mapping, this information may facilitate studies of the association between genomic variation and human traits. Single nucleotide polymorphism (SNP) alleles at multiple loci produce an LD pattern resulting from gene–gene interactions that can provide a foundation for developing statistics to detect other such interactions. Although several studies have used LD to address the role of gene interactions in various phenotypes and complex diseases, the current lack of formal statistics and the potential importance of data resulting from this research have motivated us to develop LD-based statistics. We chose to examine skin pigmentation because it is a complex trait, and SNP alleles at multiple loci may play a role in determining normal variation in skin pigmentation. The main purpose of this chapter is to outline the development of LD-based statistics for detecting interactions among SNP alleles at multiple loci that contribute to variation in human skin pigmentation. To accomplish this, we developed a general theory to study LD patterns in gene-interaction trait models. We then developed a definition of gene interaction and a measure of interactions among SNP alleles at multiple loci contributing to the trait in the framework of LD analysis.

20

Sumiko Anno and Takashi Abe

Introduction Industrial activity has increased the rate at which ozone is being depleted and has led to higher levels of exposure to ultraviolet (UV) rays-conditions to which humans have not had time to adapt. To properly adjust to heightened UV exposure, it will be necessary to understand how the environment exerts pressures and effects on portions of the genome encoding complex human phenotypes such as pigmentation [1]. Over the course of their evolution, humans have adapted to complicated and challenging environments by evolving new traits and abilities. Human skin color variation, regulated by melanin expression, is an adaptation to different levels of environmental UV exposure. For example, people indigenous to Northern Europe have pale skin, while people indigenous to Africa have dark skin. At lower latitudes, melanin production is increased to protect against high levels of UV irradiation, while at higher latitudes, melanin production is decreased, which allows the body to synthesize more vitamin D. Increased vitamin D synthesis provides a variety of health benefits, including protection against rickets (osteomalacia) [1–3]. Although variations in human skin color are known to occur in connection with environmental factors such as UV radiation, the genetic background of human skin color is still unclear. Several studies of skin pigmentation have reported polymorphisms associated with skin color at various loci such as melanocortin 1 receptor (MC1R), oculocutaneous albinism II (OCA2), and agouti signaling protein (ASIP). Human skin color variation is thought to be controlled by interactions among multiple loci at alleles known to contain single nucleotide polymorphisms (SNPs) [4–11]. SNP alleles at multiple loci produce a linkage disequilibrium (LD) pattern resulting from gene–gene interactions; such patterns can provide a foundation for developing statistics to detect similar interactions. Although several studies have used LD to address gene interactions, the current lack of formal statistics and the potential of this information have motivated us to develop LD-based statistics. Because pigmentation is a complex trait, SNP alleles at multiple loci may play a role in determining normal variation in skin pigmentation. The main purpose of this chapter was to develop LD-based statistics for detecting interactions among SNP alleles at multiple loci that contribute to human skin pigmentation variation, in order to clarify the molecular basis of the genetic background of human skin color. To accomplish this, we developed a general theory to study LD patterns using geneinteraction trait models. We then developed a definition of gene interactions and a measure of interactions contributing to the trait among SNP alleles at multiple loci within the framework of LD analysis.

Materials and Methods Samples were collected from 122 Caucasoid participants in Toledo, Ohio and 100 Mongoloid participants in Japan. Sample collection was conducted in accordance with a protocol approved by the Human Subjects Research Committee of the Shibaura Institute of Technology, Japan. Participants gave informed consent for the collection of buccal samples, which were anonymously coded [12]. We extracted DNA from the buccal samples using the ISOHAIR kit (NIPPON GENE COMPANY Chiyoda-ku, Tokyo, Japan). To provide sufficient genomic DNA for SNP

A Linkage Disequilibrium-Based Statistical Approach to Discovering Interactions… 21 genotyping, we amplified whole genomic DNA using the REPLI-g Kit (QIAGEN, Chuo-ku, Tokyo, Japan). Next, ASIP, tyrosinase-related protein 1 (TYRP1), tyrosinase (TYR), MC1R, OCA2, microphthalmia-associated transcription factor (MITF), and myosin VA (MYO5A) genes were selected as candidate genes for human skin pigmentation [4–11, 13–17]. Twenty SNPs in the loci of the candidate genes that had been registered for the dbSNP database [18] were selected: rs819136, rs1129414, rs2075508, rs10960756, rs3793976, rs2298458, rs3212363, rs1805008, rs3212371, rs2279727, rs4778182, rs1800419, rs2311843, rs1800414, rs1800404, rs7623610, rs704246, rs16964944, rs1724577, and rs4776053. PCR was performed to amplify the regions containing the SNPs of interest within the DNA samples. The PCR products were electrophoresed on 2% agarose gels to verify that the expected single-band product was generated. The verified PCR products were purified with ExoSAPIT (Amersham Pharmacia Biotech, Piscataway, New Jersey, USA). The allele discrimination assay consisted of PCR amplification of multiple SNP alleles at a particular locus using specific primers with tags differing in molecular weight. For this assay, the purified PCR products were combined with two hemi-nested allele-specific primers and two universally tagged Masscode oligonucleotide primers (Qiagen). Each tag was covalently attached to the 5′ end of an oligonucleotide primer via a photolabile linker. Following PCR amplification, the SNP-specific PCR products were passed through a QIAquick 96 silica-based filter membrane to remove unincorporated tagged primers. The filtered PCR products were exposed to a 254-nm mercury lamp to cleave the incorporated tags, and the tags were analyzed using an optimized Agilent 1100 single quadrupole mass spectrometer. The presence of a particular tag indicated the presence of the corresponding SNP allele in the genomic DNA sample. Genotype data were reported in a comma-delimited flat-file format that contained the SNP and sample identifiers for each allele detected. Alleles were reported using a binary nomenclature in which l represented wild-type alleles and 2 represented variant alleles. The SNP allele was classified into three types: wild-type homozygous, variant-type homozygous, and heterozygous. Thus, a homozygous wild-type allele was designated 1,1, and a heterozygous allele was designated 1, 2 [19–20]. The following analyses were conducted with the data obtained from the results of the SNP genotyping of Caucasoid and Mongoloid participants,.

Genotype and allele frequencies for the 20 SNPs in the two populations The genotype and allele frequencies of the 20 SNPs observed in the Caucasoid (n = 122) and Mongoloid (n = 100) participants were calculated to determine racial differences.

Cluster analysis We conducted cluster analysis for genetic differentiation of the SNP genotyping results by condensing the genotype assignment for each SNP allele into a single numeric value as follows: homozygous wild type 1,1 = 0, heterozygous 1,2 = 0.5, and homozygous variant 2,2 = 1. An unweighted pair group method with arithmetic mean (UPGMA) dendrogram was constructed based on the genotyping data using the euclidean distance. UPGMA is one of the simplest and most commonly used hierarchical clustering algorithms. As input, it receives a

22

Sumiko Anno and Takashi Abe

set of components and a distance matrix, which contains pairwise distances between all components, and constructs a hierarchical dendrogram from this set.

Linkage disequilibrium (LD) generated by gene-gene interactions contributes to two racial groups To examine the contribution of nonrandom associations of SNP alleles at multiple loci to racial differences, we examined the associations between the 20 SNP alleles at various candidate-gene loci in the genome. LD serves as a measure of gene–gene interactions among unlinked loci [21]. LD is the association between the qualitative random variables corresponding to SNP alleles at different polymorphic sites, not necessarily on the same chromosome [22–23]. LD provides important gene-mapping information when used as a tool for fine mapping of complex disease genes and also in proposed genome wide association studies. LD is also of interest because of what it may reveal about the evolution of populations. The concept of LD in this chapter is as it was defined by Richard Lewontin in one of the earliest measures of disequilibrium to be proposed (symbolized by D) [24]. When measuring LD, D quantifies disequilibrium as the difference between the observed frequency of a two-locus haplotype and the frequency it would be expected to show if the alleles are segregating at random. Consider two markers, with alleles A, a and B, b. Their haplotype frequency can be described as fAB, fAb, faB, and fab. The discrepancy of the distribution under LD can be measured by D = fABfab – fAbfaB. Measures of LD are defined as the standardized values of D. Two common such measures are R2 = D2/(fAB + fAb)(fAB + faB)(faB + fab)(fAb + fab) and D′ = D/Dmax, where Dmax is min((fAB + fAb)(fAb + fab), (fAB + faB)(faB + fab)) when the numerator is positive, and min((fAB + fAb) (fAB + faB), (fAb + fab)(faB + fab)) otherwise. The case of D′ = 1 is known as complete LD. Values of D′ < 1 indicate that the complete ancestral LD has been disrupted. The magnitude of values of D′ < 1 has no clear interpretation. The definition of R2 can be understood by considering the alleles as realizations of quantitative random variables (with values 0 and 1), from among which we can calculate a correlation coefficient. LD can be analyzed with software such as the EH program [25], Haploview [26], R statistical software [27], and others. To examine the contribution of nonrandom associations of SNP alleles at multiple loci to racial differences, we calculated the LD statistic and significance levels for all possible SNP allele pairs. Significant levels were calculated using a χ2 test on the two-by-two table of the haplotype frequencies. The P value of LD was determined with a χ2 test; statistical significance was set at 0.05. Combinations of SNP alleles at multiple loci under LD were jointly tested for association with Caucasoid or Mongoloid race by performing a χ2 test for independence. Only data that followed the conditions of Hardy-Weinberg equilibrium were used in the analysis.

Results Table 1 shows the genotypes and allele frequencies for the 20 SNPs in the Caucasoid and Mongoloid groups. We also observed some allele frequencies in the Caucasoid samples that differed from the Mongoloid samples (Figure 1). The 20 SNPs that we used in this study were

A Linkage Disequilibrium-Based Statistical Approach to Discovering Interactions… 23 assumed to contribute to racial differences in skin color. Cluster analysis showed that each racial group formed a separate cluster except for one Mongoloid participant, who fell into the Caucasoid cluster [12]. Table 1. Genotypes and allele frequencies for 20 SNPs in the Caucasoid and Mongoloid populations

The allele combination rs1800419-C/rs1800414-G/rs1800404-G was associated with the Mongoloid group (p = 5.39 × 10−20). These alleles are found in OCA2 on chromosome 15, and formed a haplotype [12]. OCA2 controls the transport of tyrosine, a precursor of melanin, into the melanosome for melanin synthesis. The allele combination rs2311843-C/rs1800404-A/rs4776053-C was associated with the Caucasoid group (p = 5.51 × 10−33). These alleles are found in OCA2 and MYO5A on chromosome 15, and formed a haplotype [12]. MYO5A functions in vesicle transport, and mutations in this gene confer a lighter skin color due to defects in actin-based pigment granule transport within melanocytes. The pigmentation variation is believed to be due to abnormal distribution of melanosomes along the dendrites of melanocytes [28]. There were significant differences in the allele combinations (i.e., haplotypes) between the two racial groups. The allele combination rs2311843-C/rs1800404-A/rs4776053-C was associated with the Caucasoid group. The one of the allele of the combination was rs4776053 in MYO5A found in only the Caucasoid group; and the other of the allele of the combination were rs2311843 and rs1800404 in OCA2 found in both the Mongoloid and the Caucasoid groups. Thus, the rs4776053 associated with the Caucasoid group only could be considered to

24

Sumiko Anno and Takashi Abe

confer lighter skin color. The rs4776053 allele varies as C/T. The allele frequencies for this genotype were 83.2/16.8 (C/T) for the Caucasoid group and 76.0/24.0 (C/T) for the Mongoloid group. The higher frequency of the C variant among the Caucasoid group indicates that rs4776053-C could be a SNP allele that confers lighter skin color. This result suggests that the lighter skin pigmentation observed in Caucasoid populations is the result of positive selection on different loci in different human populations [4, 12].

Figure 1. Comparison of allele frequencies in Mongoloid and Causcasoid populations. Allele frequencies of Mongoloid samples plotted against allele frequencies of Caucasoid samples. Note: only SNPs that vary substantially between the populations are labeled.

Conclusion The results of the LD analysis of the Caucasoid and Mongoloid groups show that the SNP alleles at multiple loci that contribute to racial differences are on the same chromosome and are likely to form the haplotype [12]. Every gene has variable SNPs, and these may constitute haplotypes; some haplotypes are present in all populations, while some are population specific [29]. Haplotypes associated with skin color differences between Caucasoid and Mongoloid persons are populationspecific haplotypes. For most genes, haplotypes represent an opportunity for functional adaptation and diversification [29]. The haplotypes identified in this study are most likely the result of adaptation to different UV ray intensities. This study adds to the growing evidence of genetic variability with regard to skin color in different geographically isolated populations [12].

A Linkage Disequilibrium-Based Statistical Approach to Discovering Interactions… 25 Several studies have searched the Perlegen and HapMap datasets for signatures of selection using nucleotide diversity or similar measures [30–31]. While these studies have concentrated on SNP diversity, LD-based measures may also be used to search for positiveselection signatures [32–33]. Our results verify that LD-based approaches can be used to identify regions with the highest variation and are a powerful means of detecting active selection prior to allelic fixation or selection acting on existing genetic variation [12, 34]. Confirmation of these findings requires further study involving more ethnic groups, such as a Negroid population, to analyze the associations between SNP alleles at multiple loci and racial differences in skin color. Clarifying these associations will not only elucidate an interesting physiological trait, but may also provide a model or test system for gene discovery in other polygenic traits (such as complex diseases) that have greater environmental sources of variation [6]. There is a clear justification for adopting an evolutionary approach to exploring the genetics of human pigmentation. Such studies have already revealed the trait to have a highly dynamic and complex evolutionary past and have pointed to the molecular mechanisms underlying phenotypic variability [6]. This was demonstrated most recently by the identification of the selection signatures of the functionally important SLC24A5 and POMC genes [35–36]. The identification and analysis of additional genes involved in human skin pigmentation, along with functional characterization of the allelic variants at the candidate loci presented here, will help to clarify the nature and extent of skin pigmentation adaptation in human populations [4, 12].

References [1] Anno, S; Abe, T; Sairyo, K; Kudo, S; Yamamoto, T; Ogata, K; Goel, VK. Interactions Between SNP Alleles at Multiple Loci and Variation in Skin Pigmentation in 122 Caucasians. Evolutionary Bioinformatics Online. 2007; 3: 169-178. [2] Rouzaud, F; Kadekaro, AL; Abdel-Malek, ZA; Hearing, VJ. MC1R and the response of melanocytes to ultraviolet radiation. Mutat Res. 2005; 571: 133-152. [3] Jablonski, NG; Chaplin, G. The evolution of human skin coloration. J Hum Evol. 2000; 39: 57-106. [4] Myles, S; Somel, M; Tang, K; Kelso, J; Stoneking, M. Identifying genes underlying skin pigmentation differences among human populations. Hum Genet. 2007; 120: 613-621. [5] Izagirre, N; García, I; Junquera, C; de la Rúa, C; Alonso, S. A scan for signatures of positive selection in candidate loci for skin pigmentation in humans. Mol Biol Evol. 2006; 23(9): 1697-1706. [6] McEvoy, B; Beleza, S; Shriver, MD. The genetic architecture of normal variation in human pigmentation: an evolutionary perspective and model. Hum Mol Genet. 2006; 15(2): R176-181. [7] Bonilla, C; Boxill, LA; Donald, SA; Williams, T; Sylvester, N; Parra, EJ; Dios, S; Norton, HL; Shriver, MD; Kittles, RA. The 8818G allele of the agouti signaling protein (ASIP) gene is ancestral and is associated with darker skin color in African Americans. Hum Genet. 2005; 116: 402-406. [8] Makova, K; Norton, H. Worldwide polymorphism at the MC1R locus and normal pigmentation variation in humans. Peptides. 2005; 26: 1901-1908.

26

Sumiko Anno and Takashi Abe

[9] Naysmith, L; Waterston, K; Ha, T; Flanagan, N; Bisset, Y; Ray, A; Wakamatsu, K; Ito, S; Rees, JL. Quantitative measures of the effect of the melanocortin 1 receptor on human pigmentary status. J Invest Dermatol. 2004; 122(2): 423-428. [10] Ancans, J; Flanagan, N; Hoogduijn, MJ; Thody, AJ. P-locus is a target for the melanogenic effects of MC-1R signaling: a possible control point for facultative pigmentation. Ann N Y Acad Sci. 2003; 994: 373-377. [11] Rees, JL. Genetics of hair and skin color. Annu Rev Genet. 2003; 37: 67-90. [12] Anno, S; Abe, T; Yamamoto, T. Interactions between SNP Alleles at Multiple Loci Contribute to Skin Color Differences between Caucasoid and Mongoloid Subjects. Int J Biol Sci. 2008; 4: 81-86. [13] Tadokoro, T; Yamaguchi, Y; Batzer, J; Coelho, SG; Zmudzka, BZ; Miller, SA; Wolber, R; Beer, JZ; Hearing, VJ. Mechanisms of Skin Tanning in Different Racial/Ethnic Groups in Response to Ultraviolet Radiation. J Invest Dermatol. 2005; 124: 1326-1332. [14] Bonilla, C; Shriver, MD; Parra, EJ; Jones, A; Fernández, JR. Ancestral proportions and their association with skin pigmentation and bone mineral density in Puerto Rican women from New York city. Hum Genet. 2004; 115: 57-68. [15] Shriver, MD; Parra, EJ; Dios, S; Bonilla, C; Norton, H; Jovel, C; Pfaff, C; Jones, C; Massac, A; Cameron, N; Baron, A; Jackson, T; Argyropoulos, G; Jin, L; Hoggart, CJ; McKeigue, PM; Kittles, RA. Skin pigmentation, biogeographical ancestry, and admixture mapping. Hum Genet. 2003; 112: 387-399. [16] Hoggart, CJ; Parra, EJ; Shriver, MD; Bonilla, C; Kittles, RA; Clayton, DG; McKeigue, PM. Control of confounding of genetic associations in stratified populations. Am J Hum Genet. 2003; 72: 1492-1504. [17] Sturm, RA; Teasdale, RD; Box, NF. Human pigmentation genes: identification, structure and consequences of polymorphic variation. Gene. 2001; 277: 49-62. [18] [Internet] National Library of Medicine: Searchable NCBI site for Single Nucleotide Polymorphisms. http://www.ncbi.nlm.nih.gov/projects/SNP/ [19] Kokoris, M; Dix, K; Moynihan, K; Mathis, J; Erwin, B; Grass, P; Hines, B; Duesterhoeft, A. High-throughput SNP genotyping with the Masscode system. Mol Diagn. 2000; 5(4): 329-340. [20] Ogata, K; Ikeda, S; Ando, E. QIAGEN Genomics Inc. A study of SNP genotyping using Masscode™ technology. Shimadzu Hyoka. 2002; 58(3/4): 125-129 [In Japanese]. [21] Zhao, J; Jin, L; Xiong, M. Test for interaction between two unlinked loci. Am J Hum Genet. 2006; 79(5): 831-845. [22] Jorde, LB. Linkage Disequilibrium and the Search for Complex Disease Genes, Genome Res. 2000; 10: 1435-1444. [23] Pritchard, JK; Przeworski, M. Linkage Disequilibrium in Humans: Models and Data, Am J Hum Genet. 2001; 69:1-14. [24] Lewontin, RC. The interaction of selection and linkage.I. General considerations; heterotic models. Genetics. 1964; 49: 49-67. [25] Terwilliger, J; Ott, J. Handbook of Human Genetic Linkage. Johns Hopkins University Press, Baltimore. 1994. [26] Barrett, JC; Fry, B; Maller, J; Daly, MJ. Haploview: analysis and visualization of LD and haplotype maps, Bioinformatics. 2005; 21: 263-265. [27] Warnes, G; Leisch, F. Genetics: Population genetics [computerprogram]. R package version 1.2.0. 2005. Available: http://cran.r-project.org/src/contrib/PACKAGES.html.

A Linkage Disequilibrium-Based Statistical Approach to Discovering Interactions… 27 [28] Libby, RT; Lillo, C; Kitamoto, J; Williams, DS; Steel, KP. Myosin Va is required for normal photoreceptor synaptic activity. J Cell Sci. 2004; 117: 4509-4515. [29] Stephens, JC; Schneider, JA; Tanguay, DA; Choi, J; Acharya, T; Stanley, SE; Jiang, R; Messer, CJ; Chew, A; Han, JH; Duan, J; Carr, JL; Lee, MS; Koshy, B; Kumar, AM; Zhang, G; Newell, WR; Windemuth, A; Xu, C; Kalbfleisch, TS; Shaner, SL; Arnold, K; Schulz, V; Drysdale, CM; Nandabalan, K; Judson, RS; Ruano, G; Vovis, GF. Haplotype variation and linkage disequilibrium in 313 human genes. Science. 2001; 293: 489-493. [30] Eberle, MA; Rieder1, MJ; Kruglyak, L; Nickerson, DA. Allele Frequency Matching Between SNPs Reveals an Excess of Linkage Disequilibrium in Genic Regions of the Human Genome. PLoS Genetics. 2006; 2: e142. [31] Nielsen, R; Williamson, S; Kim, Y; Hubisz, MJ; Clark, AG; Bustamante, C. Genomic scans for selective sweeps using SNP data. Genome Res. 2005; 15: 1566–1575. [32] Voight, BF; Kudaravalli, S; Wen, X; Pritchard, JK. A map of recent positive selection in the human genome. PLoS Biol. 2006; 4: e72. [33] Sabeti, PC; Reich, DE; Higgins. JM; Levine, HZ; Richter, DJ; Schaffner, SF; Gabriel, SB; Platko, JV; Patterson, NJ; McDonald, GJ; Ackerman, HC; Campbell, SJ; Altshuler, D; Cooper, R; Kwiatkowski, D; Ward, R; Lander, ES. Detecting recent positive selection in the human genome from haplotype structure. Nature. 2002; 419: 832–837. [34] Przeworski, M; Coop, G; Wall, JD. The signature of positive selection on standing genetic variation. Evol Int J Org Evol. 2005; 59: 2312–2323. [35] Lamason, RL; Mohideen, MA; Mest, JR; Wong, AC; Norton, HL; Aros, MC; Jurynec, MJ; Mao, X; Humphreville, VR; Humbert, JE; Sinha, S; Moore, JL; Jagadeeswaran, P; Zhao, W; Ning, G; Makalowska, I; McKeigue, PM; O'donnell, D; Kittles, R; Parra, EJ; Mangini, NJ; Grunwald, DJ; Shriver, MD; Canfield, VA; Cheng, KC. SLC24A5, a putative cation exchanger, affects pigmentation in zebrafish and humans. Science. 2005; 310: 1782-1786. [36] Millington, GWM. Pro-opiomelanocortin (POMC): the cutaneous roles of its melanocortin products and receptors. Clin Exp Dermatol. 2006; 31: 407-412.

In: Computational Biology: New Research Editor: Alona S. Russe

ISBN: 978-1-60692-040-4 c 2009 Nova Science Publishers, Inc.

Short Commentary C

S UFFICIENT C ONDITIONS FOR E XACT P ENALTY IN C ONSTRAINED O PTIMIZATION ON C OMPLETE M ETRIC S PACES Alexander J. Zaslavski∗ Department of Mathematics, The Technion-Israel Institute of Technology, 32000 Haifa, Israel

Abstract In this paper we use the penalty approach in order to study two constrained minimization problems on complete metric spaces. A penalty function is said to have the generalized exact penalty property if there is a penalty coefficient for which approximate solutions of the unconstrained penalized problem are close enough to approximate solutions of the corresponding constrained problem. In this paper we establish sufficient conditions for the generalized exact penalty property.

1.

Introduction

Penalty methods are an important and useful tool in constrained optimization. See, for example, [1-4, 6-11] and the references mentioned there. The notion of exact penalization was introduced by Eremin [6] and Zangwill [7] for use in the development of algorithms for nonlinear constrained optimization. Since that time exact penalty functions have continued to play a key role in the theory of mathematical programming. For discussions and various applications of exact penalization to various constrained optimization problems see [1, 2, 4]. In this paper we use the penalty approach in order to study constrained minimization problems with lower semicontinuous constraints in complete metric spaces. A penalty function is said to have the exact penalty property [1, 2, 4] if there is a penalty coefficient for which a solution of an unconstrained penalized problem is a solution of the corresponding constrained problem. We study two constrained minimization problems with lower ∗

E-mail address: [email protected]

30

Alexander J. Zaslavski

semicontinuous objective functions. The first problem is an equality-constrained problem in a complete metric space with a lower semicontinuous constraint function and the second problem is an inequality-constrained problem in a complete metric space with a lower semicontinuous constraint function. In [8] we considered these two problems in a Banach space with locally Lipschitzian constraint and objective functions and established a very simple sufficient condition for the exact penalty property. In particular, the problem f (x) → min subject to g(x) = c possesses the exact penalty if the real number c is not a critical value of the function g. In other words the set g −1 (c) does not contain a critical point of the function g. Usually the exact penalty property is related to calmness of the perturbed constraint function. In [8-11] and here we use assumptions of the different nature which is not difficult to verify. Note that in [8] we used the following notion of a critical point of a Lipschitzian function: A point z is a critical point of the function g if 0 ∈ ∂g(z) where ∂g(z) is Clarke’s generalized gradient of g at z [3]. In [11] we extended the results of [8] to the equality-constrained problem and the inequality-constrained problem in an arbitrary complete metric space with locally Lipschtz constraint and objective functions. In the present paper we extent the results of [11] to the equality-constrained problem and the inequality-constrained problem in a complete metric space with lower semicontinuous constraint and objective functions assuming that all closed bounded subsets of the complete metric space are compact. We use the convention that ∞ + c = ∞ for any real number c and that the sum over an empty set is zero. Let (X, ρ) be a metric space. Assume that U is a nonempty open subset of X. A function f : U → R1 is called Lipschitzian if there exists c > 0 such that |f (x1 ) − f (x2)| ≤ cρ(x1, x2) for each x1, x2 ∈ U. In order to study our minimization problems we need the notion a critical point of a Lipschitz function introduced in [11]. Let a function f : U → R1 be Lipschitzian. For each x ∈ U define Ξf (x) = lim sup[ lim inf (f (z) − f (y))(ρ(z, y))−1] y→x

z→y,z6=y

(1.1)

[11]. Clearly, Ξf (x) is well-defined for all x ∈ U . A point x ∈ U is called a critical point of f if Ξf (x) ≥ 0 [11]. A real number c ∈ R1 is called a critical value of f on U if there is a critical point x ∈ U of f such that f (x) = c [11]. For each x ∈ X and each r > 0 set B(x, r) = {y ∈ X : ρ(x, y) ≤ r}.

(1.2)

A set D ⊂ X is called bounded if there exist x ∈ X and r > 0 such that D ⊂ B(x, r). In [11, Proposition 1.1] it was pointed out that the following proposition holds. Proposition 1.1 A point x ∈ U is a critical point of f if and only if there exist a ∞ sequence {xk }∞ k=1 ⊂ U and a sequence {rk }k=1 ⊂ (0, 1) such that ρ(xk , x) → 0 as

Sufficient Conditions for Exact Penalty in Constrained Optimization...

31

k → ∞ and that for each integer k ≥ 1, B(xk , rk ) ⊂ U and f (z) ≥ f (xk ) − ρ(z, xk )k−1 for all z ∈ B(xk , rk ). In view of the definition (1.1) the following proposition holds. Proposition 1.2 [11, Proposition 1.2] Let x ∈ U , {xi }∞ i=1 ⊂ U and let lim ρ(xi, x) = 0.

i→∞

Then Ξf (x) ≥ lim supi→∞ Ξf (xi ). Corollary 1.1 [11, Corollary 1.1] Let x ∈ U , {xi }∞ i=1 ⊂ U , limi→∞ ρ(xi , x) = 0 and let lim sup Ξf (xi ) ≥ 0. i→∞

Then x is a critical point of f . The analog of the notion of a critical point of f introduced above were used in [8] in the case when X is a Banach space. Now we compare the notion of a critical point introduced above and the notion of a critical point used in [8] when X is a Banach space. Assume that (X, ||·||) is a Banach space, (X ∗, ||·||∗) is its dual space and that ρ(x, y) = ||x − y||, x, y ∈ X. Let f : U → R1 be a Lipschitzian function defined on a nonempty open subset U of X. For each x ∈ U let f 0 (x, h) = lim sup [f (y + th) − f (y)]/t, h ∈ X t→0+ ,y→x

be the Clarke generalized derivative of f at the point x [3], let ∂f (x) = {l ∈ X ∗ : f 0 (x, h) ≥ l(h) for all h ∈ X} be Clarke’s generalized gradient of f at x [3] and set ˜ f (x) = inf{f 0 (x, h) : h ∈ X and ||h|| = 1} Ξ [8]. ˜ f (x) ≥ Ξf (x). Proposition 1.3 [11, Proposition 1.3] For each x ∈ U , Ξ ˜ f (x) ≥ 0 and that a real number c is In [8] we say that x ∈ U is a critical point of f if Ξ a critical value of f if there is a critical point x ∈ U of f such that f (x) = c. Proposition 1.3 implies that if x ∈ U is a critical point of f according to the definition given in this paper, then x is also a critical point of f in the sense of the definition given in [8]. Proposition 1.3 also implies that if c ∈ R1 is a critical value of f according to the definition given in this paper, then c is also a critical value of f in the sense of the definition given in [8].

32

2.

Alexander J. Zaslavski

The main results

Let (X, ρ) be a complete metric space. For each function f : X → R1 ∪ {∞} and each nonempty set A ⊂ X put inf(f ) = {f (z) : z ∈ X} and inf(f ; A) = inf{f (z) : z ∈ A}. For each x ∈ X and each B ⊂ X put ρ(x, B) = inf{ρ(x, y) : y ∈ B}. We assume that any nonempty closed bounded subset of (X, ρ) is compact. Fix θ ∈ X. Let g : X → R1 ∪ {∞} be a lower semicontinuous function, c ∈ R1 and let f : X → R1 ∪{∞} be a lower semicontinuous bounded from below function which is not identically ∞ and such that (2.1) lim f (x) = ∞. ρ(x,θ)→∞

We consider the equality-constrained minimization problem f (x) → min, g −1 (c)

(Pe )

such that g −1 (c) 6= ∅ and g is finite-valued and continuous function and the inequalityconstrained minimization problem f (x) → min, x ∈ g −1 ((−∞, c])

(Pi )

such that g −1 ((−∞, c]) 6= ∅. With these two problems we associate the corresponding families of unconstrained minimization problems f (x) + λ|g(x) − c| → min, x ∈ X (Pe,λ) and f (x) + λ max{g(x) − c, 0} → min, x ∈ X

(Pi,λ)

where λ > 0. The following two theorems which are proved in Section 3 are the main results of the paper. Theorem 2.1 Assume that g −1 (c) 6= ∅, the function g is finite-valued and continuous, inf(f ; g −1(c)) < ∞ and that the following assumptions hold: (A1) for each x ∈ g −1 (c) satisfying f (x) = inf(f ; g −1(c)) there is ∆x > 0 such that the restrictions of the functions f and g to the ball B(x, ∆x) are finite-valued and Lipschitz; (A2) if x ∈ g −1 (c) satisfies f (x) = inf(f ; g −1(c)), then x is not a critical point of the function g and is not a critical point of the function −g.

Sufficient Conditions for Exact Penalty in Constrained Optimization...

33

Then there exists Λ > 0 such that for each > 0 there exists δ ∈ (0, ) such that the following assertion holds. If λ ≥ Λ and if x ∈ X satisfies f (x) + λ|g(x) − c| ≤ inf{f (z) + λ|g(z) − c| : z ∈ X} + δ, then there is y ∈ g −1(c) such that ρ(x, y) ≤ and f (y) ≤ inf(f ; g −1(c)) + δ. Theorem 2.2 Assume that g −1 ((−∞, c]) 6= ∅, inf(f ; g −1((−∞, c])) < ∞ and that the following assumptions hold: (A3) for each x ∈ g −1((−∞, c]) satisfying f (x) = inf(f ; g −1((−∞, c])) there is ∆x > 0 such that the restrictions of the functions f and g to the ball B(x, ∆x ) are finitevalued and Lipschitz; (A4) if x ∈ g −1 ((−∞, c]) satisfies f (x) = inf(f ; g −1((−∞, c])), then x is not a critical point of the function g. Then there exists Λ > 0 such that for each > 0 there exists δ ∈ (0, ) such that the following assertion holds. If λ ≥ Λ and if x ∈ X satisfies f (x) + λ max{g(x) − c, 0} ≤ inf{f (z) + λ max{g(z) − c, 0} : z ∈ X} + δ then there is y ∈ g −1((−∞, c]) such that ρ(x, y) ≤ and f (y) ≤ inf(f ; g −1((−∞, c])) + δ. Theorems 2.1 and 2.2 imply the following result. Theorem 2.3 1. Assume that g −1(c) 6= ∅, the function g is finite-valued and continuous, inf(f ; g −1(c)) < ∞ and that the assumptions (A1) and (A2)hold. Then there exists Λ > 0 such that for each λ ≥ Λ and each sequence {xi }∞ i=1 ⊂ X which satisfies lim [f (xi) + λ|g(xi) − c|} = inf{f (z) + λ|g(z) − c| : z ∈ X}

i→∞

−1 there exists a sequence {yi }∞ i=1 ⊂ g (c) such that

lim f (yi ) = inf(f ; g −1(c)), lim ρ(yi , xi) = 0.

i→∞

i→∞

2. Assume that g −1 ((−∞, c]) 6= ∅, inf(f ; g −1((−∞, c])) < ∞

34

Alexander J. Zaslavski

and that the assumptions (A3) and (A4) hold. Then there exists Λ > 0 such that for each λ ≥ Λ and each sequence {xi }∞ i=1 ⊂ X which satisfies lim [f (xi) + λ max{g(xi) − c, 0})] = inf{f (z) + λ max{g(z) − c, 0} : z ∈ X}

i→∞

−1 there exists a sequence {yi }∞ i=1 ⊂ g ((−∞, c]) such that

lim f (yi ) = inf(f ; g −1((−∞, c])), lim ρ(yi , xi) = 0.

i→∞

3.

i→∞

Proofs of Theorems 2.1 and 2.2

We prove Theorems 2.1 and 2.2 simultaneously. Put A = g −1(c)

(3.1)

A = g −1((−∞, c])

(3.2)

in the case of Theorem 2.1 and

in the case of Theorem 2.2. Clearly, A is a nonempty closed set. For λ > 0 put ψλ (x) = f (x) + λ|g(x) − c|, x ∈ X

(3.3)

in the case of Theorem 2.1 and ψλ (x) = f (x) + λ max{g(x) − c, 0}, x ∈ X

(3.4)

in the case of Theorem 2.2. Clearly, for any λ > 0, ψλ : X → R1 ∪ {∞} is a lower semicontinuous bounded from below function such that inf(ψλ) < ∞. We show that there exists positive number Λ such that the following property holds: (P1) for each > 0 there exists δ ∈ (0, ) such that for each λ ≥ Λ and each x ∈ X satisfying ψλ(x) ≤ inf(ψλ) + δ there is y ∈ A ∩ B(x, ) such that ψλ (y) ≤ ψλ (x). It is not difficult to see that the existence of a positive number Λ for which (P1) holds implies the validity of Theorems 2.1 and 2.2. Let us assume that there is no positive number Λ such that (P1) holds. Then for each natural number k there exist k ∈ (0, 1), λk ≥ k, xk ∈ X

(3.5)

ψλk (xk ) ≤ inf(ψλk ) + 2−1 k k−2 ,

(3.6)

such that

Sufficient Conditions for Exact Penalty in Constrained Optimization... {z ∈ A ∩ B(xk , k ) : ψλk (z) ≤ ψλk (xk )} = ∅.

35 (3.7)

Let k be a natural number. It follows from (3.6) and Ekeland’s variational principle [5] that there is yk ∈ X such that (3.8) ψλk (yk ) ≤ ψλk (xk ), ρ(yk , xk ) ≤ (2k)−1k ,

(3.9)

ψλk (yk ) ≤ ψλk (z) + k−1 ρ(z, yk ) for all z ∈ X.

(3.10)

By (3.7), (3.8) and (3.9), yk 6∈ A for all natural numbers k.

(3.11)

In the case of Theorem 2.2 it follows from (3.2) and (3.11) that g(yk ) > c for all natural numbers k.

(3.12)

In the case of Theorem 2.1 it follows from (3.2) and (3.11) that for any natural numbers k, either g(yk ) > ck or g(yk ) < ck . Extracting a subsequence and re-indexing and replacing g by −g and c by −c if necessary we may assume without loss of generality that (3.12) holds for all natural numbers k in the case of Theorem 2.1 two. By (3.3), (3.4), (3.5), (3.6) and (3.8) for all natural numbers k, f (yk ) ≤ ψλk (yk ) ≤ ψλk (xk ) ≤ inf(ψλk ) + 2−1 k k−2 ≤ inf(f ; A) + 2−1 k−2 .

(3.13)

In view of (3.13) and (2.1) the sequence {yk }∞ k=1 is bounded. Extracting a subsequence and re-indexing we may assume without loss of generality that there exists y∗ = lim yk in (X, ρ). k→∞

(3.14)

Let us show that y∗ ∈ A. By (3.12), (3.3), (3.4), (3.8), (3.13), (3.5) and (3.6) for each integer k ≥ 1 0 < λk (g(yk ) − c) = ψλk (yk ) − f (yk ) ≤ ψλk (yk ) − inf(f ) ≤ ψλk (xk ) − inf(f ) ≤ inf(ψλk ) + 1 − inf(f ) ≤ inf(f ; A) + 1 − inf(f ). Together with (3.5) this implies that lim |g(yk ) − c| = 0.

k→∞

(3.15)

Relations (3.15), (3.14), (3.1) and (3.2) imply that y∗ ∈ A. In view of (3.13) and (3.14) f (y∗ ) ≤ inf(f ; A).

(3.16)

36

Alexander J. Zaslavski

Combined with (3.16) this implies that f (y∗ ) = inf(f ; A).

(3.17)

By (3.16), (3.17), (3.1), (3.2), (A1) and (A3) there is ∆ > 0 such that the restrictions of f and g to B(y∗ , ∆) are finite-valued and Lipschitz. Thus there is L0 > 1 such that |f (y1) − f (y2 )|, |g(y1) − g(y2)| ≤ L0 ρ(y1, y2) for all y1 , y2 ∈ B(y∗ , ∆).

(3.18)

By (3.14) there is a natural number k0 such that ρ(y∗, yk ) < ∆/2 for all integers k ≥ k0.

(3.19)

Let k ≥ k0 be an integer. In view of (3.18) and (3.19) there is a number ∆1 ∈ (0, ∆/4)

(3.20)

g(z) > c for all z ∈ B(yk , ∆1).

(3.21)

such that

It follows from (3.10), (3.8), (3.6), (3.3), (3.4), (3.21), (3.20), (3.19), (3.18) and (3.5) that for each z ∈ B(yk , ∆1), −k−1 ρ(z, yk ) ≤ ψλk (z) − ψλk (yk ) = f (z) − f (yk ) + λk (g(z) − g(yk )) and −1 −1 g(z) − g(yk ) ≥ −λ−1 k k ρ(z, yk ) + λk (f (yk ) − f (z))

≥ −k−2 ρ(z, yk ) − k−1 L0 ρ(z, yk ). Combined with (1.1) this implies that Ξg (yk ) ≥ −k−2 + k−1 L0 . This implies that lim inf Ξg (yk ) ≥ 0. k→∞

(3.22)

By (3.22), (3.14), (3.18) and Proposition 1.2 Ξg (y∗) ≥ 0.

(3.23)

Relations (3.16), (3.17), (3.23), (3.1) and (3.2) contradict (A2) in the case of Theorem 2.1 and contradict (A4) in the case of Theorem 2.2. The contradiction we have reached proves that there exists Λ > 0 such that (P1) holds. This completes the proofs of Theorems 2.1 and 2.2.

Sufficient Conditions for Exact Penalty in Constrained Optimization...

37

References [1] D. Boukari and A.V. Fiacco, Survey of penalty, exact-penalty and multiplier methods from 1968 to 1993, Optimization 32, 301-334 (1995). [2] J. V.Burke, An exact penalization viewpoint of constrained optimization, SIAM J. Control Optim. 29, 968-998 (1991). [3] F. H. Clarke, Optimization and Nonsmooth Analysis , Willey Interscience (1983). [4] G. Di Pillo and L. Grippo, Exact penalty functions in constrained optimization, SIAM J. Control Optim. 27, 1333-1360 (1989). [5] I. Ekeland, On the variational principle, J. Math. Anal. Appl. 47, 324-353 (1974). [6] I. I. Eremin, The penalty method in convex programming, Soviet Math. Dokl. 8, 459462 (1966). [7] W. I. Zangwill, Nonlinear programming via penalty functions, Management Sci. 13, 344-358 (1967). [8] A. J. Zaslavski, A sufficient condition for exact penalty in constrained optimization, SIAM Journal on Optimization 16, 250-262 (2005). [9] A. J. Zaslavski, Existence of exact penalty for optimization problems with mixed constraints in Banach spaces, J. Math. Anal. Appl. 324, 669-681 (2006). [10] A. J. Zaslavski, Existence of exact penalty for constrained optimization problems in Hilbert spaces, Nonlinear Analysis 67, 238-248 (2007). [11] A. J. Zaslavski, Existence of exact penalty for constrained optimization problems in metric spaces, Set-Valued Analysis 15, 223-237 (2007).

In: Computational Biology: New Research Editor: Alona S. Russe

ISBN: 978-1-60692-040-4 © 2009 Nova Science Publishers, Inc.

Short Commentary D

HOW TO CREATE A COMPUTATIONAL MEDICINE STUDY Viroj Wiwanitkit Wiwanitkit House, Bangkok Thailand 10160; Visiting professor Hainan Medical College China

Abstract With the advent of computational research, several applications in science can be seen. The application in medicine is also documented. Computational medicine study can help answer a complicated query in medicine within a short period. How to create a computational medicine study is a common question from the beginner. In this article, the author will describe the steps for creating computational medicine research. Briefly, a simple process as that for simple in vivo and in vitro research can be used. The setting up of a conceptional framework based on literature review is the first necessary step. Next, selection of the proper database and tool for manipulation is needed. Simulating based on the designed framework can help one reach the answer. These steps must be thoroughly followed to complete computational medicine research.

Introduction With the advent of computational research, several applications in science can be seen. The application in medicine is also documented. The expectations of computational biology to command a leading role in drug discovery and disease characterization is a present focus [1, 2]. Computational medicine study can help answer a complicated query in medicine within a short period. These forces have moved much of life sciences research almost completely into the computational domain [2]. How to create a computational medicine study is a common question from the beginner. Importantly, educational training in computational medicine has been limited to students enrolled in the life sciences curricula, yet much of the skills needed to succeed in biomedical informatics involve or augment training in information technology curricula [2]. In this article, the author will describe the steps for creating computational medicine research.

40

Viroj Wiwanitkit

Research Design: The First Step It is accepted that science is based on evidence. Proof or verification is necessary in science. Although computational biology is a new science, it still follows the basic scientific principle of other sciences. Briefly, a simple process as that for simple in vivo and in vitro research can be used. Setting up the question has to be the primary step before beginning any other activities. How to set up a good question is a hard step for any beginner. The best way is to base the question on collected evidence. There are many ways to collect evidence, such as the following: 1. Primary data collection: This is the primary way, with the data personally collected by the researcher conducting the scientific project. The collection may be obtained via survey or by several techniques. However, the main pitfall of this technique is its time-consuming nature. It requires a lot of time to collect all data that are statistically satisfied for further implementation. 2. Secondary data collection: This is a faster way. This makes use of others’ work as a baseline for the generation of ideas. However, there is a main problem with this technique: the reliability of the data source. The verification of the primary data in the literature has to be done before further generalization into the study question. Nevertheless, this technique is presently widely used due to its convenience. The setting up of a conceptional framework based on literature review is the first necessary step. There are several tools for a literature search, but the two most famous databases are PubMed (www.pubmed.com) and Scopus (www.scopus.com). PubMed was developed by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM), located at the U.S. National Institutes of Health (NIH). It provides access to citations from biomedical literature. Scopus is a newer database. This database covers more literature than PubMed. It covers both medical and non-medical journals. After searching the data and getting the complete necessary documents, one can set up the research question. The conceptual framework must be set because it will be the guideline for further processes. If there is no clear concept, the research might be run in the wrong way. The definition of the study must be clearly defined. It is a requirement to know what you will do in the future before you actually do it. The design of computational medicine research can be either prospective or retrospective. For the retrospective ones, data mining is the best example. For the prospective ones, computation prediction is the best example.

Database and Tool for Manipulation After the research question and conceptual framework have been set up completely, the next step is to select the equipment as armor to conquer or to reach the research question. Unlike in vivo or in vitro studies, the in silico computational medicine research makes use of the computer as equipment. Based on mathematical principles, computers are developed and become necessary tools for present scientific research. In computational medicine, the role of the computer is very great. There are two main groups of computer facilities for

How to Create a Computational Medicine Study

41

computational medicine research. First is the database, which is the collection of data that is required for further analysis. Second is the tool, which is any interactive computational program that can be used for mathematical prediction or the simulating of input data or phenomena. There are a number of available databases at present. Some are publicly accessible; others are private. The usage of a public database in medicine can enhance the quality and effectiveness of patient care and the health of the public. Renschler said that publications could be retrieved and downloaded anywhere and any time with the introduction of electronic publishing and full text databases becoming available [3]. In addition, study groups for practice-based learning can prepare themselves for discussions of their problems or of simulated cases systematically provided by central organizations with experts using information technology [3]. Renschler also mentioned that a pilot study showed great interest in the application of information technology: 80% of the responding colleagues showed interest in occasional or regular use of medical or non-medical full text databases, preferably using their own computers [3]. However, Keller noted that the variations of medical practice of private doctors could be due to a difference in data they received from public databases [4]. Therefore, it is necessary to learn and know many databases in order to choose the best selection. Some examples will be further described. 1. Swiss-Prot protein knowledgebase [5] The Swiss-Prot protein knowledgebase (http://www.expasy.org/sprot/ and http://www.ebi.ac.uk/swissprot/) connects amino acid sequences with the current knowledge in the Life Sciences [5, 6]. Each protein entry provides an interdisciplinary overview of relevant information by bringing together experimental results, computed features and sometimes even contradictory conclusions [5, 6]. The Swiss-Prot protein knowledgebase provides manually annotated entries for all species, but concentrates on the annotation of entries from model organisms to ensure the presence of high quality annotation of representative members of all protein families [5]. The Expert Protein Analysis System (ExPASy) Web site might help to identify and reveal the function of proteins [5]. 2. UniProt [7, 8] UniProt provides a comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase, with extensive cross-references and query interfaces [7, 8]. The central database will have two sections, corresponding to the familiar Swiss-Prot (fully manually-curated entries) and TrEMBL (enriched with automated classification, annotation and extensive cross-references) [7, 8]. For convenient sequence searches, UniProt also provides several non-redundant sequence databases [7, 8]. The UniProt NREF (UniRef) databases provide representative subsets of the knowledgebase suitable for efficient searching. The comprehensive UniProt Archive (UniParc) is updated daily from many public source databases. The UniProt databases can be accessed online (http://www.uniprot.org) [7, 8]. 3. Gene Ontology Annotation Database [9] The Gene Ontology Annotation (GOA) database (http://www.ebi.ac.uk/GOA) aims to provide high-quality electronic and manual annotations to the UniProt Knowledgebase

42

Viroj Wiwanitkit (Swiss-Prot, TrEMBL and PIR-PSD) using the standardized vocabulary of the Gene Ontology (GO) [9]. As a supplementary archive of GO annotation, GOA promotes a high level of integration of the knowledge represented in UniProt with other databases [9]. GOA provides annotated entries for nearly 60,000 species (GOA-SPTr) and is the largest and most comprehensive open-source contributor of annotations to the GO Consortium annotation effort [9].

There are also many useful tools for computational medicine research. Some of the most widely used tools will be hereby presented. 1. Pegasys [10] This tool includes numerous tools for pair-wise and multiple sequence alignment, ab initio gene prediction, RNA gene detection, masking repetitive sequences in genomic DNA as well as filters for database formatting and processing raw output from various analysis tools [10]. It enables biologists and bioinformaticians to create and manage sequence analysis workflows [10]. 2. GeneViTo [11] This tool is a JAVA-based computer application that serves as a workbench for genome-wide analysis through visual interaction [11]. The application deals with various experimental information concerning both DNA and protein sequences (derived from public sequence databases or proprietary data sources) and meta-data obtained by various prediction algorithms, classification schemes or user-defined features. Interaction with a Graphical User Interface (GUI) allows easy extraction of genomic and proteomic data referring to the sequence itself, sequence features, or general structural and functional features [11]. 3. BioBuilder [12] This tool is a Zope-based software tool that was developed to facilitate intuitive creation of protein databases [12]. Protein data can be entered and annotated through web forms along with the flexibility to add customized annotation features to protein entries [12]. A built-in review system permits a global team of scientists to coordinate their annotation efforts [12]. 4. DBParser [13] This tool is for rapidly culling, merging, and comparing sequence search engine results from multiple LC-MS/MS peptide analyses [13]. It employs the principle of parsimony to consolidate redundant protein assignments and derive the most concise set of proteins consistent with all of the assigned peptide sequences observed in an experiment or series of experiments [13]. 5. UniHI [14] UniHI provides researchers with a flexible integrated tool for finding and using comprehensive information about the human interactome [14]. UniHI is available at http://www.mdc-berlin.de/unihi [14]. At present, it is based on 10 major interaction maps

How to Create a Computational Medicine Study

43

derived by computational and experimental methods. It includes more than 150,000 distinct interactions between more than 17 000 unique human proteins [14].

Simulation Experiment Simulation is the heart of computational medicine research. There are many kinds of simulating. The main types of simulation experimental medicine are: 1. Interaction Interaction simulation means the simulating focuses on the phenomenon that occurs after the interaction of two molecules. There are several techniques to reach this result. One of the most famous techniques is molecular docking. Docking involves the development of computer algorithms that evaluate the binding modes of putative ligands in receptor sites [15]. This technique can be used for designing combinations between molecules. Over the past year there have been some interesting and significant advances in computer-based ligand-protein docking techniques and related rational drug-design tools, including flexible ligand docking and better estimation of binding free energies and solvation energies [16]. There are many techniques for molecular docking. An interesting computational molecular technique is PatchDock [17], which can be used for modeling the recombination. PatchDock is a computational molecular technique for molecular docking based on shape complementarity principles [17]. The input is two molecules of any type: proteins, DNA, peptides, drugs [17]. The output or result can be further processed to be in the format of three-dimension (3D) molecular structure through the Swiss PDB Viewer (GlaxoSmithKline R&D & the Swiss Institute of Bioinformatics). The property as well as geometry of the derived complex can also be studied by the Swiss PDB Viewer. Another interesting technique is pathway mapping. This technique is according to systomics and systemic biology. This makes use of pathway identification and creation of the new overall summative pathway. 2. Mutation Mutation simulation means the simulating focuses on the phenomenon that occurs after changes, either minor increases or decreases, within molecules. The basic mutating is based on the knowledge of the coding for nucleic acids and amino acids. Simulating manipulation on the wild type codes can be easily performed and the mutant can be further used. Here are several techniques to reach this result. One of the most famous techniques is ontology. Gene ontology is the new “logy” for this purpose. Gene ontology is a scientific term used to describe the biology of a gene product in any organism. It also describes the molecular functions of gene products, the corresponding placement in and as cellular components, and the participation in biological processes [18]. Since much of biology works by applying prior knowledge to an unknown entity, the application of a set of axioms that will elicit knowledge and the complex biological data stored in bioinformatics databases are necessary [19]. These often require added knowledge to specify and constrain the values held in that database, and a way of capturing knowledge within bioinformatics applications and databases is through the use of ontologies [19]. At the beginning of this century, the Gene Ontology (GO) Consortium was founded. The

44

Viroj Wiwanitkit aim of the GO Consortium is to provide a framework for both the description and the organization of such information [20].

References [1] [2] [3] [4] [5]

[6]

[7]

[8]

[9]

[10]

[11] [12]

[13]

Pons T, Montero LA, Febles JP. Computational biology in Cuba: an opportunity to promote science in a developing country. PLoS Comput Biol. 2007 Nov;3(11):e227. Kane MD, Brewer JL.An information technology emphasis in biomedical informatics education.J Biomed Inform. 2007 Feb;40(1):67-72. Renschler HE. Rational continuing medical education. Schweiz. Rundsch. Med. Prax. 1991;(19):515-23. Keller RB. Public data and private doctors: Maine tackles treatment variations. J. State Gov 1991; 64(3):83-6. Schneider M, Tognolli M, Bairoch A. The Swiss-Prot protein knowledgebase and ExPASy: providing the plant community with high quality proteomic data and tools. Plant Physiol Biochem. 2004 Dec;42(12):1013-21. Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, Pilbout S, Schneider M. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003 Jan 1;31(1):365-70. Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LS. UniProt: the Universal Protein knowledgebase. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D115-9. Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LS. The Universal Protein Resource (UniProt). Nucleic Acids Res. 2005 Jan 1;33(Database issue):D154-9. Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R. The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D262-6. Shah, S.P., He, D.Y., Sawkins, J.N., Druce, J.C., Quon, G., Lett, D., Zheng, G.X., Xu, T., and Ouellette, B.F., 2004, Pegasys: software for executing and integrating analyses of biological sequences. BMC Bioinformatics. 5(1):40. Parkinson J, Anthony A, Wasmuth J, Schmid R, Hedley A, Blaxter M. PartiGene— constructing partial genomes. Bioinformatics. 2004; 20(9):1398-404. Kousthub PS, Deshpande N, Shanker K, Pandey, A. BioBuilder as a database development and functional annotation platform for proteins. BMC Bioinformatics. 2004; 5(1):43. Yang X, Dondeti V, Dezube R, Maynard DM, Geer LY, Epstein J, Chen X, Markey SP, Kowalak JA. DBParser: web-based software for shotgun proteomic data analyses. J. Proteome. Res. 2004; 3(5):1002-8.

How to Create a Computational Medicine Study

45

[14] Chaurasia G, Iqbal Y, Hänig C, Herzel H, Wanker EE, Futschik ME. UniHI: an entry gate to the human protein interactome. Nucleic Acids Res. 2007 Jan;35(Database issue):D590-4. [15] Jones G, Willett P. Docking small-molecule ligands into active sites. Curr Opin Biotechnol. 1995 Dec;6(6):652-6. [16] Schneidman-Duhovny D, Inbar Y, Polak V, Shatsky M, Halperin I, Benyamini H, Barzilai A, Dror O, Haspel N, Nussinov R, Wolfson HJ. Taking geometry to its edge: fast unbound rigid (and hinge-bent) docking. Proteins 2003; 52: 107-12. [17] Lybrand TP. Ligand-protein docking and rational drug design. Curr Opin Struct Biol. 1995 Apr;5(2):224-8. [18] Stevens R, Goble CA, Bechhofer S. Ontology-based knowledge representation for bioinformatics. Brief. Bioinform. 2000; 1(4):398-414. [19] Ashburner M, Lewis S. On ontologies for biologists: the Gene Ontology--untangling the web. Novartis Found Symp. 2002; 247:66-80 [20] Takai T, Takagi T. Introduction to gene ontology. Tanpakushitsu Kakusan Koso. 2003; 48(1):79-85.

In: Computational Biology: New Research Editor: Alona S. Russe

ISBN: 978-1-60692-040-4 © 2009 Nova Science Publishers, Inc.

Short Commentary E

IDENTIFYING RELATED CANCER TYPES C. D. Bajdik1,2, Z. Abanto1, J. J. Spinelli1,2, A. R. Brooks-Wilson1,3,4, and R. P. Gallagher1,2 1

2

Cancer Control Research Program, BC Cancer Agency School of Population and Public Health, University of British Columbia 3 Canada’s Michael Smith Genome Science Centre, BC Cancer Agency 4 Department of Medical Genetics, University of British Columbia

Abstract Background: Human cancer is often classified according to the anatomic site at which it occurs, and researchers are often taught these cancer types are actually a spectrum of disease. A review in 2000 (Hanahan and Weinberg; Cell 2000 100:57-70) reported that all cancers share six characteristics: (1) self-sufficiency in growth signaling, (2) the ability to ignore external anti-growth signals, (3) the ability to avoid apoptosis, (4) sustained angiogenesis, (5) the capacity for limitless reproduction and (6) the ability to invade tissue and spread to other anatomic sites. Our goal was to identify related cancer types using different observational strategies. Methods: We employed one method that used text-mining of online information about genes and disease. A second method used medical records of patients in British Columbia who were diagnosed with multiple cancer types between 1970 and 2004. A third method correlated Canadian provincial cancer rates for various cancer types. Results: Several pairs of related cancer types were identified using each method, although no pair was identified by all three strategies. The pairs of cancer types, lung/bladder and lung/kidney, were both identified by the text-mining and correlation studies. Esophageal cancer and melanoma were identified as related cancer types by both the analysis of patients with multiple primary cancers and the correlation study. Discussion: If cancer types are related, patients with one cancer might increase surveillance for other related cancer types, and drugs that are effective for treating one cancer might be successfully adapted for the related cancer types.

Introduction Cancer is often classified according to the site at which it occurs, and different cancer types are sometimes considered a spectrum of disease. This classification system closely corresponds to systems for diagnosing and treating cancer. Stomach cancer is often diagnosed

48

C. D. Bajdik, Z. Abanto, J. J. Spinelli et al.

by a gastroenterologist, whereas skin cancer is often diagnosed by a dermatologist. The healthcare professionals, and sometimes entire clinics, that treat and diagnose stomach cancer patients are often different than those who treat and diagnose melanoma. Despite the benefits that come from considering different cancer types as different diseases, there are several things that all cancer types have in common. In their landmark review, Hanahan and Weinberg summarized six characteristics that all cancers share: (1) self-sufficiency in growth signaling, (2) the ability to ignore external anti-growth signals, (3) the ability to avoid apoptosis, (4) sustained angiogenesis, (5) the capacity for limitless reproduction and (6) the ability to invade tissue and spread to other anatomic sites. (Hanahan and Weinberg 2000) We suspect that some cancer types are related in additional ways. The identification of related cancer types could lead to better therapies for treating a cancer based on the success of a therapy for a related cancer type. The identification of related cancers also might lead to improved surveillance strategies in cancer patients and their families. Finally, the identification of related cancer types is expected to provide insight regarding cancer etiology, and therefore might ultimately lead to prevention measures. Our goal was to identify related cancer types using different observational strategies. This paper describes three analyses to identify related cancer types: (Study 1) text-mining information about genetic factors, (Study 2) summarizing medical records observations of people with multiple primary cancers, and (Study 3) correlating cancer rates in nine Canadian provinces.

Methods Cancer types were defined according to the categories specified by the National Cancer Institute of Canada in their 2007 publication of Canadian cancer rates. (CCS/NCIC 2007) The 22 categories are somewhat arbitrary and defined in Table 1. Those categories are based on anatomic sites, and defined using the topology and cell histology codes described by the International Coding of Diseases for Oncology system. (Fritz et al 2000) We excluded the cancer type “liver” because the original versions of Study 1 and Study 2 (described below) did not consider liver cancer. We performed three studies that used different strategies to identify related cancer types. Each study used observational or publicly-accessible data. We provide a list of related cancer pairs that were identified by each study, and compare the results. The methods for each study are briefly described below. Each study used slightly different criteria to determine whether a cancer pair was significant, but we consistently relied on two-sided confidence intervals and p-values, and did not correct for multiple testing. Readers are advised to consult the original publications from each study for further information.

Identifying Related Cancer Types

49

Table 1. Definitions of 23 cancer types according to anatomic site and cell histology codes of the International Classification of Diseases for Oncology, Third Edition (ICDO3) (Fritz et al 2000). Cancer type definitions are taken from the 2007 National Cancer of Canada summary (NCIC 2007).

* excluding cervix

Study 1: Text-mining Online Mendelian Inheritance in Man (OMIM; www.ncbi.nlm.nih.gov/omim) is a computerized database of information about genes and heritable traits in human populations, based on information reported in the scientific literature. (Hamosh et al 2002) We developed an automated text-mining system for OMIM to identify genetically-related cancers, and

50

C. D. Bajdik, Z. Abanto, J. J. Spinelli et al.

developed the computer program CGMIM to search for entries in OMIM that are related to one or more cancer types. (Bajdik et al 2005) The software considers all cancer types in Table 1, but did not separate Hodgkin and Non-Hodgkin lymphomas because it is difficult to distinguish them in text-mining analyses. For pairs of cancer types, CGMIM generates a table with rows and columns for each cancer type, and cells containing the number of OMIM gene entries that mention an association with those cancers. If several OMIM entries mention one type of cancer, and several entries mention another type of cancer, then some entries will mention both types of cancer by chance. If the mention of different cancers occurred at random, the expected number of genes in OMIM that mention two specific types of cancer can be estimated as the total number of genes related to cancer, multiplied by the probabilities that an OMIM entry mentions each individual cancer type. The latter probabilities are estimated as the proportion of genes in OMIM that are related to each cancer type. CGMIM results are posted regularly and the source code is available from the BC Cancer Research Centre website (http://www.bccrc.ca/ccr/CGMIM). An approximate two-tailed 95% confidence interval for the ratio of expected number of cases (E) and observed cases (O) is O/E ± (1.96/√E). We defined a pair of cancer types to be significantly-related if the number of observed cases exceeded the number of expected cases, and the confidence interval for their ratio excluded 1. CGMIM results from March 31, 2008 were used to identify pairs of cancer types that are significantly-related for Study 1.

Study 2: Analysis of medical records from patients with multiple primary cancers We considered an analysis of people diagnosed with multiple cancer types in British Columbia, Canada between 1970 and 2004. (Bajdik et al 2006) The analysis used data recorded in the BC Cancer Registry (BCCR) and considered all of the cancer types in Table 1. In people with two or more cancer types, the probability of a specific type was determined as the number of diagnoses for that cancer type divided by the total number of cancer diagnoses. If two types of cancer occur independently of one another, then the probability that someone will develop both cancers by chance is the product of the individual probabilities for each type. The expected number of people with both cancers is the number of people at risk multiplied by the separate probabilities for each cancer. An approximate two-tailed 95% confidence interval for the ratio of expected number of cases (E) and observed cases (O) is O/E ± (1.96/√E). For Study 2, we defined a pair of cancer types to be significantly-related if the number of observed cases exceeded the number of expected cases, and the 95% confidence interval for their ratio excluded 1.

Study 3: Correlation of regional cancer incidence rates We considered correlations between the 2007 Canadian provincial rates for the incidence of various cancer types. The analysis used annual age-standardized incidence rates per 100,000 as reported by the National Cancer Institute of Canada. (CCS/NCIC 2007) Data was reported for 16 female cancer types and 15 male cancer types from Table 1. We did not use reported data from the province of Newfoundland and Labrador because those rates are likely to be

Identifying Related Cancer Types

51

underestimated. (CCS/NCIC 2007) We did not use the reported prostate cancer incidence rate for the province of Quebec because of the same problem. (CCS/NCIC 2007) In females and males separately, we considered the Pearson correlation for each pair of cancer types. For Study 3, we defined a pair of cancer types to be significantly-related if the Pearson correlation was positive and the 2-tailed p-value was less than 0.05 .

Results Study 1: Text-mining On March 31, 2008, CGMIM identified 2147 genes related to cancer. There were 38 pairs of cancer types with significantly more related genes than would be expected (Table 2). For example, there were 41 genes related to cancer of the esophagus and 138 genes related to cancer of the stomach. Assuming the cancers are independent, about three gene entries in

Table 2. Pairs of cancer types for which there are significantly more related genes as would be expected if the cancer types were independent. Results are based on textmining of genetic information about cancer as reported in Online Mendelian Inheritance in Man (OMIM; www.ncbi.nlm.nih.gov/omim accessed March 31, 2008).

* excluding cervix

OMIM should mention both. In reality, there were 21 gene entries in OMIM that mention both stomach and esophageal cancer, and the seven-fold discrepancy indicates that these cancers types may be genetically related. The order of anatomic sites for each pair reported in

52

C. D. Bajdik, Z. Abanto, J. J. Spinelli et al.

Table 2, and the order of pairs in Table 2, is an alphabetic one. The “groups” of pairs in the table were used to ease reading, and do not imply the quantity of a site’s mention, the degree of supporting evidence for a related pair, nor the importance of a particular site/pair. Recall that this study did not distinguish Hodgkin and Non-Hodgkin Lymphoma, and could not identify related pairs for males and females separately. A related pair of cancer types identified by this study was that of cancers affecting the ovary and the prostate. It is unlikely that anyone is diagnosed with both of these cancers, but the types might be related through a factor that affects both genders. (E.g., hormone exposure.)

Study 2: Analysis of medical records for patients with multiple primary cancers In the BCCR data, there were records for 28,159 people with records of multiple primary cancers that were diagnosed from 1970 to 2004, including 1,492 people with between three and seven diagnoses. There was only one pair of cancer types that occurred significantly more often than expected among females, and eight pairs of cancer types that occurred significantly more often than expected among males (Table 3). As in the previous table, the order of anatomic sites in each pair and the order of pairs in Table 3 is alphabetic. The order does not imply the degree of supporting evidence for a related pair nor the site’s or pair’s importance. Table 3. Pairs of cancer types among females and males for which there was significantly more people diagnosed with both cancers than expected. Results are based on analysis of people in British Columbia diagnosed with more than one type of primary cancer between 1970 and 2004 (Bajdik et al 2006).

Study 3: Correlation of regional cancer incidence rates The analysis of 2007 cancer incidence rates for Canadian provinces suggested several pairs of related cancer types (Table 4). The analysis indicates two significantly-related pairs of cancer types among females, and nine significantly-related pairs of cancer types among males. In addition to evidence of related pairs, stomach cancer and melanoma incidence rates were negatively correlated for females (p<0.05). As in Tables 2 and 3, the order of anatomic sites

Identifying Related Cancer Types

53

in each pair and the order of pairs in Table 4 is alphabetic. The order does not imply the degree of supporting evidence for a related pair nor the site’s or pair’s importance. Table 4. Pairs of cancer types among females and males for which there was a significant positive correlation of age-adjusted incidence rates in Canadian provinces during 2007.

Comparative findings

Figure 1. Pairs of significantly-related cancer types identified by three different studies.

54

C. D. Bajdik, Z. Abanto, J. J. Spinelli et al.

A summary of related cancer types identified by the three studies is presented in Figure 1. The 2-way table in that figure lists each of the 22 cancer types from Table 1 in each row and each column. A distinct pattern is used in the table cells to denote a pair of cancer types that was identified by each of the three studies. Overlaid patterns are used if more than one study identified the corresponding pair of cancer types. Only three pairs of cancer types were identified by more than one analysis, and no pair was identified by all three methods. The pairs of lung/bladder and lung/kidney cancer were identified by both the text-mining (i.e., Study 1) and ecologic correlation (i.e., Study 3) analyses. The pair of esophageal cancer and melanoma was identified by both the BCCR multiple primary (i.e., Study 2) and ecologic correlation (i.e., Study 3) analyses.

Conclusion We reported the findings from three strategies for identifying related cancer types. Several pairs of cancer types were identified by each method, but only three pairs were identified by two different strategies: lung cancer paired with bladder cancer, lung cancer paired with kidney cancer, and esophageal cancer paired with melanoma. No related cancer types were identified by all three strategies. We did not adjust our statistical analyses for multiple hypothesis-testing because of the studies’ exploratory intent. Cancer types might be related because of factors that affect both of them. For example, tobacco is a major risk factor for many types of cancer. (Thun and Hedley 2006) There are also genes that are associated with familial cancer syndromes involving multiple cancer types. (Lindor et al 2006) For example, Hereditary Breast and Ovarian Cancer (HBOC) syndrome involves increased risks of many different cancers. Occupational exposures can also affect someone’s risk for several different types of cancer. (Siemiatycki et al 2006) Factors associated with some cancers are likely to be ingested, factors associated with other cancers are likely to be inhaled, and factors associated with other cancers are more likely to involve topical exposure. However, a factor like soot can have several possible routes of exposure. Another potential explanation for a related pair is that treatment for one cancer type might affect some other cancer type. Treatment for esophageal cancer involving radiation might affect someone’s melanoma risk. Finally, someone can be diagnosed with multiple cancer types because disease surveillance might change following his or her first cancer diagnosis. A good example is men with bladder cancer who are later diagnosed with prostate cancer because regular follow-up involving urological examinations uncovers asymptomatic prostate disease. Our first study involved text-mining an encyclopedia of information about human genes and their associated traits. Details of the method are provided in our original paper. (Bajdik et al 2005) The findings from the analysis provide evidence of a genetically-related cancer types. This strategy used an international source of information comprised of findings from epidemiological, biological, genetic and other research. In addition, the encyclopedia’s information is updated daily and software to run the analysis is available free-of-charge. There are disadvantages to the text-mining strategy. Negative findings about the relationship between a gene and cancer type can be misinterpreted as evidence of a relationship. (E.g., “The gene ___ is not related to ___ cancer.”) Further, the use of unsupervised text-mining

Identifying Related Cancer Types

55

meant we could not distinguish between Hodgkin and non-Hodgkin lymphoma, and prohibited us from identifying gender-specific related pairs. Our second study examined the medical records of patients in a population who were diagnosed with more than one primary cancer type. Details of the method are provided in our original paper. (Bajdik et al 2006) That analysis suggested nine pairs of primary cancers that might be related. The benefits of the method are that it uses a large population-based source of cancer diagnosis information covering 35 years. The analysis has disadvantages owing to the same reasons. The medical charts used in the analysis do not represent an international group of patients, and the data does not include non-human observations. Our third study considered the correlation of published incidence rates for various cancer types in nine Canadian provinces. This analysis had the advantage of using rates for human cancer diagnoses only, but did not use individual patient records to identify related cancer types. Thus the strategy might indicate a relationship between cancer types A and B, although no one in the observed populations might have been diagnosed with both A and B. This is a disadvantage of ecological correlation studies in general. Hanahan and Weinberg (2000) suggested that all cancers share several traits. While every cancer might be different, the similarities of various cancer types indicate that we can benefit from studying them together. The treatment of a cancer type might be improved if drugs that are effective in treating a related cancer type are used. In addition, someone who is the survivor of a cancer might be advised to increase their subsequent surveillance for related cancer types. Increased surveillance might likewise benefit a survivor’s family members. Our intention in this chapter was not to test the relatedness of specific cancer types. Rather, our intention was to illustrate different methodologies that identify related cancer types. More simply, the methods presented here are intended for hypothesis-generating and not hypothesis-testing.

Acknowledgment CDB and ARB are supported by Scholar Awards from the Michael Smith Foundation for Health Research.

References [1] Bajdik CD, Abanto ZU, Spinelli JJ, Brooks-Wilson A, Gallagher RP (2006) Identifying related cancer types based on their incidence among people with multiple cancers. Emerging Themes in Epidemiology 3:17 [2] Bajdik CD, Kuo B, Rusaw S, Jones S, Brooks-Wilson A (2005) CGMIM: automated text-mining of Online Mendelian Inheritance in Man (OMIM) to identify geneticallyassociated cancers and candidate genes. BMC Bioinformatics 6:78 [3] Canadian Cancer Society / National Cancer Institute of Canada (2007) Canadian Cancer Statistics. Toronto [4] Fritz A, Percy C, Jack A, Shanmugaratnam K, Sobin L, Parkin DM, Whelan S (2000) International Classification of Diseases for Oncology. Third Edition. World Health Organization;

56

C. D. Bajdik, Z. Abanto, J. J. Spinelli et al.

[5] Hamosh A, Scott AF, Amberger J, Bocchini C, Valle D, McKusick VA (2002) Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Research 30:52-55 [6] Hanahan D, Weinberg RA (2000) The hallmarks of cancer. Cell 100:57-70 [7] Lindor NM, Lindor CJ, Greene MH (2006) Hereditary neoplastic syndromes. In Cancer Epidemiology and Prevention, 3rd Edition. Editted by Schottenfeld D, Fraumeni JF Jr. New York: Oxford University Press; 2006:562-76 [8] Siemiatycki J, Richardson L, Boffetta P (2006) Occupation. In Cancer Epidemiology and Prevention, 3rd Edition. Editted by Schottenfeld D, Fraumeni JF Jr. New York: Oxford University Press; 2006:322-54 [9] Thun MJ, Hedley SJ (2006) Tobacco. In Cancer Epidemiology and Prevention, 3rd Edition. Editted by Schottenfeld D, Fraumeni JF Jr. New York: Oxford University Press; 2006:217-42

RESEARCH AND REVIEW STUDIES

In: Computational Biology: New Research Editor: Alona S. Russe

ISBN: 978-1-60692-040-4 c 2009 Nova Science Publishers, Inc.

Chapter 1

S AMPLE S IZE C ALCULATION AND P OWER IN G ENOMICS S TUDIES Danh V. Nguyen1 ,∗ Damla S¸ent¨urk2 , Danielle J. Harvey1 and Chin-Shang Li1 1 Division of Biostatistics, University of California Davis, California, USA 2 Department of Statistics, Pennsylvania State University University Park, Pennsylvania, USA

High-throughput laboratory measurement technologies, including microarrays for genomics and proteomics, are now typically used in biomedical studies ranging from animal models to human clinical trials. These methods, such as gene expression microarrays, aim to capture the global expression patterns of thousands of genes or proteins simultaneously. A common feature is the high-dimensionality of the resulting data. Post-study analytical challenges involve methods to extract the meaningful information from the millions of data points. However, there is a need to develop systematic approaches to planning such studies. In this work, we provide a synthesis of the available and current trends in sample size and power analysis for genomics studies. Our emphasis will be on clarifying assumptions of the available methods as well as their applicability in practice, including the assumption of independent gene expression. We also emphasize emerging sample size design methods that focus on the false discovery rate (FDR) over the traditional family-wise error rate as a criterion.

Key words: differential gene expression; DNA microarray studies; false discovery rate (FDR); high-dimensional data; measurement error; multiple hypothesis testing; p-value; sample size; power; simulation.

1.

Introduction

DNA microarray (array) technologies introduced in the mid 1990’s, including spotted cDNA (Schena et al., 1995), spotted oligonucleotide (e.g. Ramakrishnan et al., 2002) and high-density Affymetrix oligonucleotide arrays (Lockhart et al., 1996; Lipshitz et al., ∗

Correspondence: Danh V. Nguyen; Phone: (530)754-6510; E-mail address: [email protected]

60

Danh V. Nguyen, Damla S¸ent¨urk, Danielle J. Harvey and Chin-Shang Li

1999), have seen explosive uses/applications broadly throughout the sciences. These array technologies allow for genome-wide monitoring of biological processes by measuring thousands of gene expressions simultaneously or the expression of the whole repertoire of genes of an organism. See Nguyen et al. (2002) for a more comprehensive introduction to DNA array technologies, including technical and analytical aspects; Churchill (2002), Naidoo et al. (2005) and Simon et al. (2002) for general review of experimental design consideration; Lee et al. (2005) for emphasis in toxicogenomics; Kerr and Churchill (2001; 2001b) and Kerr et al. (2001) for experimental design and analysis. The resulting highthroughput data poses analytical/methodological challenges, particularly with respect to computational aspects, due to the millions of data points across all samples as well as to the high-dimensionality of the data (i.e. the number of genes evaluated per sample). Until the introduction of high-throughput technologies such as DNA arrays, the discipline of statistical science was nearly solely preoccupied with low-dimensional data and with theoretical interests focused on properties where the sample size n is large (approaches infinity). Experiments utilizing DNA arrays involve typically less than n = 100 samples and can be appropriately viewed as fixed. However, the addition of the scientific paradigm of “global” or “system-wide” monitoring of complex biological processes has led to recent interest in statistical science to study methodological properties with the sample size n fixed and allowing the number of variables m (e.g., genes) to become large (e.g., see Kosorok and Ma, 2007). Another popular approach to the analysis of gene expression is empirical Bayes (Efron, 2007; Smyth, 2004; Efron et al., 2001). In this work we provide an up-to-date synthesis and analysis of the current methodologies for sample size calculation and power analysis at the planning/initial stage of DNA microarray experiments. We emphasize the examination of the assumptions associated with each group of methods and their practical implications for real data. Determining the required sample size to achieve a level of power to detect a scientifically meaningful difference between the expression of a gene under two distinct conditions (e.g. diseased versus non-diseased states) or among a series of experimental conditions is an important factor in designing DNA array experiments. This falls under the aim of searching for differentially expressed (DE; up- or down-regulated genes) among experimental conditions/groups. Other experiments may involve finding genes that can discriminate or predict a clinical outcome, in addition to known (traditional) prognostic variables (e.g. see Alizadeh et al., 2000; Sorlie et al., 2001, van’t Veer et al., 2002; Tibshirani and Efron, 2002; Nguyen and Rocke, 2002b). Issues of sample size (and power) are similar in both (or any) types of experimental studies, although experiments to identify DE genes are the predominant types of studies in practice, because microarrays are ideally suited for exploratory or screening tools. Thus, follow-up validation studies, using real-time RT-PCR for instance, to confirm DE genes identified is needed (Chuaqui et al. 2002; Davidson et al. 2004). Before introducing the technical aspects of sample size and power calculation, we provide some examples of array studies to illustrate the diversity of the types of exploratory studies for detecting DE genes. In a study to examine the mechanisms by which n-3 polyunsaturated fatty acids (PUFAs) decrease colon formation in a rat model, Davidson et al. (2004) used a 3 × 2 × 2 factorial design where animals were randomly assigned to three dietary regimens (corn oil/n-6 PUFA, fish oil/n-3 PUFA, or olive oil/n-9 monounsaturated FA), two treatments (injection of carcinogen or saline) and at two time points (12 hours and

61

0.0

0.5

1.0

1.5

2.0

Sample Size Calculation and Power in Genomics Studies

0.0

0.2

0.4

0.6

0.8

1.0

Figure 1. Distribution of p-values for n-3 PUFA vs n-6 PUFA comparison in the diet/colon cancer data.

10 weeks after first injection). The two time points were chosen to correspond to the initiation and promotional stages of tumor development in order to assess the chemopreventive effects of n-3 PUFA. In addition, for some animals, multiple arrays (replications) hybridized with the same biological source in the 3 × 2 × 2 groups were used. The primary interest here is to compare n-3 PUFA with other dietary fats (e.g. n-3 PUFA versus n-6 PUFA or n-3 PUFA versus other dietary fats combined, etc.) Due to repeated measurements, for each of 9,685 genes (probes), these comparisons of interest were based on p-values from fitting a linear mixed model (Pinheiro and Bates, 2000) to account for within animal correlation. For example, Figure 1 is a histogram of the 9,685 p-values. It is estimated that for this data about 76% of the genes are non-differentially expressed. It would be of interest at the planning stage of this study, or a similar study in the future, to determine the sample size needed to be able to detect genes with effect size |µ2j − µ1j |/σj with 85% power where µ2j and µ1j are the mean expression levels of gene j in the n-3 PUFA group relative to the other dietary fat groups combined (n-6 PUFA plus n-9 monounsaturated FA), and σj2 is the variability in expression of gene j. We will further explore power in the context of multiple testing, as illustrated by this example and in DNA array applications generally, where m hypothesis tests are performed corresponding to the m genes measured over nk samples/replicates in k groups (k = 1, . . . , K). Clearly, the distribution of the effect sizes (i.e. the gene-specific means and variance parameters, µkj and σj2 ) needs to be (1) assumed or (2) modelled parametrically/nonparametrically and ideally with pilot data. Many examples of applications involving two primary comparison groups ( K = 2) can be found in the literature. These include the identification of genes differentially ex-

62

Danh V. Nguyen, Damla S¸ent¨urk, Danielle J. Harvey and Chin-Shang Li

pressed between breast cancer individuals with mutation in the BRCA1 versus BRCA2 gene (Hendenfalk et al., 2001), acute myeloid and lymphoblastic leukemia (Golub et al., 1999), cells in cancer and normal tissues (Alon et al., 1999). Other studies of differential gene expression include studies of the effects of interferon-alphas in humans (Ji et al., 2003), the proliferation of Pre B cells in vitro (Yang et al., 2003), and biliary tract carcinoma (Hansel et al., 2003), among others. Numerous experimental designs and analysis methods for detecting DE have been proposed, including Kerr and Churchill (2001), Kerr et al. (2001), Lee et al. (2002), Tusher et al. (2001), and Smyth (2004) among others. See Glonek and Solomon (2004) for a detailed discussion of time course designs as well as factorial designs for cDNA array experiments. We note that although the methodologies described here were developed for highthroughput transcriptomic (genomics) data from DNA microarrays, they are applicable to other high-throughput technologies or assays. These include proteomic arrays, such as the class of spotted protein arrays initially introduced by Haab et al. (2001), as well as twodimensional difference between gel fluorescence and gel electrophoresis (e.g. see Sharma et al., 2005; Gharbi et al., 2002; Fodor et al., 2005) and 2D-E with internal standard (Wheelock et al., 2006). However, applications in proteomics and metabolomics predominantly use mass spectrometry (MS) and NMR spectroscopy. Many issues of design and analysis of DNA array data are also applicable to these technologies which also result in highdimensional data in the form of peak heights, peak areas, or binned spectral values, analogous to spot/gene intensities for DNA array data. Rocke (2004) provides an overview of the connection in design and analysis issues between DNA array and mass spectrometry and nuclear magnetic resonance (NMR) spectroscopy. Examples of methodology and applications in proteomic mass spectrometer data and metabolomic NMR data can be seen in Purohit and Rocke (2003) and Purohit et al. (2004). For an example of an application in glycomics (glucans and glycoproteins) using high-resolution MS, see Ye et al. (2007). In addition to high-dimensional “-omics” type data (transcriptomic, proteomic, glycomics, metabolomics etc.), applications in brain imaging using MRI (magnetic resonance imaging), functional MRI, and diffusion tensor imaging (DTI) also result in high-dimensional data. For a comprehensive description of these imaging technologies, see Toga and Mazziotta (2002) and for analytical approaches see Friston et al. (2007). These techniques measure brain morphology, functional activities, and the diffusion of water molecules for supplies of fiber tracts (Basser, 1995; Basser and Pierpoli, 1996), respectively. High-dimensional data result from these (and other) imaging modalities in that at each of the thousands of voxels or volume elements (comprising the volume of a brain region or whole brain) a (voxel-based) measurement (from MRI) or a diffusion tensor (3 × 3 positive definite matrix) is obtained. Statistical analysis is generally performed at the voxel-level across the whole brain or a region of interest (Friston et al. (2007)). Thus, voxel-based analysis also requires accounting for multiple comparisons, as in genomics analysis of DE. False discovery rate (FDR) can be used in voxel-based analysis as an alternative measure of error control (Genovese et al., 2002; Schwarzeman et al, 2005) to random field theory for the multiplicity problem. Concepts in the sample size and power in genomics or variations of them such as those based on FDR control, are applicable to imaging studies involving voxel-based analysis. The remainder of the paper is organized as follows. In Section 2 we summarize the background and preliminaries for the sample size methods described in subsequent sec-

Sample Size Calculation and Power in Genomics Studies

63

tions (3-6). More specifically, we provide a review of basic sample size and power in the classical setting of a single hypothesis test (e.g. expression of a single gene), the multiple testing problem, and the basics of FDR theory that will be used. In section 3-6 we describe various approaches to sample size determination, with emphasis on methods that target FDR-control, although we also provide some consideration to methods that target to control the classical family-wise error rate (FWER) in Section 6. Throughout, we will highlight in our discussion of these methodologies, the similarities and differences, and the assumptions and their implications in practice. These methods of sample size calculation can be grouped by main categories according to (1) the type of error control (e.g. FDR or FWER), (2) the computational method (e.g. analytical, analytical approximation, simulation-based, and (3) the underlying assumptions, or a combination of these. In Section 7, we provide references to additional relevant works as well as those relating to classification studies. We conclude with a brief discussion in Section 8.

2. 2.1.

Preliminaries Sample size calculation for a single hypothesis test: basic concepts

In this section, we briefly review the classical sample size calculation corresponding to a single hypothesis test and define the basic relevant concepts and notation. We then consider relevant concepts associated with multiple hypothesis testing in genomics, including results for FDR used in sample size calculations that target FDR-control. Let Xi be the expression level (intensity) of a gene in individual i, for i = 1, 2, . . . , n individuals. Suppose that one is interested in testing the hypothesis that the mean expression level is at some level µ0 versus an alternative level µ1 , H0 : µ = µ0 versus H1 : µ = µ1 . If the observed data, {xi}ni=1 , is not compatible with H0 , then it is rejected. Table 1 shows the possible outcomes of this hypothesis testing procedure and the two types of errors that can be made (Type I and Type II). Specifically, the Type I error refers to rejecting a true null hypothesis (H0 ) when it is true (Pr(Type I error) = α), and the Type II error refers to accepting H0 when in fact the alternative hypothesis H1 is true (Pr(Type II error) = β). The standard approach to hypothesis testing is to specify an acceptable (low) level of Type I error (e.g. α = 0.05 or α = 0.01) for the significance level of the test, and determine the rejection region Γα (rejection/decision rule) to minimize the Type II error (β). The power of the test is 1 − Pr(Type II error) = 1 − β. Table 1. Possible outcomes of testing a single hypothesis.

Null H0 true Alternative H1 true

Accept H0 Correct decision Type II error

Reject H0 Type I error Correct decision

Consider a simple example where Xi ∼ N (µ, σ 2) and σ 2 = 1 is known and we wish to test H0 : µ = 0 (µ0 ) versus H1 : µ = 1 (µ1 > µ0 ) based on n = 16 observations at level

64

Danh V. Nguyen, Damla S¸ent¨urk, Danielle J. Harvey and Chin-Shang Li

α = 0.05. The test statistic is based on the sample mean T =

X − µ0 √ σ/ n

(1)

which equals 4X in this example. Under H0 , T ∼ N (0, 1) and one rejects H0 if T ≥ z1−α where z1−α is the (1 − α)th quantile of the standard normal distribution; i.e. Φ(z1−α) = 1 − α. Note that Pr(Type I error) = Pr(reject H0 |H0 true) = Pr(T > z1−α ) = 1 − Φ(z1−α) = α. Suppose that we observe x = 0.64 so our observed test statistic is t = 4(0.64) = 2.56. The p-value for a test statistic is the probability that one observes a test statistic (T ), under H0 , as extreme or more extreme that what was observed (i.e. more extreme than t = 2.56). There are various ways to compute this probability. When the sampling distribution of T is known, as it is in this case, it can be obtained directly as p = Pr(T > 2.56) = 0.0052. If the assumed probability model for the data (the Xi ’s) leads to a test statistic T whose sampling distribution is unknown, the p-value can be obtained using simulation, for instance. This approach is useful when the form of T is particularly complex, with unknown exact or approximate sampling distribution. In this example (although P not needed), based on B = 10000 simulations p = B −1 B b=1 I{tb > 2.56} = 0.0048, where tb is the observed test statistic based on the bth simulated data. Throughout, we denote the indicator function I(A) or I{A} for event A to be I(A) = 1 if A is true and I(A) = 0 otherwise. In the two-sample problem, such as comparing the gene expression levels among n1 diseased and n2 non-diseased individuals, the two-sample t-statistic, namely T = (X 1 − X 2 )/s can be used to test H0 : µ1 = µ2 versus H1 : µ1 6= µ2 . Here X k (k = 1, 2) is the sample mean of group k and s = {(s21 /n1) + (s22 /n2 )}1/2 where s2k is the sample variance for group k. In this case, the distributional assumption of T (or equivalently of the underlying data Xki ’s) can be avoided in computing the p-value by using random permutations of the data and repeatedly computing the test statistic T . Denoting tb to be the observed test statistic for the bth permutation of the data, an estimate of the p-value is P p = B −1 B b=1 I{|tb | > |t|} where t is the observed test statistic based on the original (unpermuted) data. The power function of a test of the null hypothesis H0 , denoted by h(n, θ) is a function of the sample size n and of a parameter of interest, denoted by θ. In the above one-sample problem, θ = µ, the mean, for instance. When comparing competing test statistics, it is appropriate to fix n and compare their respective power. For the sample size calculation to plan/design a study at a desired level of power, h(n, θ) is fixed at a given power and the root of the function can then be solved to obtain the required n. Therefore, a sample size n can be obtained to achieve a desired level of power, typically set at 80% or higher in clinical studies and with a fixed Type I error probability of α. Continuing with the onesample problem described above, we have that h(n, ∆) = Pr(T > z1−α |H1 ) = Pr(T ∗ > √ √ z1−α + nδ/σ) = 1 − Φ(z1−α + nδ/σ), where T ∗ ∼ N (0, 1), δ = µ0 − µ1 , and δ/σ is the (standardized) effect size. For example, to obtain the sample size required to detect an effect size ∆ (e.g. µ1 = 1, µ0 = 0, σ = 1, ∆ = 1) with 90% power (β = 0.1) a solution to h(n, ∆) = 0.9

(2)

Sample Size Calculation and Power in Genomics Studies 65 √ is needed to find n. Equation (2) is the same as Φ(z1−α + nδ/σ) = β, where β = 0.1. √ Thus, z1−α + nδ/σ = zβ = −z1−β , and so n = σ 2(z1−α + z1−β )2 /δ 2. For example, with ∆ = 1 and type I error of α = 0.05 the needed sample size to achieve 90% power is n = (z0.95 + z0.9 )2 = (1.645 + 1.282)2 = 8.57 ≈ 9 samples. √ We note that the power function h(n, ∆) = 1 − Φ(z1−α + nδ/σ) in this case can be √ obtained explicitly since the distribution of the test statistic T = n(X − µ0 )/σ is known due to the normality assumption for the data model. However, when the normal model is not assumed, the power function h(n, ∆) is approximate when n is large. The sensibility of a model assumption will depend on the specific application. For gene expression data, the distribution of raw expression is typically not symmetric and so one can justify such a methodological assumption after a suitable transformation to the original/raw expression data, such as a logarithm. Note also that we have considered an example with a one-sided alternative hypothesis (e.g. H1 : µ1 > µ0 ). For a two-sided test of H0 : µ = µ0 versus H1 : µ 6= µ0 the null hypothesis H0 is rejected given the extremes T ≤ −z1−α/2 or T ≥ z1−α/2 √ √ and the power functon is h(n, ∆) = 1 − Φ(z1−α/2 + nδ/σ) + Φ(−z1−α/2 + nδ/σ) where δ = µ0 − µ and therefore n = σ 2 (z1−α/2 + z1−β )2/δ 2 . In the case of two-group comparisons based on the data {(X1i, X2j ), i = 1, . . ., n1 , j = 1, . . . , n2}, the tests of H0 : µ1 = µ2 versus H1 : µ1 6= µ2 , H1 : µ1 > µ2 , or H1 : µ1 < µ2 correspond to alternative hypotheses of no differential expression, up-regulation or down-regulation relative to condition 1, respectively. In the hypothesis of no differential expression under the normal distribution assumption (with equal group variances σ12 = σ22 = σ 2 known), the sample size per group is n = 2σ 2(z1−α/2 + z1−β )2 /δ 2. As mentioned earlier, when h(n, ∆) (or a function of it) cannot be solved analytically to obtain the sample size, numerical methods can be used. Alternatively, (Monte Carlo) simulation can be utilized based on an explicit probability model to generate the data and to evaluate the resulting power as a function of n for a given test statistic of interest. The advantage of a simulation approach is that a more complex model, which may better reflect the real data, can be used for sample size planning. The main consideration in using a simulation model then is to assess/justify the model used. As is usually the case, pilot data or data from previous similar studies are helpful in making this assessment. Simulation models that use the real (e.g. pilot) data aim to address this issue.

2.2.

Introduction to multiple testing and false discovery rate

We next provide an introduction to multiple hypothesis testing and the relevant concepts, including results on the false discovery rate (FDR). Because of the large number of hypothesis tests carried out in genomics expression data, the use of FDR as a measure of error

66

Danh V. Nguyen, Damla S¸ent¨urk, Danielle J. Harvey and Chin-Shang Li

has been popular and it is particularly useful in exploratory studies, such as those using microarrays. A common application in microarray studies is to identify differentially expressed genes among two or more groups (experimental conditions), as illustrated in the Introduction section with the colon cancer initiation and progression in the rat example. Assuming that 1 be each array hybridization (sample) is from a unique animal, let {(X1i1, . . ., X1im)}ni=1 the m gene expression levels for animals in group 1 (e.g. animals on n-3 PUFAs diet) and 2 similarly let {(X2i1, . . ., X2im)}ni=1 be the m levels for animals in group 2 (e.g. animals on n-6 PUFA diet). If one is interested in identifying genes over-expressed in the n-3 PUFA group, the corresponding hypothesis for gene j is H0j : µ1j = µ2j versus H1j : µ1j > µ2j , for j = 1, . . ., m. These m hypotheses are typically dependent in practice, as in array data, although independence is often assumed as an approximation. As we will describe later, some methods for sample size incorporate simple dependence structures that are tractable. Simulation studies can also be used to assess the assumptions of independence as well as specific dependence structures. When testing these m hypotheses, the possible outcomes are summarized in Table 2. Among m hypothesis tests (m genes), m0 and m1 denote the number of truly unexpressed and truly expressed genes, respectively. The number of rejections (“discoveries”) is R and the number of non-rejections is W (= m − R). Note that only R, W and m are observable from Table 2. The proportion of genes truly unexpressed is denoted by π0 ≡ m0 /m. From Table 2, note that V /m0 = Proportion of genes declared significant among genes actually not DE S/m1 = Proportion of genes declared significant among genes actually DE 1 − V /m0 = U/m0 = T N/m0 = specificity V /R = Proportion of false discoveries S/R = Proportion of true discoveries U/W

= T N/(T N + F N ) = Negative predictive value

(U + S)/m = Accuracy.

Table 2. Possible outcomes of testing m hypotheses. The proportion of true null hypotheses is π0 ≡ m0 /m and FDR = E( VR I{R>0}).

Null true (Truly unexpressed) Alternative true (Truly expressed) Total

Accept (Declare unexpressed) U (True Negative/TN) T (Error II, False Negative/FN) W

Reject (Declare expressed) V (Error I, False Positive/FP) S (True Positive/TP) R

Total m0 m1 m

Sample Size Calculation and Power in Genomics Studies

67

The conventional criteria for controlling error in multiple testing is to control the familywise error rate (FWER), which is Pr(V ≥ 1). This is the probability of making at least one false positive error. Testing each of the m hypotheses at the comparison-wise error rate (CWER) of α (e.g. α = 0.05) does not control (guarantee) the FWER at level α. The simplest method to control the FWER at level α is to test each hypothesis at the CWER of α∗ = α/m, which is the well-known Bonferroni correction. This approach is not reasonable when m is large (since for array data m ∼ 1000 − 50000) which would require testing each hypothesis at a level α∗ that is too small. For reasonable levels of CWER, Pr(V ≥ 1) is large for m in the order typical of array experiments. Improvements on the Bonferroni correction that still control the FWER, based on multi-step procedures, were proposed by Holm (1979) and Hochberg (1988), although (1) dependency of the test statistics is not used and (2) the gain is modest, particularly for m in the order of microarray applications. An alternative criteria that is widely used in genomics studies for error-control is the false discovery rate (FDR). Although FDR was proposed by Benjamini and Hochberg (1995) in a different context of application, it has been found to be useful in genomics, imaging and other high-dimensional data. In the context of multiple testing, as summarized in Table 2, FDR is the expected proportion of false discoveries among the R discoveries or rejections. More precisely, FDR (Benjamini and Hochberg, 1995, herein BH) is defined as V V I R > 0 Pr(R > 0). (3) =E FDR = E R {R>0} R Denoting the ordered observed p-values as p(1), . . . , p(m), the BH FDR controlling procedure is to find ˆBH = max{j : p(j) ≤ (j/m)α} (4) k and reject p(1), . . . , p(kˆBH) . The pre-specified FDR target control level is α ∈ (0, 1). It was shown by BH that FDR ≤ π0 α ≤ α in the FDR controlling procedure (4), where in the last inequality π0 is set to 1. FDR controlling procedure (4) was later shown by Finner and Roters (2001) to control the FDR at exactly level π0 α. Refinement by an adaptive control was proposed in Benjamini and Hochberg (2000). Thus, this approach is a conservative procedure and improved power results from incorporating an estimate of π0 into the FDR controlling procedure (4). This was recognized in the context of genomics data and new methods that aim to estimate π0 for FDR estimation were proposed by Storey (2002; 2003) and Storey and Tibshirani (2003), although improvement due to estimating π0 (or m0) in multiple testing has previously been recognized (e.g. Schweder and Spjøvtoll, 1982). The effect of estimating π0 on the improvement in power to detect DE was examined by Nguyen (2004a,b, 2005) among others. Storey (2002) introduced an alternative direct approach to the BH sequential FDR procedure (4); see also Storey (2003) and Storey et al. (2004). This approach involves first estimating π0. We briefly review this estimation procedure, assuming independent test statistics (or equivalently p-values). Large p-values, say pi > λ, 0 < λ < 1, suggest that the observed data is more compatible with true null hypotheses; i.e. more consistent with H0j for gene j. Estimation of π0 can be based on the set of large p-values falling into the upper interval (λ, 1]. Also, note that If no genes are differentially expressed then null p-values are uni-

68

Danh V. Nguyen, Damla S¸ent¨urk, Danielle J. Harvey and Chin-Shang Li

formly distributed, denoted P ∼ U(0, 1), and Pr{P ∈ (λ, 1]} = Pr(P > λ) = 1 − λ. Therefore, the expected number of null p-values that would fall into the interval (λ, 1] is (1 − λ)m0. If the number of null p-values in (λ, 1], namely #{Null pj > λ}, is unknown, an unbiased estimate of π0 is π ˆ0(UB) =

#{Null pj > λ} , m(1 − λ)

(5)

since E[ˆ π0(UB)] = m0/m = π0 . Because the numerator in (5) is not observable, replacing the numerator with #{pj > λ}, an observable quantity, leads to a conservatively biased estimate of π0 : W (λ) #{pj > λ} = . (6) π ˆ0(λ) = m(1 − λ) m(1 − λ) The estimator π ˆ0 (λ) is conservatively biased since #{pj > λ} = #{Null pj > λ} + π0(λ)] ≥ E[ˆ π0(UB)] = π0 . Note that #{Alt. pj > λ} ≥ #{Null pj > λ} and, thus, E[ˆ as λ approaches 1, #{pj > λ} consists mostly of truly null p-values, therefore, the bias decreases. At the same time, due to the resulting small interval 1 − λ (as λ → 1), the variance increases. Thus, λ, serves as a tuning parameter that balances bias and variance. Storey (2002) and Storey and Tibshirani (2003) proposed the following automatic algorithm for choosing the optimal λ to minimize the mean squared error of π ˆ0 (λ). The ˆ0(OPT) ≡ limλ→1 π ˆ0 (λ). For example, applicaoptimal estimator of π0 is defined as π tion to the colon cancer data gives the estimated proportion of non-differentially expressed genes between-3 PUFA and n-6 PUFA enriched diets to be π ˆ0 (OPT) = 0.758. ˆ0 (λk ). 1. For each λk ∈ R = {0, 0.01, 0.02, . . ., 0.95} compute π 2. Fit a natural cubic spline with 3 degrees of freedom, fˆ, through the data points ˆ0(λk )}96 {λk , π k=1 . The data points are weighted by 1 − λk . ˆ0 (OPT) = fˆ(1). 3. Estimate π0 by π Thus, the FDR controlling procedure incorporating π ˆ0 (λ) is α ˆλ = max j : p(j) ≤ j . k m π ˆ0(λ)

(7)

Note that taking π ˆ0 (λ) ≡ 1 gives the BH procedure (4). In addition to the estimation of the proportion of null hypotheses, the following basic results on direct estimation of the FDR (due to Storey, 2002; Storey et al., 2004) are useful for sample size calculations that aim to control FDR (described in subsequent sections). Storey (2002) introduced the positive FDR (pFDR), V |R > 0 = FDR/ Pr(R > 0) (8) pFDR = E R and showed that pFDR(γ) =

π0 γ π0 Pr(P ≤ γ|H = 0) = Pr(P ≤ γ) Pr(P ≤ γ)

(9)

Sample Size Calculation and Power in Genomics Studies

69

for independent p-values, where γ is the rejection threshold corresponding to a fixed rejection region [0, γ], and H = I(alternative is true) (Storey, 2003; Theorem 1). The denomic ≤ γ) = R(γ)/m, where R(γ) = #{pi ≤ γ}. nator, Pr(P ≤ γ), can be estimated by Pr(P c > 0), Since pFDR is conditioned on R(γ) > 0 and Pr(R(γ) > 0) ≥ 1 − (1 − γ)m ≡ Pr(R Storey (2002) proposed the following conservatively biased estimator of pFDR: d λ(γ) = pFDR

W (λ)γ π ˆ0(λ)γ , = c ≤ γ)Pr(R c > 0) (1 − λ){R(γ) ∨ 1}{1 − (1 − γ)m} Pr(P

(10)

where R ∨ 1 = max{R, 1}. Thus, by dropping the estimate of Pr(R > 0) from (10), we obtain a direct estimator of FDR: d λ (γ) = FDR

W (λ)γ . (1 − λ){R(γ) ∨ 1}

(11)

d λ(γ)] ≥ FDR(γ) The estimate (11) is conservatively designed in the sense that E[FDR for all γ and π0 (Storey, 2002; Theorem 2). As we will describe in Section 4., sample size calculation methods use the result (9) and relation between FDR and pFDR when the number of hypotheses m is large. Because the lower bound Pr(R(γ) > 0) ≥ 1−(1−γ)m ≈ 1 for m large, as in array data, from definition (8), pFDR ≈ FDR. It is also convenient to express result (9) equivalently in terms of test statistics, instead of p-values. In terms of the independent test statistics T1, . . . , Tm and rejection region Γ = [0, γ], (9) is equivalent to pFDR(Γ) =

π0 Pr(T ∈ Γ|H = 0) = Pr(H = 0|T ∈ Γ), Pr(T ∈ Γ)

(12)

with Pr(T ∈ Γ) = π0 Pr(T ∈ Γ|H = 0) + π1 Pr(T ∈ Γ|H = 1) (Storey, 2002; Theorem 1). Thus, to control FDR at level α (when pFDR ≈ FDR), (12) is set to be less than or equal to α, i.e. Pr(T ∈ Γ|H = 0) α π1 . (13) ≥ 1 − α π0 Pr(T ∈ Γ|H = 1) We note here that although the BH method for FDR control was originally shown for independent p-values, it has been shown to also hold for a certain class of dependency, such as positive regression dependency (Benjamini and Yekutieli, 2001). Estimation of FDR under dependence is also proposed in Storey and Tibshirani (2003).

3.

General simulation-based approaches to sample size and power planning

In the past decade there have been unprecedented transcriptomic data from microarray experiments generated, ranging from laboratory to clinical studies and in various genomes. Human array data is particularly abundantly available from individual investigator databases that are open to the public and also from more formal public repositories, including the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO; http://www.ncbi.nlm.nih.gov/geo/). These data can be selected and combined to form pilot

70

Danh V. Nguyen, Damla S¸ent¨urk, Danielle J. Harvey and Chin-Shang Li

data to estimate model parameters (e.g. variability) in defining effect sizes for sample size and power analysis. The abundance of “pilot” data provides a platform for consideration of more complex models for gene expression from which the determination of sample size and power can be based (e.g. see Page et al., 2006). The availability of data coupled with modest computational requirements makes these simulation-based approaches to sample size and power planning feasible. k Consider the case of a two group comparison with pilot data {(Xki1, . . . , Xkim)}ni=1 for groups k = 1, 2. One can consider a parametric model, such as the normal distribution, for processed gene expression with gene-specific parameters based on the pilot data. For instance, in the case of independent gene expression, Li et al. (2005) considered Xkij ∼ N (0, σ ˆ j2) for the m0 unexpressed genes (j ∈ M0 ). For DE genes, the mean difference for the experimental group relative to the control (reference) group, for instance, can be σj , σ ˆ j2), for j ∈ assumed to be in units of the standard deviation from pilot data: X2ij ∼ N (δˆ M1. The set of DE, namely M1, can be selected at random from the pilot data. Data sets can be generated from this parametric simulation model that allow for gene-specific effects. Note that if the gene-specific parameters are not obtained from the pilot data, the data can be generated to be within a target range of effect sizes. The BH or Storey and Tibshirani (ST) FDR controlling procedure can then be applied to each of the simulated data sets to estimate power and sample size. This approach may be generalized to comparisons among more than two groups as well as other hypothesis tests of interest, including linear regression models or mixed-effects regression models. For example, in the colon cancer example in the introduction, gene-specific fixed effects parameters as well as variance components from the pilot data can be used as input parameters into the normal model for a linear mixed-effects model to simulate expression data. An important challenge in sample size calculations (as well as in other methods for the analysis of gene expression data) is the generation/simulation of expression data that reflects characteristics of real expression data. A key step is to relax the artificial assumption of independence and to incorporate into the data generation model the correlation structure of genomics data. The transcriptional state of the repertoire of genes under a given condition is a coordinated, complex process. Dependence among gene expression, whether among individual genes or within cluster/family of related functional members, has been shown from numerous microarray experiments. The important issue of incorporating the dependence structure of gene expression data, addressed by Li et al. (2005), arises in the simulation of gene expression data for two group comparisons. Their procedure involves three parts: (1) removing the potential (observed) differences between groups in the pilot data; (2) resampling from the data; and (3) randomly selecting DE genes by adding a mean difference between the two groups, δˆ σj , to each resampled data set. Although mean differences between the two groups are removed, the correlation structure is retained. More precisely, these steps are as follows: 1. Let xk1 , . . ., xknk be the original m-vectors of expression data for groups k = 1, 2. Remove systematic differences in gene expression between groups not due to noise by taking x∗ki ≡ xki − xk + z, i = 1, . . . , nk , where xk = (xk1 . . . , xkm )T is the vector of average expression for the m genes in group k and z is vector of overall averages over the two groups combined.

Sample Size Calculation and Power in Genomics Studies

71

2. New expression data for group k are then generated by repeatedly resampling (with k . replacement) from {x∗ki }ni=1 3. DE genes can then be (randomly) selected and a mean difference between the two groups is assigned to be δˆ σj . The number of DE genes is set at m1 . Suppose that steps (2)-(3) are repeated ND times (e.g. ND = 1000) resulting in ND simulated data sets. The BH and ST FDR procedures are then applied to each generated data set, which retains the original correlation structure, from which power and sample size can be determined. Because typically π0 , the number of null genes (non-DE genes), is unknown, its effect on sample size and power should be assessed in the simulation and this is done by simply varying m1 in step (3); i.e. repeat the simulation for various values of m1. Nguyen et al. (2007) also proposed a simulation-based approach to determine sample size and power using the ST FDR controlling algorithm, where the model for gene expression incorporates both additive and multiplicative measurement errors, xk = µk eη + ,

k = 1, . . ., K,

(14)

where xk is the observed (gene expression) intensity measurement and µk is the true (unknown) gene expression (Rocke and Durbin, 2001) in group k. For two comparison groups, K = 2. The gene expression measurement error model (14) has been widely adopted and provides a reasonable approximation to empirical data (see, for example, Rocke and Durbin, 2001; Zien et al. 2003; Huber et al., 2002 and references therein). In the above error model, the additive and multiplicative measurement error models are ∼ N (0, σ2)

and

η ∼ N (0, ση2).

(15)

These terms represent the error associated with genes that are not expressed or expressed at low levels and the multiplicative (proportional) error for genes expressed at high levels. Model (14) is a two-component error model which approximates a constant standard deviation for low expression levels and a constant coefficient of variation for higher expression levels. More specifically, the following lognormal model of gene expression for µk is adopted (Zien et al. (2003), µk = µ∗k eβ ,

k = 1, . . ., K,

(16)

where β ∼ N (0, σ 2), µ∗k is the mean gene expression level in group k, and the parameter σ represents the standard deviation of the biological variability. The family of lognormal distributions has been used as a model for gene expression (Nguyen and Rocke, 2004; Zien et al., 2003; Nguyen, 2004a,b; Konishi, 2004, among others). See Limpert et al. (2001) for a general introduction to the use of the lognormal distribution in the sciences. For two comparison groups, the fold ratio of expression between groups 1 and q 2 for gene j is θj ≡ ∗ ∗ µ1j /µ2j . The signal to noise ratio, averaged over groups 1 and 2, is µ∗1j µ∗2j /σ . Simulated data allowed for varying levels of fold changes in Nguyen et al. (2007). Alternatively, with pilot data, the effects sizes can be directly estimated from the data as in the work of Li et al. (2005) described above. The measurement error parameters (σ and ση2) can be generally estimated based on an array with technical replicates (e.g. Bartosiewicz et al. (2000), Stuart et al. (2001), and Lemon et al. (2002) among others). The following procedure is then applied to the ND simulated data sets :

72

Danh V. Nguyen, Damla S¸ent¨urk, Danielle J. Harvey and Chin-Shang Li

0.07

1

0.06 0.8

0.04 FDR

0

Power P(n, π )

0.05 0.6

0.4

0.03 0.02

0.2 0.01 0 30

0 30 25

1

20

0.8

15

0.6 10

1

20

0.8

15

0.6 10

0.4 5

Sample size ni

25

0

0.4 5

0.2 0

π

Sample size n

o

i

0.2 0

0

πo

Figure 2. Expected power and FDR control. (Left) The power surface, P (n, π0) as a function of sample size n and π0 , and (Right) the expected FDR control, set at level α = 0.05. Adopted from Nguyen et al. (2007).

1. Specify the desired FDR control level α ∈ (0, 1). 2. Compute the test-statistics, denoted by t1 , . . . , tm and their corresponding p-values by p1, . . . , pm corresponding to null and alternative hypotheses, H0j and H1j , for genes j = 1, . . ., m. 3. Apply the ST FDR controlling method. 4. Generate power surface, h(n, π0) for determining power and sample size. The power corresponding to each sample size n and π0 , the proportion of truly non-DE genes −1 PND is obtained as ND l=1 (sl /m1 ). It is the average proportion of the true alternative hypotheses correctly identified (discovered), averaged over the ND data sets (simulations). For example, Figure 2 displays the estimated power and FDR control for the two comparison groups case under the measurement error model above. Due to the complexity of the model, possibly with general dependence structure, it is important to check the FDR control from the simulation. The average proportion of false rejections, averaged over the −1 PND ND simulations are obtained as ND l=1 vl / max{rl , 1} (right plot in Figure 2). Li et al. (2005) provide a careful study of FDR control in the data-based simulation model and found that the actual FDR is under-controlled with correlated gene expression. Because FDR controlling procedures (e.g. BH) have been shown to hold under independence and some special dependence structures, an examination of FDR control is needed as general dependence is the case with real array data.

4.

Methods assuming independence in gene expression

Many sample size calculation methods assume independent test statistics. These approaches are based on the result of Storey (2002, Theorem 1), where all tests are assumed to be

Sample Size Calculation and Power in Genomics Studies

73

independent and identically distributed and the null hypothesis H0 is true with probability π0 . This is the result described by (12), i.e. pFDR(Γ) =

π0 Pr(T ∈ Γ|H = 0) = Pr(H = 0|T ∈ Γ). Pr(T ∈ Γ)

(17)

As described in Section 2.2., since Pr(R > 0) ≈ 1 in microarray applications because the probability of no significant discoveries among thousands of genes monitored is close to zero. Thus, pFDR(Γ) ≈ FDR(Γ) and result (17) can be used to determine the sample size. These approaches, utilizing the simplifying assumption of independence and identical tests were developed by Liu and Hwang (2007) and Jung (2005). More precisely, controlling FDR at level α, i.e. FDR ≤ α, is equivalent to ∆≥

Pr(T ∈ Γ|H = 0) , Pr(T ∈ Γ|H = 1)

(18)

α π1 where ∆ = 1−α π0 . Note that α, the desired level of FDR control, is to be specified by the investigator and the level of control will depend on the type of experiment. The proportion of null genes π0 is best estimated when pilot data is available; otherwise its specification is also required. A rejection region Γ is chosen to satisfy the boundary of (18). Considerq the two-group comparison based on the t-statistic for gene j, namely Tj =

−1 0 1 (X 1j − X 2j )/ s2j (n−1 1 + n2 ) to test Hj : δj = 0 versus Hj : δj 6= 0, where δj =

µ1j − µ2j , s2j = (n − 2)−1 {(n1 − 1)s21 ) + (n2 − 1)s22)}, and nk (k = 1, 2) are the group sample sizes with n = n1 + n2 . The null hypothesis H0j is rejected when |Tj | > cj , for the threshold constant cj . Thus, application of (18) to the two-group comparison gives ∆=

Pr(|Tj | > cj |H = 0) , Pr(|Tj | > cj |H = 1)

(19)

where cj is the critical value corresponding to gene j. When H0j holds (gene j is not DE), then Pr(|Tj | > cj |H = 0) = 2t(n − 2; −cj ), where t(d; c) denotes the cumulative distribution function for the central t-distribution with d degrees of freedom (DF). Under the alternative hypothesis, H1j , Tj ∼ t(n − 2, θj ), the non-central t-distribution with q −1 −1 non-centrality parameter θj = δj / σj n1 + n2 . Thus, the denominator in (19) is Pr(|Tj | > cj |H = 1) = 1 − t(n − 2, θj ; cj ) + t(n − 2, θj ; −cj ). Note that for (normalized and) log (base 2) transformed data, a two-fold change for gene j corresponds to δj = 1. Also, although not much of practical interest, δj = δ and σj = σ correspond to the case where the differential expression and variability in gene expression are identical for each gene. Clearly, other test statistics T in (18) with known sampling distribution can be used in other designs or hypothesis tests. This includes the multi-group comparison using the F-test, as described in Liu and Hwang (2007). For instance, in a (simple single) loop design (see e.g. Yang and Speed (2003), Smyth (2004) and references therein) with three groups/treatments (G1 → G2 → G3 → G1), the design matrix   1 0 X =  −1 1  0 −1

74

Danh V. Nguyen, Damla S¸ent¨urk, Danielle J. Harvey and Chin-Shang Li

compares group 1 to 2 and group 2 to 3, based on a single set of experiments with three arrays (e.g. in a two-color cDNA array). Liu and Hwang (2007) proposed to determine the number of sets of slides for a given design to control the FDR based on a linear model for each set i (i = 1, . . ., n), Yij = Xβj + ij ,

j = 1, . . . , m.

(20)

Here β j is a vector of size p for gene j , Yij is the expression measures of gene j over the ith replicate set and eij is the corresponding random error. A null hypothesis test of interest pertaining to gene j can be formulated as H0j : Lβ j = 0 versus H1j : Lβ j 6= 0, where L is an r × p (r ≤ p) matrix of coefficients consisting of contrasts/comparisons of interests (see, with variance σ 2 , e.g. Seber, 1977). Assuming that the error eij is normally distributed P ˆj )T [L(XT X)−1LT /n]−1 (Lβ ˆj )}/{ n eT eij /d(n)}, the F-test of H0j is Tj = {r−1(Lβ i=1 ij ˆ where eij = Yij − Xβj , d(n) is the degrees of freedom depending on the sample Pn T −1 T ˆj = size n, and β i=1 (X X) X Yij /n is the least squares estimator of β j . Un0 der the null hypothesis Hj : Lβ j = 0, the test statistic Tj is distributed as a central F-distribution with r and d(n) DF, denoted by F (r, d(n)). Under the alternative hypothesis H1j : Lβ j 6= 0, Tj is distributed as a non-central F-distribution, denoted F (r, d(n), λ), where λ = (Lβ j )T Σ−1 (Lβ j ) and Σ = σ 2L(XT X)−1 L/n. To determine the sample size n, the probability of rejection under the null and alternative hypothesis is based on the central and non-central F-distribution, respectively, to solve (18), ∆=

1 − F (r, d(n); cj ) Pr(Tj > cj |H = 0) = . Pr(Tj > cj |H = 1) 1 − F (r, d(n), λ; cj)

(21)

Here, F (df1 , df2 , λ; c) denotes the cumulative distribution function of the F-distribution. Numerical integration methods are needed to solve (17) generally and to accommodate different effect sizes (e.g. δj ) and gene-specific characteristics (e.g. σj ), Liu and Hwang (2007) suggested modelling the distribution of gene-specific parameters parametrically or nonparametrically (e.g. fˆ(δj , σj )) from pilot data. Sensitivity of these proposed methods to dependence in gene expression has not been assessed. Under the independence assumption, Jung (2005) provided a similar approach to sample size calculation, controlling FDR for the two-group comparison. The method is based on the relation m0 γ P , (22) FDR(γ) = m0 γ + j∈M1 Pr(Pj ≤ γ|H1j ) (Storey, 2002; see (12) in Section 2.2.), where M1 is the set of truly DE genes. Jung (2005) considered the two-sample comparison test statistic for gene j, Tj = (X 1j − q −1 X 2j )/ sj n−1 . Although not a justifiable assumption in genomics applica1 + n2 tions, assuming that the sample size is large ( nk , large for k = 1, 2), Tj ∼ N (0, 1) under H0j : δj = 0. This assumption together with the assumption of equal/constant effect size leads to a simple closed-form solution for sample size based on (22). As for the more realistic cases, numerical methods are needed to obtain a solution for sample size. For illustration, consider the one-sided test (e.g. up- or down-regulation) with alternative H1j : δj > 0. Then

Sample Size Calculation and Power in Genomics Studies √ under the alternative, Tj ∼ N (δj na1 a2 /σj , 1), where ak = nk /n. Thus, m0 γ P . FDR(γ) = √ m0γ + j∈M1 [1 − Φ(z1−γ − δj na1 a2 /σj )]

75

(23)

Note that this is a specific application of (18). The expected number of true rejections (discoveries) at the single test significance level γ is X √ [1 − Φ(z1−γ − δj na1 a2 /σj )] (24) E{S(γ)} = j∈M1

and a common measure of power is the expected proportion of true discoveries, E(S/m1), for multiple testing. Thus, given the specified expected number of true discoveries that one wants to detect, say E{S(γ)} = s (s ≤ m1 ) at the CWE level of γ, and the FDR control set at level α, then (23) becomes α = m0 γ/(m0γ + s), which gives the modified single-gene significance level (Type I error) of γ ∗ = (s/m0)[α/(1 − α)] needed to detect s discoveries on average with FDR controlled at level α. Thus, upon substituting the new γ ∗ and the expected number of true discoveries desired into (24), the sample size needed can be obtained by solving h(n, ∆) = 0, where X √ [1 − Φ(z1−γ ∗ − δj na1 a2/σj )] − s (25) h(n, ∆) = j∈M1

and ∆ = (δ1 /σ1, . . . , δm /σm ). Jung (2005) proposed using the bisection method to solve h(n, ∆) = 0 and for the unrealistic case of equal effect sizes for all genes ( δj /σj = δ/σ for all j = 1, . . . , m), n = {σ 2(z1−γ ∗ + z1−β ∗ )2/(a1a2 δ 2 )} + 1, where β ∗ = s/m1. For two-sided tests of no differential expression for each gene, γ ∗ and δj in (25) are replaced by γ ∗/2 and |δj |. To avoid a large sample (nk → ∞), Jung also considered replacing the normal distribution by the non-central t-distribution to compute Pr(Pj ≤γ|H1j ) in (22), as was done in Liu and Hwang (2007). (See earlier discussion.) As discussed earlier, the gene-specific effect sizes can be easily estimated based on similar pilot data in most cases. However, the effect dependence in gene expression on sample size and power estimation is important and has been recognized. There have been some approaches to incorporate the dependence structure and we discuss some of these approaches next.

5.

Methods accounting for dependence in gene expression

A method for sample size calculation based on fixing the expected number of false positives E(V ) using a small individual CWER (1/1000 or 2/1000) was proposed by Tsai et al. (2005) and Wang and Chen (2004). First, consider the case of independent gene expression and denote P 1 the outcome of the test for gene j by Sj , then the number of true discoveries S = m j=1 Sj ∼ Bin(m1 , 1 − β), where (1 − β) is the probability that a truly DE gene is declared significant. (Note that this is assumed to be constant from gene to gene.) Tsai et al. defined the family-wise power of identifying s out of m1 truly DE genes for a given CWER level α as the probability of at least s true discoveries, m1 X m1 (1 − β)l β m1 −l . (26) φ ≡ Pr(S ≥ s) = l l=s

76

Danh V. Nguyen, Damla S¸ent¨urk, Danielle J. Harvey and Chin-Shang Li

The sample size formula for testing no differential expression of a single gene is n=

2σ 2(z1−α/2 + z1−β )2 . δ2

(27)

For multiple genes with equal effect size, from (26) one can see that for even a very low probability (i.e. family-wise power) of detecting all m1 differentially expressed genes one would need to have the comparison-wise power (1 − β) to be close to one. Thus, Wang and Chen (2004) and Tsai et al. (2005) suggested the following modification. Consider the sensitivity measure λ = s/m1 . To detect at least the fraction λ of truly DE genes at familywise power level φλ, one can solve for the required comparison-wise error (1 − β) from (26) with s = [m1λ] denoting the largest integer less than m1 λ. Thus, given the solution for (1 − β) and given α and a fixed effect size δ, the sample size can be obtained from (27). Other criterion, besides the sensitivity, including the proportion of true discovery ( s/r) or the accuracy [(m0 − v + s)/m] can be used for λ as well. The equal effect size for each gene is needed for the proposed method (i.e. for expression (26) ), although this is clearly not a realistic assumption. Dependence between genes can be modelled naively by making the simplifying assumption that DE genes are equally correlated; i.e. Corr(Si , Sj ) = θ. The quantity of interest S, an sum of Bernoulli variates, is then modeled as a beta- binomial distribution and so the family-wise power φ = Pr(S≥s) is obtained by summing the individual probabilities from s to m1 . This is simply (26) with the binomial distribution replaced by the beta-binomial distribution. The issue of sample size calculation under dependence adjustment is taken up in more details by Shao and Tseng (2007). In addition to the need for adjusting for dependence in gene expression, this resulting dependence among the test statistics also has affected the measure of power. More precisely, under dependence, the commonly used measure of power is an “average power”, r1, and it is the expected proportion of true discoveries, among all m1 true alternatives, ¯ r1 = E[S/m1] = 1 − β.

(28)

Under dependence, the variability of U is more than when the test statistics are independent, and the achieved proportion of true discoveries is more variable. Thus, Shao and Tseng (2007) among others, considered determining the sample size to achieve a probability that at least a fraction r1 of truly DE genes are identified. More precisely, this probability is Ψ = Pr([S/m1] ≥ r1 )

(29)

and the overall power is specified by the pair (Ψ, r1). Shao and Tseng (2007) considered the following approach to incorporate dependence into sample size calculations using the Pm overall power criterion for the two-sample comparison. Let R ≡ R(γ) = j=1 Rj (γ) ij and denote the correlation between Ri ≡ Ri(γ) and Rj ≡ Rj (γ) by θR = Corr(Ri , Rj ). P 1 ij Also, let Sj = I(pj ≤ γ), S = m j=1 Sj , and ρS be the correlation between the (normal) test statistics for gene i and j. Using the result in Ahn and Chen (1995), the correlation ij θS = Corr(Si , Sj ) can be expressed as a function of the CDF of a standard bivariate m1 normal distribution, ρij S , βi and βj . Also, define the average correlation among {Sj }j=1

Sample Size Calculation and Power in Genomics Studies 77 P as θS = (m1(m1 − 1))−1 i6=j θSij . The variance of S, σS2 , can be obtained as a function ¯ of βi and θSij , and under the assumption of equal effect sizes ( δj all equal, βj = β = β) 2 ¯ ¯ σS = m1 β(1 − β)[1 + θ S (m1 − 1)]. The calculation of sample size (and power) is based on a normal approximation for the distribution of S, depending on the parameters m1, β¯ and σS2 . That is, ¯ σ2 ) S ∼ N (m1(1 − β), S

(30)

approximately for “small dependent blocks in the arrays” (see Shao and Tsing (2007) based on Billingsley (1968)). Also, with the assumption that both m0 and m1 are large, we ¯ for can obtain the per gene comparison level γ as γ = {α/(1 − α)}(m1/m0 )(1 − β), sample size calculations (where α is the specified FDR control level). For example, with equal effect size (δj equal, βj = β), for average power r1 and overall power of 0.5, Ψ = Pr(S ≥ m1 r1) = 0.5 does not depend on σS2 . The sample size (for one-sided alternatives) is n = σ 2(z1−γ + z1−β )2/(a1a2 δ 2 ). When the overall power Ψ > .5, which is the case of interest in practice, β¯ can be obtained from the normal approximation (30) and σS2 needs to be estimated based on pilot data, for instance. See Shao and Tseng (2007) for details on this estimation as well as the case of unequal effect sizes and extensions to two-sided alternative hypothesis. We briefly note here that the approximation (30) may require assessment as well as the assumption that m1 is large. For array experiments, m1 may not be large, in the 1-5% in many experiments. Clearly, exceptions are experiments that include conditions in which cells are subject to broad changes, such as a carcinogen or irradiation.

6.

Methods for controlling family-wise error

Microarray experiments are exploratory in nature, examining thousands of gene expressions simultaneously. Thus, a less strict criterion, like the FDR, is more suitable for exploratory search of DE genes. See Nguyen (2004a,b, 2005) for more discussion. However, some methods for controlling FWER have been proposed, based on the step-down p-value adjustment work of Westfall and Young (1993). See also Dudoit et al. (2003). We briefly review here the study of Jung, Bang and Young (JBY, 2005) for sample size determination in the two-group comparison setting, with emphasis on the assumptions of the methodologies. We further note that such methods are rarely used, at least for the initial stage of microarray experiments. However, these methods may be suitable for follow-up microarray experiments, where a smaller subset of candidate genes are further studied. Thus, a more strict criterion, like the FWER may be more suitable. Again, let Tj be the t-statistic for gene j testing H0j : δj = 0 versus H1j : δj > 0 for j = 1, . . . , m. A one-sided test is considered here for convenience. The null hypothesis is rejected if Tj > c, for some critical value c, and the FWER is defined as α = Pr(T1 > c or T2 > c, . . . , or Tm > c|H0 ) = Pr( max Tj > c|H0 ), 1≤j≤m

(31)

0 where H0 : δj = 0 for all j (i.e H0 = ∩m j=1 Hj ). The Bonferroni adjustment takes c = cα = tn−2,α/m , the upper α/m-quantile of the t-distribution with n − 2 DF. JBY considered estimating the distribution of W = max1≤j≤m Tj under H0 using permutation.

78

Danh V. Nguyen, Damla S¸ent¨urk, Danielle J. Harvey and Chin-Shang Li

The adjusted p-value is defined as Pr(max1≤j 0 ≤m Tj 0 ≥ tj |H0 ), where tj is the observed t-statistic of Tj from the original data. The single-step procedure for estimating the adjusted p-value, based on B (random) permutations of the original data is p˜j = B −1

B X

I{wb ≥ tj },

j = 1, . . ., m,

(32)

b=1 (b)

(b)

where wb = max1≤j≤m tj , tj is the computed t-statistic based on the bth permutation of the original data (b = 1, . . ., B). All hypotheses with p˜j < α are rejected. (For two-sided tests, tj is replaced by |tj |.) The step-down procedure for adjusting the observed p-values as proposed by Westfall and Young (1993) and adopted by Dudoit et al. (2003) is provided in JBY (see Algorithm 2, p. 160). They proposed an algorithm to find the sample size by solving the power equation h(n) = 1 − β, where √ h(n) = Pr(max1≤j≤m (ej + δj na1 a2 /σj ) > cα), (e1, . . . , em ) ∼ N (0, R) and R denotes a m × m correlation matrix. This is based on JBY’s result that the distribution of (T1, . . . , Tm) has approximately the same distribution as (e1, . . . , em ) ∼ N (0, R) under √ the null hypothesis and (ej + δj na1 a2 /σj ) under the alternative hypothesis when nk ’s √ are large. Because the distribution of max1≤j≤m ej and max1≤j≤m (ej + δj na1 a2/σj ) are unknown, solving h(n) = 1 − β is non-trivial and JBY proposed using simulation. However, to be able to simulate the needed distribution, further simplifying assumptions are made, including equal correlation between genes (e.g. block compound symmetry). Such a correlation structure is at best poor models for the observed data. In addition, the distribution result upon which the power function h(n) is based is derived from assuming large sample size, which is not justified for gene expression data as the sample size should be fixed and small. Although computationally more intensive, direct simulation of the underlying model with parameters estimated from pilot data (see Section 3.) can be applied in this setting to relax many of the simplifying, but poorly justified, assumptions from real data perspectives.

7.

Other approaches and relevant literature

One of the early works on power and sample size consideration for DNA microarray studies is that of Lee and Whitmore (2002), where various error controls were discussed in the context of multiple testing, including FDR, and the use of E(S)/m1, the expected proportion of truly expressed genes, as a measure of power. Pawitan et al (2005) considered a mixture model for the distribution of the t-statistics, and emphasized the need to also consider the sensitivity/false negative rate in conjunction with the false discovery rate. Pan et al. (2002) and Lin and Le (2002) also used a mixture model approach. Pounds and Cheng (2005) considered the problem of k-group comparison, also using the FDR criterion; see also Hu et al. (2005). Yang et al. (2003) used a conservative bound for FDR for calculating sample size and considered this issue in the context of serial time points experiments (within groups) based on a mixed model where time and treatment are fixed effects and subject is a single random effect. Tibshirani (2006) proposed a simple approach where the gene score from the SAM (significance analysis of microarray, Tusher et al., 2001) is used and the null distribution of the scores is estimated based on permutation. Muller et al. (2004) considered sample

Sample Size Calculation and Power in Genomics Studies

79

size calculation based on a hierarchical Bayes model under a decision-theoretic framework. Black and Doerge (2002) considered a requirement for the number of spots within the array for detection significant of fold change; see also Lee et al. (2000). Wei et al. (2004) examined factors affecting size and power and compared the relative sample size requirement for humans and inbred animals studies to detect fold changes. Although we have focused on sample size and power considerations for the identification of DE genes, some work has been done on sample size determination for classification studies as well. One area of application is the classification among tumor issue/cell types or between normal and cancer cells (e.g. see Alizadeh et al., 2000; Golub et al., 1999; Alon et al., 1999; Nguyen and Rocke, 2002a,b,c, 2004). Sample size requirements for designing classification can be found in Hwang et al. (2002), Mukherjee et al. (2003), and Dobbin and Simon (2007). Hua et al. (2005) examined the number of features, as a function of sample size, in various classifiers.

8.

Discussion

In recent years, there have been increased interests in the design issues, including sample size and power, at the initial study planning stage of genomics studies. Because of the large number of probes monitored in microarrays or other high-throughput assays, the FDR criterion has been preferred as the measure of error. Thus, much research have been devoted to methods for determining sample size/power for controlling FDR and FDR-related criterions. As we have emphasized here, various assumptions made are for mathematical convenience and their consequences need better assessment. A key assumption is the independence of gene expression. Under dependence, the effects on FDR control assuming independence, is not negligible and should be carefully assessed and quantified. This also applies to models that over-simplify the dependence structure by assuming, for instance, a single common correlation among all genes. In these cases, it is informative to compare their performance/sensitivity to the case of more general dependence structure, which can be based on the observed dependence structures across many real data sets. We note also that through such studies, some assumptions made for mathematical convenience that are not critical for modelling real expression data can be identified. Finally, we note that the sample size and power results are critically dependent on the proportion of null genes π0 . Depending on the type of genomics experiments or the area of application (e.g. highdimensional data in MS or MRI studies) a suitable range of π0 should be chosen. Pilot data or previous similar studies are informative for determining the range of π0 of interest in practice. Sensitivity of the assumptions, such as independence or a specific type of dependence structure among genes, needs to be assessed in the appropriate range of π0 .

Acknowledgment Support for this work includes the National Institute of Health (NIH) grants UL1RR024922, RL1AG032119 and RL1AG032115, National Institute of Child Health and Human Development grant HD036071, NIEHS grant P01-ES011269-06 and grant UL1 RR024146 from the National Center for Research Resources, a component of NIH.

80

Danh V. Nguyen, Damla S¸ent¨urk, Danielle J. Harvey and Chin-Shang Li

References Ahn, H. and Chen, J.J. (1995). Generation of over-dispersed and under-dispersed binomial variates. J. Comput. Graphical Statist., 4, 55-64. Alizadeh, A.A., Eisen, M.B., Davis, R.E. et al. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403, 503-511. Alon, U., Barkai, N., Notterman, D.A., Gish, K., et al. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA, 96, 6745-6750. Bartosiewicz, M., Trounstine, M., Barker, D. et al. (2000). Development of a toxicological gene array and quantitative assessment of this technology. Archives of Biochemistry and Biophysics, 376, 66-73. Basser, P.J. (1995). Inferring microstructural features and the physiological state of tissues from diffusion-weighted images . NMR Biomed, 8, 333-344. Basser, P.J., Pierpaoli, C. (1996). Microstructural and physiological features of tissues elucidated by quantitative-diffusion-tensor MRI. J Magn Reson B, 111, 209-219. Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B, 57, 289-300. Benjamini, Y., and Hochberg, Y. (2000). On the adaptive control of false discovery rate in multiple testing with independent statistics. J. Educ. Behav. Statist., 25, 60-83. Benjamini, Y., Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics , 29, 1165-1188. Billingsley, P. (1968). Convergence of Probability Measures . Wiley, New York. Black, M.A., and Doerge, R.W. (2002). Calculation of the minimum number of replicate spots required for detection of significant gene expression fold change in microarray experiments. Bioinformatics, 18, 1609-1616. Churchill, G.A. (2002). Fundamental of experimental design for cDNA microarrays Nature Genetics, 32, 490-495. Chuaqui, R.F., Bonner, R.F., Best, C.J., et al. (2002). Post-analysis follow-up and validation of microarray experiments. Nature Genetics, 32 Supplement, 509-514. Davidson, L.A., Nguyen, D.V., Hokanson, R.M., Callaway, E.S. et al. (2004). Chemopreventive n-3 polyunsaturated fatty acids reprogram genetic signatures during colon cancer initiation and progression in the rat. Cancer Research, 64, 6797-6804. Dobbin, K., and Simon, R. (2005). Sample size determination in microarray experiments for class comparison and prognostic classification. Biostatistics, 6, 27-38.

Sample Size Calculation and Power in Genomics Studies

81

Dobbin, K.K., Simon, R.M. (2007). Sample size planning for developing classifiers using high-dimensional DNA microarray data. Biostatistics, 8, 101-117. Dudoit, S., Shaffer, J.P., and Boldrick, J.C. (2003). Multiple hypothesis testing in microarray experments. Statistical Science, 18, 71-103. Efron, B. (2007). Size, power and false discovery rates. The Annals of Statistics , 13511377. Efron, B., Tibshirani, R., Storey, J.D., and Tusher, V. (2001). Empirical Bayes analysis of a microarray experiment. Journal of the American Statistical Association , 96, 11511160. Finner, H., and Rotter, M. (2000). On the false discovery rate and expected type I errors. Biometrical Journal, 43, 985-1005. Friston, K.J., Ashburner, J.T., Kiebel, S.J., Nichols, T.E., Penny, W.D. (2007). Statistical Parametric Mapping: The Analysis of Functional Brain Images. Academic Press, San Diego. Fodor, I.K., Nelson, D.O., Alegria-Hartman, M. et al. (2005). Statistical challenges in the analysis of two-dimensional difference gel electrophoresis experiments using DeCyder. Bioinformatics, 21, 3733-3740. Gharbi, S., Gaffney, P., Yang, A. et al. (2002). Evaluation of Two-dimensional Differential Gel Electrophoresis for Proteomic Expression Analysis of a Model Breast Cancer Cell System. Molecular and Cellular Proteomics , 1, 91-98. Genovese, C. R., Lazar, N. A., and Nichols, T. (2002). Thresholding of statistical maps in functional neuroimaging using the false discovery rate. NeuroImage, 15, 870-878. Glonek, G.F., Solomon, P.J. (2004). Factorial and time course designs for cDNA microarray experiments. Biostatistics, 5, 89-111. Golub, T.R, Slonim, D.K., Tamayo, P. et al. (1999). Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science, 286, 531-537. Haab, B.B., Dunham, M.J. and Brown, P.O. (2001). Protein Microarrays for highly parallel detection and quantitation of specific proteins and antibodies in complex solutions. Genome Biology, 2(2):research 0004.1 - 0004.13. Hansel, D.E., Rahman, A., Hidalgo, M. et al. (2003). Identification of novel cellular targets in biliary tract cancers using global gene expression technology. American Journal of Pathology, 163, 217-229. Hedenfalk, I., Duggan, D., Chen, Y. et al. (2001). Gene-expression profiles in hereditary breast cancer. The New England Journal of Medicine, 344, 539-548. Holm, S. (1979). A simple sequentially rejective multiple testing procedure. Scandinavian Journal of Statistics , 6, 65-70.

82

Danh V. Nguyen, Damla S¸ent¨urk, Danielle J. Harvey and Chin-Shang Li

Hochberg, Y. (1998). A sharper Bonferroni procedure for multiple tests of significance. Biometrika, 75, 800-802. Hu, J., Zou, F., and Wright, F.A. (2005). Practical FDR-based sample size calculations in microarray experiments. Bioinformatics, 21, 3264-3272. Hua, J., Xiong, Z., Lowey, J., Suh, E., Dougherty, E.R. (2005). Optimal number of features as a function of sample size for various classification rules. Bioinformatics, 21, 15091515. Huber, W., von Heydebreck, A., Sultmann, H., Poustka, A., and Vingron, M. (2002). Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics, 18, S96-S104. Hwang, D., Schmitt, W.A., Stephanopoulos, G., Stephanopoulos, G. (2002). Determination of minimum sample size and discriminatory expression patterns in microarray data. Bioinformatics, 18, 1184-1193. Ji, X., Cheung, R., Cooper, S., Li, Q., Greenberg, H.B., He, X.S., (2003). Interferon alfa regulated gene expression in patients initiating interferon treatment for chronic hepatitis C. Hepatology, 37, 610-621. Jung, S.H. (2005). Sample size for FDR-control in microarray data analysis. Bioinformatics, 21, 3097-3104. Jung, S.H., Bang, H., and Young, S. (2005). Sample size calculation for multiple testing in microarray data analysis. Biostatistics, 6, 157-169. Kerr, M.K., and Churchill, G.A. (2001). Experimental design issues for gene expression microarrays. Biostatistics, 2, 183-201. Kerr, M.K. and Churchill, G.A., (2001b). Statistical design and analysis of gene expression microarrays. Genetical Research, 77, 123-128. Kerr, M.K., Martin, M., and Churchill, G.A. (2001). Analysis of variance for gene expression microarray data. Journal of Computational Biology , 7, 819-837. Konishi, T. (2004). Three-parameter lognormal distribution ubiguitously found in cDNA microarray data and its application to parametric data treatment. BMC Bioinformatics, 5, 5. Kosorok, M.R. and Ma, S. (2007). Marginal asymptotics for the “large p, small n” paradigm: With applications to microarray data Ann. Statist., 35, 1456-1486. Lee, K.M., Kim, J.H., Kang, D. (2005). Design issues in toxicogenomics using DNA microarray experiments. Toxicology and Applied Pharmacology , 207, S200-2208. Lee, M-LT., and Whitmore, G.A. (2002). Power and sample size for DNA microarray studies. Statistics in Medicine, 21, 3543-3570.

Sample Size Calculation and Power in Genomics Studies

83

Lee, M-LT., Lu, W., Whitmore, G.A., and Beier, D. (2002) Models for microarray gene expression data. Journal of Biopharmaceutical Statistics , 21, 1-19. Lee, M-LT., Kuo, F.C., Whitmore, G.A., and Sklar, J.L. (2000). Importance of replication in microarray gene expression studies: Statistical methods and evidence from repetitive cDNA hybridizations. Proceedings of the National Academy of Sciences , 97, 9834-9939. Lemon, W.J., Palatini, J.J., Krahe, R., and Wright, F.A. (2002). Theoretical and experimental comparisons of gene expression indexes for oligonucleotide arrays. Bioinformatics, 18, 1470-1476. Li, S.S., Bigler, J., Lampe, J.W., Potter, J.D., Feng, Z. (2005). FDR-controlling testing procedures and sample size determination for microarrays. Stat Med., 24, 2267-2280. Limpert, E., Stahel, W.A., and Abbt, M. (2001). Log-normal distributions across the sciences: Keys and clues. BioScience, 5, 341-352. Lin, W.J, and Le, C.T. (2002). How many replicates of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach, Genome Biology, 3(5):research 0022.1 - 0022.10. Lipshutz, R.J., Fodor, S.P.A., Gingeras, T.R. and Lockhart, D.J. (1999). High density synthetic oligonucleotide arrays. Nature Genetics, 21, 20-24. Liu, P., and Hwang, J.T. (2007). Quick calculation for sample size while controlling false discovery rate with application to microarray analysis. Bioinformatics, 23, 739-746. Lockhart, D.J., Dong, H., Byrne, M.C. et al. (1996). Expression of monitoring by hybridization to high-density oligonucleotide arrays. Nature Biotechnology, 14, 16751680. Mukherjee, S., Tamayo, P., Rogers, S. et al. (2003). Estimating dataset size requirements for classifying DNA microarray data. J Comput Biol., 10, 119-142. Muller, P., Parmigiani, G., Robert, C., Rousseau, J. (2004). Optimal Sample Size for Multiple Testing: The Case of Gene Expression Microarrays. J. Am. Statis. Assoc., 99, 990-1001. Naidoo, S., Denby, K., Berger, D.K. (2005). Microarray experiments: considerations for experimental design. South African Journal of Science , 101, 347-353. Nguyen, D.V. (2004a). On estimating the proportion of true null hypotheses for false discovery rate controlling procedures in exploratory DNA microarray studies. Computational Statistics and Data Analysis , 47, 611-637. Nguyen, D.V. (2004b). A comparison of direct and sequential false discovery rate algorithms: computational experiments for exploratory DNA microarray studes. Computing Science Statistics , 36, 1-15.

84

Danh V. Nguyen, Damla S¸ent¨urk, Danielle J. Harvey and Chin-Shang Li

Nguyen, D.V. (2005). A unified computational framework to compare direct and sequential false discovery rate algorithms for exploratory DNA microarray studies. Journal of Data Science, 3, 331-352. Nguyen, D.V., Liu, H., and Senturk, D. (2007). A general FDR-based computational framework for sample size planning in microarray studies, In Pham, T., Yan, H., and Crane, D. (eds), Advanced Computational Methods for Biocomputing and Bioimaging. Nova Science Publishers, New York, p.55-69. Nguyen, D.V., Arpat, A.B., Wang, N., and Carroll, R.J. (2002). DNA Microarray experiments: biological and technological aspects. Biometrics, 58, 701-717. Nguyen, D.V. and Rocke, D.M. (2002a). Classification of acute leukemia based on DNA microarray gene expressions using partial least squares. In Lin, S.M and Johnson, K.F. (eds), Methods of Microarray Data Analysis. Kluwer Academic Publishers, Dordrecht, 109-124 Nguyen, D.V. and Rocke, D.M. (2002b). Tumor classification by partial least squares using microarray gene expression data. Bioinformatics, 18, 39-50. Nguyen, D.V. and Rocke, D.M. (2002c). Multi-class cancer classification via partial least squares using gene expression profiles. Bioinformatics, 18, 1216-1226. Nguyen, D.V. and Rocke, D.M. (2004). On partial least squares dimension reduction for microarray-based classification: a simulation study. Computational Statistics and Data Analysis, 46, 407-425. Page, G.P., Edwards, J.W., Gadbury, G.L. et al. (2006). The PowerAtlas: a power and sample size atlas for microarray experimental design and research BMC Bioinformatics, 7, 84. Pan, W., Lin, J., and Le, C.T. (2002). How many replicates of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach. Genome Biology, 3, 1-10. Pawitan, Y., Michiels, S., Koscielny, S., Gusnato, A., and Ploner, A. (2005). False discovery rate, sensitivity and sample size for microarray studies. Bioinformatics, 21, 3017-3024. Pinehiro, J.C. and Bates, D.M. (2000). Mixed-Effects Models in S and S-PLUS . Springer Verlag, New York. Pounds, S. and Cheng, C. (2005). Sample size determination for the false discovery rate. Bioinformatics, 21, 4263-4271. Purohit, P.V. and Rocke, DM. (2003). Discriminant models for high-throughput proteomics mass spectrometer data. Proteomics, 3, 1699-1703. Purohit, P.V., Rocke, D.M., Viant, M.R., and Woodruff, D.L. (2004). Discrimination models using variance-stabilizing transformation of metabolomic NMR data. Omics, 8, 118-130.

Sample Size Calculation and Power in Genomics Studies

85

Reiner, A., Yekutieli, D., and Benjamini, Y. (2003). Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics, 19, 368-375. Ramakrishnan, R., Dorris, D., Lublinsky, A. et al. (2002). An assessment of Motorola CodeLinkTM microarray performance for gene expression proling applications. Nucleic Acids Research, 30, e30. Rocke, D.M. (2004). Design and analysis of experiments with high-throughput biological assay data. Cell and Developmental Biology, 15, 703-713. Rocke, D.M. and Durbin, B. (2001). A model for measurement error for gene expression arrays. Journal of Computational Biology , 8, 557-569. Schena, M., Shalon, D., Davis, R. W., and Brown, P. O. (1995). Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science, 270, 467-470. Schwartzman, A, Dougherty, RF, and Taylor, JE (2005). Cross-subject comparison of principal diffusion direction maps. Magnetic Resonance in Medicine, 53, 1423-1431. Schweder, T. and Spjøvtoll, E. (1982). Plot of p-values to evaluate many tests simultaneously. Biometrika, 69, 493-502. Seber, G.A.F. (1977). Linear Regression Analysis . John Wiley & Sons, New York. Shao, Y. and Tseng, C.H. (2007). Sample size calculation with dependence adjustment for FDR-control in microarray studies. Stat Med., 26, 4219-4237. Sharma, K., Lee, S., Han, S. et al. (2005). Two-dimensional fluorescence difference gel electrophoresis analysis of the urine proteome in human diabetic nephropathy. Proteomics, 5, 2648-2655. Simon, R., Radmacher, M.D., Dobbin, K. (2002). Design of studies using DNA microarrays. Genetic Epidemiology, 23, 21-36. Smyth, G.K. (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Application in Genetics and Molecular Biology, 3, Article 1. Sørlie, T., Perou, C.M., Tibshirani, R. et al. (2001). Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci USA, 98, 10869-10874. Storey, J.D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society Series B , 64, 479-498. Storey, J.D. (2003). The positive false discovery rate: A Bayesian interpretation and the q-value. Annals of Statistics , 31, 2013-2031.

86

Danh V. Nguyen, Damla S¸ent¨urk, Danielle J. Harvey and Chin-Shang Li

Storey, J.D., Taylor, J.E., and Siegmund, D. (2004). Strong control, conservative point estimation, and simultaneous conservative consistency of false discovery rates: A unified approach. Journal of the Royal Statistical Society Series B , 66, 187-205. Storey, J.D., and Tibshirani, R. (2003). Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences , 100, 9440-9445. Stuart, O., Bush, T., and Nigam, K. (2001). Changes in global gene expression patterns during development of and maturation of rat kidney. Proceedings of the National Academy of Sciences, 98, 5649-5654. Tibshirani, R. (2006). A simple method for assessing sample sizes in microarray experiments. BMC Bioinformatics, 7, 106 Tibshirani, R. and Efron, B. (2002). Pre-validation and inference in microarrays. Statistical Applications in Genetics and Molecular Biology , 1, Article 1. Toga, A.W. and Mazziotta, J.C. (2002). Brain Mapping: The Methods. Second Edition, Academic Press, San Diego. Tsai, C.A., Wang, S.J., Chen, D.T., Chen, J.J. (2005). Sample size for gene expression microarray experiments. Bioinformatics, 21, 1502-1508. Tusher, V.G., Tibshirani, R., and Chu, G. (2001). Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences, 98, 5116-5121. van’t Veer, L.J., Dai, H., van de Vijver, M.J. et al. (2002). Gene expression proling predicts clinical outcome of breast cancer. Nature, 415, 530-536. Wang S.J., and Chen, J.J. (2004). Sample size for identifying differentially expressed genes in microarray experiments. Journal of Computational Biology , 11, 714-726. Wei, C., Li, J., and Bumgartner, R. (2004). Sample size for detecting differentially expressed genes in microarray experiments. BMC Bioinformatics, 5, 1-10. Westfall, P.H. and Young, S.S. (1993). Resampling-Based Multiple Testing: Examples and Methods for P-value Adjustment. John Wiley and Sons, New York. Wheelock, A.M., Morin, D., Bartosiewicz, M., Buckpitt, A.R. (2006). Use of a fluorescent internal protein standard to achieve quantitative two-dimensional gel electrophoresis. Proteomics, 6, 1385-1398. Yang, M.C.K., Yang, J.J., McIndoe, R.A., and She, J.X. (2003). Microarray experimental design: power and sample size considerations. Physiol Genomics, 16, 24-28. Yang, Y., Hoh, J., Broger, C., Neeb, M., Edington, J., Lindpaintner, K., Ott, J. (2003). Statistical methods for analyzing microarray feature data with replications. Journal of Computational Biology , 10, 157-169.

Sample Size Calculation and Power in Genomics Studies

87

Yang, Y.H., Speed, T. (2003). Design and analysis of comparative microarray experiments. In Statistical Analysis of Gene Expression Microarray Data . Chapman and Hall/CRC press, p.51. Ye, J., Liu, H., Kirmiz, C., Lebrilla, C.B. and Rocke, D.M. (2007). On the analysis of glycomics mass spectrometry data via the regularized area under the ROC curve. BMC Bioinformatics, 8, 477. Zien, A., Fluck, J., Zimmer, R., and Lengauer, T. (2003). Microarrays: how many do you need? Journal of Computational Biology , 10, 653-667.

In: Computational Biology: New Research Editor: Alona S. Russe

ISBN: 978-1-60692-040-4 © 2009 Nova Science Publishers, Inc.

Chapter 2

COUPLING COMPUTATIONAL AND EXPERIMENTAL ANALYSIS FOR THE PREDICTION OF TRANSCRIPTION FACTOR E2F REGULATORY ELEMENTS IN THE HUMAN GENE PROMOTER Kenichi Yoshida∗ Department of Life Sciences, Meiji University, 1-1-1 Higashimita, Tama-ku, Kawasaki, Kanagawa 214-8571, Japan

Abstract Completion of the human genome sequencing has provided us with opportunities to understand the molecular complexity of the human body. The transcriptional regulatory circuits of gene expressions are one of the most promising matters to be resolved by exploring the human genome sequence. We have been interested in human cell fate regulated by the transcription factor E2F. To accelerate the investigation, we need to develop a strategy that can efficiently identify E2F target genes. Basically, our approach is to combine computational and experimental analysis. Annotated data of gene expression profiles deposited in the public database and knowledge accumulated in the published literature are a treasure-house of E2F candidate target genes. Next, a promoter region based on the information of the transcriptional start site can be used for motif searching of E2F regulatory elements. Finally, a set of predicted E2F regulatory elements are tested by molecular biological and biochemical assays. In this chapter, I give a basic introduction of our recent strategy for computational and experimental analysis for the prediction of transcription factor E2F regulatory elements in the human gene promoter. In addition, recent progress in unrevealing E2F functions achieved by genome wide approaches is discussed.

∗

Correspondence: Kenichi Yoshida, Department of Life Sciences, Meiji University, 1-1-1 Higashimita, Tama-ku, Kawasaki, Kanagawa 214-8571, Japan. Tel. & Fax.: +81-44-934-7107. E-mail: [email protected]

90

Kenichi Yoshida

Introduction The human genome sequence can provide us with huge volumes of information on gene structure such as coding and promoter regions. At the same time, however, recent highthroughput experimental technologies have provided us with such a plethora of complex information that understanding it becomes a major problem. To overcome these phenomena, computational biology that can create new values just from DNA sequences or expression data derived from genome wide experiments has been evolved. Deeper genome annotation within the ENCODE regions, which span 1% of the human genome sequence, make it possible to assess the accuracy of the computationally created information [Guigó et al., 2006; Gerstein et al., 2007]. Currently, our approach is simply to identify transcription factor E2F target genes by bioinformatics and then set out to elucidate the E2F target genes in a functional manner in the cellular context. E2F plays pivotal and unique roles in cell cycle regulation as well as carcinogenesis, differentiation, and development [Blais and Dynlacht, 2004; Bracken et al., 2004; Yamasaki and Pagano, 2004; Korenjak and Brehm, 2005]. Therefore, a state-of-the-art strategy combining computational and experimental approaches to identify E2F target genes could be applied to other cell cycle regulators, with the aim of eventually unraveling the complexity of gene regulation.

1. DNA Microarray and Bioinformatics Data for global gene expression changes monitored by DNA microarray have been produced in large amounts. Technologically, the DNA microarray itself became the classic one. Generally, the transcriptional control network model has been well established in the yeast cell cycle regulation rather than mammalian cell cycle regulation [Futcher, 2002]. So far, several studies have reported on E2F target genes and their function by analyzing microarray data in mammalian cells [Ishida et al., 2001; Ma et al., 2002; Stanelle et al., 2002; Huang et al., 2003; Vernell et al., 2003; Young et al., 2003; Black et al., 2005; Jamshidi-Parsian et al., 2005]. Basically, E2F over-expression by conditional regulation or knock-down facilitated through the use of short interfering RNA (siRNA) was employed to prepare mRNAs to be tested with the DNA microarray technique. As E2F is negatively controlled by association with the retinoblastoma (pRb) pocket binding protein [Korenjak and Brehm, 2005], some reports preferred siRNA-mediated gene silencing of pRb to activate the E2F function. Collecting unique patterns of expression changes can be applied for prognostic purpose of certain illnesses, and can be effective in the selection of suitable treatment strategies in the clinically challenged. Towards this purpose, one might think whether these E2F target genes identified by the DNA microarray are real targets or not. To our perplexity, accumulating bodies of evidence suggest a pivotal role of microRNAs (miRNAs) in human carcinogenesis as unique oncogenes or tumor suppressors [Kent and Mendell, 2006; Zhang et al., 2007]. Microarray analysis revealed down-regulation of the E2F pathway by miR-34a [Tazawa et al., 2007], indicating the importance of miRNA function to be taken into account for E2F target genes. Indeed, miR-17-5p and miR-20a, both of which are transcriptional targets of the cMyc oncogene, negatively regulated E2F [O’Donnell et al., 2005]. Endogenous E2F1, E2F2, and E2F3 could directly bind to the promoter of the miR-17-92 cluster, and miR-20a, a member of the miR-17-92 cluster, modulated the translation of the E2F2 and E2F3 mRNAs

Coupling Computational and Experimental Analysis…

91

[Sylvestre et al., 2007]. The E2F3 protein was also confirmed to be down-regulated by miR210 [Giannakakis et al., 2008]. Therefore, we need to know E2F target miRNAs and then which miRNAs can regulate the E2F activity. For this purpose, the adenoviruses Ad-Control, as an empty vector, and Ad-E2F1, containing the E2F1 cDNA, were used. For the infection, A549 human lung carcinoma cells were infected by adding the adenoviral vectors at a MOI (multiplicity of infection) of 100 plaque-forming units per cell. Cells were collected after 24 hours virus infection and total RNA containing miRNA was recovered with a miRNeasy Mini Kit (Qiagen, Valencia, CA). The miRNA processing and hybridization to a human miRNA microarray, which contains 470 mature miRNAs, and data acquisition and analyses was performed according to the manufacture’s miRNA microarray system protocol Version 1.0, April 2007 (Agilent Technologies, Santa Clara, CA) [Wang et al., 2007]. Surprisingly, no miRNAs were up-regulated in Ad-E2F1-infected A549 cells, whereas only 10 miRNAs including miR-202 (0.57-fold), miR-330 (0.59-fold), miR-501 (0.62-fold), miR-509 (0.63fold), miR-601 (0.66-fold), miR-575 (0.67-fold), miR-636 (0.69-fold), miR-149 (0.72-fold), miR-610 (0.73-fold), and miR-583 (0.75-fold) were down-regulated in Ad-E2F1-infected cells compared to Ad-Control-infected cells. We do not have any clues about the functional relationship between E2F and these miRNAs at present.

2. Computational Prediction of E2F Binding Site Locations As described above, one can easily access many candidate genes to be regulated by E2F. Tracing gene lists published in the literature is a reliable method. Another way is to search the public database. The Gene Expression Omnibus (GEO) at the National Center for Biotechnology Information (NCBI) is the most popular public database (http://www.ncbi.nlm.nih.gov/geo/) storing DNA microarray data [Barrett and Edgar, 2006; Edgar and Barrett, 2006; Barrett et al., 2007]. In addition, we have been collecting E2F target genes by over- or down-regulating E2F expression in human cells [Goto et al., 2006]. Although computational methods that can predict the transcriptional start site are available [Zhang, 2007], we searched transcriptional start sites of candidate genes at the NCBI Map Viewer (http://www.ncbi.nlm.nih.gov/projects/mapview/) [Wheeler et al., 2007]. The core promoter is often defined as minimal required sequences, approximately 80~100 base pairs (bp) surrounding the transcriptional start site, which can drive a reporter gene at a basal level. On the other hand, the proximal promoter can be defined as 250~1,000 bp upstream of the core promoter. We usually focus on the 700 bp upstream and 300 bp downstream of transcriptional start sites, -700/+300, in which the transcription start sites, designated as + 1, are forwarded to a transcription factor binding site search. In silico identification of E2F binding site locations is accomplished by searching assembled collections of experimentally defined transcription factor binding sequences. To the best of our knowledge, the largest and most commonly used matrix library collection is the TRANSFAC database for eukaryotic transcription factors [Stormo, 2000; Matys et al., 2003; Matys et al., 2006]. This binding specificity of transcription factors is usually obtained from SELEX (Systematic Evolution of Ligands by Exponential Enrichment) and Chromatin immunoprecipitation (ChIP)-on-chip. Several sets of optimized matrix cut-off values are built in to the TRANSFAC database to provide a variety of search modes of different stringencies. The user simply inputs DNA sequences of their interest into the system with matrices including default or user-defined cut-

92

Kenichi Yoshida

off values. We normally use a cut-off score of 85 for E2F binding site prediction. For more accuracy, it is recommended to perform a double check with another system, such as a MatInspector [Cartharius et al., 2005].

3. Experimental Verification of E2F Binding Site Locations There are a few essential biological assays such as gel shift or ChIP assay for experimentally testing that the computationally identified sequences are really bound by transcription factors. Before this stage, we normally check the identified putative E2F regulatory sequences with a luciferase-based reporter assay. Briefly, a promoter fragment is cloned upstream of the firefly luciferase reporter gene in the pGL3-Basic (Promega, Madison, WI). pRL-TK (Promega), a plasmid that contains the Renilla luciferase gene under the cytomegalovirus promoter, is utilized as an internal control to normalize the effects of the transfection efficiency. A Dual Luciferase Reporter Assay Kit (Promega) is used for the luciferase reporter assay. Light intensity was quantified in a luminescence reader (GloMax 20/20 Luminometer, promega). In our experience, gene promoters that include predicted E2F regulatory elements sometimes failed to be up-regulated by E2F1 co-expression in human cell culture. For example, a TRANSFAC search revealed that upstream of human histone acetyltransferase, MYST2 (a histone acetyltransferase binding to ORC1: hereafter we call this gene HBO1) possesses a putative E2F regulatory element (cut-off score 86). Moreover, HBO1 was selected as candidate gene for an E2F target, because HBO1 has been shown to bind to DNA replication factors ORC1 and MCM2 [Iizuka and Stillman, 1999; Burke et al., 2001]. DNA replication factors are known as classic targets of E2F [Ren et al., 2002]. Indeed, ORC1, MCM5, and MCM6 have been shown to be regulated by E2F1 [Ohtani et al., 1996; Ohtani et al., 1999]. These pieces of evidence favored by bioinformatics strongly suggested that HBO1 potentially is a novel target of E2F. To make sure, we cloned the -270/-1 region of the HBO1 sequence (RefSeq Accession NM_007067), including the predicted E2F regulatory element, into the pGL3-Basic vector upstream of the luciferase coding region, and analyzed the promoter activity of the vector with firefly luciferase as the reporter gene transiently transfected into human culture cells. Co-expression of the E2F1 expression vector with the pGL3-HBO1 270/-1 failed to induce the reporter gene activity under the control of the putative HBO1 promoter (unpublished data). Therefore, this step is very important when determining whether candidate genes are really worth forwarding to the next experimental steps such as RT-PCR, gel shift, and ChIP assays.

4. ChIP-on-Chip Recent progress in genomic microarray technology has allowed researchers to identify the exact location of the chromatin to be bound by transcription factors. The combination of chromatin immunoprecipitation followed by hybridization to tiled arrays (ChIP-on-chip or ChIP-chip) analysis is one of the most powerful tools of the genomic microarray [Lieb, 2003; Ren and Dynlacht, 2004; Blais and Dynlacht, 2005a]. Because cell cycle regulation is strictly regulated by a couple of transcription factors, it is absolutely ideal to be analyzed by ChIPon-chip [Blais and Dynlacht, 2005b; Wu et al., 2007]. Among cell cycle regulators at

Coupling Computational and Experimental Analysis…

93

transcriptional levels, E2F has unique roles not also in the G1 to S phase progression but also DNA replication, DNA damage repair, differentiation and development [Cam and Dynlacht, 2003; Bracken et al., 2004]. Therefore, ChIP-on-chip analysis has been employed to identify E2F target genes [Blais and Dynlacht, 2004]. Among ~1,200 genes expressed during cell cycle entry, it was found that the promoters of 127 genes were bound by E2F4, partly in common with E2F1, in human primary fibroblasts. Remarkably, this experiment revealed that E2F could regulate the genes involved in chromatin assembly/condensation, chromosome segregation, and the mitotic spindle checkpoint [Ren et al., 2002]. Using a human CpG microarray, ChIP analysis revealed that E2F4 could bind to 68 unique targets involving genes encoding proteins involved in DNA repair or recombination [Weinmann et al., 2002]. Other report using the ChIP approach urged that E2F and nuclear respiratory factor-1 cooperatively regulate E2F target genes in particularly for mitochondrial biogenesis and metabolism [Cam et al., 2004]. ChIP-on-chip analyses of approximately 24,000 promoters indicated that more than 20% of the promoters were bound by E2F1 [Bieda et al., 2006]. Interestingly, within 30 mega-bp of the human genome, more than 80% of E2F1 binding sites are estimated to be located within core promoters and that 50% of the E2F1 binding sites overlapped with transcription start sites [Bieda et al., 2006]. Recently, the binding patterns of E2F1, E2F4, and E2F6 were assayed by ChIP-on-chip in normal and cancerous cells [Xu et al., 2007], demonstrating that three members share common target promoters and are located within 2 kilo-bp of a transcription start site of the target genes. Epigenetics should be considered to account for spatial regulation of chromatin by E2F. It is recognized that silent and active chromatin loci in eukaryote genome, namely controlling the balance between methylation and acetylation of histone H3 lysine 9, are important in E2F-dependent promoter regulation [Nicolas et al., 2003].

5. Systems Biology The circadian rhythm is a set of biological rhythms that have a periodicity of around 24 hours. Systemic biological approaches partly succeeded in understanding that mammalian circadian clocks consist of complex integrated feedback loops of transcription factors [Hayes et al., 2005; Kronauer et al., 2007]. Interestingly, a recent finding in cyanobacteria strongly indicates that circadian clocks are regulated through transcriptional-translational feedback regulation, especially regulated by phosphorylation of a key clock protein [Iwasaki et al., 2002; Naef, 2005]. This tendency is true of cell cycle regulation. The cell cycle is also a set of biological rhythms that have a periodicity of around 24 hours. The cell cycle holds not only complex integrated feedback loops of transcription factors but also protein-protein interaction. E2F1 itself is known to be modified at a post-translational level. For instance, E2F1 is phosphorylated in response to DNA damage [Stevens et al., 2003]. Stabilized E2F1 can change the transcriptional targets from cell cycle progression to apoptosis genes in response to DNA damage [Pediconi et al., 2003]. The cell cycle database (http://www.itb.cnr.it/cellcycle/) is a useful bioinformatics tool [Alfieri et al., 2008] to help with understanding the cell cycle gene ontology. A recent data-driven mathematical approach to understanding the G1 cell cycle progression revealed that cyclin E/Cdk2 activation is independent of cyclin D/Cdk4/6 in mammalian cells [Haberichter et al., 2007]. Cyclin D/Cdk4/6 and cyclin E/Cdk2 complexes are upstream regulators of E2F; therefore a data-

94

Kenichi Yoshida

driven mathematical approach could be applicable to understand the transcriptionaltranslational feedback regulation as well as redundant E2F members in the context of cell cycle regulation. Apparently, E2F cannot stand alone during the cell cycle regulation. The crossover and information exchange among E2F, p53, and c-Myc for the regulation of the cell cycle and carcinogenesis has been well documented [Matsumura et al., 2003; Stanelle and Pützer, 2006]. A systemic biological approach demonstrated that p53-mediated transcriptional repression of several target genes is dependent of the activities of E2F [Tabach et al., 2005]. E2F1 is known to be involved in both cell cycle and apoptosis [La Thangue, 2003; Bell and Ryan, 2004]. The regulation of how these functions can be separately exerted at a molecular level remains however uncertain [Knezevic and Brash, 2004]. In addition to feedback regulation of transcription factors, apoptosis induced by E2F1 is roughly divided into the p53dependent or p53-independnet types [Stanelle and Pützer, 2006]. These issues are the most challenging case to be solved by a systemic biological approach in the near future.

Conclusion In the future, transcriptional complexity will be more unraveled under close scrutiny by methodological improvements. At present, we can summarize transcriptional complexity at a unicellular level, but eventually it has to be unraveled at tissue or multicellular organism levels. Understanding the full spectrum of the gene regulatory networks will allow us to add or withdraw certain factors from the well-established systems. This type of simulation could be a powerful tool to predict or simulate the pathological conditions of gene regulatory malformation frequently seen in cancerous or disease cells. Stimuli or genes which can affect the robustness of networks can be used for better tools to heal human disease.

Acknowledgement This work was supported in part by a Grant-in-Aid from the Ministry of Education, Culture, Sports, Science and Technology in Japan (MEXT).

References Alfieri R, Merelli I, Mosca E, Milanesi L: The cell cycle DB: a systems biology approach to cell cycle analysis. Nucleic Acids Res 2008, 36:D641-645. Barrett T, Edgar R: Mining microarray data at NCBI’s Gene Expression Omnibus (GEO). Methods Mol Biol 2006, 338:175-190. Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Edgar R: NCBI GEO: mining tens of millions of expression profilesdatabase and tools update. Nucleic Acids Res 2007, 35:D760-765. Bell LA, Ryan KM: Life and death decisions by E2F-1. Cell Death Differ 2004, 11:137-142. Bieda M, Xu X, Singer MA, Green R, Farnham PJ: Unbiased location analysis of E2F1binding sites suggests a widespread role for E2F1 in the human genome. Genome Res 2006, 16:595-605.

Coupling Computational and Experimental Analysis…

95

Black EP, Hallstrom T, Dressman HK, West M, Nevins JR: Distinctions in the specificity of E2F function revealed by gene expression signatures. Proc Natl Acad Sci USA 2005, 102:15948-15953. Blais A, Dynlacht BD: Hitting their targets: an emerging picture of E2F and cell cycle control. Curr Opin Genet Dev 2004, 14:527-532. Blais A, Dynlacht BD: Constructing transcriptional regulatory networks. Genes Dev 2005a, 19:1499-1511. Blais A, Dynlacht BD: Devising transcriptional regulatory networks operating during the cell cycle and differentiation using ChIP-on-chip. Chromosome Res 2005b, 13:275-288. Bracken AP, Ciro M, Cocito A, Helin K: E2F target genes: unraveling the biology. Trends Biochem Sci 2004, 29:409-417. Burke TW, Cook JG, Asano M, Nevins JR: Replication factors MCM2 and ORC1 interact with the histone acetyltransferase HBO1. J Biol Chem 2001, 276:15397-15408. Cam H, Dynlacht BD: Emerging roles for E2F: beyond the G1/S transition and DNA replication. Cancer Cell 2003, 3:311-316. Cam H, Balciunaite E, Blais A, Spektor A, Scarpulla RC, Young R, Kluger Y, Dynlacht BD: A common set of gene regulatory networks links metabolism and growth inhibition. Mol Cell 2004, 16:399-411. Cartharius K, Frech K, Grote K, Klocke B, Haltmeier M, Klingenhoff A, Frisch M, Bayerlein M, Werner T: MatInspector and beyond: promoter analysis based on transcription factor binding sites. Bioinformatics 2005, 21:2933-2942. Edgar R, Barrett T: NCBI GEO standards and services for microarray data. Nat Biotechnol 2006, 24:1471-1472. Futcher B: Transcriptional regulatory networks and the yeast cell cycle. Curr Opin Cell Biol 2002, 14:676-683. Gerstein MB, Bruce C, Rozowsky JS, Zheng D, Du J, Korbel JO, Emanuelsson O, Zhang ZD, Weissman S, Snyder M: What is a gene, post-ENCODE? History and updated definition. Genome Res 2007, 17:669-681. Giannakakis A, Sandaltzopoulos R, Greshock J, Liang S, Huang J, Hasegawa K, Li C, O'Brien-Jenkins A, Katsaros D, Weber BL, et al: miR-210 links hypoxia with cell cycle regulation and is deleted in human epithelial ovarian cancer. Cancer Biol Ther 2008, 7:252-261. . Goto Y, Hayashi R, Kang D, Yoshida K: Acute loss of transcription factor E2F1 induces mitochondrial biogenesis in HeLa cells. J Cell Physiol 2006, 209:923-934. Guigó R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E, et al: EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol 2006, 7:Suppl 1:S2.1-31. Haberichter T, Mädge B, Christopher RA, Yoshioka N, Dhiman A, Miller R, Gendelman R, Aksenov SV, Khalil IG, Dowdy SF: A systems biology dynamical model of mammalian G1 cell cycle progression. Mol Syst Biol 2007, 3:84. Hayes KR, Baggs JE, Hogenesch JB: Circadian clocks are seeing the systems biology light. Genome Biol 2005, 6:219. Huang E, Ishida S, Pittman J, Dressman H, Bild A, Kloos M, D'Amico M, Pestell RG, West M, Nevins JR: Gene expression phenotypic models that predict the activity of oncogenic pathways. Nat Genet 2003, 34:226-230.

96

Kenichi Yoshida

Iizuka M, Stillman B: Histone acetyltransferase HBO1 interacts with the ORC1 subunit of the human initiator protein. J Biol Chem 1999, 274:23027-23034. Ishida S, Huang E, Zuzan H, Spang R, Leone G, West M, Nevins JR: Role for E2F in control of both DNA replication and mitotic functions as revealed from DNA microarray analysis. Mol Cell Biol 2001, 21:4684-4699. Iwasaki H, Nishiwaki T, Kitayama Y, Nakajima M, Kondo T: KaiA-stimulated KaiC phosphorylation in circadian timing loops in cyanobacteria. Proc Natl Acad Sci USA 2002, 99:15788-15793. Jamshidi-Parsian A, Dong Y, Zheng X, Zhou HS, Zacharias W, McMasters KM: Gene expression profiling of E2F-1-induced apoptosis. Gene 2005, 344:67-77. Kent OA, Mendell JT: A small piece in the cancer puzzle: microRNAs as tumor suppressors and oncogenes. Oncogene 2006, 25:6188-6196. Knezevic D, Brash DE: Role of E2F1 in apoptosis: a case study in feedback loops. Cell Cycle 2004, 3:729-732. Korenjak M, Brehm A: E2F-Rb complexes regulating transcription of genes important for differentiation and development. Curr Opin Genet Dev 2005, 15:520-527. Kronauer RE, Gunzelmann G, Van Dongen HP, Doyle FJr, Klerman EB: Uncovering physiologic mechanisms of circadian rhythms and sleep/wake regulation through mathematical modeling. J Biol Rhythms 2007, 22:233-245. La Thangue NB: The yin and yang of E2F-1: balancing life and death. Nat Cell Biol 2003, 5:587-589. Lieb JD: Genome-wide mapping of protein-DNA interactions by chromatin immunoprecipitation and DNA microarray hybridization. Methods Mol Biol 2003, 224:99-109. Ma Y, Croxton R, Moorer RLJ, Cress WD: Identification of novel E2F1-regulated genes by microarray. Arch Biochem Biophys 2002, 399:212-224. Matsumura I, Tanaka H, Kanakura Y: E2F1 and c-Myc in cell growth and death Cell Cycle 2003, 2:333-338. Matys V, Fricke E, Geffers R, Gössling E, Haubrock M, Hehl R, Hornischer K, Karas D, Kel AE, Kel-Margoulis OV, et al: TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res 2003, 31:374-378. Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K, et al: TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res 2006, 34:D108-110. Naef F: Circadian clocks go in vitro: purely post-translational oscillators in cyanobacteria. Mol Syst Biol 2005, 1:2005.0019. Nicolas E, Roumillac C, Trouche D: Balance between acetylation and methylation of histone H3 lysine 9 on the E2F-responsive dihydrofolate reductase promoter. Mol Cell Biol 2003, 23:1614-1622. O'Donnell KA, Wentzel EA, Zeller KI, Dang CV, Mendell JT: c-Myc-regulated microRNAs modulate E2F1 expression. Nature 2005, 435:839-843. Ohtani K, DeGregori J, Leone G, Herendeen DR, Kelly TJ, Nevins JR: Expression of the HsOrc1 gene, a human ORC1 homolog, is regulated by cell proliferation via the E2F transcription factor. Mol Cell Biol 1996, 16:6977-6984.

Coupling Computational and Experimental Analysis…

97

Ohtani K, Iwanaga R, Nakamura M, Ikeda M, Yabuta N, Tsuruga H, Nojima H: Cell growthregulated expression of mammalian MCM5 and MCM6 genes mediated by the transcription factor E2F. Oncogene 1999, 18:2299-2309. Pediconi N, Ianari A, Costanzo A, Belloni L, Gallo R, Cimino L, Porcellini A, Screpanti I, Balsano C, Alesse E, et al: Differential regulation of E2F1 apoptotic target genes in response to DNA damage. Nat Cell Biol 2003, 5:552-558. Ren B, Cam H, Takahashi Y, Volkert T, Terragni J, Young RA, Dynlacht BD: E2F integrates cell cycle progression with DNA repair, replication, and G(2)/M checkpoints. Genes Dev 2002, 16:245-256. Ren B, Dynlacht BD: Use of chromatin immunoprecipitation assays in genome-wide location analysis of mammalian transcription factors. Methods Enzymol 2004, 376:304-315. Stanelle J, Stiewe T, Theseling CC, Peter M, Pützer BM: Gene expression changes in response to E2F1 activation. Nucleic Acids Res 2002, 30:1859-1867. Stanelle J, Pützer BM: E2F1-induced apoptosis: turning killers into therapeutics Trends Mol Med 2006 12 177-185. Stevens C, Smith L, La Thangue NB: Chk2 activates E2F-1 in response to DNA damage. Nat Cell Biol 2003, 5:401-409. Stormo GD: DNA binding sites: representation and discovery. Bioinformatics 2000, 16:16-23 Sylvestre Y, De Guire V, Querido E, Mukhopadhyay UK, Bourdeau V, Major F, Ferbeyre G, Chartrand P: An E2F/miR-20a autoregulatory feedback loop. J Biol Chem 2007, 282:2135-2143. Tabach Y, Milyavsky M, Shats I, Brosh R, Zuk O, Yitzhaky A, Mantovani R, Domany E, Rotter V, Pilpel Y: The promoters of human cell cycle genes integrate signals from two tumor suppressive pathways during cellular transformation. Mol Syst Biol 2005, 1:2005.0022. Tazawa H, Tsuchiya N, Izumiya M, Nakagama H: Tumor-suppressive miR-34a induces senescence-like growth arrest through modulation of the E2F pathway in human colon cancer cells. Proc Natl Acad Sci USA 2007, 104:15472-15477. Vernell R, Helin K, Müller H: Identification of target genes of the p16INK4A-pRB-E2F pathway. J Biol Chem 2003, 278:46124-46137. Wang H, Ach RA, Curry B: Direct and sensitive miRNA profiling from low-input total RNA. RNA 2007, 13:151-159. Weinmann AS, Yan PS, Oberley MJ, Huang TH, Farnham PJ: Isolating human transcription factor targets by coupling chromatin immunoprecipitation and CpG island microarray analysis. Genes Dev 2002, 16:235-244. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, et al: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2007, 35:D5-12. Wu WS, Li WH, Chen BS: Identifying regulatory targets of cell cycle transcription factors using gene expression and ChIP-chip data. BMC Bioinformatics 2007, 8:188. Xu X, Bieda M, Jin VX, Rabinovich A, Oberley MJ, Green R, Farnham PJ: A comprehensive ChIP-chip analysis of E2F1, E2F4, and E2F6 in normal and tumor cells reveals interchangeable roles of E2F family members. Genome Res 2007, 17:1550-1561. Yamasaki L, Pagano M: Cell cycle, proteolysis and cancer. Curr Opin Cell Biol 2004, 16:623-628.

98

Kenichi Yoshida

Young AP, Nagarajan R, Longmore GD: Mechanisms of transcriptional regulation by RbE2F segregate by biological pathway. Oncogene 2003, 22:7209-7217. Zhang B, Pan X, Cobb GP, Anderson TA: microRNAs as oncogenes and tumor suppressors. Dev Biol 2007, 302:1-12. Zhang MQ: Computational analyses of eukaryotic promoters. BMC Bioinformatics 2007, 8:Suppl 6:S3.

In: Computational Biology: New Research Editor: Alona S. Russe

ISBN 978-1-60692-040-4 c 2009 Nova Science Publishers, Inc.

Chapter 3

S OLVING A S TOCHASTIC G ENERALIZED A SSIGNMENT P ROBLEM WITH B RANCH AND P RICE David P. Morton∗ Graduate Program in Operations Research Department of Mechanical Engineering The University of Texas Austin, TX 78712-0292, 512-471-4104 USA Jonathan F. Bard† Graduate Program in Operations Research Department of Mechanical Engineering The University of Texas Austin, TX 78712-0292, 512-471-3076 USA Yong Min Wang‡ American Airlines, 333 Amon Carter Blvd. MD 5358 Fort Worth, TX 76155-2664 USA

Abstract In this chapter, we investigate the generalized assignment problem with the objective of finding a minimum-cost assignment of jobs to agents subject to capacity constraints. A complicating feature of the model is that the coefficients for resource consumption and capacity are random. The problem is formulated as a stochastic integer program with a penalty term associated with the violation of the resource constraints and is solved with a branch-and-price algorithm that combines column generation with branch and bound. To speed convergence, a stabilization procedure is included. The performance of the methodology was tested on four classes of randomlygenerated instances. The principal results showed that the value of the stochastic solution (V SS), i.e., the gap between the stochastic solution and the expected value solution, was 35.5% on average. At the root node of the search tree, it was found that the linear programming relaxation of the master problem associated with column ∗

E-mail address: [email protected] E-mail address: [email protected]. Corresponding author. ‡ E-mail address: [email protected] †

100

David P. Morton, Jonathan F. Bard and Yong Min Wang generation provided a much tighter lower bound than the relaxation of the original constraint-based formulation. In fact, two thirds of the test problems evidenced no gap between the optimal integer solution and the relaxed master problem solution. Additional testing showed that (1) initializing the master problem with a feasible solution outperforms the classical big-M approach; (2) SOS type 1 branching is superior to single-variable branching; and, (3) variable fixing based on reduced costs provides only a slight decrease in runtimes.

Key words: stochastic integer programming, generalized assignment problem, branch and price

1.

Introduction

The generalized assignment problem (GAP) is a classical example of a difficult combinatorial optimization problem that has received considerable attention over the years due to its widespread application. In many instances, it appears as a substructure in more complicated models, including routing problems [11], facility location models [21], and computer networking applications [2]. In the deterministic version of the problem, the objective is to assign jobs to agents at minimum cost, subject to capacity constraints. Despite its simplicity, the GAP is strongly NP-hard [18] and can only be solved to optimality for a few hundred agents and jobs. Recently, the GAP has been extended to capture uncertainty, often an important factor in real-world applications [1, 25, 26]. A stochastic GAP can arise when an agent’s capacity and/or coefficients for resource consumption are known only imprecisely. The purpose of this research is to investigate a stochastic GAP in which both of these sets of coefficients are random. To find solutions, we have developed a branch-and-price algorithm that makes use of a stabilization procedure to speed convergence. Experimental results are presented for instances with up to 20 agents, 60 jobs, and 100 scenarios. In the next section, the relevant literature is reviewed. Section 3. introduces a constraintbased stochastic integer programming formulation for the stochastic GAP, which is reformulated as a set partitioning problem in Section 4. as the first step in column generation. Our branch-and-price algorithm is described in Section 5., followed by our computational experience in Section 6.. Insights gained from model development and testing are summarized in Section 7..

2.

Literature Review

Most exact methods for solving the GAP are based on branch and bound (B&B), beginning with the work of Ross and Soland [20]. Improved relaxations were developed by Fisher et al. [11] and Martello and Toth [17]. In both of these papers, the job assignment constraints are relaxed and placed in the objective function as a penalty term giving rise to a Lagrangian formulation, which is solved with a dual-ascent procedure. The approach proposed in [11] is extended in [14] by allowing temporary primal infeasibility and by adding surrogate constraints whenever such violations occur. However, the performance of the corresponding

Solving a Stochastic Generalized Assignment Problem with Branch and Price

101

algorithm was seen to degrade as the ratio of the number of agents to the number of jobs grew. Often, exact solutions are not necessary and near-optimal solutions that can be obtained quickly suffice. Most such heuristics for the GAP are based on its linear programming (LP) relaxation. In [27] for example, the LP relaxation is repeatedly solved and decision variables taking 0 or 1 values are fixed. In the implementation, the fixing can occur with a probability less than one. Narciso and Lorena [19] describe a second heuristic rooted in Lagrangian relaxation that makes use of a subgradient algorithm to improve the multipliers. Relative to the deterministic GAP, stochastic variants have received much less attention. Random yes-no demand for individual jobs is modeled via Bernoulli random variables for resource consumption in [1]. In this stochastic program, job-agent assignments are made in the first stage, and after the resource-consumption coefficients are realized, reassignments are made in the second stage to deal with overloaded agents. When the resourceconsumption coefficients are independent of the agents, the combinatorial reassignment problem is totally unimodular. This convexity property allows the authors to employ an L-shaped decomposition method with binary first stage variables and continuous second stage variables. (The L-shaped method [28] generates cutting planes in a manner similar to Benders’ decomposition.) The standard deterministic GAP involves a set of agents, each with a single resourceconsumption constraint, but is extended to agents with multiple resources in [12]. A stochastic variant of the multi-resource GAP is developed in [26], where the resourceconsumption coefficients are deterministic but the agents’ resource capacities are random. Like the model in [1], job-agent assignments are made prior to the realization of the random parameters. To correct any violations of the capacity constraints that may result, three recourse alternatives are discussed: (i) penalize the magnitude of the total violation, (ii) penalize the number of violated resource constraints, or (iii) cancel jobs (and incur a penalty) to satisfy the resource constraints. Bimodal, exponential, and normally-distributed resourcecapacity distributions were considered. Lagrangian relaxation bounds were iteratively tightened and used with a B&B algorithm to solve these models. A model for a stochastic GAP model is presented by Spoerl and Wood [25] in which the resource-consumption coefficients are independent and normally distributed. The normality assumption is exploited in developing an equivalent deterministic model. Under the additional assumption that the resource-consumption coefficients for an agent all have a common mean-to-variance ratio, a smaller deterministic equivalent model is derived. Dantzig-Wolfe decomposition [6, 7] exploits specially-structured linear programs using a reformulation with exponentially-many columns. These columns are iteratively generated and added to a master program as needed by solving a so-called pricing subproblem to identify columns with attractive reduced costs. More generally, a column-generation (CG) method is a way to approach a mathematical program with an excessive number of columns. In some integer programs like the GAP, CG is applied to a reformulation with exponentiallymany columns because more compact formulations have weaker LP relaxations. (In other circumstances, such formulations may be the only model available.) Often, CG is applied only at the root node of the B&B tree for the purpose of obtaining a tight initial LP bound. When the approach is applied at all nodes of the search tree the full methodology is called branch and price (B&P) [4, 29]. In recent decades, B&P has been

102

David P. Morton, Jonathan F. Bard and Yong Min Wang

used with considerable success on a variety of problems, including the GAP [23], facility location problems [24], vehicle routing problems [8], and scheduling problems [3, 9]. Effective CG depends on being able to solve the pricing subproblem quickly and on the overall algorithm converging in a reasonable number of iterations. However, since its early application to cutting-stock problems [13], CG has had a reputation of slow convergence, which is attributable in part to multiple optimal dual solutions of the master problem. In response, stabilization methods have been developed that attempt to limit the distance between the dual solutions from one iteration to the next. The primary means of accomplishing this is by slightly relaxing the master problem constraints and penalizing “infeasibilities.” This has the effect of placing bounds on the dual variables. These bounds and associated penalties are typically updated dynamically [10]. Other approaches resolve multiple optimal dual solutions by using an interior point method to solve the master problem [22] or by using a weighted average of previously generated solutions and the current dual solution [30]. For more detail, see [15]. In this chapter, we consider a stochastic GAP that has random resource capacities and random resource-consumption coefficients. We penalize, in expectation, a weighted sum of the magnitudes of resource-constraint violations and then develop a B&P algorithm that includes a stabilization feature along with several other computational enhancements. To the best of our knowledge, no version of the stochastic GAP has been solved by B&P in the literature, although we note that the possibility was discussed in [24].

3.

Mathematical Formulation

We begin by presenting the standard model of the deterministic GAP. Let i ∈ I index the set of agents and j ∈ J index the set of jobs. The goal of the GAP is to find a minimumcost assignment, using binary decision variables xij , of all jobs j ∈ J to agents i ∈ I. The problem can be formulated as follows XX cij xij (1a) min x

s.t.

i∈I j∈J

X

xij = 1,

j∈J

(1b)

i∈I

X

dij xij ≤ bi,

i∈I

(1c)

j∈J

xij ∈ {0, 1},

i ∈ I, j ∈ J

(1d)

where cij is the cost of assigning job j to agent i, bi is the capacity of agent i and dij is the amount of that capacity consumed when job j is assigned to agent i. Constraints (1b) and (1d) ensure that each job is assigned to exactly one agent and constraints (1c) ensure that the resource capacity of agent i is obeyed. The objective function (1a) minimizes the total cost of assigning all jobs to agents. Now let us consider uncertainty in the resource-capacity and resource-consumption co˜ ˜ ˜ efficients. Let ξi = (dij )j∈J , bi denote the random vector of coefficients associated with agent i, and let ξiω = (dωij )j∈J , bωi be its realizations indexed over the sample space,

Solving a Stochastic Generalized Assignment Problem with Branch and Price

103

ω ∈ Ωi , where it is assumed that |Ωi| is finite. Let pωi = P (ξ˜i = ξiω ), ω ∈ Ωi , be the corresponding probability mass function and let ξ˜ = (ξ˜i )i∈I be the vector of all the random coefficients. As we describe below, however, our model’s objective function is separable in the subvectors ξ˜i , i ∈ I, and hence the dependency structure among these subvectors is irrelevant. Restated, an optimal solution to our model is optimal for all dependency structures between these subvectors. As with all the stochastic GAP models referenced in the previous section [1, 25, 26], our model first assigns jobs to agents subject to constraints (1b) and (1d) and with costs as indicated in (1a). Then we observe a realization of ξ˜ and penalize the sum of the magnitudes of violations in constraints (1c) over all agents with respective unit penalties qi ≥ 0. We add to the costs already incurred in the objective function (1a), the expected value of this penalty function given as follows.  +  + X X X X qi  = qi Eξ˜i  d˜ij xij − ˜bi d˜ij xij − ˜bi  Eξ˜ i∈I

j∈J

i∈I

j∈J

X

=

i∈I

X

qi

ω∈Ωi

 + X pωi  dωij xij − bωi 

(2)

j∈J

Here, (·)+ = max(·, 0). Linearizing the “max” terms with the help of the continuous variables yiω , which denote the magnitude of the constraint violations, leads to X X XX cij xij + qi pωi yiω (3a) z ∗ = min x,y

s.t.

i∈I j∈J

X

i∈I

xij = 1,

ω∈Ωi

j∈J

(3b)

i∈I

X

dωij xij − yiω ≤ bωi ,

ω ∈ Ωi , i ∈ I

(3c)

j∈J

xij ∈ {0, 1}, yiω

≥ 0,

i ∈ I, j ∈ J

ω ∈ Ωi , i ∈ I

(3d) (3e)

The objective function in (3a) minimizes the job assignment costs as in (1a) plus the expected cost of violating the resource-capacity constraints. Constraints (3c) and (3e) achieve the desired linearization of (2), while constraints (3b) and (3d) are identical to (1b) and (1d).

4.

Reformulating the Stochastic GAP

We begin this section by reformulating model (3) so that it has a column orientation. The motivation for doing so is that the LP relaxation of the constraint-based formulation (3) gives weak lower bounds on z ∗ , resulting in excessive runtimes when B&B is applied directly. (We revisit this issue in Section 6..) After reformulating model (3), we describe how to obtain upper and lower bounds on z ∗ that can be used to reduce the size of the search tree. Finally, we augment our reformulation to help stabilize the CG iterations within the B&P algorithm.

104

David P. Morton, Jonathan F. Bard and Yong Min Wang

4.1.

Formulation

Let xi = (xij )j∈J denote the vector of job assignments to agent i, and let xki , k ∈ Ki , index all 2|J| such assignments ranging from agent i having no assignments to being assigned all P + ω k ω d x − b . jobs from J. Given assignment xki and scenario ω ∈ Ωi , let yiωk = i j∈J ij ij With yik = yiωk ω∈Ω , we have the pairs (xki , yik ) over k ∈ Ki, i ∈ I, representing all i P P feasible solutions to (3). Defining cki = j∈J cij xkij + qi ω∈Ωi pωi yiωk , i.e., the expected cost of assignment xki , we can reformulate model (3) as follows. z ∗ = min λ

s.t.

XX

cki λki

(4a)

i∈I k∈Ki

XX

xkij λki = 1,

j∈J

(4b)

i∈I k∈Ki

X

λki = 1,

i∈I

(4c)

k∈Ki λki ∈

{0, 1},

k ∈ Ki , i ∈ I

(4d)

Constraints (4c) and (4d) ensure exactly one set of jobs is assigned to each agent and the objective function gives the expected cost of that assignment. Constraint (4b) is then equivalent to (3b), i.e., each job is done once. Problem (4) is called the full master problem (MP). Of course, it is neither practical nor desirable to explicitly enumerate all feasible assignments of Ki , i ∈ I. Instead we start with modest-sized subsets Ki0 ⊂ Ki, i ∈ I, that have the property that each job j ∈ J can be covered by at least one of the agents, i.e., constraints (4b) - (4d) are feasible when Ki is replaced by Ki0. This leads to the so-called restricted master problem (RMP). z¯ = min λ

s.t.

XX i∈I

k∈Ki0

i∈I

k∈Ki0

XX X

cki λki

(5a)

xkij λki = 1,

λki = 1,

j∈J

i∈I

(5b) (5c)

k∈Ki0

ˆ = Let λ

ˆk λ i

k∈Ki0 , i∈I

λki ∈ {0, 1},

k ∈ Ki0, i ∈ I

be a feasible solution to the LP relaxation of model (5).

ˆij = Then (ˆ xij , yˆiω )ω∈Ωi ,i∈I,j∈J is a feasible solution to the LP relaxation of (3), where x P P k k ω ωk k ˆ ˆ ˆi = k∈K 0 yi λi . The first stage decision is binary, and the following k∈Ki0 xij λi and y i ˆ and x proposition characterizes the relationship between λ ˆ. ˆ = (λ ˆ k )k∈K 0 ,i∈I be an optimal solution to the LP Proposition 1. (Savelsbergh [23]) Let λ i i ˆ k is fractional for some i, then there must be a j such that x ˆij = relaxation of (5). If λ i P k k ˆ k∈K 0 xij λi is fractional. i

Solving a Stochastic Generalized Assignment Problem with Branch and Price

105

The proof of this proposition in [23] is for the deterministic GAP, but hinges on the assumption that there are no duplicate columns in the (restricted) master problem. This assumption is valid for our (restricted) master problem so Savelsbergh’s proof carries over directly to (5). Consider the LP relaxation of (5) and let πj , j ∈ J, and αi , i ∈ I, be optimal dual variables associated with constraints (5b) and (5c), respectively. The reduced cost for λki is then c¯ki = cki − =

X

X

πj xkij − αi

j∈J

(cij − πj ) xkij + qi

j∈J

X

pωi yik,ω − αi

ω∈Ω

The optimal dual multipliers from the RMP are defined over the problem with columns Ki0 , i ∈ I. Consider the problem of solving mink∈Ki c¯ki , i.e., finding a column for agent i with the smallest reduced cost over the set Ki . Such a column can be found by solving the following pricing problem for agent i.

vi = min xi ,yi

s.t.

X X (cij − πj )xij + qi pωi yiω j∈J

X

ω∈Ωi

dωij xij

−

yiω

≤

bωi ,

ω ∈ Ωi

(6)

j∈J

xij ∈ {0, 1}, yiω ≥ 0,

ω ∈ Ωi , j ∈ J

ˆi = (ˆ xij )j∈J and yˆi = (ˆ yiω )ω∈Ωi . If vi − αi ≡ Let (ˆ xi , yˆi ) solve (6), where x ˆij to the RMP with objective function comink∈Ki c¯ki
4.2.

Lower and Upper Bounds

If we solve the integer-constrained RMP as described above and the corresponding solution has optimal value z¯ then this provides an upper bound on the optimal value of the original ∗ problem, i.e., z¯ ≥ z . This follows because any integer-feasible solution to the RMP, ˆk = 0 ˆ= λ ˆk , can be extended to a feasible solution to the full MP by setting λ λ i i 0 k∈Ki , i∈I

for k ∈ Ki \Ki0, i.e., for those columns not in the RMP.

106

David P. Morton, Jonathan F. Bard and Yong Min Wang

The lower bound we use follows a standard construction in column-generation methods. The LP relaxation of the full MP provides a lower bound on z ∗, as does any feasible solution to its dual. Let (πj )j∈J and (αi )i∈I be an optimal solution to the dual of RMP’s LP relaxation, and let vi , i ∈ I, be defined as in (6). We can then show that (πj )j∈J and (αi )i∈I are feasible to the dual of the relaxed full MP. This solution is feasible to dual constraints for k ∈ Ki \Ki0 by the way vi , i ∈ I, is defined in (6). See, for example, [5, 6] for detailed arguments along these lines. We summarize this result in the following theorem. Theorem 1. Let qi ≥ 0, i ∈ I, and let z ∗ be the optimal value of model (3), or equivalently of model (4). Let xki , k ∈ Ki0 , i ∈ I, be such that the RMP (5) is feasible, and let (πj )j∈J and (αi )i∈I be an optimal solution to the dual of model (5)’s LP relaxation. Let vi , i ∈ I, be defined as in (6). Then using π = (πj )j∈J as dual variables on constraints (4b) and α = (αi )i∈I as dual variables on constraints (4c), we have that (π, α) is feasible to the dual of the LP relaxation of (4) and hence z≡

X j∈J

πj +

X

vi ≤ z ∗

(7)

i∈I

The lower bound of Theorem 1 is useful because it allows us to terminate the column generation procedure without having to ensure vi − αi ≥ 0, for all i ∈ I, and is also valuable in fathoming nodes in the branch-and-bound tree. The lower bound of Theorem 1 is (locally) valid at any node in the search tree, provided the associated RMP is feasible.

4.3.

Column Generation

The B&P algorithm consists of two parts. In the first part, an optimal solution to the LP relaxation of the master program is obtained using column generation. Initially, this LP relaxation is that of model (4) but as the iterations progress, some assignment decisions are fixed via branching so the LP changes. The column generation component of the algorithm is summarized in Figure 1. Step 0 populates Ki0 with an initial set of columns. This can be replaced with a big-M method as discussed presently. We compare the relative performance of these two approaches in Section 6.. As indicated, the CG procedure can alternatively be terminated with a near-optimal solution using a relative tolerance of ε ≥ 0 by employing the lower and upper bounds of the previous section. Specifically, we terminate in Step 3 if z |, |z|}. The second part of the algorithm is described in Section 5.. z¯ − z ≤ ε · min{|¯

4.4.

Stabilization

Column generation methods often reach a near-optimal solution relatively quickly but then experience slow convergence to an optimum. Oscillation in the dual variables of the RMP’s LP relaxation, which are used to form the pricing subproblem, has been observed to cause this behavior. A simple idea to prevent such oscillation is to bound the dual values [16, 22, 30]. We follow the approach of [10] in which dual restrictions are achieved through a perturbation of the primal constraints. In particular, we replace the LP relaxation of the

Solving a Stochastic Generalized Assignment Problem with Branch and Price

107

Procedure CG

Step 0 Let xki = xkij

, k ∈ Ki0 , i ∈ I, denote an initial set of possible assignments P + ω xk − bω d , ω ∈ Ωi , k ∈ Ki0 , i ∈ I, and for each agent. Let yiωk = i j∈J ij ij P P cki = j∈J cij xkij + qi ω∈Ωi pωi yiωk , k ∈ Ki0 , i ∈ I. k ci , k ∈ Ki0, i ∈ I. Initialize RMP (5) with columns xki j∈J

Step 1 Solve the LP relaxation of the RMP (5) to obtain optimal value zLP , primal solution λ, and dual solution (π, α). ˆi and yˆi . Step 2 For i ∈ I, solve the pricing problem (6) to obtain vi , x Step 3 If vi ≥ αi for all i ∈ I stop and report zLP and x∗, where x∗ij = i ∈ I, j ∈ J.

P

k∈Ki0

Step 4 For i ∈ I, if vi < αi then add an element to index set Ki0 for new column P P ˆij + qi ω∈Ωi pωi yˆiω . Go to step 1. where cˆi = j∈J cij x

xkij λki , cˆi x ˆi

,

Figure 1. Column generation procedure.

RMP (5) with the following.

min

λ,u+ ,u−

s.t.

XX

cki λki +

i∈I k∈Ki0

XX

X j∈J

+ κ+ j uj −

X

− κ− j uj

(8a)

j∈J

(8b)

j∈J

− xkij λki + u+ j − uj = 1,

i∈I k∈Ki0

X

λki = 1,

i∈I

(8c)

k∈Ki0 + u+ j ≤ j ,

j∈J

(8d)

− u− j ≤ j ,

j∈J

(8e)

− λki , u+ j , uj ≥ 0,

k ∈ Ki0 , i ∈ I, j ∈ J

(8f)

− This model modifies constraint (5b), allowing infeasibilities that are limited by + j and j for each job j ∈ J. Such violations are penalized in the objective function using parameters

108

David P. Morton, Jonathan F. Bard and Yong Min Wang

− κ+ j and κj , j ∈ J. The dual of (8) provides insight into the stability issue and is given by

max

π,α,δ + ,δ −

s.t.

X

πj +

j∈J

X

X

αi −

i∈I

X

+ − − + j δj + j δj

(9a)

j∈J

xkij πj + αi ≤ cki

k ∈ Ki0, i ∈ I

(9b)

j∈J

πj − δj+ ≤ κ+ j ,

j∈J

−πj − δj− ≤ −κ− j , δj+ , δj− ≥ 0,

(9c)

j∈J

(9d)

j∈J

(9e)

where π = (πj )j∈J and α = (αi )i∈I are, respectively, the dual variables associated with constraints (8b) and (8c), and δ + = (δj+ )j∈J and δ − = (δj− )j∈J are the (negative of the) dual variables associated with constraints (8d) and (8e). Constraints (9c) and (9d) can be rewritten − + + κ− j − δj ≤ πj ≤ κj + δj .

Thus choosing π outside the (trust) region [κ− , κ+ ] is possible by choosing δ + or δ − to be positive but such deviations are penalized in model (9)’s objective function using + and − . These parameters are updated dynamically. A natural choice for κ+ and κ− is the dual solution from the previous iteration, i.e., κ+ = κ− = π. If the dual variables, π, used in the pricing subproblems fail to produce new columns for the RMP, then we decrease + and − ; otherwise, we increase them. The master problem modified for dual stabilization (8) does not produce the type of lower bound described in Section 4.2.. Of course, when + = − = 0, model (8) reduces to the LP relaxation of RMP (5) and then the lower bound is valid. When employing stabilization, we replace Step 3 in the column generation procedure of Figure 1 with the following step, where r < 1 is a predefined constant. Step 3 If vi ≥ αi for all i ∈ I then: − ∗ ∗ If + j = j = 0, then stop and report zLP and x , where xij = i ∈ I, j ∈ J.

P

k∈Ki0

xkij λki ,

+ − − + − Otherwise update + j ← r j , j ← r j and κj ← πj , κj ← πj , for j ∈ J, where π is the dual solution obtained at step 1. Go to step 1. − In the implementation, when + j and j are sufficiently small, they are set to zero.

5.

Branch-and-Price Algorithm

Our goal is to solve the integer-constrained stochastic GAP defined by model (3). The column generation procedure of the previous section solves the LP relaxation of the reformulated model (4). While this will be shown in the next section to provide a tighter lower

Solving a Stochastic Generalized Assignment Problem with Branch and Price

109

bound on z ∗ than the LP relaxation of (3), its solution may still be fractional. Applying B&B to the RMP with the columns associated with Ki0 , i ∈ I, that were obtained at the end of the CG procedure, however, does not ensure that we will find an optimal solution to the original problem because some of the columns in Ki \Ki0, i ∈ I, could be required. This can be discovered only by generating further columns after having branched on one or more variables. In other words, we must use column generation at each node in the search tree to ensure that an optimal solution to the original model (3) is found. Typically, we prefer to branch on the original decision variables in model (3), i.e., the x-variables, rather than those of (4), i.e., the λ-variables. Branching, of course, affects the pricing subproblems. Specifically, for some i1 ∈ I and j1 ∈ J fixed, if xi1j1 = 0 through branching, then we simply remove this job j1 from the set J when forming agent i1’s pricing problem (6). If xi1j1 = 1 then we also remove job j1 from J, but in addition we replace the resource values bωi1 , ω ∈ Ωi1 , with bωi1 − dωi1j1 , ω ∈ Ωi1 , in (6). These issues are further described below, followed by the details of the B&P algorithm.

5.1.

Master Problem

In order to carry out column generation at a node in the search tree we must begin with a feasible RMP in the form of (5), albeit with additional restrictions due to branching. At the root node, a feasible solution is obtained in the following manner: We divide the set of jobs J among the agents using the assignment costs cij , i ∈ I, j ∈ J. Specifically, we partition the elements of J by forming a column for each agent in which x ˆij = 1 if i ∈ ˆij = 0 otherwise. To ensure a valid partition (and to preclude the same arg mini∈I cij and x job being assigned to two agents) we break ties lexicographically, i.e., if multiple agents for job j have cost mini∈I cij , then the job is assigned to the agent with the smallest index i. The objective function coefficient Pfor agent i’s column + is computed in the usual manner, P P ω ω ω ˆij + qi ω∈Ωi pi ˆij − bi . After branching restrictions have cˆi = j∈J cij x j∈J dij x been added we can again come across an infeasible initial RMP, this, despite the fact that the associated full MP with branching restrictions is feasible provided each job can be performed by some agent. Therefore, we carry out a similar procedure, if necessary, i.e., we partition unassigned jobs among agents eligible to perform the job. An alternative to this approach is a big-M method in which an artificial agent is introduced for each job who can perform only that job but at a very high cost.

5.2.

Branching

We consider two branching strategies. The first is the standard approach of branching on a single variable xij for some i and j. The second branches on a subset of variables associated with type 1 special ordered sets (SOS). To branch on xij we form descendent nodes in which we respectively fix xij = 1 and xij = 0. The xij whose value is closest to 12 is chosen for branching. Setting xij = 1 implies that job j is assigned to agent i while xij = 0 forbids agent i from doing job j. This rule affects the λ-variables in the LP relaxation of the RMP (5) as follows: To forbid the assignment of j to i we set λki = 0 for all k ∈ Ki0 with xkij = 1. To assign j to i we set λki = 0 for all k ∈ Ki0 with xkij = 0 and we set λki0 = 0 for all i0 6= i, k ∈ Ki00 with xi0j = 1.

110

David P. Morton, Jonathan F. Bard and Yong Min Wang

Of course, setting λki = 0 amounts to simply removing the corresponding column from the formulation. Our second strategy replies on the fact that P for an arbitrary subset of agents I 0 ⊆ I and a fixed job j ∈ J we have the logical constraint i∈I 0 xij ≤ 1 since at most one agent i ∈ I 0 will be assigned job j. This allows two descendent nodes P us to use SOS branching where P are created. The first node has i∈I 0 xij = 0 and the second has i∈I¯0 xij = 0, where I¯0 = I\I 0. We choose I 0 a priori so that |I 0| ≈ |I| 2 in an attempt to achieve “balance” in the LPs P associated with the descendent nodes. We perform SOS branching on the job j such that i∈I 0 xij is closest to 0.5. In our testing, we compare two approaches for selecting the next node to explore in the search tree, based on depth-first search and best-bound search. When depth-first search is employed, the node with the most fractional variables is explored first.

5.3.

Fathoming

When using branch and bound to solve an integer program, we fathom a node when one of the following conditions is satisfied: (i) the current node is infeasible, (ii) the current node’s lower bound is greater than or equal to the incumbent upper bound, or (iii) the node produces an integer solution. These three rules apply in the B&P setting with the understanding that a node corresponds to the full MP under the given branching restriction. Restated, in general, a rule does not apply when speaking of the RMP at a node. We have previously mentioned that infeasibility of the RMP does not imply infeasibility of the full MP at a node. So, rule (i) rarely applies in our setting. (It only applies if there is a job that all agents are forbidden to do, due to branching restrictions.) Rule (ii) can be applied to the RMP at a node, i.e., if the lower bound z of Theorem 1, adapted to an LP relaxation with branching constraints, exceeds z¯, the global upper bound, the node can be fathomed. In general we cannot fathom a node when the RMP has an integer solution but has columns with negative reduced costs. Of course, if the objective function value happens to be sufficiently close to the lower bound we can terminate.

5.4.

Outline of Algorithm

The following notation is used to describe the second part of the B&P algorithm. P : the LP relaxation of the RMP (5) SPi : the pricing subproblem (6) for agent i N : the list that contains branching constraints v(C): 0 if the current branching constraint C is not yet explored and 1 otherwise ¯ the counter branching constraint of C C: BRANCH(N ): subroutine that chooses a branching constraint from the list N and returns its location in the list. The depth-first search returns the last location of the list while the best-bound search returns the location that has the smallest objective function value. CHOOSENODE(): subroutine that chooses a branching variable. Since branching is perP k k formed based on the original variables, it returns (i, j) such that xij = k∈K 0 xij λi , is i

Solving a Stochastic Generalized Assignment Problem with Branch and Price

111

closest to 0.5 for single-variable branching with depth-first search. For best-bound search, it returns (i, j) whose node has the lowest bound.

Procedure BP Initialization Perform Procedure CG to find z and corresponding solution λ if all λ’s are integer then stop Obtain (i, j) = CHOOSENODE() and add a branching constraint C = {xij = 0} to the list N Set v(C) = 0, z¯ = ∞, z = −∞ Do while (N = 6 ∅) k = BRANCH(N ); v(Ck ) = 1 Fix variables in P and SPi , i ∈ I, according to the branching constraint Ck Do while (true) Solve P to find its objective value (z) and primal (λ) and dual (π, α) solutions if all λki ’s are integer if z < z¯ then z¯ z and (ˆ x, yˆ) ← P P k k ωk k , k∈K 0 xij λi k∈K 0 yi λi i

i∈I, j∈J

i

←

ω∈Ωi , i∈I

end if for i ∈ I, Solve an SP i with π to obtain its objective value vi and a solution (xi, yi ) = ω (xij )j∈J , (yi )ω∈Ωi P cˆi where cˆi = j∈J cij xij + if vi − αi < 0 then add a column to P with xi P qi ω∈Ωi pωi yiω end for P z ← z + i∈I (vi − αi ) if z ≥ z¯ then do FATHOM(Ck , N ) and break if no column is added for all i ∈ I then break End do Obtain (i, j) = CHOOSENODE() and add a branching constraint C = {xij = 0} to the list N Set v(C) = 0 End do x, yˆ) Report z ∗ = z¯ and (ˆ Figure 2. Overview of branch-and-price procedure.

112

David P. Morton, Jonathan F. Bard and Yong Min Wang

The algorithm is summarized in Figure 2 for single-variable branching. For SOS branching, the two code segments that state “add a branching constraint C = {xij = 0}” P shouldP have the branching constraint replaced by either C = { i∈I 0 xij = 0} or C = { i∈I¯0 xij = 0}, depending on the situation. The stabilization procedure can be applied to all the nodes in the search tree but empirically, its effect is significant only at the root node for our problem. The subroutine FATHOM() in Figure 2 removes the branching constraints that have already been explored and is described in Figure 3.

Subroutine FATHOM(Ck , N ) Do while (true) Restore the bounds for the variables that are fixed according to Ck in P and SPi , i∈I N ← N \{Ck } if v(Ck ) = 0 break k ←k−1 if (the list N is empty) then return End loop Add a constraint C¯k to the list N return Figure 3. Implementation of the fathoming subroutine.

5.5.

Example

The B&P procedure in Figure 2 is illustrated with a 2-agent, 5-job, 1-scenario problem. Table 1 contains all input data. Because only one scenario is used, index ω is omitted. Also, to keep things simple, the computations will reflect single-variable branching with depth-first search. Table 1. Input data for 2-agent, 5-job, 1-scenario example. i 1 2

bi 121 85

di1 94 8

di2 1 77

di3 56 64

di4 67 21

di5 85 43

ci1 13 110

ci2 112 30

ci3 57 52

ci4 39 81

ci5 20 73

qi 25 25

Initialization The following feasible solutions are used to form the initial RMP. 1 0 0 1 1 125 3197 1 1 1 , y = , cˆ = x = 0 1 1 0 0 56 1482 The dual solution of the LP relaxation of the initial RMP is π = (3197, 1482, 0, 0, 0) and α = (0, 0). Next, we solve the pricing problems (6) with fixed π and α, yielding the

Solving a Stochastic Generalized Assignment Problem with Branch and Price following solutions for each i. 1 1 0 0 0 2 , x = 1 1 0 0 0

2

y =

0 0

2

,

cˆ =

125 140

113

We add x2 as columns to RMP. The sequence of solving the master and pricing problems is repeated until no column is generated by the pricing problems. At the end of the initializing step, the following RMP is obtained. min λ

3197λ11 + 125λ21 + 795λ31 + 283λ41 + 714λ51 + 834λ61 + 3334λ71 + 132λ81 +1482λ12 + 140λ22 + 162λ32 + 133λ42 + 675λ52 + 154λ62 + 264λ72 + 73λ82(10a)

s.t.

λ11 + λ21 + λ31 + λ71 + λ22 + λ32 + λ72 = 1

(10b)

λ21 λ31 λ11 λ11

(10c)

+ + + +

λ41 λ41 λ41 λ51

+ + + +

λ51 λ51 λ61 λ61

+ + + +

λ71 λ12 λ71 λ71

+ + + +

λ12 λ32 λ42 λ81

+ + + +

λ22 λ42 λ62 λ52

=1 + + +

λ52 λ72 λ62

=1

(10d)

=1 +

λ72

(10e) +

λ82

=1

(10f)

λ11 + λ21 + λ31 + λ41 + λ51 + λ61 + λ71 + λ81 = 1

(10g)

λ12

(10h)

+

λ22

λki ≥ 0,

+

λ32

+

λ42

+

λ52

+

λ62

+

λ72

+

λ82

=1

i = 1, 2, k = 1, . . ., 8

(10i)

Branch-and-Price Step The RMP (10) produces the fractional solution λ41 = λ81 = λ32 = 0.5 and we choose the most fractional assignment x ˆ13 = 0.5, where x ˆij is obtained λ72 = P k k from k∈K 0 xij λi . We add the branching constraint {x13 = 0} to the list N giving N = i {x13 = 0}, and set v({x13 = 0}) = 0. Node 1 Choose the constraint {x13 = 0} and set v({x13 = 0}) = 1. To enforce this constraint in RMP, consider constraint (10d) and fix the following variables λ31 = λ41 = λ51 = 0. Add {x13 = 0} to all pricing problems. Solve RMP: The LP relaxation of the new RMP has optimal value z = 1880.5, primal solution λ71 = λ81 = λ32 = λ42 = 0.5, and dual solution π = (1615.5, 2935.5, 3107.0, 1586.5, 1622.5) and α = (−4426.0, −4560.5). Solve the pricing problems: The pricing problems yield optimal values v1 = −5173.5 and v2 = −7321.0. Since vi < αi for both agents, we add the following columns to RMP and obtain the indicated lower bound: 0 1 0 1 1 32.0 971 9 9 9 , y = , cˆ = x = 1 1 1 1 1 128.0 3546 z = 1880.5 + (−5173.5 + 4426.0) + (−7321.0 + 4560.5) = −1627.5 Solve RMP: The LP relaxation of the new RMP has optimal value z = 1133.0, primal solution λ91 = λ32 = 1, and dual solution π = (1721.3, 137.0, 3145.0, 1012.7, 1554.7)

114

David P. Morton, Jonathan F. Bard and Yong Min Wang

and α = (−1733.3, −4024.7). Because we have an integer solution, we can update z¯ to z¯ = 1133.0. Solve the pricing problems: The pricing problems yield optimal values v1 = −1793.0 and v2 = −5842.7. Since vi < αi for both agents, add the following columns to RMP, and obtain the following lower bound: 1 1 0 0 1 59.0 1620 10 10 10 , y = , cˆ = x = 1 0 1 1 1 51.0 1591 z = 1133.0 + (−1793.0 + 1733.3) + (−5842.7 + 4024.7) = −744.7 Solve RMP: The LP relaxation of the new RMP has optimal value z = 990.5, = 0.5, and dual solution π = primal solution λ21 = λ81 = λ42 = λ10 2 (725.5, −5.5, 1327.0, 696.5, 732.5) and α = (−595.0, −1890.5). Solve the pricing problems: The pricing problems yield optimal values v1 = −712.5 and v2 = −2306.0. Add the following columns to RMP, and obtain the following lower bound: 1 0 0 0 0 0.0 13 11 11 11 , y = , cˆ = x = 1 0 1 1 0 8.0 443 z = 990.5 + (−712.5 + 595.0) + (−2306.0 + 1890.5) = 457.5 Solve RMP: The LP relaxation of the new RMP has optimal value z = 575.0, primal solution λ81 = λ11 2 = 1, and dual solution π = (725.5, 112.0, 1327.0, 696.5, 732.5) and α = (−712.5, −1890.5). Since we have an integer solution and z < z¯, z¯ = 575.0. Solve the pricing problems: The pricing problems yield optimal values v1 = −712.5 and v2 = −2306.0. Add the solution from the second pricing problem as a column to RMP, and obtain the following lower bound: x12 2 = (1, 0, 1, 1, 0),

y212 = 8.0,

cˆ12 2 = 443

z = 575.0 + (−712.5 + 712.5) + (−2306.0 + 1890.5) = 159.5 Solve RMP: The LP relaxation of the new RMP has optimal value z = 575.0, primal solution λ81 = λ11 2 = 1, and dual solution π = (310.0, 112.0, 496.0, 281.0, 317.0) and α = (−297.0, −644.0). Solve the pricing problems: The pricing problems yield optimal values v1 = −297.0 and v2 = −644.0. Since vi − αi = 0 for both agents and the RMP has an integer solution, we fathom the node. Remove λ31 = λ41 = λ51 = 0 from the RMP and x13 = 0 from the pricing problems. Remove {x13 = 0} from the list N and add {x13 = 1} to the list N and set v({x13 = 1}) = 0. Node 2 Choose the constraint {x13 = 1} and set v({x13 = 1}) = 1. To enforce this constraint in RMP, again consider constraint (10d) and fix the following variables λ11 = λ21 = λ61 = λ71 = λ81 = 0 and λ12 = λ32 = λ42 = λ52 = 0. Add {x13 = 1} to all pricing problems.

Solving a Stochastic Generalized Assignment Problem with Branch and Price

115

Solve RMP: The LP relaxation of the RMP has optimal value z = 704.0, primal solution λ31 = 0.25, λ41 = 0.5, λ51 = 0.25, λ22 = 0.25, λ72 = 0.5, λ82 = 0.25 and dual solution π = (348.0, −7.0, 1690.0, 157.0, 274.0) and α = (−1243.0, −201.0). Solve the pricing problems: The pricing problems yield optimal values v1 = −1633.0 and v2 = −439.0. Since vi < αi for both agents, add the following columns to RMP, and obtain the following lower bound. x12 1 = (0, 0, 1, 0, 0),

y112 = 0.0,

cˆ12 1 = 57

y213 = 0.0, cˆ13 x13 2 = (1, 0, 0, 0, 1), 2 = 183 z = 704.0 + (−1633.0 + 1243.0) + (−439.0 + 201.5) = 76.0 Solve RMP: The LP relaxation of the new RMP has optimal value z = 466.0, primal solution λ41 = λ13 2 = 1 and dual solution π = (110.0, 226.0, 1452.0, 0.0, 269.0) and α = (−1395.0, −196.0). Update z¯ = 466. Solve the pricing problems: The pricing problems yield objective values v1 = −1509.0 and v2 = −196.0. Add the solution from the first pricing problem as a column to RMP, and obtain the following lower bound: x13 1 = (0, 1, 1, 0, 0),

y113 = 0.0,

cˆ13 1 = 169

z = 575.0 + (−1509.0 + 1395.0) + (−196.0 + 196.0) = 352.0 Solve RMP: The LP relaxation of the new RMP has optimal value z = 433.0, primal 7 solution λ13 1 = λ2 = 1 and dual solution π = (110.0, 112.0, 1452.0, 114.0, 122.0) and α = (−1395.0, −82.0). Update z¯ = 433. Solve the pricing problems: The pricing problems yield optimal values v1 = −1420.0 and v2 = −82.0. Add the solution from the first pricing problem as a column to RMP, and obtain the following lower bound. x14 1 = (0, 0, 1, 1, 0),

y114 = 2.0,

cˆ14 1 = 146

z = 433.0 + (−1420.0 + 1395.0) + (−82.0 + 82.0) = 408.0 Solve RMP: The LP relaxation of the new RMP has optimal value z = 433.0, primal 7 solution of λ13 1 = λ2 = 1, and dual solution π = (110.0, 137.0, 1452.0, 114.0, 147.0) and α = (−1420.0, −107.0). Solve the pricing problems: The pricing problems yield optimal values v1 = −1420.0 and v2 = −107.0. Since vi − αi = 0 for both agents, we fathom the node. Remove λ11 = λ21 = λ61 = λ71 = λ81 = 0 and λ12 = λ32 = λ42 = λ52 = 0 from the RMP and x13 = 1 from the pricing problems. Remove {x13 = 1} from the list N . Since the list N is empty, the algorithm terminates. The objective function value at node 2 is better than that at node 7 1, so we have z ∗ = 433 with solution λ13 1 = λ2 = 1 and all remaining λ = 0. In terms of the original problem variables, the solution is

∗

x =

0 1 1 0 0 1 0 0 1 1

T

,

∗

y =

0 0

.

116

6.

David P. Morton, Jonathan F. Bard and Yong Min Wang

Computational Results

The B&P algorithm was implemented in JAVA using a PC running SuSE Linux with dual 1.8GHz CPUs and 1GB memory for the computations. All integer and linear programs were solved with CPLEX 9.0 using a relative tolerance of 0.0001 for the IPs. For the testing, four classes of problems were randomly generated based on the rules for constructing deterministic GAP instances described in [17]. In particular, with m = |I| and n = |J|, we have the following four classes: A. The values of cij and dij are drawn from a discrete uniform distribution n on {10, . . ., 25} P and {5, . . ., 25}, ∗ respectively, qi = 30 and bi = 9 m∗ + 0.4 max1≤l≤m j∈J ∗ (l) dlj , where J (l) = {j|l = arg min1≤r≤m crj }, that is, J (l) is the set of jobs that are best assigned to agent l when capacity is not considered. B. Same as A for cij , dij and qi ; bi = 0.7 of the value of bi in A. P d C. Same as A for cij , dij and qi ; bi = 0.8 j∈J mij . D. Same as C for bi , dij is integer from a uniform distribution on {1, . . . , 100}, qi = 125, and cij = 100 − dij + k, where k is and integer from a uniform distribution on {1, . . ., 21}. In each case, 10 scenarios for dωij and bωi were generated independently from a uniform distribution on dij ± 20% and bi ± 20%, respectively. The number of agents m was taken from the set {5, 10, 20} and the number of jobs n was taken from the set {30, 50}. All instances were solved with the B&P algorithm and with CPLEX applied to the deterministic equivalent in order to gauge the effectiveness of the proposed approach. The name of each problem is composed of four fields (see Table 2.) The first is a letter that stands for the problem class defined above; the remaining three refer to the numbers of agents, jobs and scenarios, in turn. Three objective function values are reported in Table 2: EV denotes the optimal objective function value obtained when we replace the stochastic elements with their expected values and solve the associated single-scenario problem. With xEV denoting the solution to the EV problem, EEV is given by

EEV =

XX i∈I j∈J

cij xEV ij +

X i∈I

qi

X ω∈Ωi



pωi 

X j∈J

+

ω dωij xEV ij − bi

That is, EEV is the objective function value of the stochastic GAP model evaluated at the suboptimal solution obtained by solving the EV problem. Finally, RP = z ∗ is the optimal value of model (3). In addition to these objective function values, Table 2 shows the values of the stochastic solutions, i.e., V SS = 100 · (EEV − RP )/RP . For all problems, the V SS values are substantially higher than 5%. For example, for A.5.30.10, V SS is 46.56%, which means that we save an expected cost of 46.56% by solving the stochastic model rather than solving the deterministic problem using mean values. Note that the EV values are at most RP . This is ensured by Jensen’s inequality since the function in Eq. (2) is convex in the random parameters for a fixed value of the assignment decisions x. Jensen’s

Solving a Stochastic Generalized Assignment Problem with Branch and Price

117

inequality provides another lower bound that we can use to fathom nodes during branch and price, provided we first solve the single-scenario integer program corresponding to EV . Sometimes the Jensen bound is weak, but other times the EV values are just as large as the RP values; e.g., for the cases of A.20.50.10 and B.10.50.10. This can occur even when V SS is quite large.

Table 2. Objective function values for instances from four stochastic GAP problem classes V SS

Objective value

Problems

(%)

EV

EEV

RP

A.5.30.10

46.56

362.0

532.0

363.0

A.5.50.10

93.04

617.0

1196.8

620.0

A.10.30.10

111.26

324.0

690.8

327.0

A.10.50.10

28.35

554.0

714.9

557.0

A.20.30.10

12.12

308.0

346.5

309.0

A.20.50.10

78.48

513.0

915.6

513.0

B.5.30.10

37.04

331.0

456.3

333.0

B.5.50.10

40.23

606.0

856.8

611.0

B.10.30.10

6.21

325.0

348.4

328.0

B.10.50.10

49.59

551.0

824.2

551.0

B.20.30.10

16.55

305.0

357.8

307.0

B.20.50.10

43.62

516.0

741.1

516.0

C.5.30.10

5.42

384.0

414.0

392.7

C.5.50.10

15.23

591.9

690.2

599.0

C.10.30.10

35.29

349.0

488.4

361.0

C.10.50.10

15.14

571.0

662.3

575.2

C.20.30.10

32.64

322.5

458.0

345.3

C.20.50.10

46.62

535.4

880.0

600.2

D.5.30.10

9.53

3062.3

3366.0

3073.0

D.5.50.10

16.85

5045.0

5916.0

5063.0

D.10.30.10

38.32

3010.9

4201.0

3037.1

D.10.50.10

34.05

4960.5

6688.4

4989.6

D.20.30.10

26.46

2992.2

3814.1

3016.0

D.20.50.10

36.63

4939.7

6865.3

4973.0

118

David P. Morton, Jonathan F. Bard and Yong Min Wang

Table 3 reports the runtime in seconds for the B&P algorithm starting the computations with either initial solutions or using the big- M method. For this comparison only the problem at the root node is solved. The results indicate that even though the quality of the initial solution was not good, it is better to start with any feasible solution than use the big- M method, particularly on the larger instances. The latter approach can spend a significant amount of time finding a feasible solution. For the remainder of our analysis, the initial solution method is used. Table 3. Runtime comparisons for two initialization methods (sec.) Initial

Big-M

Problem

solution

method

A.5.30.10

9.4

10.7

A.5.50.10

68.5

52.8

A.10.30.10

9.4

16.0

A.10.50.10

41.1

72.6

A.20.30.10

8.1

44.8

A.20.50.10

27.2

140.1

B.5.30.10

14.2

18.2

B.5.50.10

80.0

144.9

B.10.30.10

7.2

21.0

B.10.50.10

32.2

140.0

B.20.30.10

5.4

42.1

B.20.50.10

19.0

144.5

C.5.30.10

14.8

16.3

C.5.50.10

69.1

90.8

C.10.30.10

6.8

20.9

C.10.50.10

33.8

89.9

C.20.30.10

8.6

45.0

C.20.50.10

24.2

173.8

D.5.30.10

14.5

25.7

D.5.50.10

123.9

174.6

D.10.30.10

8.2

21.4

D.10.50.10

42.7

105.8

D.20.30.10

10.3

43.9

D.20.50.10

26.6

260.1

Solving a Stochastic Generalized Assignment Problem with Branch and Price

119

In the next experiment, the performance of our branching strategies and node-selection strategies is compared without using stabilization. Tables 4 shows the results of the two branching strategies with best-bound node selection, and Table 5 does the same for depthfirst search. In the tables, the problem labels are again in the first column, the second column displays the objective function values, while the remaining columns contain the CPU time in seconds, the number of nodes explored in the search tree, and the number of columns added during the column-generation phase for both strategies. When depth-first search outperforms the best-bound search, it does so by a modest amount, but when the opposite holds the differences are often significant. Restated, relatively speaking, the distribution of runtimes for depth-first search seems to have a “long tail” in that sometimes the computational effort is much larger. When comparing SOS branching with single-variable branching, the same result holds. With the exception of D.10.50.10, when the number of B&B nodes was large, SOS branching tended to perform better. Note that some instances were solved by the column generation procedure alone, i.e., the number of nodes is 0. In this case, their computational times are essentially identical, regardless of the branching and node-selection strategy; see, for example, A.10.50.10. In what follows, we use the SOS-branching method with the best-bound search. The results with stabilization are contained in Table 6. In the computations, the parame− ters + j and j were initialized to 0.1 and a value of r = 0.1 was used when no column was − added (see the modified step 3 at the end of Section 4.4.). When + j and j dropped below 0.001, they were set to zero. The current dual solutions were used for the initial values of − κ+ j and κj , which were updated when no columns were added. Updating them every iteration is a second possibility but from our empirical results this approach did not prove to be effective. Comparing this table with the SOS-branching results in Table 4, we see that with stabilization, runtimes decreased in all but one instance (A.5.50.10). Without stabilization, the runtimes averaged 65% longer. The computational savings were due primarily to the reduced number of columns that were generated. In the final experiment, we compared the performance of the B&P algorithm (under the initial-solution method, SOS-branching, the best-bound search for node selection, and the stabilization procedure), with the results obtained by directly solving the deterministic equivalent (DEQ) model (3) with CPLEX. Results are first presented for the 24 test problems used above with 10 scenarios each. We then solve an additional set of problems, also from the classes A, B, C and D, but this time with a greater number of scenarios ranging from 10 to 100. Table 7 presents the results for the LP gaps obtained from the LP relaxation at the root node, and the runtimes associated with solving the IPs directly (labeled DEQ) and with B&P. The LP gap was calculated by dividing the difference between the optimal LP value at the root node and the optimal value of the IP by the latter and multiplying by 100%, LP × 100%. From the table we see that, on average, the LPDEQ gap i.e., LP = zIPz−z IP was 1.21% and the LPBP gap was 0.03%, and that in 15 out of 24 instances, the latter gap was zero allowing for two significant digits. Solutions times averaged 67.0 seconds for the DEQ, excluding problem instances C.20.50.10 and D.20.50.10 which could not be solved by CPLEX within a 2-hour time limit, and 29.6 seconds for the B&P algorithm. The above results reflect an average over 10 scenarios for various numbers of agents and jobs. In Tables 8 and 9, we instead vary the number of scenarios from 10, 20, . . . , 100.

120

David P. Morton, Jonathan F. Bard and Yong Min Wang

Table 4. Comparison of branching strategies without stabilization (best-bound search)

SOS branching Problem

Single-variable branching

Time

No.

Optimal

No.

Time

No.

Optimal

No.

(sec.)

nodes

node

columns

(sec.)

nodes

node

columns

A.5.30.10

9.5

2

1

540

14.1

2

1

509

A.5.50.10

44.9

16

6

1305

216.3

10

3

1945

A.10.30.10

10.3

2

1

535

13.4

2

1

489

A.10.50.10

28.8

0

0

1069

26.1

0

0

1069

A.20.30.10

8.5

0

0

459

6.5

0

0

4593

A.20.50.10

34.4

0

0

1269

30.1

0

0

1269

B.5.30.10

12.7

2

1

642

12.0

2

2

557

B.5.50.10

78.8

4

1

1699

138.5

4

4

1616

B.10.30.10

9.0

2

1

482

8.5

2

2

450

B.10.50.10

37.2

0

0

1292

34.8

0

0

1292

B.20.30.10

11.6

2

1

581

9.8

6

4

578

B.20.50.10

27.4

2

1

1032

30.6

2

1

1008

C.5.30.10

35.1

28

9

836

36.7

24

21

669

C.5.50.10

99.4

4

2

1701

1707.9

152

152

5291

C.10.30.10

13.5

2

2

550

10.8

2

1

514

C.10.50.10

32.7

0

0

1124

30.3

0

0

1124

C.20.30.10

17.7

6

2

699

14.5

2

2

655

C.20.50.10

38.8

0

0

1328

35.3

0

0

1328

D.5.30.10

35.9

18

9

879

53.3

60

36

1138

D.5.50.10

332.6

70

38

2891

794.3

142

125

4984

D.10.30.10

16.5

0

0

788

15.5

0

0

788

D.10.50.10

91.8

6

3

1849

71.8

4

2

1754

D.20.30.10

32.8

0

0

659

30.5

0

0

1308

D.20.50.10

135.8

4

2

2670

133.4

14

14

2725

For each problem (e.g., A.20.60.70), we generated 10 random problem instances and report the average computational effort. Also, we used common random numbers in forming the problems. For example, A.20.60.20 simply adds 10 scenarios to the A.20.60.10 problem.

Solving a Stochastic Generalized Assignment Problem with Branch and Price

121

Table 5. Comparison of branching strategies without stabilization (depth-first search)

SOS-branching

Single-variable branching

Time

No.

Optimal

No.

Time

No.

Optimal

No.

Problem

(sec.)

nodes

node

columns

(sec.)

nodes

node

columns

A.5.30.10

15.3

6

4

726

15.0

6

4

766

A.5.50.10

272.8

34

22

3312

228.6

28

6

3027

A.10.30.10

8.7

2

1

520

14.4

20

1

634

A.10.50.10

28.1

0

0

1069

27.9

0

0

1069

A.20.30.10

6.9

0

0

459

6.8

0

0

459

A.20.50.10

32.3

0

0

1269

32.2

0

0

1269

B.5.30.10

12.4

2

2

639

13.2

2

2

651

B.5.50.10

119.8

8

2

2093

147.3

10

3

2393

B.10.30.10

10.6

4

4

523

8.8

4

2

506

B.10.50.10

36.8

0

0

1292

36.8

0

0

1292

B.20.30.10

12.5

6

2

621

10.2

4

2

598

B.20.50.10

35.8

18

1

1132

31.9

14

3

1138

C.5.30.10

56.1

48

45

1072

40.8

30

29

969

C.5.50.10

593.7

60

60

4621

1800.1

178

178

7654

C.10.30.10

12.8

2

2

550

11.9

2

2

544

C.10.50.10

32.4

0

0

1124

33.9

0

0

1124

C.20.30.10

17.4

10

10

708

15.4

6

6

702

C.20.50.10

38.0

0

0

1328

38.1

0

0

1328

D.5.30.10

39.6

26

5

942

70.9

52

41

1351

D.5.50.10

489.6

70

51

3605

1083.3

162

153

5770

D.10.30.10

16.6

0

0

788

15.8

0

0

788

D.10.50.10

180.5

34

34

2317

78.6

4

2

1765

D.20.30.10

32.6

0

0

659

31.5

0

0

1308

D.20.50.10

136.7

10

8

2692

153.1

12

11

2760

In Table 8, the problem data were generated using a uniform distribution of ±20%, i.e., dij ± 20% and bj ± 20%, while in Table 9, ±50% was used for the realizations. No changes were made to the B&P components.

122

David P. Morton, Jonathan F. Bard and Yong Min Wang Table 6. Results with stabilization

Problem

Time

No.

Optimal

No.

(sec.)

nodes

node

columns

A.5.30.10

7.8

2

1

419

A.5.50.10

61.1

10

5

1398

A.10.30.10

7.4

2

1

369

A.10.50.10

20.3

2

1

754

A.20.30.10

6.9

0

0

291

A.20.50.10

17.2

0

0

691

B.5.30.10

6.7

0

0

351

B.5.50.10

48.1

6

1

1183

B.10.30.10

6.5

0

0

335

B.10.50.10

17.3

0

0

700

B.20.30.10

8.6

2

1

376

B.20.50.10

15.3

0

0

612

C.5.30.10

20.6

24

9

836

C.5.50.10

69.6

4

2

1701

C.10.30.10

7.8

2

1

550

C.10.50.10

21.3

0

0

706

C.20.30.10

13.5

6

3

699

C.20.50.10

20.8

0

0

740

D.5.30.10

21.2

14

9

591

D.5.50.10

238.3

72

36

2340

D.10.30.10

10.6

0

0

448

D.10.50.10

43.0

6

3

968

D.20.30.10

13.7

0

0

589

D.20.50.10

48.2

4

2

1077

In both tables, the first column in the DEQ section gives the number of instances out of 10 that exceeded a two-hour limit in CPLEX. Those instances were excluded from the average runtime calculations. As can be seen, there are many DEQ instances that could not be solved within two hours, particularly for those problems with a greater number of scenarios. However, the B&P algorithm solved all instances within minutes. Not surprisingly, the results in Table 9 where the randomness has greater variability, tend to be more difficult

Solving a Stochastic Generalized Assignment Problem with Branch and Price

123

Table 7. Comparison of LP gaps and runtimes

Problem

Gap (%) LPDEQ LPBP

Time (sec.) DEQ B&P

A.5.30.10 A.5.50.10 A.10.30.10 A.10.50.10 A.20.30.10 A.20.50.10

0.25 0.25 0.77 0.60 0.32 0.11

0.00 0.00 0.00 0.00 0.00 0.00

0.7 0.9 0.8 1.0 0.9 1.0

6.4 53.6 6.5 16.1 5.5 14.9

B.5.30.10 B.5.50.10 B.10.30.10 B.10.50.10 B.20.30.10 B.20.50.10

0.24 0.34 1.00 0.00 0.95 0.00

0.00 0.04 0.00 0.00 0.00 0.00

0.8 0.9 0.8 0.9 0.9 0.9

6.4 51.9 5.6 16.7 6.6 12.9

C.5.30.10 C.5.50.10 C.10.30.10 C.10.50.10 C.20.30.10 C.20.50.10

2.11 1.05 4.53 1.67 7.33 2.68

0.29 0.02 0.08 0.00 0.06 0.00

3.3 3.1 212.9 94.4 46.1 –

18.5 45.5 6.7 20.8 12.6 18.3

D.5.30.10 D.5.50.10 D.10.30.10 D.10.50.10 D.20.30.10 D.20.50.10

0.54 0.21 1.18 0.37 1.60 0.89

0.11 0.04 0.00 0.01 0.00 0.02

3.1 38.2 3.1 1058.4 1.1 –

28.7 108.5 9.2 179.9 11.1 48.2

computationally, although problem instances of type D are an interesting exception.

7.

Summary and Conclusions

The generalized assignment problem has been widely studied due to its usefulness and the fact that it often appears as a substructure in more complicated models. In this study, a column-generation approach was applied and a branch-and-price approach developed to solve a stochastic GAP where resource capacity and resource-consumption coefficients are modeled as random variables. The performance of the algorithms was demonstrated in

124

David P. Morton, Jonathan F. Bard and Yong Min Wang Table 8. Runtimes for various scenarios (±20%)

Problems A.20.60.10 A.20.60.20 A.20.60.30 A.20.60.40 A.20.60.50 A.20.60.60 A.20.60.70 A.20.60.80 A.20.60.90 A.20.60.100 B.20.50.10 B.20.50.20 B.20.50.30 B.20.50.40 B.20.50.50 B.20.50.60 B.20.50.70 B.20.50.80 B.20.50.90 B.20.50.100 C.20.30.10 C.20.30.20 C.20.30.30 C.20.30.40 C.20.30.50 C.20.30.60 C.20.30.70 C.20.30.80 C.20.30.90 C.20.30.100 D.20.30.10 D.20.30.20 D.20.30.30 D.20.30.40 D.20.30.50 D.20.30.60 D.20.30.70 D.20.30.80 D.20.30.90 D.20.30.100

DEQ No. Time instances (sec.) 1 0.4 0 0.6 0 0.8 0 0.9 0 1.2 0 1.4 0 1.6 0 1.7 0 1.8 1 2.0 0 82.3 1 146.0 0 505.2 1 175.1 0 40.4 1 49.2 0 111.0 1 138.0 1 222.0 0 94.6 1 1221.3 0 1573.8 0 1364.8 0 1747.5 0 1897.1 0 1888.3 0 2163.7 0 2128.0 0 2670.6 1 3324.3 1 505.9 0 577.2 0 570.3 0 1079.6 0 1594.9 0 1809.5 0 2092.0 0 2713.8 0 3744.8 1 3499.6

Time (sec.) 38.3 54.1 58.1 62.3 66.1 63.8 89.4 88.5 86.4 103.3 34.7 49.6 59.8 55.3 60.0 65.3 66.3 71.0 80.8 86.6 16.6 18.8 23.1 22.6 25.6 26.9 29.4 31.6 35.6 36.3 38.1 45.2 48.5 54.5 69.3 76.9 94.2 95.4 119.7 138.9

B&P No. No. nodes columns 5 595 12 633 7 688 5 777 7 616 3 419 10 716 5 682 3 611 7 696 3 476 7 435 6 339 3 488 2 491 2 417 2 490 3 414 5 348 5 357 1 202 1 151 2 275 1 106 2 217 2 270 2 263 1 171 2 220 2 217 11 610 11 492 8 624 8 554 9 693 10 419 11 624 12 496 18 502 20 570

terms of computational effort for stochastic variants of a class of test problems from the literature.

Solving a Stochastic Generalized Assignment Problem with Branch and Price

125

Table 9. Runtimes for various scenarios (±50%)

Problem A.20.60.10 A.20.60.20 A.20.60.30 A.20.60.40 A.20.60.50 A.20.60.60 A.20.60.70 A.20.60.80 A.20.60.90 A.20.60.100 B.20.50.10 B.20.50.20 B.20.50.30 B.20.50.40 B.20.50.50 B.20.50.60 B.20.50.70 B.20.50.80 B.20.50.90 B.20.50.100 C.20.30.10 C.20.30.20 C.20.30.30 C.20.30.40 C.20.30.50 C.20.30.60 C.20.30.70 C.20.30.80 C.20.30.90 C.20.30.100 D.20.30.10 D.20.30.20 D.20.30.30 D.20.30.40 D.20.30.50 D.20.30.60 D.20.30.70 D.20.30.80 D.20.30.90 D.20.30.100

DEQ No. Time instances (sec.) 1 1.0 0 14.8 1 837.6 0 476.1 1 822.7 1 1084.7 1 1516.0 2 2006.9 2 1973.2 2 1968.5 0 67.3 3 2198.3 3 2259.3 3 2613.0 3 2902.9 5 4299.7 3 3373.4 4 3314.8 2 2892.4 3 3117.5 10 7249.7 6 5663.7 8 6524.3 9 7038.3 9 6938.0 7 6527.4 9 7012.0 9 7093.1 10 7213.9 10 7222.9 0 111.8 0 22.0 0 13.2 0 20.7 0 10.3 0 15.3 0 17.7 0 16.6 0 19.1 0 19.0

Time (sec.) 41.6 46.2 70.1 67.3 80.4 104.1 109.3 118.3 122.9 142.3 27.2 36.4 48.6 59.8 66.2 73.0 65.8 78.7 86.3 95.3 21.9 18.7 21.3 24.1 28.4 32.3 39.2 37.8 41.9 45.4 25.5 32.1 42.1 41.9 44.3 48.3 49.8 50.4 55.1 70.0

B&P No. No. nodes columns 2 717 1 363 5 774 2 388 2 558 4 685 6 433 6 619 5 513 6 625 1 238 3 471 6 574 8 417 7 491 7 578 2 387 4 326 3 426 4 498 6 408 4 120 1 196 1 110 1 107 1 152 2 156 1 145 1 164 1 205 2 269 2 263 6 400 4 396 2 461 3 325 2 258 1 266 1 320 4 397

Our general conclusion from the research is that stochastic solutions can improve the objective function value by at least 5%, and by 50% for some problems when compared

126

David P. Morton, Jonathan F. Bard and Yong Min Wang

to the mean-value solution. In the computations, using initial feasible columns was more effective than using a big-M method to construct an initial feasible master problem. Additional experiments showed that the B&P formulation provided much tighter LP bounds than the conventional LP relaxation bounds, and that the use of stabilization greatly improved convergence rates. All but one of the 24 problems had reduced runtimes under stabilization, averaging 65% longer without it. With respect to the enumeration scheme, branching on a subset of variables rather than on a single variable proved to be a slightly better strategy. We tried, but didn’t report, reduced cost fixing as it did not provide a substantial improvement on our test problems. Finally, we can say with confidence that the B&P algorithm solved the test problems more quickly than simply trying to solve the deterministic equivalent formulation directly. In the literature, it has been reported that solutions to the deterministic GAP can be found by CPLEX in reasonable time when the problem size is modest. However, the runtimes for the stochastic GAP increase significantly for the same number of jobs and agents, even when B&P is applied.

Acknowledgments This research was partially supported by the National Science Foundation through grant CMMI-0653916.

References [1] M. Albareda-Sambola, M. H. van der Vlerk, and E. Fern´andez. Exact solutions to a class of stochastic generalized assignment problems. European Journal of Operational Research, 173:465–487, 2006. [2] V. Balachandran. An integer generalized transportation model for optimal job assignment in computer networks. Operations Research, 24(4):742–759, 1976. [3] J. F. Bard and S. Rojanasoonthon. A branch-and-price algorithm for parallel machine scheduling with time windows and job priorities. Naval Research Logistics, 53(1):24– 44, 2006. [4] C. Barnhart, E. L. Johnson, G. L. Nemhauser, M. W. P. Savelsbergh, and P. H. Vance. Branch-and-price: Column generation for solving huge integer programs. Operations Research, 46(3):316–329, 1998. [5] D. Bertsimas and J. N. Tsitsiklis. Introduction to Linear Optimization . Athena Scientific, Belmont, MA, 1997. [6] G. B. Dantzig and P. Wolfe. Decomposition principle for linear programs. Operations Research, 8(1):101–111, 1960. [7] G. B. Dantzig and P. Wolfe. The decomposition algorithm for linear programs. Econometrica, 29:767–778, 1961.

Solving a Stochastic Generalized Assignment Problem with Branch and Price

127

[8] M. Desrochers, J. Desrosiers, and M. M. Solomon. A new optimization algorithm for the vehicle routing problem with time windows. Operations Research, 40(2):342–354, 1992. [9] M. Desrochers and F. Soumis. A column generation approach to the urban transit crew scheduling problem. Transportation Science, 23(1):1–13, 1989. [10] O. du Merle, D. Villeneuve, J. Desrosiers, and P. Hansen. Stabilized column generation. Discrete Mathematics, 194(1–3):229–237, 1999. [11] M. L. Fisher, R. Jaikumar, and L. N. van Wassenhove. A multiplier adjustment method for the generalized assignment problem. Management Science, 32(9):1095–1103, 1986. [12] B. Gavish and H. Pirkul. Algorithms for the multi-resource generalized assignment problem. Management Science, 37(6):695–713, 1991. [13] P. C. Gilmore and R. E. Gomory. A linear programming approach to the cutting stock problem–Part II. Operations Research, 11(6):863–888, 1963. [14] M. Guignard and M. B. Rosenwein. An improved dual based algorithm for the generalized assignment problem. Operations Research, 37(4):658–663, 1989. [15] M. E. L¨ubbecke and J. Desrosiers. Selected topics in column generation. Operations Research, 53(6):1007–1023, 2005. [16] R. E. Marsten, W. W. Hogan, and J. W. Blankenship. The boxstep method for largescale optimization. Operations Research, 23(3):389–405, 1975. [17] S. Martello and P. Toth. An algorithm for the generalized assignment problem. In J. P. Brans, editor, Operations Research ‘81, pages 589–603. North-Holland, Amsterdam, 1981. [18] S. Martello and P. Toth. Knapsack problems: Algorithms and computer implementations. John Wiley & Sons, Chichester, UK, 1994. [19] M. G. Narciso and L. A. N. Lorena. Lagrangean/surrogate relaxation for generalized assignment problems. European Journal of Operational Research , 114(1):165–177, 1999. [20] G. T. Ross and R. M. Soland. A branch-and-bound algorithm for the generalized assignment problem. Mathematical Programming , 8(1):91–103, 1975. [21] G. T. Ross and R. M. Soland. Modeling facility location problem as generalized assignment problems. Management Science, 24(3):345–357, 1977. [22] L. M. Rousseau, M. Gendreau, and D. Feillet. Interior point stabilization for column generation. Working paper, Universite d’Avignon, Cedex 9, France, 2003. [23] M. Savelsbergh. A branch-and-price algorithm for the generalized assignment problem. Operations Research, 45(6):831–841, 1997.

128

David P. Morton, Jonathan F. Bard and Yong Min Wang

[24] E. F. Silva and R. K. Wood. Solving a class of stochastic mixed-integer programs with branch and price. Mathematical Programming , 108:395–418, 2006. [25] D. R. Spoerl and R. K. Wood. A stochastic generalized assignment problem. Working paper, Naval Postgraduate School, Monterey, CA, 2006. [26] B. Tokas, J. W. Yen, and Z. B. Zabinsky. Addressing capacity uncertainty in resourceconstrained assignment problems. Computers & Operations Research, 33:724–745, 2006. [27] M. Trick. A linear relaxation heuristic for the generalized assignment problem. Naval Research Logistics, 39(1):137–151, 1992. [28] R. M. Van Slyke and R. J.-B. Wets. L-shaped linear programs with application to optimal control and stochastic programming. SIAM Journal of Applied Mathematics , 17(4):638–663, 1969. [29] F. Vanderbeck and L. A. Wolsey. An exact algorithm for IP column generation. Operations Research Letters, 19(4):151–159, 1996. [30] P. Wentges. Weighted Dantzig-Wolfe decomposition for linear mixed-integer programming. International Transactions on Operational Research , 4(2):151–162, 1997.

In: Computational Biology: New Research Editor: Alona S. Russe

ISBN: 978-1-60692-040-4 © 2009 Nova Science Publishers, Inc.

Chapter 4

RECONSTRUCTION AND ANALYSIS OF LARGE-SCALE PHYLOGENETIC DATA, CHALLENGES AND OPPORTUNITIES Toni Gabaldón1, Marina Marcet-Houben and Jaime Huerta-Cepas Bioinformatics Department, Centro de Investigación Príncipe Felipe, Avda. Autopista del Saler 16, 46012 Valencia, Spain

Abstract The analysis of the evolutionary relationships among biological sequences, known as phylogenetics, constitutes one of the most powerful tools of computational biology. Besides its classical use to ascertain the evolution of a group of species, phylogenetics has many other applications such as the prediction of the function of a protein and the detection of genes under specific selective constrains. The advent of the genome era has brought about the possibility of extending such analyses to larger sets comprising thousands of sequences from complete genomes. The use of whole genomes, rather than that of reduced sets of genes or proteins, opens the door to a wide range of new possibilities. On the other hand, however, it poses many conceptual and technical challenges that require the development of new algorithms to interpret and manipulate large-scale phylogenetic data. Here we survey recent progress in the development of automated pipelines to reconstruct and analyze large collections of phylogenetic trees and provide some examples of how they have been used to address important biological questions.

1. Introduction Phylogenetics was initially defined as the study of the evolutionary relatedness of groups of organisms. This was formerly done by comparing morphological characters from different organisms that were thought to be homologous, i.e they derived from a common ancestral trait. Later, the possibility of sequencing biological molecules such as proteins and genes 1

To whom correspondence should be addressed. E-mail: [email protected] Telephone: 34 96 328 96 80 Fax: +34 96 328 97 01.

130

Toni Gabaldón, Marina Marcet-Houben and Jaime Huerta-Cepas

represented a revolution in the field of phylogenetics, since it made possible not only the use of molecular sequences to ascertain the evolution of organisms but also to study the evolution of the genes and proteins themselves. Nowadays, phylogenetics is a broad field that has many different applications. Besides its classical use to reconstruct the evolution of a group of species, phylogenetics can help, for example, to predict the function of a gene, to establish correspondences among genes in different organisms or to detect genes under specific selective constrains [1]. In recent years, phylogenetics is facing a new challenge. The availability of complete genomic sequences from a growing number of organisms has paved the way to move from the evolutionary analysis of single protein families (phylogenetics) to that of complete genomes and proteomes (phylogenomics). To achieve this transition, new tools have been developed, ranging from the combination of the evolutionary information from many genes into a single phylogeny [2] to the large-scale reconstruction of thousands of phylogenetic trees in an automatic way [1, 3]. Such scaling up of phylogenetic analyses constitutes now an emerging field within bioinformatics. With the number of fully-sequenced genomes currently approaching one thousand, and with a doubling time of roughly 15 moths, the availability of sequence information is overwhelming. As a result, there is a constant need to develop new algorithms to reconstruct and interpret phylogenetic data. A typical phylogenetic analysis comprises several steps, usually including the selection of putative homologous sequences, the reconstruction of a reliable multiple sequence alignment and the search for a phylogenetic tree that represents the evolutionary relationships of the sequences involved. A plethora of algorithms and computer programs have been developed to assist in each one of these phases and in the interpretation of the resulting phylogenetic trees. This computerization of the phylogenetic analysis is a complex task that faces three major challenges. First, the need for large computational resources in terms of computing time or memory imposes a constant need to develop faster algorithms and use increasingly complex computational settings. Second, the set of genes encoded in a genome is far from being homogeneous and a correct phylogenetic analysis may imply using specific evolutionary models or parameters for each type of gene. Selecting these parameters automatically requires the implementation of sophisticated heuristics. Finally, and perhaps the most important challenge, interpreting such type of complex data poses many difficulties and does require the development of novel algorithms, tools, forms of representing the data and even new semantics and concepts. This chapter summarizes the main challenges and opportunities that are associated with the application of phylogenetics over large sequence datasets, such as those derived from whole genome sequences. To provide the reader with sufficient background, we will start with a brief description of the different parts of a standard phylogenetic pipeline and then discuss the technical challenges that arise when these techniques have to be applied over large datasets. Next, we will discuss the difficulties associated with the interpretation of the results from large-scale phylogenetic analyses. Finally, we will illustrate the use of this type of analyses by providing some examples of the application of phylogenomics to tackle different biological questions.

Reconstruction and Analysis of Large-Scale Phylogenetic Data…

131

2. The Standard Phylogenetic Analysis Pipeline Although different studies may present particularities depending on the question addressed, most phylogenetic analyses do have a central methodological core that can be represented as a pipeline in which different analyses are applied in an ordered manner (Figure 1).

Figure 1. Schematic representation of a general phylogenetic pipeline. Each sequence from the seed genome is used to retrieve their homologous candidates over the rest of species. Each group of sequences is aligned using a multiple alignment program, and the resulting alignment is trimmed to only conserve the most informative columns. Finally, such alignments derived from every seed sequence are used to infer high quality phylogenies including the estimation of gamma parameters and evolutionary model testing steps. Recursion of this pipeline over each single protein encoded in a genome may be used to derive a phylome.

This pipeline starts with a search for evolutionarily related sequences that will be included in the analysis. These sequences are then aligned to establish the correspondences between equivalent residues. Finally, this information is used to infer the evolutionary relationships between the sequences in the alignment, which is usually represented in the form of a hierarchical tree. Here, we will briefly describe the main components of a standard

132

Toni Gabaldón, Marina Marcet-Houben and Jaime Huerta-Cepas

phylogenetic pipeline, mentioning some of the most popular methods and programs (see table 1 for useful on-line resources).

2.1. Homology Search A phylogenetic analysis only makes sense in the context of sequences that are related through evolution. Therefore a crucial step consists of a search for putative homologous sequences. This homology relationship is usually inferred from the level of sequence similarity found between the sequences considered. Although theoretically it is not necessarily so, it is generally assumed that significantly high levels of sequence similarity between two sequences are indicative of common ancestry. This search for similar sequences can be performed by using local-alignment algorithms such as Smith-Waterman [4] or BLAST [5] to search on public sequence databases such as those at NCBI (http://www.ncbi.nlm.nih.gov) or ensembl (www.ensembl.org). Alternatively, locally stored databases can facilitate massive sequence searches. Similarity searches are not exempt of some caveats. For instance, evolutionarily distant homologs are hard to detect and high levels of similarity restricted to small portions of sequences (e.g protein domains) can mislead homology predictions. In order to minimize the effect of such problems, thresholds on the statistical significance of alignments (e-values) and the proportion of the query sequence that can be aligned with the hit should be used to filter out spurious hits. Profilebased or iterative searches such as those implemented in psi-blast [6] or HMMER [7] increase the sensitivity of the similarity searches and serve to detect distantly related homologs. Another problem, which is not related to the search algorithm but rather to the database is the level of sequence redundancy. In fact, even the so-called “non redundant” versions of existing databases usually contain different sequences for the same proteins. The reason for this is that the automatic methods devised to get rid of redundant proteins are generally based on similarity thresholds. Some mutant versions and, most commonly, splice variants of the same protein might not be detected and therefore will remain in the database. The effect of this on the phylogenetic analyses should be evaluated, since it can lead to wrong conclusions, such as an over-estimation of the amount of recent duplications. An operational solution to the problem of alternative splice-forms is to include only the longest variant in the phylogenetic analysis.

2.2. Multiple Sequence Alignments Phylogenetic reconstruction relies on the correct identification of homologous residues from the different sequences involved. This is usually achieved by a multiple sequence alignment (MSA) phase that aims at placing homologous residues on top of each other by maximizing certain scores. These scores are usually computed from residue similarity matrices and a set of gap penalties. Multiple sequence alignment is a computationally intensive, NP-complete problem and, therefore, most practical computations should rely on heuristic approaches to find (nearly) optimal solutions. Such challenges have led to the

Reconstruction and Analysis of Large-Scale Phylogenetic Data…

133

development of numerous computer programs and algorithms for MSA [8]. Most popular programs include, among many others, ClustalW [9], T-Coffee [10] and MUSCLE [11]. MSA faces the problem that, even if statistically or mathematically optimal solutions are found, these are not necessarily coincident with biological optimality. This, together with the fact that any heuristic may introduce errors in the final alignments, results in limited accuracies even for the best algorithms. Benchmark analyses that use sets of 3D structurebased, manually curated alignments such as BAliBASE [12] provide maximal accuracy values of 80-90%. To avoid using unreliable alignment regions, MSA are usually trimmed by eliminating columns from regions that show low residue conservation or high presence of gaps. This was traditionally done manually, something which difficults the reproducibility of the analysis, but nowadays programs such as Gblocks [13] or trimAl [14] can be used to trim the alignments automatically.

2.3. Phylogenetic Reconstruction Once the sequences are aligned, a phylogenetic tree can be reconstructed based on the positional homology information contained in the alignment. There are three major approaches for phylogenetic estimation, namely distance methods, parsimony and statistical approaches such as Maximum Likelihood and Bayesian inference. We will provide a brief overview focusing on the adequateness of each of these methods to large-scale implementations.

2.3.1. Maximum Parsimony One of the classic methods to infer phylogenetic trees is Maximum Parsimony (MP), in which the preferred tree is the one that implies the minimal number of character changes along its branches. The characters can be any attribute that varies among the different elements included in the analysis. Such attributes can be morphological, physiological or even behavioral but in the era of molecular phylogenetics the characters usually correspond to nucleotides or amino acids in a biological sequence. In an exhaustive MP phylogenetic approach, all possible tree topologies are evaluated in terms of the number of character changes that are needed to explain the data, and the one with least changes is chosen as the preferred phylogeny. Computing all possible character change scenarios over a large number of sequences is computationally expensive and, since the number of trees to evaluate grows exponentially with the number of sequences included, MP approaches over very large datasets are not usually feasible. Besides its high computational demands, MP suffers from other drawbacks such as the difficulty of dealing with multiple substitutions or homoplasy. For all these reasons the use of MP is not common in large scale analyses. 2.3.2. Distance Methods Distance methods are among the most popular tree construction methods. They are by far the fastest approach to build phylogenies and provide reasonable accuracies in terms of topology [15]. Therefore, they have long been the method of choice to reconstruct phylogenies of large groups of taxa or to conduct bootstrap calculations. The main disadvantage of distance methods, however, is the possibility of getting stuck in poor local

134

Toni Gabaldón, Marina Marcet-Houben and Jaime Huerta-Cepas

minima. To reconstruct a phylogeny using a distance-based method, the evolutionary distances between all pairs of sequences involved should be computed. One way to do this is to approximate the evolutionary distance by calculating the percentage of non-identical sites between the two sequences. This approach is fast but it usually under-estimates distances between distantly-related sequences, due to the fact that multiple substitutions might have occurred at the same site. Therefore, distances are usually estimated from amino acid substitution matrices and then corrected to account for multiple substitutions. Once a distance matrix is obtained, several approaches can be followed to actually reconstruct the phylogenetic tree. Clustering methods such as UPGMA (Unweighted pairgroup method using arithmetic means) [16] or Minimum evolution were the first to be designed. But the Neighbour joining (NJ) algorithm [17], is nowadays the most extensively used. This algorithm reconstructs the tree by clustering neighboring sequences in a stepwise manner. At each step, multiple topologies are examined and the one with minimal branch lengths is chosen. For large datasets it is only able to examine a small proportion of the total number of possible topologies. NJ is known to be statistically consistent; therefore if correct pair-wise distances are used, it reconstructs the true tree. Usually however, distance estimation is prone to statistical errors, compromising the accuracy of the resulting tree.

2.3.3. Statistical Methods: Maximum Likelihood and Bayesian Analyses Statistical methods evaluate the appropriateness of different trees by using a specific statistical framework. In brief, given an evolutionary model, they calculate the likelihood (or probability) that a given tree would have produced the observed sequence alignment. The evolutionary model is a set of probabilities for residue substitutions, evolutionary rates distribution and any other parameter that can be used to describe how a given set of sequences may have evolved. The main difference between Maximum Likelihood (ML) and Bayesian Inference (BI) methods is their specific statistical framework. While ML computes the likelihood that a given tree would have produced the alignment, in BI the posterior probabilities of the trees, conditional to that alignment, are considered by using the Bayes theorem. Since both approaches are NP-hard problems, performing exhaustive searches on the topology space is not possible for moderate sizes of data and different heuristics are applied. Most ML implementations use hill-climbing algorithms to search the tree space. Changes in the topology are introduced at each step to subsequently evaluate the likelihood of the tree. The algorithm stops when a maximum likelihood tree is found. Significant progress in ML computation has been achieved with the release of fast and accurate programs such as PhyML [18] and RAxML [19]. At the side of Bayesian analysis, the program MrBayes [20] is perhaps the most popular. It implements a Markov Chain Monte Carlo method for taking samples from the probability distribution of the phylogenetic tree space.

3. Scaling up the Pipeline The massive accumulation of sequence data during recent years has brought about the need for using the methods outlined above over large datasets. Phylogenetic inference requires large amounts of computing time and resources. This makes it difficult to apply phylogenetics over large datasets and poses many limitations on the type of analyses that can

Reconstruction and Analysis of Large-Scale Phylogenetic Data…

135

be performed. In the following sections we will see how some of the mentioned approaches need to be adapted while others are simply unfeasible.

3.1. Balancing Speed and Accuracy Different methodological solutions vary in their computational demands, but also in the accuracy of their results. A general trend is that speed is usually achieved at the cost of accuracy or resolution. Nevertheless, the level of resolution that can be achieved by faster, less accurate methods is sometimes sufficient for drawing robust conclusions. For example, when the hypothesis we are testing is related to the existence of a particular topology, the conclusions will not depend on an accurate estimation of branch lengths. Conversely, if our interest is focused around the specific evolutionary rates between the different sequences, our conclusions may be compromised by an incorrect estimation of branch lengths. Thus a convenient balance between speed and accuracy should be specific to the question addressed. Maximizing computing resources is always a need in large-scale phylogenetics and different strategies can be used to achieve this. First, significant reduction in time requirements may come from the development of faster phylogenetic algorithms or the use of faster computers. Alternatively, the parallelization of the computations and the use of distributed computers can also speed up the process. This latter strategy is especially appropriate to problems such as the reconstructions of trees with a large number of sequences or sites that require huge amounts of memory and lots of CPU cycles. In summary, before starting a large-scale phylogenetic analysis, a careful planning is needed, which considers, globally, the number of phylogenies to reconstruct, their size, the availability of computational resources and the expected quality of the results.

3.2. Homology Searches Currently, large-scale blast searches can be computationally scaled up without big problems by using distributed systems or parallel approaches. Sequence similarity search is one of the computing tasks in bioinformatics for which specific hardware has been developed. Paracel BlastMachine, TimeLogic Decypher and GeneMatcher2, are just some examples of computer devices specifically developed to perform large amounts of similarity searches in an optimized way. By using especially printed hardware circuits and/or optimized hardware configuration, these devices accelerate sequence similarity searches on large databases by several orders of magnitude, as compared to the use of equivalent numbers of standard computers.

3.3. Multiple Sequence Alignments Multiple sequence alignment is usually not the time-limiting step in a phylogenetic pipeline. For example, aligning 26 protein sequences, of roughly 500 amino acids each, takes 3 seconds, 30 times less than reconstructing a Maximum Likelihood phylogenetic tree from

136

Toni Gabaldón, Marina Marcet-Houben and Jaime Huerta-Cepas

that alignment (in a 64-bits Linux PC, using MUSCLE [11] for the alignment and PhyML [18] for the phylogenetic reconstruction, both with default parameters). Therefore, we are not limited to use the fastest implementations available and one should always consider the use of slower but more accurate algorithms. For instance, the use of algorithms that include a final refinement phase to correct the alignment, such as Muscle [11], is highly advisable. We might decide, however, to discard slower algorithms if the gain in accuracy is achieved at a high cost of computing time. For instance, in a recent benchmark T-Coffe achieved slightly better accuracies (73%) than Muscle (70%) but multiplied by 1,273 the invested time [21]. Another aspect to consider is the memory requirements of the different algorithms, especially when large concatenated alignments are to be computed. Regardless of the chosen alignment method, the quality of the alignments in large-scale pipelines is likely to be lower than that of smaller studies. The inclusion of large numbers of automatically selected sequences as well as the use of standard parameters for all gene families leads, inevitably, to higher proportions of unreliable regions in the final alignment. Thus, the inclusion of a trimming phase to automatically eliminate less reliable regions from the alignment is advisable. Programs such as G-blocks [13] that assist in the MSA trimming phase by selecting blocks of conserved regions have become very popular. They have shown a good performance, in small to medium-scale datasets [22], but their use over very large datasets is sometimes hampered by the need for defining, prior to the analysis, the set of parameters that will be used for all sequence families. A recently developed algorithm, implemented in trimAl [14], allows for automatically adjusting the trimming parameters to optimize the final phylogenetic signal-to-noise ratio. This makes TrimAl especially suited for large-scale phylogenomic analyses that involve thousands of large multiple sequence alignments. Another positive effect of analyzing trimmed alignments is that it results in faster phylogenetic analyses, simply because fewer positions have to be considered. Indeed, we [14] showed that, besides improved phylogenetic accuracies, reductions of 20-30% in computing time were achieved when using automatically trimmed alignments.

3.4. Phylogenetic Reconstruction From the different phylogenetic reconstruction approaches, Maximum Parsimony is generally avoided in large-scale analyses for the reasons mentioned above. Distance methods, and in particular, Neighbor Joining is perhaps the most widely used for its speed. However, there is a general agreement in that statistical approaches, such as Bayesian inference and Maximum likelihood are currently the most accurate methods. As a result, there is a growing tendency for incorporating these strategies in large-scale analyses, despite their high computational demands. This has been facilitated by recent progress in faster implementations of the algorithms. For instance, the development of new heuristics that reduce the searchable solution space or define more efficient ways to explore it have resulted in fast implementations of ML and BI approaches in several programs [18-20]. In the case of ML, alternatives to the use of statistical re-sampling methods such as the traditional bootstrapping approach have been developed [23]. This provides approximate likelihood ratios as support values for the tree partitions and avoids the need for reconstructing hundreds

Reconstruction and Analysis of Large-Scale Phylogenetic Data…

137

of phylogenies to obtain the bootstrap support values for a single tree. Finally, parallel implementations of BI [24] and ML [25] also allows for their use on large computing clusters.

4. Challenges in the Analysis and Interpretation of Large-Scale Datasets Together with the high computational costs associated to large scale phylogenetics reconstruction, the biological interpretation of extensive collections of trees and alignments represents another important challenge in the field of phylogenomics. Inspection of thousand of phylogenies cannot be addressed manually and, therefore, automatic methods and algorithms for the interpretation of phylogenies are necessary. In this direction, several issues have to be taken into account, here we address some of the most important problems that are faced when analyzing large phylogenetic datasets.

4.1. Combining Information from Various Genes to Derive a Single Phylogeny Phylogenetic trees based on a single gene family, also named gene-trees, are useful in many areas. However, they present the drawback of presenting different topologies depending on the family used to derive the tree. This situation is especially problematic when the aim of the analysis is to define a single phylogeny that represents the evolutionary relationships among a group of species. A possible solution to this problem is to integrate the phylogenetic information from various genes into a single tree [24]. To achieve this, two main strategies may be used, namely concatenation of multiple sequences and super tree construction. Alternative methods such as genome trees based on gene content or gene order do not use sequence information directly and will not be considered here [26]. In principle, the concatenation of groups of genes allows for more robust and representative phylogenies by increasing the number of informative sites. The usual procedure, consists of selecting a group of representative genes and concatenate them to create a super-alignment which is then processed in the usual way to obtain a tree. This combination of data amplifies the phylogenetic signal and increases the resolving power, for example, when the signal is masked by homoplasy [27]. In order to use gene concatenation, each gene should, ideally, be present in single copy in all species considered. As a result, the number of genes that can be used in such analyses decreases as the number of species included grows. To attenuate this effect, genes absent in a few genomes can be included by introducing gaps in the missing species and methods to select one gene from few recent paralogs can be applied. Gene concatenation is also particularly sensitive to processes such as horizontal gene transfer, so gene families that may have undergone such processes should not be included in the super-alignment [28]. The major drawback of this methodology is that, for large groups of divergent species, the number of genes that can be used in the concatenation is very limited, representing a minimal fraction of the genes encoded in any given genome. For instance, in a recently reported tree of life including 191 species, only 31 genes were used [29]. This has raised

138

Toni Gabaldón, Marina Marcet-Houben and Jaime Huerta-Cepas

some criticism as to whether such a reduced number of genes can fairly represent the evolution of whole genomes [30]. Reduced gene sets are subject to biases caused by genesampling effects or by differences in lengths, since house-keeping, widespread genes, and specially those with long sequences, contribute with a higher amount of information [31]. In contrast to trees of life representing a broad range of species, those that tackle the evolution of smaller sets of closely related organisms have many genes that are suitable to concatenation. In this case, we will face a different problem, since the amount of the information to deal with in the phylogenetic analysis will pose high computational demands. An alternative strategy to integrate the information from different genes consists of combining the individual gene-trees rather than their alignments. This strategy is often called “super-tree reconstruction” and comprises several methods that differ slightly [32]. The basis of many of these methods is a representation of the topology of the source trees in the form of a matrix, which is then optimized to obtain the super-tree. An advantage of this method over gene concatenation is that it is not limited to genes that are widespread. Moreover, the genetree set can even contain trees that share none of their species, as long as other trees in the set can link them together. The method, however, still does not allow for duplicated genes. Even though this limitation may pose a problem to eukaryotic datasets, super-tree methods generally allow the use of larger amounts of data than the concatenation of genes. As a result, it includes the phylogenetic information from a higher variety of genes. Despite the strengths outlined above, the approach has also been the focus of some criticism. For instance, it has been pointed out that super-trees only use sequence information indirectly, since they are based on tree topology instead of primary character data [32]. Also, although this method is able to overcome the difficulties caused by processes like lateral gene transfer due to the large amount of data, it is unable to fine-tune its topological distribution, as the appearance of novel relationships between species is not supported by the raw data used to build the tree [33]. One special case of super-tree construction is when all trees have the same species. In this case we can build what is called a consensus tree.

4.2 Tree Comparison and Topological Pattern Search Natural questions that may arise when inspecting large datasets of phylogenetic trees include how similar are the different trees are from each other or the fraction of trees that provide support for a specific topology. There is a large variety of programs that compare tree topologies. Perhaps the quartet [34] and Robinson and Foulds [35] distances are the most commonly used. The quartet distance counts the number of quartets, that is sub-trees induced by four leaves, that differ between two trees, whereas Robinson and Foulds distance is based directly on the edge structure of the trees and their induced bipartitions. There are also some methods that consider the branch length information [15] but they cannot be directly applied on trees with different evolutionary rates. A recent implementation, however, allows to do this [36]. Most of the methods listed above are limited to sets of trees that have the same taxa. However, in practice, this is not usually the case since gene loss and duplication are rampant. In these cases other methodologies need to be applied. One such method is the tree reconciliation method [37], that explains the differences between a gene tree and a species

Reconstruction and Analysis of Large-Scale Phylogenetic Data…

139

tree in terms of gene duplications and gene losses. Other approximations are based on pruning the trees to retrieve subsets of shared taxa [38]. Related to the problem of comparing two trees is the need to develop algorithms that identify specific topological patterns within the trees. Some groups have implemented algorithms to search for specific topological patterns [39, 40] that are based on the examination of the possible tree partitions. Dufayard and colleagues [23] have implemented a similar algorithm that allows the user to define specific scenarios with the help of a graphical interface.

4.3 Phylogeny-Based Orthology Prediction Orthologs are genes that diverged after a speciation event [41]. When compared to paralogs, which evolved through a duplication event, orthologous pairs of genes show a higher tendency to perform the same function. This is the reason why predicting them correctly has become so important [1]. There are many methods that predict whether two genes are orthologous to each other, most of them based on pair-wise similarities [42]. However, since the original definition of orthology is an evolutionary one [41], a prediction based on phylogeny seems to be more appropriate. Such approach has been traditionally prevented due to its high computational demands. Only recently it has become possible to infer orthology relationships from phylogenetic trees on a large scale. There are two main approaches to derive orthology relationships from phylogenetic trees, namely reconciliation and species-overlap methods. Reconciliation methods use a species tree as model. When comparing a gene tree with the species, mismatching nodes can be identified. In the reconciliation phase, these nodes are considered to be duplication events which were subsequently followed by the necessary amount of gene loss to explain the observed differences. This approximation will render correct orthology predictions if the assumption that the gene and species trees are correct holds. In practice, however, this assumption is often violated. Another problem of this methodology is that it requires a fully-resolved species tree, something that is often not available. Some approaches, such as that of soft-parsimony [43], address these problems by considering branch statistical supports and allowing for the use of partially unresolved species trees. An alternative way out of the problem of variations in gene and species trees is the so-called species-overlap methods. In this case duplication nodes are only considered as such when their branches have shared species. While simple, this method performs well [44] and has the advantage that the only information that is needed from the species phylogeny is the one needed to root the tree. Programs such as ETE [45] and LOFT [46], implement this type of algorithms.

5. Addressing Biological Questions Through Phylogenomics So far we have surveyed most of the important challenges that arise in the production and analysis of large-scale phylogenetic data. The picture, however, would not be complete unless we provide some illustrating examples of how such type of information has been useful to address specific biological questions. Due to space limitations, we have selected only three

140

Toni Gabaldón, Marina Marcet-Houben and Jaime Huerta-Cepas

case-stories from our own work that serve to illustrate the use of large-scale phylogenetics in three different types of biological questions.

5.1. The Human Phylome Project The completion of the sequence of the human genome represented a breakthrough in the modern history of biology [47]. Since then, many sequences from a diverse group of eukaryotic model organisms have been sequenced. This has brought about the need to identify orthology relationships among our genes and those from the species used in the laboratories. Moreover, the availability of the complete set of human genes also has paved the way to investigate the evolution of our species from the perspective of each one of our genes. The aim of the human phylome project was to provide a complete collection of high-quality human gene phylogenies so that it could be used to derive genome-wide orthology predictions and to perform different evolutionary analyses [44]. It constituted a great technical challenge, since it involved the implementation of a high-quality pipeline, which included alignment trimming procedures, evolutionary model testing, and maximum likelihood and bayesian phylogenetic inference. To achieve this, large computational resources had to be used including the Mare Nostrum super-computer from the Barcelona Supercomputing Center, by then one of the top 10 supercomputers in the world. The resulting phylogenies are available through the PhylomeDB database [45]. Besides producing such a collection of trees, the human phylome project also focused on some evolutionary analyses. First, a species overlap algorithm (see 4.3) was used to identify all duplication and speciation events and subsequently derive a comprehensive set of orthology and paralogy relationships among the genomes involved. Next, all duplications were dated by using the topological information of the phylogeny and mapped to different evolutionary stages. Quantitative results showed a high ratio of duplications per gene before the radiation of vertebrates. This would be compatible with the existence of at least one round of whole genome duplication at that time [48]. Moreover, by inspecting the functional annotations of the families duplicated at different evolutionary periods a coherent parallelism between the biology of the organismal groups and the over-represented functional terms associated to the genes duplicated at such periods emerged [44]. Finally, by using tree-pattern search algorithms, some alternative evolutionary scenarios were contrasted with the whole phylome, quantifying the fraction of trees that were compatible with each of the scenarios. Three controversial scenarios were chosen, including the relative position of nematodes, chordates and arthropods; the relationships among rodents, primates and laurasatherians; and, lastly, the grouping of opisthokonts with amoebozoans. In all cases, the results indicated that there is a great topological diversity without large differences in the fraction of trees that support the best two hypotheses.

5.2. Reconstructing the Fungal Tree of Life The problem of the level of congruence between species trees and gene trees was the focus of a more recent study [49]. In this case, we used fungi, one of the best-sampled groups

Reconstruction and Analysis of Large-Scale Phylogenetic Data…

141

of eukaryotes in terms of complete genome sequences [50]. We first reconstructed the fungal tree of life with a standard alignment concatenation procedure and using data from 60 sequenced genomes (Figure 2). 0.1

icus es ja

pon

s pom

m yc

myce

i en ns ha es yc m

s

oru

isp

ng

61

nic

ero

dd

Lo 37 *

uli

0|9

6

64

98

ans

91

8

dida

Can

89

28

|8

98

34

Kluyveromyces

48|82

Kluyveromyces lactis 56

27*

Ashbya gos sypii

ns s nidula

illu

97

Kluyv

94

79

s

igatu

s fum

erom

46

c a fis

98

s

lus

rgil

vatu

99

31

Ne

cla

89

u arp

itis

67

93

95

cc

ro

ps

en

s

ei Trichoderma rees

haemat Nectria

Fusa

ococca

m

rum

poru mo

gram

ariu

m riu

inea

xys

rtic ve

Fus

sea

rium

ra po os

sa

iae

Fu

s

on

xu

vis

rum

otio

ag

ii

e

do

re

ler

ce

sc

St

ev

ta

ra

ia

es

illio

no

ide

re ae

us

pa

ph

es

do

lla

ru

m

fiji

a m as pl

ika

yc

yc

os

an

ria

vz

m

m

to

ro

m

rea

ine

sc

tryti

otin

is

ha

ca

cc

sis

ul

Sa

sk

yc

es

ro

ler

ha

Bo

yc

ay

ud

m

cc

Sc

globosum

crassa

e gri

orth

nap

spora

Neuro

Podospora anserina

Chaetomium

M

sb

ce

ha

Mag

H

ce

ro

my

aste

llii

my

ha

Sa

Sa

Co

sc

aro

cc

85

at

um

e

oid

idi

cc

58

rata

yce

cch

Sa

glab

rom

Sa

73

84

m si

88

m

c Un

cha

53

sii

ee sr

c ino

Sac

polysp

orus

dida

36

ry

yces

Can

4

|6 78

heri

pe As

waltii

36|94

33*

Aspergillus niger

eri

es kluyv

romyc

3

99|98

rto osa

dublin

Saccha

|6

Aspergillus terreus

rgillu

s

iensi

98 99

s oryzae

Aspe

albic

dida

Can

91

90

s

Aspergillu

Asperg

alis

pic

tro

ida

nd

Ca

5

flavu

elo

es

c my

44

cu

ii

m on

di

ita n

illi gu da di

is

on

ia

ch

Pi

97

gillus

ar yo

s

atid

zo

sti

D

nu

rob

is

pit

eb

ea

er

lus ida Ca

Ca n

sle

nd

Asper

iae

lipo lyti ca

aro cch o sa Sch iz

Ya rro wia

nd

carinii

charo

ocystis

osac Schiz

ans form neo

is

ae yz

ke

de

ph

s

or

bla

triu

alito

eu

ce

us

es

hy

m

En

s ro

p zo

yc

oc

es

hi

om

ch

min

yc

R

yc

tra

gra

m

Ph Ba

is ayd

olo

inia cc

ob or

Pu

Sp

Saccharomycotina

m go tila Us

Pezizomycota

cinereus Coprinus

cus

Archiascomycota

r icolo aria b Lacc

oc ptoc

Basidiomycota

Postia placenta

Cry

Cytridiomycota

Pneum

Zygomycota

be

Phanerochaete chrysosporium

Microsporidia

Figure 2, Phylogenetic tree representing the evolutionary relationships among 60 fungal species. The tree is based on the ML analysis of the concatenated alignment of 69 widespread proteins. The figure was built using the iTOL web interface [59]. Numbers on the nodes indicate two different types of support values. The first number indicates the phylome support for that node. An asterisk next to this number indicates that the topology obtained by the TOL is not the most common among the trees in the phylome. Whenever there is a second number (in bold), this indicates the bootstrap support. Only partitions that have a boostrap support lower than 100 are indicated in bold. Branches with dashed lines indicate evolutionary relationships that are supported by less than 50% of the trees in the phylome.

Then, to investigate its genome-wide support, we reconstructed the Saccharomyces cerevisiae phylome. By using tree scanning algorithms, we compared each one of the trees built in the phylome with the topology found in the tree of life. The goal of this comparison was to see how many of the trees in the phylome were fully compatible with the tree of life. The results were a bit startling, as only a small number of the trees in the phylome did not contradict the tree of life (7%). This clearly shows the inability of the trees based on only one gene to obtain a recount of the true evolution of the species. To investigate the effect of taxon sampling we repeated the experiment using smaller sets of species. Two of these sets were

142

Toni Gabaldón, Marina Marcet-Houben and Jaime Huerta-Cepas

densely sampled including 21 Saccharomycotina (T21) and 12 Saccharomyces (T12a) species, respectively. Furthermore, we built a second tree with 12 species (T12b) sampled from the main clades of the fungal tree of life. In the phylome of 21 species we barely observed a change in the percentage of trees that were congruent with the tree of life. On the other hand, there was a slight improvement in the T12a comparison, which reached a 13% of complete congruence between the tree of life and the phylome. T12b doubled those results reaching a 29% congruence. From this data we can deduce that when using only one gene to build the tree we have a better chance of recuperating the true tree when using a small number of species and if these species are distantly related. Next we investigated whether the incongruences were affecting tree nodes differentially. We found that while some nodes are very robust, presenting the same distribution in the majority of the phylome, others are extremely variable, accounting for the large divergence found in trees built from single genes. These results imply that these type of trees do represent, for most of the nodes, the strongest evolutionary relationships along the genome. But also warn us that care should be taken with some nodes that may be highly supported by bootstrap analyses but are, in contrast, low represented among the single-gene phylogenies. Interestingly, the phylome-TOL comparison provides a novel way to incorporate genomewide information on the species trees. Indeed, by introducing some bifurcations (or marking low represented nodes, as in figure 2) we obtain a tree of life that represents more accurately the evolutionary relationships that are robustly supported across the genome.

5.3. Reconstruction of Ancestral Proteomes Inferences on extinct organisms have been traditionally limited to the information contained in the fossil record or by observation of derived extant species. More recently, phylogenetics has been used to explore the biochemical properties of ancestral sequence, a field known as ancestral sequence reconstruction [51]. But its application is limited to small sets of proteins. Ideally, one would like to sequence completely the genome of the extinct organism of interest, to then infer its properties as one would do with extant species with sequenced genomes. Sequencing of the DNA of extinct species has now become feasible thanks to the combined use of current high-throughput sequencing techniques and metagenomic approaches [52], but its application is limited to relatively young specimens that have been preserved under exceptional circumstances. Most ancient genomes have likely disappeared forever and therefore other means are needed to gain insight on them. A novel technique, which we called ancestral proteome reconstruction [53], exploits large-scale phylogenetic analyses to reconstruct the protein repertoire encoded in ancestral organisms. The assumption behind this methodology is that the phylogeny of genes that are vertically inherited from the ancestral genome would be similar to the species phylogeny. In such cases, when a given set of proteins from different species is monophyletic and the tree topology is consistent with a vertical descent from a common ancestral sequence, then presence of that protein family in the last common ancestor of the species considered can be assumed. This methodology was applied for the first time in the reconstruction of the proteome of the ancestor of mitochondria, a study that involved the automatic reconstruction and analysis of more than 20.000 trees [40].

Reconstruction and Analysis of Large-Scale Phylogenetic Data…

143

Mitochondria are eukaryotic organelles that originated from the endosymbiosis of an αproteobacterium [54]. The only information that was used to propose different hypothesis for this endosymbiotic origin [55-57], consisted in the few proteins that are still encoded by the small mitochondrial genome. To broaden this view, we reconstructed the phylomes of sequenced alpha-proteobacterial genomes to search for trees whose topology indicated a vertical descent of eukaryotic proteins from an alpha-proteobacterial ancestor [40, 58]. By mapping the functions encoded in the reconstructed proteome onto metabolic maps, the protomitochondrial metabolism could be reconstructed, allowing for deeper understanding of the processes that initially fixed mitochondrial endosymbiosis Table 1. Compilation of several on-line resources for phylogenomics

Name Ensembl TreeFam PhylomeDB BaliBase NCBI Blast Inparanoid

Description produces and maintains automatic annotation on selected eukaryotic genomes. a curated database of phylogenetic trees of animal gene families a database for genome-wide collections of gene phylogenies a benchmark alignment database for the evaluation of multiple alignment programs

Phylemon

Run BLAST programs on the NCBI databases a comprehensive database of eukaryotic orthologs a curated database of structure-based alignments for homologous protein families Interactive tree oflife database of phylogenetic information a suite of web-tools for molecular evolution, phylogenetics and phylogenomics

COG

Clusters of Orthologous Groups of proteins

PHYLIP

Compilation of phylogenetic programs

HOMSTRAD iTOL TreeBase

URL http://www.ensembl.org http://www.treefam.org http://phylomedb.bioinfo.ci pf.es http://www-bio3d-igbmc.ustrasbg.fr/balibase/ http://www.ncbi.nlm.nih.go v/blast/ http://inparanoid.sbc.su.se/ http://tardis.nibio.go.jp/ho mstrad/ http://itol.embl.de http://www.treebase.org/ http://phylemon.bioinfo.cip f.es http://www.ncbi.nih.gov/C OG http://evolution.genetics.wa shington.edu/phylip/softwa re.html

References [1] [2] [3] [4]

Gabaldón T: Evolution of proteins and proteomes, a phylogenetics approach. Evolutionary Bioinformatics Online 2005, 1:51-56. Delsuc F, Brinkmann H, Philippe H: Phylogenomics and the reconstruction of the tree of life. Nat Rev Genet 2005, 6:361-375. Eisen JA, Fraser CM: Phylogenomics: intersection of evolution and genomics. Science 2003, 300:1706-1707. Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol 1981, 147:195-197.

144 [5] [6]

[7] [8] [9]

[10] [11] [12] [13] [14] [15] [16] [17] [18] [19]

[20] [21] [22]

[23]

[24]

Toni Gabaldón, Marina Marcet-Houben and Jaime Huerta-Cepas Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215:403-410. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25:3389-3402. Durbin R, Eddy SR, Krogh A, Graeme M: Biological Sequence Analysis: Probabilistic models of proteins and nucleic acids. 1988. Notredame C: Recent evolutions of multiple sequence alignment algorithms. PLoS Comput Biol 2007, 3:e123. Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22:4673-4680. Notredame C, Higgins DG, Heringa J: T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 2000, 302:205-217. Edgar RC: MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 2004, 5:113. Thompson JD, Koehl P, Ripp R, Poch O: BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins 2005, 61:127-136. Castresana J: Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol Biol Evol 2000, 17:540-552. Capella-Gutíerrez S, Silla-Martínez J, Gabaldón T: TrimAl: a tool for automatic alignment trimming in large-scale phylogenetic analyses. (submitted) 2008. Kuhner MK, Felsenstein J: A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol Biol Evol 1994, 11:459-468. Sneath PH, Sokal RR: Numerical taxonomy. Nature 1962, 193:855-860. Saitou NN, M.: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 1987, 4:406-425. Guindon S, Gascuel O: A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol 2003, 52:696-704. Stamatakis A, Ludwig T, Meier H: RAxML-III: a fast program for maximum likelihood-based inference of large phylogenetic trees. Bioinformatics 2005, 21:456463. Ronquist F, Huelsenbeck JP: MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 2003, 19:1572-1574. Nuin PA, Wang Z, Tillier ER: The accuracy of several multiple sequence alignment programs for proteins. BMC Bioinformatics 2006, 7:471. Talavera G, Castresana J: Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. Syst Biol 2007, 56:564577. Dufayard JF, Duret L, Penel S, Gouy M, Rechenmann F, Perriere G: Tree pattern matching in phylogenetic trees: automatic search for orthologs or paralogs in homologous gene sequence databases. Bioinformatics 2005, 21:2596-2603. Altekar G, Dwarkadas S, Huelsenbeck JP, Ronquist F: Parallel Metropolis coupled Markov chain Monte Carlo for Bayesian phylogenetic inference. Bioinformatics 2004, 20:407-415.

Reconstruction and Analysis of Large-Scale Phylogenetic Data…

145

[25] Stamatakis A: RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 2006, 22:2688-2690. [26] Snel B, Huynen MA, Dutilh BE: Genome trees and the nature of genome evolution. Annu Rev Microbiol 2005, 59:191-209. [27] Brown JR, Douady CJ, Italia MJ, Marshall WE, Stanhope MJ: Universal trees based on large combined protein sequence data sets. Nat Genet 2001, 28:281-285. [28] Wolf YI, Rogozin IB, Grishin NV, Tatusov RL, Koonin EV: Genome trees constructed using five different approaches suggest new major bacterial clades. BMC Evol Biol 2001, 1:8. [29] Ciccarelli FD, Doerks T, von Mering C, Creevey CJ, Snel B, Bork P: Toward automatic reconstruction of a highly resolved tree of life. Science 2006, 311:1283-1287. [30] Dagan T, Martin W: The tree of one percent. Genome Biol 2006, 7:118. [31] Daubin V, Gouy M, Perriere G: A phylogenomic approach to bacterial phylogeny: evidence of a core of genes sharing a common history. Genome Res 2002, 12:10801090. [32] Bininda-Emonds OR: The evolution of supertrees. Trends Ecol Evol 2004, 19:315-322. [33] Bininda-Emonds OR: Supertree construction in the genomic age. Methods Enzymol 2005, 395:745-757. [34] Estabrook G, McMorris F, Meacham C: Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Syst Zool 1985:193-200. [35] Robinson D, Foulds L: Comparison of phylogenetic trees. Mathematical Biosciences 1981:131-147. [36] Soria-Carrasco V, Talavera G, Igea J, Castresana J: The K tree score: quantification of differences in the relative branch length and topology of phylogenetic trees. Bioinformatics 2007, 23:2954-2956. [37] Zmasek CM, Eddy SR: A simple algorithm to infer gene duplication and speciation events on a gene tree. Bioinformatics 2001, 17:821-828. [38] Puigbo P, Garcia-Vallve S, McInerney JO: TOPD/FMTS: a new software to compare phylogenetic trees. Bioinformatics 2007, 23:1556-1558. [39] Esser C, Ahmadinejad N, Wiegand C, Rotte C, Sebastiani F, Gelius-Dietrich G, Henze K, Kretschmann E, Richly E, Leister D, et al: A genome phylogeny for mitochondria among alpha-proteobacteria and a predominantly eubacterial ancestry of yeast nuclear genes. Mol Biol Evol 2004, 21:1643-1660. [40] Gabaldón T, Huynen MA: Reconstruction of the proto-mitochondrial metabolism. Science 2003, 301:609. [41] Fitch WM: Distinguishing homologous from analogous proteins. Syst Zool 1970, 19:99113. [42] Koonin EV: Orthologs, paralogs, and evolutionary genomics. Annu Rev Genet 2005, 39:309-338. [43] Berglund-Sonnhammer AC, Steffansson P, Betts MJ, Liberles DA: Optimal gene trees from sequences and species trees using a soft interpretation of parsimony. J Mol Evol 2006, 63:240-250. [44] Huerta-Cepas J, Dopazo H, Dopazo J, Gabaldón T: The human phylome. Genome Biol 2007, 8:R109. [45] Huerta-Cepas J, Bueno A, Dopazo J, Gabaldón T: PhylomeDB: a database for genomewide collections of gene phylogenies. Nucleic Acids Res 2008, 36:D491-496.

146

Toni Gabaldón, Marina Marcet-Houben and Jaime Huerta-Cepas

[46] van der Heijden RT, Snel B, van Noort V, Huynen MA: Orthology prediction at scalable resolution by phylogenetic tree analysis. BMC Bioinformatics 2007, 8:83. [47] McPherson JD, Marra M, Hillier L, Waterston RH, Chinwalla A, Wallis J, Sekhon M, Wylie K, Mardis ER, Wilson RK, et al: A physical map of the human genome. Nature 2001, 409:934-941. [48] Ohno S: Evolution by gene duplication. London: Allen and Unwin; 1970. [49] Marcet-Houben M, Gabaldón T: The tree versus the forest: the fungal tree of life and the topological diversity within the yeast phylome. (submitted) 2008. [50] Galagan JE, Henn MR, Ma LJ, Cuomo CA, Birren B: Genomics of the fungal kingdom: insights into eukaryotic biology. Genome Res 2005, 15:1620-1631. [51] Pupko T, Pe'er I, Hasegawa M, Graur D, Friedman N: A branch-and-bound algorithm for the inference of ancestral amino-acid sequences when the replacement rate varies among sites: Application to the evolution of five gene families. Bioinformatics 2002, 18:1116-1123. [52] Noonan JP, Hofreiter M, Smith D, Priest JR, Rohland N, Rabeder G, Krause J, Detter JC, Paabo S, Rubin EM: Genomic sequencing of Pleistocene cave bears. Science 2005, 309:597-599. [53] Gabaldón TaH, M.A: Reconstruction of ancestral proteomes. In Ancestral sequence reconstruction. Edited by Liberles DA. Oxford: Oxford University Press; 2007: 128138 [54] Gray MW, Burger G, Lang BF: Mitochondrial evolution. Science 1999, 283:1476-1481. [55] Margulis L: Symbioses in Cell Evolution. San Francisco: W.H. Freeman; 1981. [56] Kurland CG, Andersson SG: Origin and evolution of the mitochondrial proteome. Microbiol Mol Biol Rev 2000, 64:786-820. [57] Martin W, Müller M: The hydrogen hypothesis for the first eukaryote. Nature 1998, 392:37-41. [58] Gabaldon T, Huynen MA: From endosymbiont to host-controlled organelle: the hijacking of mitochondrial protein synthesis and metabolism. PLoS Comput Biol 2007, 3:e219. [59] Letunic I, Bork P: Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation. Bioinformatics 2007, 23:127-128.

In: Computational Biology: New Research Editor: Alona S. Russe

ISBN: 978-1-60692-040-4 © 2009 Nova Science Publishers, Inc.

Chapter 5

CHROMATIN FIBER: 30 YEARS OF MODELS Julien Mozziconacci1 and Christophe Lavelle2 1

Laboratoire de Physique Théorique de la Matière Condensée, Université Pierre et Marie Curie, Paris, France. (E-mail: [email protected]). 2 Interdisciplinary Research Institute, CNRS - USR 3078, Villeneuve d'Ascq, France. (E-mail: [email protected]).

Abstract A thorough understanding of electrostatic, elastic and topological behaviour of DNA has provided some relevant mechanistic insights into the regulation of genetic expression. Although this approach has proved valuable for the study of many biological processes, it is limited by the simple description level represented by DNA. Indeed, genomic DNA in eukaryotic cells is basically divided into chromosomes, each consisting in a single huge chromosomal fiber hierarchically supercoiled. Since this organisation plays a critical role in all processes involved in DNA metabolism, tremendous efforts have been done to build relevant models of chromatin structure and dynamics. Namely, by shifting from a DNA (as a simple molecular polyelectrolyte) point of view to a chromatin (as a polymorph supramolecular nucleoprotein complex) point of view, we should go towards more efficient mechanistic framework in which the control of genetic expression and other DNA metabolism processes could be interpreted. This review gives an historical overview of the progresses that have been done during the last 30 years in this field, and discusses what the most challenging outcomes are now.

Introduction Since the elucidation of its structure by Franklin, Watson and Crick in 1953 (Watson & Crick, 1953), DNA has been seen as the long polymer carrying the genetic information (Avery et al, 1944). Now, one of the primordial questions today in biology is: where is encoded the regulatory information that controls gene expression? In other words, what specifies which genes are transcribed while others are silenced? The complexity of this question has many origins, including the numerous actors at stake, the combination of their effects and the variety of time and space scales to be considered. It is now well acknowledged that the way

148

Julien Mozziconacci and Christophe Lavelle

DNA is packaged and organised in the nucleus holds a significant part of the answer. Indeed, DNA is not naked in vivo but associated with basic proteins, the histones, to form a nucleoprotein complex named "chromatin" (Flemming, 1882). Chromatin inevitably holds a significant part of the regulatory information encoded in the nucleus, sometimes collectively referred to as "epigenetic" information. As was true for the elucidation of DNA structure and the genetic code, elucidation of chromatin structure and the epigenetic code obviously requires a combination of experimental and theoretical approaches. Amazingly, after more than 30 years of intense experimental (biochemical, biophysical) and modeling (from the first hand-made models to most recent all-atoms computations) efforts, chromatin structure has not lost so much of its mystery (van Holde & Zlatanova, 2007; Widom, 1989). After presenting some critical data on the nucleosome, the basic structural unit of chromatin, we will present the so-called "chromatin 30 nm fiber", considered as the second step in DNA hierarchical folding into chromosomes. We will discuss the various models that have been developed to quantitatively understand the structure and dynamics of this fundamental nucleoprotein macromolecular complex carrying our genetic and (at least part of) epigenetic information.

Nucleosome/Chromatosome as the Chromatin Fundamental Subunit The nucleosome was identified more than thirty years ago (Kornberg, 1974; Noll, 1974; Olins & Olins, 1974; Oudet et al, 1975), basically from a combination of nuclease digestion and electron microscopy studies; see (Kornberg & Lorch, 1999; Olins & Olins, 2003) for historical perspectives. This particle consists in ~150 bp of DNA wrapped into two turns of a left-handed superhelix around an octameric core of two copies each of histones H2A, H2B, H3 and H4. Its structure has now been revealed at 1,9 Å resolution (Davey et al, 2002) (fig. 1a). Due to their general property as organizers of the eukaryotic genome, nucleosomes have always been subjected to much effort aiming at revealing their exact role in the controlled access of regulating factors to target DNA sequences. Now, despite their common representation as simple cylinders, many studies suggest that nucleosomes in the cell nucleus spend a significant fraction of their life-time in various conformations, either structural or replacement isomers, or super- or sub-order structures of the canonical octameric form. Therefore, the "basic repeating unit" qualification of the beginnings has rapidly evolved towards a much more complex polymorphic view of the nucleosome. In fact, far from being a repetitive entity of chromatin, each nucleosome have its own characteristics stemming from the existence of histone variants (Kamakaka & Biggins, 2005), histone post-translational modifications (Peterson & Laniel, 2004), potential transient conformational alteration (Lavelle & Prunell, 2007) and the sequence-dependent properties of the wrapped DNA (Sivolob et al, 2003). Adding further complexity to the nucleosome description, a ninth histone known as the "linker histone" (H1 or H5) associates dynamically to the nucleosome and seals the two turns of the DNA through interaction of its globular domain, while its C-terminal tail links the two proximal entry/exit DNA regions together into a stem (Hamiche et al, 1996b). This particle made of the nucleosome plus the linker histone was named "chromatosome" (Simpson, 1978).

Chromatin Fiber: 30 Years of Models

149

Linker histone association with nucleosome is known to further compact the chromatin fiber (Bednar et al, 1998; Hannon et al, 1984). Several issues related to the linker histone are still debated (Zlatanova et al, 2007), as for instance its exact binding mode on the nucleosome (Thomas, 1999), its role in nucleosome spacing (Woodcock et al, 2006) or the mechanism of its exchange in vivo (Misteli et al, 2000). Structural and functional dynamics of a nucleosome cover many aspects. The conformational dynamics encompasses the fluctuation of the linker DNA at the entry/exit of nucleosome, supposed to depend both one the histone tails and the DNA sequence (De Lucia et al, 1999; Sivolob et al, 2003; Sivolob & Prunell, 2004) and to have drastic effects on chromatin fiber topological properties (Bancaud et al, 2006). Structural polymorphism concerns potential alteration of the nucleosome itself, leading in one hand to transient deformation of nucleosomal structure such as the lexosome (Prior et al, 1983) or the gaped nucleosome (Mozziconacci & Victor, 2003; Sivolob et al, 2003), and in the other hand to partially histone-depleted nucleosomes such as the hexasome (Kireeva et al, 2002), the tetrasome (Alilat et al, 1999; Hamiche et al, 1996a) or the hemisome (Dalal et al, 2007a; Dalal et al, 2007b). Lastly, one should also consider the positional dynamics which encompass nucleosome sliding, either mediated by remodeling factors or spontaneous thermodynamic movement. While such sliding seems necessary to occasionally free DNA from histone sequestration, general aims and scopes of this feature may be more subtle and are still under debate (Becker, 2002), as is the mechanism of this sliding (Flaus & OwenHughes, 2004).

Nucleosome Computation Strikingly, nucleosome must act both as a compaction and regulation tool, or, in other words, be reasonably stable while keeping some dynamic properties to allow chromatin opening for promoter access and subsequent transcription. Nucleosome stability, positioning, and sliding are different aspects which commonly depend on nucleosome free energy (Anselmi et al, 2000; Levitsky, 2004; Tolstorukov et al, 2008). While these aspects have attracted numerous analytical theoretical models, standard all-atom molecular dynamics simulation of a single nucleosome (a 23 000 atoms structure) is unreachable so far, at least for times longer than a few picoseconds (Bishop, 2005). Therefore, various alternative approaches have been developed to model the nucleosome and nucleosomal DNA dynamics. In an attempt to characterize the nucleosome core particle (NCP) interaction with DNA, Schiessel et al (Kulic & Schiessel, 2004) proposed a toy model in which DNA is a rope with a repulsive self interaction energy. This rope is wrapped on a spool (representing the NCP) and the strength of their interaction is simply proportional to the length of the DNA/NCP contact line. They mainly showed that DNA unwrapping under force is sudden and happens one full turn at a time, in agreement with previous measures in single molecule experiments (BrowerTolland, PNAS, 2002). The group of Peng Ye Wang used Brownian dynamics to enlighten some aspects of nucleosomal DNA dynamics. They built a model in which DNA is a selfavoiding articulated chain of balls and the nucleosome is a cylinder interacting with DNA with a Morse potential. They used their model to study nucleosome formation and stability (Li et al, 2004), chirality (Li et al, 2005), sliding (Li et al, 2006) and stability issues related to histone biochemical modifications (Li et al, 2007). Others all-atoms approaches have been

150

Julien Mozziconacci and Christophe Lavelle

developed to address DNA/nucleosome interactions. Using a combination of coarse-grained and all atom methods, Korolev et al showed that lysine rich histone tails were critical to mediate nucleosome/nucleosome and DNA/nucleosome interactions via salt-bridges (Korolev et al, 2006; Korolev & Nordenskiold, 2007). Ruscio et al performed a molecular dynamics simulation of the nucleosome and of free DNA in solution using implicit solvent methods. They report that nucleosomal DNA is found to be considerably more flexible than expected from biophysical measures on single molecules (Ruscio & Onufriev, 2006). Two coarsegrained models where recently proposed in order to fasten the simulations and open new avenues to the study of the nucleosomal particles dynamics (Sharma et al, 2007; Voltz et al, 2008). They both use beads covalently bound along the protein skeleton to represent residues, instead of an explicit all-atom structure. Finally, it is worth noticing two theoretical studies aimed at determining the chromatosome structure. The earlier one was built by homology modeling methods (Bharath et al, 2003) whereas the later one was proposed by combining biochemical results with molecular docking (Fan & Roberts, 2006). Interestingly, not only linker histone but also core histone tails is supposed to bend linker DNA and influence DNA path in nucleosomal arrays (Perico et al, 2006). In chromosomes, nucleosomes are more or less regularly spaced on DNA forming a beads-on-a-string array. They are not randomly distributed along the genome and it becomes now clear that their precise location participate in gene regulation. Several group attempted to determine nucleosomes positions on genomes thanks to computational approaches. The bottom line idea in their model is to identify DNA sequences which have a higher tendency to be bent and therefore to be wrapped around a nucleosomal core particle. Using “wavelet transform microscope”, Audit et al showed that long range correlations can be detected in eukaryotic genomes and proposed that they encode specific nucleosome patterns (Audit et al, 2001; Audit et al, 2002; Vaillant et al, 2007). Two groups extended this approach in order to predict nucleosomes location knowing the underlying DNA sequence. Fraser et al used a physical Monte Carlo simulation to predict nucleosome positioning on the beta-lactoglobulin gene (Fraser et al, 2006; Gencheva et al, 2006) whereas Segal et al used a probabilistic method, based on experimental results on nucleosome positioning in yeast, in order to reveal nucleosomes occupancy over the full yeast genome (Segal et al, 2006). Such in silico methods have been further developed to predict nucleosome positions in different organisms (Miele et al, 2008).

Nucleosomal Arrays and the "30 nm Fiber" Under appropriate conditions, the beads-on-a-string nucleosomal array (fig. 1b) can fold into a compact structure. First images of condensed nucleosomal arrays showed fibrous structures of roughly 30nm in diameter (Ris & Kubai, 1970). This so-called "30nm fiber" is currently considered as the next packaging level of DNA after the nucleosome and is supposed to represent the most relevant form of interphase DNA as a physiological substrate for transcription, repair and replication. Before nucleosome characterization, only continuous supercoiled nucleo-histone models were suggested (Pardon & Wilkins, 1972). Then, evidence for a discrete DNA-histone complex from nuclease digestion patterns (Hewish & Burgoyne, 1973), biochemical and X-ray diffraction studies (Kornberg, 1974; Kornberg & Thomas, 1974) and electron microscopy (Olins & Olins, 1974) dramatically changed the view of the

Chromatin Fiber: 30 Years of Models

151

chromatin fiber and open the way to dozens of experimental studies and theoretical assumptions aiming at deciphering the in vivo folding of these "beads on a string" (fig. 1b).

a

b

Figure 1. A nucleosome and an extended nucleosomal array. Chromatin fiber basically consists in the compaction of a nucleosomal array, occurring spontaneously in physiological salt. How this folding precisely takes place is at the heart of much of the computational efforts dedicated to this macromolecular structure. (a) Nucleosome crystal structure (as drawn from 1KX5 PDB coordinates). (b) Native nucleosomal array, extracted from CHO (chinese hamster ovary) cells, observed in law salt. Bar 100nm. TEM facilities kindly provided by Eric Le Cam.

First observations of isolated chromatin fibers lead to the solenoid model in which consecutive nucleosomes are stacked one on each other (Finch & Klug, 1976; Thoma et al, 1979; Worcel & Benyajati, 1977) (fig. 2a, 2b, 2c and 2ei). Alternative models, in which each nucleosome is stacked on its second nearest neighbour on DNA, rapidly followed (Williams et al, 1986; Woodcock et al, 1984; Worcel et al, 1981) (fig. 2d and 2eii,iv). Note that some heterogenous "superbeads" models have also been proposed (Hozier et al, 1977; Kiryanov et al, 1976; Renz, 1979; Zentgraf & Franke, 1984; Zentgraf et al, 1980) (fig. 2eiii), but have been rapidly occulted by the vast array of more convincing continuous models. Further experiments on isolated chromatin suggested that these fibers are left-handed helices of stacked nucleosomes (Williams et al, 1986). Recently, both electron micrographs and digestion data obtained from regular nucleosome arrays reconstituted in vitro gave strong support to a two-start helix organization of the chromatin fiber (Dorigo et al, 2004), soon confirmed by the X-ray structure of a tetranucleosome (Schalch et al, 2005) (fig. 2m). A strong indication that linkers are cross-linking the fiber in vivo has been obtained from ionizing radiation cutting patterns of DNA in mitotic chromosomes (Rydberg et al, 1998). According to all these pieces of evidence, a zig-zag crossed-linker model was believed to be the most relevant fiber structure. However, another problem arose soon: this structure provides a maximal compaction of 6 nucleosomes per 10 nm of fiber, corresponding to the compaction found for isolated fibers in vitro (Williams et al, 1986) and may be assumed to be relevant in vivo during interphase, but could in no way reach the linear density of the fiber in metaphasic chromosomes. A simple calculation of the volume occupied by the fiber in a

152

Julien Mozziconacci and Christophe Lavelle

mitotic chromosome showed that the fiber compaction must increase from 6 to 10 nucleosomes per 10 nm during the last compaction step of prophase (Daban, 2000; Daban & Bermudez, 1998) (fig. 2h). Models proposed to account for the compaction observed in metaphasic chromosomes as well as in highly dense heterochromatin (Daban & Bermudez, 1998; Grigoryev, 2004; Mozziconacci et al, 2006) will be discussed below. Conventional biochemical and biophysical techniques used to study chromatin structure and dynamics have been recently complemented by an array of single-molecule approaches, in which chromatin fibers are investigated one-at-a-time (Zlatanova & Leuba, 2003). Among them, micromanipulation (either with magnetic or optical tweezers) is a powerful technique to test single chromatin fiber assembly (Leuba et al, 2003) and response to various mechanical constraints, either under tension (Bennink et al, 2001; Brower-Toland et al, 2002; Cui & Bustamante, 2000; Pope et al, 2005) or torsion (Bancaud et al, 2006; Bancaud et al, 2007). Namely, force measurement has revealed the existence of an internucleosomal attraction that maintains the higher-order chromatin structure under physiological condition (Cui & Bustamante, 2000) and shown the multi-step (and partially reversible) pealing of nucleosomal DNA (Brower-Toland et al, 2002; Pope et al, 2005). On the other hand, torsional manipulation of single chromatin fibers showed that chromatin fiber can accommodate surprisingly large amount of torsional stress, either negative or positive, without much change in its length (Bancaud et al, 2006; Bancaud et al, 2007). All these studies greatly improved our quantitative knowledge of chromatin mechanical and topological properties. Based on all these various data, 3D models were gradually developed. Although we cannot be exhaustive with such a broad and documented subject, we have tried here to discuss some of the most useful insights provided at their time by some cornerstone models.

From Hard Models To In Silico Chromatin Fibers In the first twenty years, chromatin fiber modelling was mainly achieved by using schematic drawings and hard models. While schemes basically help to interpret experimental results, hard-models provide more practical tool to test new hypothesis, but are obviously limited by their static nature. The development of computers and related 3D applications have allowed chromatin modelers to switch from their hard models to in silico ones, leading to a wave of numerical models accumulating since the last ten years (fig. 2). First plastic tube models were published by Worcel et al, for solenoidal (Worcel & Benyajati, 1977) (fig. 2b) or cross-linked (Worcel et al, 1981) fibers. They were followed by others hard models (Makarov et al, 1985; Williams et al, 1986) (fig. 2e), culminating surprisingly recently with a highly meticulous set of models made from copper-wire (Engelhardt, 2007) (fig. 2p).

Chromatin Fiber: 30 Years of Models

a

b

i

ii

c

iii

d

iv

e

h

153

f

i

j

g

k

Figure 2. A collection of models (schematic drawings, hand-made models, computer-assisted geometrical and/or computational models) gathered from the abundant literature on this subject is presented here in chronological order. (a) Original solenoid model, adapted from (Finch & Klug, 1976). (b) Plastic tubes-build solenoid model, adapted from (Worcel & Benyajati, 1977). (c) Solenoid model, with H1 and salt-dependent folding, adapted from (Thoma et al, 1979). (d) Twisted helical ribbon model, adapted from (Woodcock et al, 1984). (e) Rope-build solenoid (i), twisted ribbon (ii), supranucleosomal particles (iii) and cross-linked (iv) models, adapted from (Williams et al, 1986). (f) Two-angle model, adapted from (Woodcock et al, 1993). (g) Idem, with linker histone, adapted from (Bednar et al, 1998). (h) First interdigitated model, adapted from (Daban & Bermudez, 1998). (i) Columnar schematic models for (telomeric) chromatin with short linkers, adapted from (Fajkus & Trifonov, 2001). (j) Monte-Carlo simulation, adapted from (Wedemann & Langowski, 2002). (k) Three-angel model (with nucleosome gaping), adapted from (Mozziconacci & Victor, 2003).

154

Julien Mozziconacci and Christophe Lavelle

l

m

n

o

p

q

r

s

Figure 2. Continued.(l) "Molecularly coated" cross-linked model, adapted from (Schalch et al, 2005). (m) Monte-Carlo cross-linked compact model, adapted from (Besker et al, 2005). (n) "Molecularly coated" interdigitated model, adapted from (Robinson et al, 2006). (o) Mesoscopic model, adapted from (Arya et al, 2006). (p) Copper-wire-build cross-linked model, adapted from (Engelhardt, 2007). (q) Linker length-dependent and all-atom models, adapted from (Wong et al, 2007). (r) Monte-Carlo models with defects, adapted from (Diesinger & Heermann, 2008). (s) Idem, adapted from (Kepper et al, 2008).

Chromatin Fiber: 30 Years of Models

155

The most popular geometrical model was published in 1993 (Woodcock et al, 1993) (fig. 2f). Woodcock et al proposed that the geometry of native chromatin fibers extracted from nuclei can be simply described using two angles. The first angle (alpha) is the angle between the entry and exit linker directions of a single nucleosome and is dependent on nucleosomal conformation; the second one (beta) corresponds to the phasing angle between consecutive nucleosomes and is dependent on the DNA length joining the two nucleosomes, due to the helical nature of the DNA molecule. This model assumes that DNA linkers are straight due to the high bending stiffness of naked DNA. This two-angle model has been used through the following years as a toy model which properties can be calculated analytically. In the early 2000's, two groups of theoretical physicists aimed at introducing DNA mechanics is this geometrical model. It comes out from these studies that the chromatin fiber can have various structures and tune its mechanical properties, such as rigidity in bending, twisting or stretching, mainly by changing the distance in between consecutive nucleosomes along DNA (Ben-Haim et al, 2001; Schiessel et al, 2001). Topological constraints were later introduced in the two-angle model and a constant DNA linking number decompaction path was found in the alpha/beta space (Barbi et al, 2005). The two-angle model has also been used in Monte-Carlo simulations in order to better understand the energetical influence in the folding of the chromatin into a compact fiber (Besker et al, 2005; Diesinger & Heermann, 2008; Kepper et al, 2008; Wedemann & Langowski, 2002) (fig. 2j, 2m, 2r and 2s). The presence of the linker histone H1, and its effect on the fiber compaction through the stem like structure that it induces at the entry/exit of each nucleosomes, has been investigated mainly by modifying the way alpha is defined, as seen on electron microscopy images (Bednar et al, 1998) (fig. 2g). Despite the success of the two-angle model for theoretical calculations and systematic searches for specific structures (Ben-Haim et al, 2001; Besker et al, 2003; Schiessel et al, 2001), new experimental evidences challenged the idea that the DNA linker is straight in vivo. In 1998, Daban showed that the linear compaction of the chromatin fiber in metaphase chromosomes must be at least 11 nucleosomes per 11nm, which is twice as much as the most compact structure that can be described using the two angle model (Daban, 2003; Daban & Bermudez, 1998). Following this idea, he proposed a new model in which nucleosomes are stacked one on each other forming columns which are in turn coiled in order to form the 30 nm chromatin fiber (fig. 2h). In this purely geometric space-filling model, bent linkers are added between nucleosomes and DNA mechanical properties are not taken into account. In another attempt to modify the two-angle model in order to attain the metaphasic chromosome compaction, Mozziconacci et al proposed a conformational change of the nucleosome, similar to the gaping of an oyster (fig. 2k). This flexibility of the particle, which adds a new degree of freedom to the two-angle model, was shown to increase the fiber compaction and enhance the stacking interactions between neighbouring nucleosomes (Mozziconacci et al, 2006; Mozziconacci & Victor, 2003). In order to make a physically realistic model of the DNA linker deformations, chain-like linker models were developed. The first such model including DNA mechanical constraints was used to explain direct physical micro manipulation of single fibers (Katritch et al, 2000). A more sophisticated model, including explicitly the electric charges on the surface of DNA and histones was later developed. The so-called Discrete Surface Charge Optimization (DISCO) model was used to investigate electrostatic effects in salt induced compaction of the fiber and the role of histone tails in nucleosome/nucleosome interactions (Arya & Schlick, 2006; Arya et al, 2006; Beard & Schlick, 2001; Sun et al, 2005; Zhang et al, 2003) (fig. 2o).

156

Julien Mozziconacci and Christophe Lavelle

However useful they proved to investigate chromatin dynamics, none of these models allowed fibers to reach compaction comparable to the one expected in metaphasic chromosomes. In 2005, all those models (excepted Daban's model) were indeed pointing to the same two-start “cross-linker” folding. This structure received then further and almost definitive support from two experiments carried on reconstituted fibers in Tim Richmond's lab (Dorigo et al, 2004; Schalch et al, 2005) (fig. 2l). In 2006, Daniela Rhodes challenged this idea. She measured the dimensions of reconstituted fibers with varying linker length and showed that her results cannot be explained by the two-start "cross-linker" model (Robinson et al, 2006). The only structure which could fit inside the measured dimensions was indeed found to be Daban's five-start structure (fig. 2n)! Even more intriguing was their finding that the fiber dimensions depend on linker length in a step-like way, the diameter increasing from 33 to 44 nm as the linker length exceeds 60 bp. Those results led to the development of new fibers models. Wong et al recently proposed an explicit all-atom structure for each linker length used in Rhodes experiment and concluded that all previously models, including the two start cross linker structure are relevant (Wong et al, 2007) (fig. 2q). It just all depends on the length of the linkers connecting nucleosomes. At the same time, Wu et al proposed an analytical calculation of the fiber dimensions (Wu et al, 2007). They used their formula to fit Rhodes data and proposed that the chromatin fiber can switch from a helical ribbon to a cross-linker model depending on the linker length. Depken and Schiessel proposed a geometrical model in which nucleosomes are staked. Based on the nucleosome shape, they proposed that chromatin fibers can be constituted by the wrapping of either 5 or 7 columns of nucleosomes, for linkers respectively shorter than 210 bp or longer than 184 bp (preprint available at http://arxiv.org/abs/0712.0973).

Why So Many Models? The wealth of experimental data and apparently contradictory models produced over the last 30 years for the 30-nm chromatin fiber illustrates the difficulty of observing and understanding this macromolecular structure. Namely, for each model data can be found that supports or contradicts it, justifying why no consensual picture exists yet (Tremethick, 2007; van Holde & Zlatanova, 2007). Difficulties are mainly from two sources: first, there is obviously an "extrinsic" problem due to experimental limitation. It's indeed hard to get in vivo data, and isolated fibers should be regarded with caution (Woodcock, 1994). Then, there is also an "intrinsic" and probably even more limitative problem due to the fact that chromatin is by itself a polymorphic and dynamic structure. It is indeed not only a convenient way to pack two meters of DNA into the nucleus, but also definitely has a regulatory role in DNA metabolism and must therefore exhibit structural polymorphism and transient local structural rearrangements, both making it a difficult entity to grasp. Last but not least, although chromatin 30 nm fiber's ability to readily form in vitro makes it a convincing candidate for a distinct biologically relevant high-order chromatin structure, no evidence have been shown so far to clearly prove its existence in vivo (van Holde & Zlatanova, 1995). This led notably other kind of condensed phase of nucleosomes, such as

Chromatin Fiber: 30 Years of Models

157

liquid crystals, to be proposed to participate in chromosome compaction (Livolant et al, 2006) (see also columnar packing in fig. 2j)

Conclusion Since naked DNA doesn't exist in eukaryotic nucleus, DNA metabolism is first chromatin metabolism, which makes chromatin 30 nm fiber (assuming it exists in vivo!) a distinct level of cellular regulation. More than a "natural barrier" to DNA accessibility, chromatin has a functional role that relevant models should help us to understand. Chromatin faces electrostatic, elastic and topological constraints that have to be integrated in multiscale models including both structural and dynamical parameters (Langowski & Heermann, 2007; Lavelle & Benecke, 2006). The three dimensional chromatin structure depends on distinct but highly coupled parameters: DNA sequence, nucleosome spacing (and the regularity of this spacing), histone modifications (through variants incorporations and/or post-translational modifications), nucleosome conformation (through DNA fluctuations at the entry/exit sites and potential deformation of the core particle itself), interactions within the fiber (through histone tails interactions and DNA elasticity and topology) and non-histone proteins possibly present (i.e. HMG proteins, HP1 in heterochromatin, TRF1/2 in telomeres). Almost none of these issues are addressed by current models, giving the quest for functional and predictive chromatin fiber models a long but surely exciting one.

References Alilat M, Sivolob A, Revet B, Prunell A (1999) Nucleosome dynamics. Protein and DNA contributions in the chiral transition of the tetrasome, the histone (H3-H4)2 tetramerDNA particle. J Mol Biol 291(4): 815-841 Anselmi C, Bocchinfuso G, De Santis P, Savino M, Scipioni A (2000) A theoretical model for the prediction of sequence-dependent nucleosome thermodynamic stability. Biophys J 79(2): 601-613 Arya G, Schlick T (2006) Role of histone tails in chromatin folding revealed by a mesoscopic oligonucleosome model. Proc Natl Acad Sci U S A 103(44): 16236-16241 Arya G, Zhang Q, Schlick T (2006) Flexible histone tails in a new mesoscopic oligonucleosome model. Biophys J 91(1): 133-150 Audit B, Thermes C, Vaillant C, d'Aubenton-Carafa Y, Muzy JF, Arneodo A (2001) Longrange correlations in genomic DNA: a signature of the nucleosomal structure. Phys Rev Lett 86(11): 2471-2474 Audit B, Vaillant C, Arneodo A, d'Aubenton-Carafa Y, Thermes C (2002) Long-range correlations between DNA bending sites: relation to the structure and dynamics of nucleosomes. J Mol Biol 316(4): 903-918 Avery HE, MacLeod CM, McCarty M (1944) Studies on the chemical nture of the substance inducing transformation of pneumococcal types. J Exp Med 79: 137-158 Bancaud A, Conde e Silva N, Barbi M, Wagner G, Allemand JF, Mozziconacci J, Lavelle C, Croquette V, Victor JM, Prunell A, Viovy JL (2006) Structural plasticity of single chromatin fibers revealed by torsional manipulation. Nat Struct Mol Biol 13(5): 444-450

158

Julien Mozziconacci and Christophe Lavelle

Bancaud A, Wagner G, Conde ESN, Lavelle C, Wong H, Mozziconacci J, Barbi M, Sivolob A, Le Cam E, Mouawad L, Viovy JL, Victor JM, Prunell A (2007) Nucleosome chiral transition under positive torsional stress in single chromatin fibers. Mol Cell 27(1): 135147 Barbi M, Mozziconacci J, Victor JM (2005) How the chromatin fiber deals with topological constraints. Phys Rev E Stat Nonlin Soft Matter Phys 71(3 Pt 1): 031910 Beard DA, Schlick T (2001) Modeling salt-mediated electrostatics of macromolecules: the discrete surface charge optimization algorithm and its application to the nucleosome. Biopolymers 58(1): 106-115 Becker PB (2002) NEW EMBO MEMBER'S REVIEW: Nucleosome sliding: facts and fiction. Embo J 21(18): 4749-4753. Bednar J, Horowitz RA, Grigoryev SA, Carruthers LM, Hansen JC, Koster AJ, Woodcock CL (1998) Nucleosomes, linker DNA, and linker histone form a unique structural motif that directs the higher-order folding and compaction of chromatin. Proc Natl Acad Sci U S A 95(24): 14173-14178 Ben-Haim E, Lesne A, Victor JM (2001) Chromatin: a tunable spring at work inside chromosomes. Phys Rev E Stat Nonlin Soft Matter Phys 64(5 Pt 1): 051921 Bennink ML, Leuba SH, Leno GH, Zlatanova J, de Grooth BG, Greve J (2001) Unfolding individual nucleosomes by stretching single chromatin fibers with optical tweezers. Nat Struct Biol 8(7): 606-610 Besker N, Anselmi C, De Santis P (2005) Theoretical models of possible compact nucleosome structures. Biophys Chem 115(2-3): 139-143 Besker N, Anselmi C, Paparcone R, Scipioni A, Savino M, De Santis P (2003) Systematic search for compact structures of telomeric nucleosomes. FEBS Lett 554(3): 369-372 Bharath MM, Chandra NR, Rao MR (2003) Molecular modeling of the chromatosome particle. Nucleic Acids Res 31(14): 4264-4274 Bishop TC (2005) Molecular dynamics simulations of a nucleosome and free DNA. J Biomol Struct Dyn 22(6): 673-686 Brower-Toland BD, Smith CL, Yeh RC, Lis JT, Peterson CL, Wang MD (2002) Mechanical disruption of individual nucleosomes reveals a reversible multistage release of DNA. Proc Natl Acad Sci U S A 99(4): 1960-1965 Cui Y, Bustamante C (2000) Pulling a single chromatin fiber reveals the forces that maintain its higher-order structure. Proc Natl Acad Sci U S A 97(1): 127-132 Daban JR (2000) Physical constraints in the condensation of eukaryotic chromosomes. Local concentration of DNA versus linear packing ratio in higher order chromatin structures. Biochemistry 39(14): 3861-3866 Daban JR (2003) High concentration of DNA in condensed chromatin. Biochem Cell Biol 81(3): 91-99 Daban JR, Bermudez A (1998) Interdigitated solenoid model for compact chromatin fibers. Biochemistry 37(13): 4299-4304 Dalal Y, Furuyama T, Vermaak D, Henikoff S (2007a) Structure, dynamics, and evolution of centromeric nucleosomes. Proc Natl Acad Sci U S A 104(41): 15974-15981 Dalal Y, Wang H, Lindsay S, Henikoff S (2007b) Tetrameric Structure of Centromeric Nucleosomes in Interphase Drosophila Cells. PLoS Biology 5(8): e218

Chromatin Fiber: 30 Years of Models

159

Davey CA, Sargent DF, Luger K, Maeder AW, Richmond TJ (2002) Solvent mediated interactions in the structure of the nucleosome core particle at 1.9 a resolution. J Mol Biol 319(5): 1097-1113 De Lucia F, Alilat M, Sivolob A, Prunell A (1999) Nucleosome dynamics. III. Histone taildependent fluctuation of nucleosomes between open and closed DNA conformations. Implications for chromatin dynamics and the linking number paradox. A relaxation study of mononucleosomes on DNA minicircles. J Mol Biol 285(3): 1101-1119 Diesinger PM, Heermann DW (2008) The influence of the cylindrical shape of the nucleosomes and H1 defects on properties of chromatin. Biophys J 94(11): 4165-4172 Dorigo B, Schalch T, Kulangara A, Duda S, Schroeder RR, Richmond TJ (2004) Nucleosome arrays reveal the two-start organization of the chromatin fiber. Science 306(5701): 15711573 Engelhardt M (2007) Choreography for nucleosomes: the conformational freedom of the nucleosomal filament and its limitations. Nucleic Acids Res 35(16): e106 Fajkus J, Trifonov EN (2001) Columnar packing of telomeric nucleosomes. Biochem Biophys Res Commun 280(4): 961-963 Fan L, Roberts VA (2006) Complex of linker histone H5 with the nucleosome and its implications for chromatin packing. Proc Natl Acad Sci U S A 103(22): 8384-8389 Finch JT, Klug A (1976) Solenoidal model for superstructure in chromatin. Proc Natl Acad Sci U S A 73(6): 1897-1901 Flaus A, Owen-Hughes T (2004) Mechanisms for ATP-dependent chromatin remodelling: farewell to the tuna-can octamer? Curr Opin Genet Dev 14(2): 165-173 Flemming W (1882) Zell substanz, Kern und Zelltheilung., Leipzig: Vogel. Fraser RM, Allan J, Simmen MW (2006) In silico approaches reveal the potential for DNA sequence-dependent histone octamer affinity to influence chromatin structure in vivo. J Mol Biol 364(4): 582-598 Gencheva M, Boa S, Fraser R, Simmen MW, CB AW, Allan J (2006) In Vitro and in Vivo nucleosome positioning on the ovine beta-lactoglobulin gene are related. J Mol Biol 361(2): 216-230 Grigoryev SA (2004) Keeping fingers crossed: heterochromatin spreading through interdigitation of nucleosome arrays. FEBS Lett 564(1-2): 4-8 Hamiche A, Carot V, Alilat M, De Lucia F, O'Donohue MF, Revet B, Prunell A (1996a) Interaction of the histone (H3-H4)2 tetramer of the nucleosome with positively supercoiled DNA minicircles: Potential flipping of the protein from a left- to a righthanded superhelical form. Proc Natl Acad Sci U S A 93(15): 7588-7593 Hamiche A, Schultz P, Ramakrishnan V, Oudet P, Prunell A (1996b) Linker histonedependent DNA structure in linear mononucleosomes. J Mol Biol 257(1): 30-42 Hannon R, Bateman E, Allan J, Harborne N, Gould H (1984) Control of RNA polymerase binding to chromatin by variations in linker histone composition. J Mol Biol 180(1): 131149 Hewish DR, Burgoyne LA (1973) Chromatin sub-structure. The digestion of chromatin DNA at regularly spaced sites by a nuclear deoxyribonuclease. Biochem Biophys Res Commun 52(2): 504-510 Hozier J, Renz M, Nehls P (1977) The chromosome fiber: evidence for an ordered superstructure of nucleosomes. Chromosoma 62(4): 301-317 Kamakaka RT, Biggins S (2005) Histone variants: deviants? Genes Dev 19(3): 295-310

160

Julien Mozziconacci and Christophe Lavelle

Katritch V, Bustamante C, Olson WK (2000) Pulling chromatin fibers: computer simulations of direct physical micromanipulations. J Mol Biol 295(1): 29-40 Kepper N, Foethke D, Stehr R, Wedemann G, Rippe K (2008) Nucleosome geometry and internucleosomal interactions control the chromatin fiber conformation. Biophys J Kireeva ML, Walter W, Tchernajenko V, Bondarenko V, Kashlev M, Studitsky VM (2002) Nucleosome remodeling induced by RNA polymerase II: loss of the H2A/H2B dimer during transcription. Mol Cell 9(3): 541-552 Kiryanov GI, Manamshjan TA, Polyakov VY, Fais D, Chentsov JS (1976) Levels of granular organization of chromatin fibres. FEBS Lett 67(3): 323-327 Kornberg RD (1974) Chromatin structure: a repeating unit of histones and DNA. Science 184(139): 868-871 Kornberg RD, Lorch Y (1999) Twenty-five years of the nucleosome, fundamental particle of the eukaryote chromosome. Cell 98(3): 285-294 Kornberg RD, Thomas JO (1974) Chromatin structure; oligomers of the histones. Science 184(139): 865-868 Korolev N, Lyubartsev AP, Nordenskiold L (2006) Computer modeling demonstrates that electrostatic attraction of nucleosomal DNA is mediated by histone tails. Biophys J 90(12): 4305-4316 Korolev N, Nordenskiold L (2007) H4 histone tail mediated DNA-DNA interaction and effects on DNA structure, flexibility, and counterion binding: a molecular dynamics study. Biopolymers 86(5-6): 409-423 Kulic IM, Schiessel H (2004) DNA spools under tension. Phys Rev Lett 92(22): 228101 Langowski J, Heermann DW (2007) Computational modeling of the chromatin fiber. Semin Cell Dev Biol 18(5): 659-667 Lavelle C, Benecke A (2006) Chromatin physics: Replacing multiple, representation-centered descriptions at discrete scales by a continuous, function-dependent self-scaled model. Eur Phys J E Soft Matter 19(3): 379-384 Lavelle C, Prunell A (2007) Chromatin polymorphism and the nucleosome superfamily. Leuba SH, Karymov MA, Tomschik M, Ramjit R, Smith P, Zlatanova J (2003) Assembly of single chromatin fibers depends on the tension in the DNA molecule: magnetic tweezers study. Proc Natl Acad Sci U S A 100(2): 495-500 Levitsky VG (2004) RECON: a program for prediction of nucleosome formation potential. Nucleic Acids Res 32(Web Server issue): W346-349 Li W, Dou SX, Wang PY (2004) Brownian dynamics simulation of nucleosome formation and disruption under stretching. J Theor Biol 230(3): 375-383 Li W, Dou SX, Wang PY (2005) The histone octamer influences the wrapping direction of DNA on it: Brownian dynamics simulation of the nucleosome chirality. J Theor Biol 235(3): 365-372 Li W, Dou SX, Xie P, Wang PY (2006) Brownian dynamics simulation of directional sliding of histone octamers caused by DNA bending. Phys Rev E Stat Nonlin Soft Matter Phys 73(5 Pt 1): 051909 Li W, Dou SX, Xie P, Wang PY (2007) Brownian dynamics simulation of the effect of histone modification on nucleosome structure. Phys Rev E Stat Nonlin Soft Matter Phys 75(5 Pt 1): 051915

Chromatin Fiber: 30 Years of Models

161

Livolant F, Mangenot S, Leforestier A, Bertin A, Frutos M, Raspaud E, Durand D (2006) Are liquid crystalline properties of nucleosomes involved in chromosome structure and dynamics? Philos Transact A Math Phys Eng Sci 364(1847): 2615-2633 Makarov V, Dimitrov S, Smirnov V, Pashev I (1985) A triple helix model for the structure of chromatin fiber. FEBS Lett 181(2): 357-361 Miele V, Vaillant C, d'Aubenton-Carafa Y, Thermes C, Grange T (2008) DNA physical properties determine nucleosome occupancy from yeast to fly. Nucleic Acids Res Misteli T, Gunjan A, Hock R, Bustin M, Brown DT (2000) Dynamic binding of histone H1 to chromatin in living cells. Nature 408(6814): 877-881 Mozziconacci J, Lavelle C, Barbi M, Lesne A, Victor JM (2006) A physical model for the condensation and decondensation of eukaryotic chromosomes. FEBS Lett 580(2): 368372 Mozziconacci J, Victor JM (2003) Nucleosome gaping supports a functional structure for the 30nm chromatin fiber. J Struct Biol 143(1): 72-76 Noll M (1974) Subunit structure of chromatin. Nature 251(5472): 249-251 Olins AL, Olins DE (1974) Spheroid chromatin units (v bodies). Science 183(122): 330-332 Olins DE, Olins AL (2003) Chromatin history: our view from the bridge. Nat Rev Mol Cell Biol 4(10): 809-814 Oudet P, Gross-Bellard M, Chambon P (1975) Electron microscopic and biochemical evidence that chromatin structure is a repeating unit. Cell 4(4): 281-300 Pardon JF, Wilkins MH (1972) A super-coil model for nucleohistone. J Mol Biol 68(1): 115124 Perico A, La Penna G, Arcesi L (2006) Electrostatic interactions with histone tails may bend linker DNA in chromatin. Biopolymers 81(1): 20-28 Peterson CL, Laniel MA (2004) Histones and histone modifications. Curr Biol 14(14): R546551 Pope LH, Bennink ML, van Leijenhorst-Groener KA, Nikova D, Greve J, Marko JF (2005) Single chromatin fiber stretching reveals physically distinct populations of disassembly events. Biophys J 88(5): 3572-3583 Prior CP, Cantor CR, Johnson EM, Littau VC, Allfrey VG (1983) Reversible changes in nucleosome structure and histone H3 accessibility in transcriptionally active and inactive states of rDNA chromatin. Cell 34(3): 1033-1042 Renz M (1979) Heterogeneity of the chromosome fiber. Nucleic Acids Res 6(8): 2761-2767 Ris H, Kubai DF (1970) Chromosome structure. Annu Rev Genet 4: 263-294 Robinson PJ, Fairall L, Huynh VA, Rhodes D (2006) EM measurements define the dimensions of the "30-nm" chromatin fiber: evidence for a compact, interdigitated structure. Proc Natl Acad Sci U S A 103(17): 6506-6511 Ruscio JZ, Onufriev A (2006) A computational study of nucleosomal DNA flexibility. Biophys J 91(11): 4121-4132 Rydberg B, Holley WR, Mian IS, Chatterjee A (1998) Chromatin conformation in living cells: support for a zig-zag model of the 30 nm chromatin fiber. J Mol Biol 284(1): 71-84 Schalch T, Duda S, Sargent DF, Richmond TJ (2005) X-ray structure of a tetranucleosome and its implications for the chromatin fibre. Nature 436(7047): 138-141 Schiessel H, Gelbart WM, Bruinsma R (2001) DNA folding: structural and mechanical properties of the two-angle model for chromatin. Biophys J 80(4): 1940-1956

162

Julien Mozziconacci and Christophe Lavelle

Segal E, Fondufe-Mittendorf Y, Chen L, Thastrom A, Field Y, Moore IK, Wang JP, Widom J (2006) A genomic code for nucleosome positioning. Nature 442(7104): 772-778 Sharma S, Ding F, Dokholyan NV (2007) Multiscale modeling of nucleosome dynamics. Biophys J 92(5): 1457-1470 Simpson RT (1978) Structure of the chromatosome, a chromatin particle containing 160 base pairs of DNA and all the histones. Biochemistry 17(25): 5524-5531 Sivolob A, Lavelle C, Prunell A (2003) Sequence-dependent nucleosome structural and dynamic polymorphism. Potential involvement of histone H2B N-terminal tail proximal domain. J Mol Biol 326(1): 49-63 Sivolob A, Prunell A (2004) Nucleosome conformational flexibility and implications for chromatin dynamics. Philos Transact A Math Phys Eng Sci 362(1820): 1519-1547 Sun J, Zhang Q, Schlick T (2005) Electrostatic mechanism of nucleosomal array folding revealed by computer simulation. Proc Natl Acad Sci U S A 102(23): 8180-8185 Thoma F, Koller T, Klug A (1979) Involvement of histone H1 in the organization of the nucleosome and of the salt-dependent superstructures of chromatin. J Cell Biol 83(2 Pt 1): 403-427 Thomas JO (1999) Histone H1: location and role. Curr Opin Cell Biol 11(3): 312-317 Tolstorukov MY, Choudhary V, Olson WK, Zhurkin VB, Park PJ (2008) nuScore: a webinterface for nucleosome positioning predictions. Bioinformatics Tremethick DJ (2007) Higher-order structures of chromatin: the elusive 30 nm fiber. Cell 128(4): 651-654 Vaillant C, Audit B, Arneodo A (2007) Experiments confirm the influence of genome longrange correlations on nucleosome positioning. Phys Rev Lett 99(21): 218103 van Holde K, Zlatanova J (1995) Chromatin higher order structure: chasing a mirage? J Biol Chem 270(15): 8373-8376 van Holde K, Zlatanova J (2007) Chromatin fiber structure: Where is the problem now? Semin Cell Dev Biol 18(5): 651-658 Voltz K, Trylska J, Tozzini V, Kurkal-Siebert V, Langowski J, Smith J (2008) Coarsegrained force field for the nucleosome from self-consistent multiscaling. J Comput Chem 29(9): 1429-1439 Watson JD, Crick FH (1953) Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature 171(4356): 737-738 Wedemann G, Langowski J (2002) Computer simulation of the 30-nanometer chromatin fiber. Biophys J 82(6): 2847-2859 Widom J (1989) Toward a unified model of chromatin folding. Annu Rev Biophys Biophys Chem 18: 365-395 Williams SP, Athey BD, Muglia LJ, Schappe RS, Gough AH, Langmore JP (1986) Chromatin fibers are left-handed double helices with diameter and mass per unit length that depend on linker length. Biophys J 49(1): 233-248 Wong H, Victor JM, Mozziconacci J (2007) An all-atom model of the chromatin fiber containing linker histones reveals a versatile structure tuned by the nucleosomal repeat length. PLoS ONE 2(9): e877 Woodcock CL (1994) Chromatin fibers observed in situ in frozen hydrated sections. Native fiber diameter is not correlated with nucleosome repeat length. J Cell Biol 125(1): 11-19 Woodcock CL, Frado LL, Rattner JB (1984) The higher-order structure of chromatin: evidence for a helical ribbon arrangement. J Cell Biol 99(1 Pt 1): 42-52

Chromatin Fiber: 30 Years of Models

163

Woodcock CL, Grigoryev SA, Horowitz RA, Whitaker N (1993) A chromatin folding model that incorporates linker variability generates fibers resembling the native structures. Proc Natl Acad Sci U S A 90(19): 9021-9025 Woodcock CL, Skoultchi AI, Fan Y (2006) Role of linker histone in chromatin structure and function: H1 stoichiometry and nucleosome repeat length. Chromosome Res 14(1): 17-25 Worcel A, Benyajati C (1977) Higher order coiling of DNA in chromatin. Cell 12(1): 83-100 Worcel A, Strogatz S, Riley D (1981) Structure of chromatin and the linking number of DNA. Proc Natl Acad Sci U S A 78(3): 1461-1465 Wu C, Bassett A, Travers A (2007) A variable topology for the 30-nm chromatin fibre. EMBO Rep 8(12): 1129-1134 Zentgraf H, Franke WW (1984) Differences of supranucleosomal organization in different kinds of chromatin: cell type-specific globular subunits containing different numbers of nucleosomes. J Cell Biol 99(1 Pt 1): 272-286 Zentgraf H, Muller U, Franke WW (1980) Supranucleosomal organization of sea urchin sperm chromatin in regularly arranged 40 to 50 nm large granular subunits. Eur J Cell Biol 20(3): 254-264 Zhang Q, Beard DA, Schlick T (2003) Constructing irregular surfaces to enclose macromolecular complexes for mesoscale modeling using the discrete surface charge optimization (DISCO) algorithm. J Comput Chem 24(16): 2063-2074 Zlatanova J, Leuba SH (2003) Chromatin fibers, one-at-a-time. J Mol Biol 331(1): 1-19 Zlatanova J, Seebart C, Tomschik M (2007) Nap1: taking a closer look at a juggler protein of extraordinary skills. Faseb J 21(7): 1294-1310

In: Computational Biology: New Research Editor: Alona S. Russe

ISBN 978-1-60692-040-4 c 2009 Nova Science Publishers, Inc.

Chapter 6

FAST M ODELLING OF P ROTEIN S TRUCTURES T HROUGH M ULTI - LEVEL C ONTACT M APS Davide Ba`u∗, Ian Walsh∗, Gianluca Pollastri and Alessandro Vullo School of Computer Science and Informatics University College Dublin, Belfield, Dublin 4, Ireland

Abstract We present an algorithm to reconstruct protein C α traces from 4-class distance maps, and benchmark it on a non-redundant set of 258 proteins of length between 51 and 200 residues. We first represent proteins as contact maps, and show that even when exact maps are available, only low-quality models can often be obtained. We then adopt a more powerful simplification of distance maps: multi-class contact maps. We show that the reconstructions based on 4-class native maps are significantly better than those from binary maps. Furthermore, we build two predictors of 4-class maps based on recursive neural networks: one ab initio, or relying on the sequence and on evolutionary information; one in which homology information is provided as a further input, showing that even very low sequence similarity to PDB templates yields more accurate maps than the ab initio predictor. We reconstruct C α traces based on both ab initio and homology-based 4-class map predictions. We show that homology-based predictions are generally more accurate than ab initio ones even when homology is dubious.

1.

Introduction

Although a protein can be first characterised by its amino acid sequence, most proteins fold into three-dimensional structures that encode their function. Genomics projects leave us with millions of protein sequences, currently ≈ 6 × 106, of which only a small fraction (≈ 2%) have their 3D structure experimentally determined. In the near future, we will probably devolve upon structural genomics projects in order to bridge the huge gap between sequence and structure. The current high throughput pipelines have to deal with serious ∗

Contributed equally to this work

166

Davide Ba`u, Ian Walsh, Gianluca Pollastri et al.

bottlenecks, e.g. a large fraction of targets are found to be unsuitable for structural determination with available methods [1]. Therefore, computational protein structure prediction remains an irreplaceable instrument for the exploration of sequence-structure-function relationships. This is especially important for analyses at the genome or inter-genome level, where informative structural models need to be generated for thousands of gene products (or a portion of them) in reasonable amounts of time. The faster and more reliable methods for structure prediction rely on the transfer of knowledge between closely related proteins accumulated in sequence and structure databases – the field known as template-based modelling. The algorithms employed typically adopt heuristics based on sequence and/or structural similarity to model the unknown target structure based on known structures that are fathomed to be homologous to it. Automating the modelling process is difficult: there are several stages and critical points in the design (choice of templates, the creation of a correct structural alignement etc.) and for some of them manual intervention is at least helpful. The accuracy of template-based techniques strongly depends on the amount of detectable similarity, thus preventing the reliable application of these methods for a significant fraction of unannotated proteins. This is the realm of the so called ab initio or de novo protein structure prediction, where models are predicted not relying on similarity to proteins with known structure. Ab initio techniques are obviously not as accurate as those based on templates, but the design in this case is generally much simpler. Moreover, improvements can be obtained by relying on fragmentbased algorithms [2], that use fragments of proteins of known structure to reconstruct the complete structure of the target protein. A system for the prediction of protein structures ab initio is generally composed of two elements: an algorithm to search the space of possible protein configurations to minimise some cost function; the cost function itself, composed of various constraints being either derived from physico-chemical laws, experimental results, or being structural features (e.g. secondary structure or solvent accessibility) predicted by machine learning or other kinds of statistical systems [3]. We describe and benchmark all the components of a fully-automated system for protein structure modelling which is fast and simple in the design (modular, few stages). The same protocol is applied whether or not the unknown input protein shares significant levels of similarity to other proteins with known structure, and is based on two steps, solved efficiently. Given the input protein, we first encode information about the family of homologous sequences and possibly structures. Sequence information has the form of profiles extracted from multiple sequence alignemnts. Unlike the usual template-based methodology, there is no a priori choice of the best available templates used to model the unknown structure. For each position of the input sequence, structural information from putative templates (if present) is carefully weighed according to the quality of their respective structures and the amount of similarity. Based on sequence and structural information, we make inferences about the geometry of the unknown structure by predicting a set of soft constraints by machine learning. The unknown structure is found in the second and final stage. Here, the system features the typical structure of ab initio methods, where modelling occurs as a result of searching the configurational space of 3D structures with a suitable potential or pseudoenergy function. At the current stage of development, the potential is a non linear function of the soft constraints found in the first stage, with few parameters and simple enough to be globally optimised by quick Monte Carlo searches using a linear schedule. In order to keep

Fast Modelling of Protein Structures Through Multi-level Contact Maps

167

the simulations within manageable times, protein structures are represented by the trace of their backbone Cα atoms, bearing in mind it is hard to derive a meaningful energy model for such stripped-down representation of a protein. We overcome this problem by relying on informative geometrical constraints to discern native-like protein conformations from unfolded, or incorrectly folded ones. The constraints predicted in the first stage have the form of residue based pairwise distance attributes labelled into two or more classes. In the past, research has focussed on studying binary contact maps (i.e. two classes, contact or not). It is generally believed that binary maps provide sufficient information to unambiguosly reconstruct native or near-native models [4]. Efforts have therefore been put on the prediction of this kind of distance restraints. Unfortunately, the expected success rates of the most promising techniques developed for this problem have not improved to satisfactory levels, despite years of attempts [5]. The reason for this is at least twofold. First, contact map prediction is an unbalanced problem, with far fewer contacts than noncontacts. Especially for long-range contacts (i.e. those between amino acids that are tens or hundreds of positions apart in the sequence) the ratio between negative and positive examples can exceed 100. Second, contact map predictors are generally ab initio, i.e. do not exploit all available information. Another problem with binary contact maps is that, although it has long been stated that native maps yield correct structures, this is true only at ˚ on average, in the best case). a relatively low resolution (3-4 A In this chapter, we introduce a representation of protein structures based on a generalisation of binary contact maps, multi-class distance maps, and show that it is powerful and predictable with some success. Our tests suggest that multi-class maps, when using experimental restraints, are informative enough to quickly guide simple optimisation searches to nearly correct models - significantly better than with binary contact maps. We compare reconstructions based on binary and multi-class maps on a non-redundant set of 258 proteins of length between 51 and 200 residues. The reconstructions based on multi-class maps have ˚ and a TM-score of 0.83 an average Root Mean Square Deviation (RMSD) of roughly 2 A ˚ and 0.65 for binary maps). to the native structure (4 A We then develop high-throughput systems for the prediction of multi-class contact maps, which exploit similarity to proteins of known structure, where available, in the form of simple structural frequency profiles from sets of PDB templates. We build two predictors of multi-class maps based on recursive neural networks: one ab initio, or relying on the sequence and on evolutionary information; one in which homology information is provided as a further input. We show that even very low sequence similarity to PDB templates (PSI-BLAST e-value up to 10) yields more accurate maps than the ab initio predictor. Furthermore, the predicted map is generally more accurate than the maps of the templates, suggesting that the combination of sequence and template information is more informative than templates alone. Finally, the optimisation search protocol we developed is benchmarked using both ab initio and homology-based multi-class map predictions. We show that homology-based predictions are generally more accurate than ab initio ones even when homology is dubious, and that fair to accurate protein structure predictions can be generated for a broad range of homology to structures in the PDB. Using the current reconstruction protocol, hundreds of reconstructions for the same protein can be performed in few minutes on current machines. On small clusters of machines it is possible to perform predictions on a genomic scale in few hours for simpler organisms,

168

Davide Ba`u, Ian Walsh, Gianluca Pollastri et al.

Figure 1. Full atoms three dimensional structure for protein 1SF9. While computationally intense, the full atoms representation allows to display all structural information, like side chains orientation and secondary structure elements.

or few days for the most complex ones.

2.

Representing Protein Structures

Protein three-dimensional (3D) structures are fully represented by the coordinates of their atoms. For a protein with N atoms, 3N coordinates are then needed to describe its 3D structure (Figure 1). Although this is the ideal representation, it has the drawback of yielding a computationally intense model. Simplified representations have been proposed before, in which an amino acid is typically described by fewer points than the atoms it contains, thus reducing the degrees of freedom of the model. Typical simplified representations include backbone only models, where all the side chain atoms are excluded, and virtual atom models, where each residue in the sequence is assigned a virtual (i.e. geometrical, not physical) point, to represent a subset of its atoms [6]. At the extreme of the above cases are representations with only one point per amino acid, typically the C α atom (see Figure 2), or Cβ atoms. This way the degrees of freedom needed to represent a protein of N atoms and M amino acids are reduced to 3M , with M << N . The prediction algorithm described in this work represents protein structures as the trace of their backbone C α atoms, one for each amino acid of the sequence. An obvious advantage of this choice is its extreme simplicity, given that one order of magnitude fewer points are used, which yields two orders of magnitude fewer interactions than a full atom model. It is worth noting that reliable full-atom model can be generally derived from C α traces close to the native ones, for instance by refining them using molecular dynamics simulations, or optimisations of detailed energy functions applied to full atom models predicted from the backbone [7]. The real difficulty is to derive a meaningful energy model for a protein from which most of the details have been removed, to effectively explore the search space from

Fast Modelling of Protein Structures Through Multi-level Contact Maps

169

Figure 2. Cα trace for protein 1SF9. Although computationally less intense than a full atom model (it uses one order of magnitude fewer points, and yields two orders of magnitude fewer interactions), using a representaion from which most of the native protein details have been removed has the drawback of making it hard to derive a meaningful energy model.

random initial configurations. Here we overcome the problem by relying entirely on nonphysical (i.e. geometrical) constraints to discern good (native-like) conformations from bad (unfolded, or incorrectly folded) ones. Although this is a simplified goal, we show that success in this task generally yields informative predictions. The potentials we develop in the next section are based on terms measured using another simplified representation of protein structures, the contact map. A protein’s contact map belongs to a class of two-dimensional (2D) projections of 3D representations of geometrical objects. In the next subsection we give a brief description of contact maps, particularly focussing on multi-class ones, which are the representations our prediction algorithm relies on.

Distance Matrix and Contact Map Using two-dimensional projections of 3D objects is an attractive way of encoding geometrical information of protein structures, as these are scale and rotation invariant and do not depend on the coordinate frame. Therefore, 2D projections can be modelled as the output variable of learning or statistical systems trained in a supervised fashion, i.e. using samples of (input, target) pairs collected from structure databases. We call this encoding of a structure 2D because it can be graphically represented as a two-dimensional matrix, where the cells denote properties of pairs of objects in the 3D space. In the case of proteins, the geometrical relationship may involve fragments of the structure at different scales, using for instance amino acid [8] or secondary structure segment pairs [9], the former being much ˚ have been assessed as a special category more common than the latter. Contact maps at 8 A at CASP for several years [5]. Geometrical relationships between amino acids can be expressed as a set of distance

170

Davide Ba`u, Ian Walsh, Gianluca Pollastri et al.

˚ and black is the Figure 3. Distance matrix in greyscale image format. White is 0 A maximum distance in the protein.

restraints, e.g. in the form L ≤ d(i, j) ≤ U , where d(i, j) is the distance between residues in positions i and j and L (resp. U ) is lower (resp. upper) bound on the distance. Restraints such as the above ones can be experimentally determined, e.g. from NMR experiments. Indeed, algorithms for modelling protein structures from distance restraints are borrowed from the NMR literature and use for instance stochastic optimisation methods [4, 10], distance geometry [11, 12], and variants of them [13–15]. There is a trade-off between the resolution of the input restraints, e.g. the uncertainty with which they specify the property of the pairs, and the ability of the reconstruction algorithm to recover the correct model from these inputs. In the best case, the complete noise-free distance matrix is available, and the optimisation problem can be solved analytically by finding a 3D embedding of the 2D restraints. A distance matrix consists in the set {d(i, j)}i>j of N (N − 1)/2 distances between any two points in positions i and j of a protein with N amino acids. Note how the distance matrix corresponds to the above form of constraints with lower distance bound equal to the upper one. Figure 3 shows a greyscale picture of the distance matrix of the protein with PDB code 1ABV, where the distances are calculated between the C α atoms. The distance matrix or even detailed distance restraints cannot be reliably determined by means of computational techniques, unless experimental data is available or when there is strong homology to proteins with known structure. This is why in the past research has focussed on predicting representations of the distance matrix which are at the same time simpler to learn and able to retain substantial structural information. The contact map of a protein is a labelled matrix derived by thresholding the distance matrix and assigning labels to the resulting discrete intervals. The simplest alternative is the binary contact map, where one distance threshold t is chosen to describe pairs of residues that are in contact (d(i, j) < t) or not (d(i, j) ≥ t). The binary contact map can also be seen as the adjacency

Fast Modelling of Protein Structures Through Multi-level Contact Maps

171

Figure 4. Different secondary structure elements like helices (thick bands along the main diagonal) and parallel − or anti-parallel− β-sheets (thin bands parallel − or anti-parallel − to the main diagonal) are easily detected from the contact map.

matrix of the graph with N nodes corresponding to the residues. Binary contact maps are popular as noise-tolerant alternatives to the distance map, and algorithms exist to recover protein structures from these representations [4,16,17]. Unfortunately, our studies and other empirical evidence indicates that recovering good-quality models even from the binary map of the native fold is difficult [17] The definition of contact among amino acid is based on a single atom (normally C α or Cβ ) and depends on a geometrical threshold. This may be ambigous in situations where other knowledge must be taken into account, for instance when the orientation of the side-chain is important. Although numerous methods have been developed for binary contact map prediction [18–24], improvements are only slowly occurring (e.g. in [21], as shown by the CASP6 experiment [25]). Accurate prediction is far from being achieved and limitations of existing prediction methods have again emerged at CASP7 and from automatic evaluation of structure prediction servers such as EVA [26]. There are various reasons for this: the number of positive and negative examples (contacts vs. non contacts) is strongly unbalanced; the number of examples grows with the squared length of the protein making this a tough computational challenge; capturing long ranged interactions in the primary sequence is difficult, hence grasping an adequate global picture of the map is a formidable problem. Based on the above considerations, we believe that alternative representations of protein topologies are particularly appealing, provided that they are informative and, especially, predictable. Here we focus on a representation of the distance matrix called multi-class contact map and based on a set of categorical attributes or classes. Each class corresponds ˚ where a given pair of residues may fall into. Formally, to an interval of distances (in A) given a set of distance thresholds {tk }k=0...T (where t0 = 0 and tT = ∞), a multi-class

172

Davide Ba`u, Ian Walsh, Gianluca Pollastri et al.

Figure 5. Distribution of contacts in [0,50] distance bins with trivial |i − j| ≤ 3 residue contacts ignored. The classes were chosen in order to retain good distance constraints ˚ corresponds, as a first approximation, to physical and balanced classes. Class 1 ([0,8) A) contacts.

contact map of a protein with N amino acids is a symmetrix N × N matrix C where the element corresponding to the amino acids in positions i and j is defined as Cij = k if d(i, j) ∈ [tk , tk+1 ). Obviously, this class of projections contains richer information than binary contact maps (so long as T > 3). Therefore, using multi-class contact maps is expected to improve the resolution of reconstruction algorithms on geometrical constraints. Moreover, if a suitable set of distance thresholds is chosen, the number of instances in each class may be kept approximately balanced, which in turn may improve generalisation performances of learning algorithms over the (normally unbalanced) binary prediction case. For our experiments, we derived a set of five distance thresholds to define multi-class contact maps based on four distance intervals. As shown in Figure 5, the four classes are empirically chosen from the distribution of distances among amino acids in the training set, ignoring trivial pairs |i − j| ≤ 3 and by trying to keep informative distance constraints and the classes as balanced as possibile. The resulting set of thresholds is {0, 8, 13, 19, ∞}, which defines suitable distance intervals corresponding to short ( [0, 8)), medium ([8, 13), [13, 19)) and long-ranged interactions among amino acids. A potential improvement beyond this choice is to automatically determine an optimal set of thresholds based on some criteria, e.g. the reconstruction ability on a set of benchmarking proteins.

3.

Modelling Structures with Contact Maps

We predict protein models by solving a global optimisation problem, where a function (pseudo-energy) is minimised by searching the configurational space of 3D structures. The pseudo-energy function we use to guide the search is designed in a way that allows us to

Fast Modelling of Protein Structures Through Multi-level Contact Maps

173

solve an unconstrained minimisation problem by a simple simulated annealing protocol. More specifically, the pseudo-energy function measures the degree of match of a protein conformation to the constraints encoded in the contact map (binary or multi-class) predicted in the first stage. In the following, we describe the set of moves used to explore the configurational space and the different forms of potential functions used respectively for binary and multi-class contact maps.

3.1.

Optimisation Algorithm

The algorithm we use for the reconstruction of the coordinates of protein C α traces is organised in two sequential phases, bootstrap and search. The function of the first phase is to generate an initial physically realisable configuration. A random structure is created using a self-avoiding random walk and explicit modelling of predicted helices, by adding C α positions one after the other until an initial draft of the whole backbone is obtained. More specifically, this part runs through a sequence of N steps, with N being the length of the r where input chain. At stage i, the position of the i-th Cα is computed as ri = ri−1 + d |r| d ∈ [3.733, 3.873] and r is a random direction vector. Both d and r are uniformly sampled. If the i-th residue is predicted at the beginning of an helix, all the following residues in the same segment are modelled as an ideal helix with random orientation. In the search step, the algorithm refines the initial bootstrapped structure by global optimisation of a pseudo-potential function using local moves and a simulated annealing protocol. Simulated annealing is a good choice in this case, since the constraints obtained from various predictions are in general not realisable and contradictory. Hence the need for using a “soft” method that tries to enforce as many constraints as possible never terminating with failure, and is robust with respect to local minima caused by contradictions. The search strategy is similar to that in [4], but with a number of modifications. At step t of (t) the search, a randomly chosen Cα atom at position ri is displaced to the new position (t+1) by a crankshaft move (Figure 6), leaving all the other C α atoms of the protein in their ri original position. Secondary structure elements are displaced as a whole, without modifying their geometry (Figure 7). The move in this case has one further degree of freedom in the helix rotation around its axis. This is assigned randomly, and uniformly distributed. A new set of coordinates S (t+1) is accepted as the best next candidate with probability (t) p = min(1, e∆C/T ) defined by the annealing protocol, where ∆C = C(S (t), M) − C(S (t+1), M) and T (t) is the temperature at stage t of the schedule.

3.2.

Pseudo-energy Function

Let Sn = {ri}i=1...n be a sequence of n 3D coordinates, with ri = (xi , yi , zi) the coordinates of the i-th Cα atom of a given conformation related to a protein p. Let DSn = {dij }i<j , dij = kri − rj k2, be the corresponding set of n(n − 1)/2 mutual distances between Cα atoms. A first set of constraints comes from the (predicted) contact map and depends on the type of contact maps, i.e. binary (see section 3.2.1.) or multi class maps (see section 3.2.2.). The representation of protein models induces the constraints B = {dij ∈ [3.733, 3.873], |i − j| = 1}, encoding bond lengths, and another set

174

Davide Ba`u, Ian Walsh, Gianluca Pollastri et al.

(t)

Figure 6. Crankshaft move: a randomly chosen C α atom at position ri is displaced to the (t+1) leaving all the others C α atoms of the protein in their original position. new position ri

Figure 7. Secondary structure elements are displaced as a whole, without modifying their geometry.

C = {dij ≥ DHC , i 6= j} for clashes. The set M = C ∪ B ∪ C defines the configurational space of physically realisable protein models. 3.2.1. Binary Contact Map Constraints When using binary contact maps the set of contraints coming from the predicted maps can 2 be represented as a matrix C = {cij } ∈ {0, 1}n . Let F0 = {(i, j) | dij > dT ∧ cij = 1} denote the pairs of amino acid in contact according to C (binary case) but not in Sn (“false negatives”). Similarly, define F1 = {(i, j) | dij ≤ dT ∧ cij = 0} as the pairs of amino acids in contact in Sn but not according to C (“false positives”). The objective function is then defined as: X X (dij /DT )2 + (dij − DB )2 } C(Sn, M) = α0 {1 + (i,j)∈F0

+ α1 |F1| + α2

X

(i,j):dij 6∈C

(i,j):dij 6∈B (DHC −dij )

e

(1)

Fast Modelling of Protein Structures Through Multi-level Contact Maps

175

3.2.2. 4-class Contact Map Constraints In the case of 4-class contact maps, the constraint derived from the predicted map assumes a slightly different form. Since contacts between pairs of C α are here predicted in four classes, a contact is penalised not only if it is not present in the predicted map, but also depending on its distance to the boundaries of the correspoding class: Fk = {(i, j) | Dk < dij < Dk+1 ∧ cij 6= k} with Dk being the distance thresholds that define the classes. 0 Let Dk = (Dk + Dk+1 )/2, then the objective function is defined as: C(Sn , M) = α0 {1 + +

X X

X

k (i,j)∈Fk

(dij − DB )2} + α1

(i,j):dij 6∈B

3.3.

0

(dij /Dk )2 X

e(DHC −dij )

(2)

(i,j):dij 6∈C

Experiments and Results

The protein data set used in reconstruction simulations consists of a non redundant set of 258 protein structures (S258) showing no homology to the sequences employed to train the contact map predictors (see below). This set includes proteins of moderate size (51 to 200 amino acids) and diverse topology as classified by SCOP (Structural Classification of Proteins database) [27] (all-α, all-β, α/β, α + β, surface, coiled-coil and small). No two proteins in this set share more than 25% sequence identity. In all the experiments, we run the annealing protocol using a non linear (exponential decay) schedule with initial (resp. final) temperature proportional to the protein size (resp. 0). Pseudo energy parameters are set to α0 = 0.2 (false non-contacts), α1 = 0.02 (false contacts) and α2 = 0.05 (clashes) for binary maps and α0 = 0.005 and α1 = 0.05 (clashes) for multi-class maps, so that the conformational search is biased towards the generation of compact clash-free structures and with as many of the predicted contacts realised. In the first set of simulations we compare the quality of reconstructions based on binary maps and multi-class maps for the case in which experimental constraints are known, that ˚ , since these are more informative than is the maps are native. We use binary maps at 12 A a number of alternative we tested (tests not shown). In order to assess the quality of predictions, two measures are considered here: root mean square deviation (RMSD) and TM-score [28] between the predicted structure and the native one. For each protein in the test set, we run 10 folding simulations and select the best one. The results for the best simulations are then averaged over all the 258 proteins in the set and are reported in Table 1.

4.

Contact Map Prediction

Only a small number of algorithms have being developed for the prediction of distance maps [11, 29]. Far more common are methods for the prediction of binary contact maps ˚ , 8A ˚ , 10A ˚ , or 12A ˚ usually chosen to define the [18–24], with distance cutoffs of 6 A

176

Davide Ba`u, Ian Walsh, Gianluca Pollastri et al.

Table 1. Reconstruction algorithm results for the best models derived from binary and multi-class true contact maps. Maps Binary 4-Class

RMSD 4.01 2.23

TM-score 0.65 0.83

threshold between a contact and a non-contact. At the Critical Assessment of Protein Struc˚ between ture Prediction, CASP [30], maps are evaluated with a distance threshold of 8 A Cβ atoms (Cα in the case of Gly). There is a wide range of machine learning techniques for predicting contacts: hidden markov models [31], recursive neural networks [9], multi-layer perceptrons [18, 19, 24], support vector machines [22, 23], and self-organizing maps [21] are just a few. Predictors of contact maps are virtually always ab initio, meaning that they do not rely directly on similarity to proteins of known structure. In fact, often, much care is taken to try to exclude any detectable similarity between training and test set instances when gauging predictive performances of structural feature predictors. The method we present here is based on recursive neural networks, in particular 2dimensional recursive neural networks (2D-RNNs). We predict both binary and multi-class maps. The system presented is an update of the system which took part in CASP7 [30]. The most significant update is the addition of homology information from the PDB [32]. In the following sections we give a detailed overview of the system and show that homology information greatly increases the performance of the predictor, even in the difficult [0,30)% sequence identity homology zone.

4.1. 2D-RNNs 2D-RNNs were previously described in [20] and [33]. This is a family of adaptive models for mapping two-dimensional matrices of variable size into matrices of the same size. If oj,k is the entry in the j-th row and k-th column of the output matrix (in our case, it will represent the estimated probability of residues j and k belonging to a particular class), and ij,k is the input in the same position, the input-output mapping is modeled as: (1) (2) (3) (4) oj,k = N (O) ij,k , hj,k , hj,k , hj,k , hj,k (1) (1) (1) (1) (1) hj,k = N (1) ij,k , hj−1,k , .., hj−s,k , hj,k−1 , .., hj,k−s (2) (2) (2) (2) (2) hj,k = N (2) ij,k , hj+1,k , .., hj+s,k , hj,k−1 , .., hj,k−s (3) (3) (3) (3) (3) hj,k = N (3) ij,k , hj+1,k , .., hj+s,k , hj,k+1 , .., hj,k+s (4) (4) (4) (4) (4) hj,k = N (4) ij,k , hj−1,k , .., hj−s,k , hj,k+1 , .., hj,k+s j, k = 1, . . ., N s = 1, . . . , S

Fast Modelling of Protein Structures Through Multi-level Contact Maps

177

(n)

where hj,k for n = 1, . . . , 4 are planes of hidden vectors transmitting contextual information from each corner of the matrix to the opposite corner. We parametrise the output update, and the four lateral update functions (respectively N (O) and N (n) for n = 1, . . . , 4) using five two-layered feed-forward neural networks, as in [33]. Stationarity is assumed for all residue pairs (j, k), that is the same parameters are used across all j = 1, ..., N and k = 1, ..., N . Each of the 5 neural network contains its own individual parameters, that are not constrained to the ones of the other networks. We use 2D-RNNs with shortcut connections. The best way to think of shortcuts is to think of a simple recurrent network in a 1-dimensional (1D) case. The standard definition of 1D recurrent neural networks prescribe an explicit dependency between the input being processed now (here), at time (position) j, and the item processed previously, j − 1, resulting in an implicit dependency between j and all previous items. Most algorithms lack the power to extract information from the implicit dependencies (especially when using gradient learning) beyond the span of a few steps, because of the well known problem of vanishing gradient [34]. Therefore allowing shortcuts is an extension of this idea where in addition to simply having a a direct dependency on the previous item, j − 1, there is also a direct dependency on the previous j − s for all s > 1, ..., S. Indeed, shortcut connections can be placed at any of the previous inputs j − s for any s ∈ 1, .., S. The latter placement of shortcuts between j and S was used to produce near perfect secondary structure predictions in a bidirectional recurrent neural network when (j, s) are native contacts [35]. Notice that increasing the number of shortcuts increases the parameters resulting in a model that may overfit on the data. Extending this idea to the 2D case in any direction in the matrix is straightforward (in fact any dimension can be processed). Shortcut directions and patterns are not strictly constrained (so long as cycles are not introduced in the directed graph representing the network) and may even be learned. With the addition of shortcuts the span of contextual information analysed by a recursive network can be extended, although this may come at the price of increased noise reaching the input, and increased potential for overfitting the examples. The choice of input ij,k is an important factor for the algorithm. In the case of contact map prediction the simplest input is the amino acid symbols at (j, k). Different input signals can be constructed to improve the algorithm. For example, contact density was used in [8] to improve contact map prediction accuracy significantly. In section 4.4 the design of the input will be discussed.

4.2. Training Learning proceeds by gradient descent by minimising the relative cross entropy between target and output. Careful management of the gradient must take place, not letting it be too small or too large: the absolute value of each component of the gradient is kept within the [0.1,1] range, meaning that it is set to 0.1 if it is smaller than 0.1, and to 1 if it is greater than 1. The learning rate is set to 0.3 divided by the the total number of proteins in the dataset. The weights of the networks are initialised randomly. Learning is slow due to the complexity of the problem. Each 2D-RNN contains 5 neural networks, replicated N 2 times for a protein of length N . During each training epoch forward and back-propagation has to occur in each of the 5×N 2 networks, for all P proteins in

178

Davide Ba`u, Ian Walsh, Gianluca Pollastri et al.

the training set. The neural network forward and back-propagation have a complexity proportional to O(θ) where θ is the number of parameters in the network. Learning generally converges at about 300-350 epochs. Although the complexity of an epoch is polynomial at O(θN 2 P ), the large size of the training set, and especially the quadratic term in the length of the proteins make learning quite time-consuming. Training of all systems (binary, multiclass; ab initio, template-based) took approximately three months on a cluster of 10 2.8GHz CPUs. However, during prediction only one forward propagation needs to run for each instance, meaning that predictions for a set may be run in roughly 3 orders of magnitude less time than a training on the same set. For instance, maps for 1000 proteins of average length 120 amino acids can be predicted in approximately 13 hours on a single 2.8GHz CPU, and genomic-scale predictions are possible even on a small cluster of machines.

4.3. Architecture In each of the 5 neural networks used to parameterise the functions, N (O) and N (n) for n = 1, . . . , 4, we use a single hidden layer. Let Nhh and Nho denote the number of units associated with the hidden layer and the output layer of the hidden contextual networks respectively. From the definition of the 2D-RNN we see that each hidden network has I regular input units and 2 × Nho + S × Nho contextual inputs, where S are the total number of shortcuts allowed. Thus, including the usual bias terms in each layer, the total number of parameters in one of the four hidden networks is: (I + 2 × Nho + S × Nho ) × Nhh + Nhh + Nhh × Nho + Nho . The output network also contains I regular inputs but it takes contextual inputs from the four hidden networks 4 × Nho resulting in: (I + 4 × Nho ) × Nh + Nh + D × N h + D parameters, where Nh are the number of units in the hidden layer of the output network and D is the number of classes. The activation functions used are softmax and tanh. Only the output units of the output network have softmax functions in order to estimate Bayesian posterior probability of class membership. All other units have tanh transfer functions. No overfitting avoiding techniques such as early stopping or weight decay were applied given the very large size of the datasets, and the fact that we ensemble many networks in the final predictor (see section 4.5.2). Due to the large computational power needed to train one model we ensemble networks both from different trainings and from different stages of the same training. Networks are saved every 5 epochs, and for each training the last 3 networks are ensembled. Three networks with different architectural parameters (Nhh = Nho = Nh = 13, 14, 15) are trained for each predictor. Results for network performances in this work are reported for these ensembles of 3 × 3 = 9 models. Ensembling leads to significant classification performance improvements over single models. All results are in 5-fold cross validation, meaning that, in fact, 5 times 9 models are available for each system. For the reconstruction results (see next section) only the final networks for each training are ensembled, for a total of 1 × 3 × 5 = 15 for each system. The number of classes is D = 2 or D = 4 depending on the problem (binary vs. multiclass). For all networks the number of shortcuts is S = 2, with more sophisticated shortcut placements to be investigated in the future.

Fast Modelling of Protein Structures Through Multi-level Contact Maps

179

4.4. Input Design Input ij,k associated with the j-th and k-th residue pair contains primary sequence information, evolutionary information, structural information, and direct contact information derived from the PDB templates: (E)

(T )

ij,k = (ij,k , ij,k )

(3)

where, assuming that e units are devoted to evolutionary sequence information and structural information in the form of secondary structure [36, 37], solvent accessibility [36, 38] and contact density [8]: (E)

(1)(E)

ii,j = (ij,k

(e)(E)

, . . . , ij,k

)

(4)

)

(5)

Template information is placed in the remaining t units: (T )

(1)(T )

ij,k = (ij,k

(t)(T )

, . . ., ij,k

Hence ij,k contains a total of e + t components. As in [8] e = 418, consisting of a sparse 20 × 20 matrix corresponding to the frequency of all pairs of amino acids observed in the two columns j and k of the multiple sequence alignment - this was chosen in order to capture information about correlated mutations. Structural information in the form of secondary structure (three classes), solvent accessibility (two classes), and contact density (four classes) for residue j and k are placed in the remaining 6,4 and 8 input units respectively. For the template units we use t = 3 for binary maps and t = 5 for multi class maps, representing weighted contact class information from the templates and one template quality unit. For example, in the case of multi class maps the first four input units contain the weighted average contact class frequency in the PDB templates, while the last unit encodes (p) the average quality of the template column. Assume that dj,k is a 4-component binary vector encoding the contact class of the j-th and k-th residue pair in the p-th template. Then, if P is the total number of templates for a protein: (1)(T ) (4)(T ) (ij,k , . . . , ij,k )

=

PP

(p) p=1 wp dj,k PP p=1 wp

(6)

where wp is the weight attributed to the p-th template. If the sequence identity between template p and the query is idp and the quality of a template (measured as X-ray resolution + R-factor/20 or 10 for NMR hits, as in [39]) is qs , then the weight is defined as: wp = qp id3p

(7)

Taking the cube of the identity between template and query allows to drastically reduce the contribution of low-similarity templates when good templates are available. For instance a 90% identity template is weighed two orders of magnitude more than a 20% one. In preliminary tests (not shown) this measure performed better than a number of alternatives.

180

Davide Ba`u, Ian Walsh, Gianluca Pollastri et al.

The final unit of ij,k , the quality unit, encodes the weighted average coverage and similarity of a column of the template profile as follows: PP (5)(T ) p=1 wp cp (8) ij,k = PP p=1 wp where cp is the coverage of the sequence by template p (i.e. the fraction of non-gaps in the alignment). Encoding template information for the binary maps is similar. (E) Ab initio based predictions use only the first part of the input, ij,k from equation 4, including secondary structure, solvent accessibility and contact density, although these are predicted ab initio. The template based predictions use the complete ij,k as input.

4.5. Experiments 4.5.1. Problem Definition The main objective of the experiments is to compare ab initio systems (PDB templates are assumed unavailable) and template-based systems. When very reliable PDB information (e.g. sequence identity to the query greater than 30-35%) is available we expect templatebased predictions to be substantially better, and in fact, to nearly exactly replicate the maps of the best template. More interesting questions are: whether template-based predictions improve on ab initio ones in the so called twilight zone of sequence similarity (less than 30%); whether, in this same region, template-based predictions are better than can be obtained by simply copying the map of the best template, or a combination of the maps of the templates. ˚ contact maps ˚ ab intio contact maps (12AI ), 12 A The 4 systems that we test are 12 A with templates (12T E ), multi-class ab intio (MAI ) and multi- class with templates (MT E ). 4.5.2. Dataset The dataset used in the present simulations is extracted from the December 2003 25% pdb select list 1. We use the DSSP program [40] (CMBI version) to assign relevant structural features (secondary structure and relative solvent accessibility). Cα coordinates, directly available from the PDB, are used to calculate contact density [8]. Sequences for which DSSP does not produce an output due, for instance, to missing entries or format errors are removed. For computational reasons, and to focus on single domains, proteins which have more than 200 amino acids are also removed. After processing by DSSP and the removal of long proteins, the set contains 1602 proteins and 163,379 amino acids. All the tests reported in this paper are run in 5-fold cross validation. The 5 folds are of roughly equal sizes, composed of 318-327 proteins. The datasets are available upon request. Evolutionary information in the form of Multiple sequence alignments have long being shown to improve prediction of protein structural features [20, 33, 37, 41–45]. Multiple sequence alignments for the 1602 proteins are extracted from the NR database as available on March 3 2004 containing over 1.4 million sequences. The database is first redundancy reduced at a 98% threshold, leading to a final 1.05 million sequences. The alignments are 1

http://homepages.fh-giessen.de/˜hg12640/pdbselect

Fast Modelling of Protein Structures Through Multi-level Contact Maps

181

˚ binary classes and the four classes in Table 2. Number of residues contained in 12A the Multi class definition. class 0

class 1

˚ 12A

4,062,483

15,755,172

Multi class

1,623,411

3,205,472

class 2

class 3

5,176,584

9,812,188

generated by three runs of PSI-BLAST [46] with parameters b = 3000, e = 10−3 and h = 10−10. Table 2 shows the class distribution of both types of map in the dataset. What is immediately obvious from this table is that the class distribution is more balanced in the 4 class problem and therefore should be easier to learn. 4.5.3. Template Generation For each of the 1602 proteins we search for structural templates in the PDB. We base our search on PDBFINDERII [47] as available on August 22 2005. An obvious problem arising is that all proteins in the set are expected to be in PDB (barring name changes), hence every protein will have a perfect template. To avoid this, we exclude from PDBFINDERII every protein that appears in the set. We also exclude all entries shorter than 10 residues, leading to a final 66,350 chains. Because of the PDBFINDERII origin, only one chain is present in this set for NMR entries. To generate the actual templates for a protein, we run two rounds of PSI-BLAST against the version of the redundancy-reduced NR database described above, with parameters b = 3000 (maximum number of hits), e = 10−3 (expectation of a random hit) and h = 10−10 (expectation of a random hit for sequences used to generate the PSSM). We then run a third round of PSI-BLAST against the PDB using the PSSM generated in the first two rounds. In this third round we deliberately use a high expectation parameter ( e = 10) to include hits that are beyond the usual Comparative Modelling scope ( e < 0.01 at CASP6 [25]). We further remove from each set of hits thus found all those with sequence similarity exceeding 95% over the whole query, to exclude PDB resubmissions of the same structure at different resolution, other chains in N-mers and close homologues. The distribution of sequence similarity of the best template, and average template similarity is plotted in figure 8. Roughly 14% of the proteins have no hits at more than 10% sequence similarity. About 19% of all proteins have at least one very high quality (better than 90% similarity) entry in their template set. Although the distribution is not uniform, all similarity intervals are adequately represented: for about 41% of the proteins no hit is above 30% similarity; for nearly 24% of the proteins the best hit is in the 30-50% similarity interval. The average similarity for all PDB hits for each protein, not surprisingly, is generally low: for roughly 73% of all proteins the average identity is below 30%. It should be noted that template generation is an independent module in the systems. We are currently investigating whether more subtle strategies for template recognition would still benefit contact map predictions, with or without retraining the systems on the new template distributions.

182

Davide Ba`u, Ian Walsh, Gianluca Pollastri et al.

Figure 8. Distribution of best-hit (blue) and average (red) sequence similarity in the PSIBLAST templates for the S2171 set. Hits above 95% sequence similarity excluded.

4.5.4. Training/Testing Protocol The predictors of contact maps rely on predictions of secondary structure, solvent accessibility and contact density [8]. True structural information was used for training in both ab initio and template based systems. For testing, we used predictions from our servers: Porter, PaleAle and BrownAle predicting secondary structure, solvent accessibility and contact density respectively. The ab initio models use ab initio secondary structure, solvent accessibility and contact density predictions. The template models use template-based secondary structure and solvent accessibility and ab initio contact density predictions (template-based contact density remains to be investigated). All our experiments are carried out in 5-fold cross validation. The same dataset and multiple alignments are used to train the ab initio and template based secondary structure predictor Porter, solvent accessibility predictor PaleAle and the contact density predictor BrownAle. By design, these were trained using the same 5 fold split as the map predictors, therefore removing a trained fold while testing was a simple procedure and all 1D predictions are by models that were trained on a dataset split independent on the query. The accuracy measure for all classes is calculated in order to compare the ab initio and template based models: C−1 X correctc (9) Accuracy = totalc c=0

where C is the total number of classes. All the accuracy values are calculated as a function of the best hit template found in the PDB to the query sequence. The best hit was determined

Fast Modelling of Protein Structures Through Multi-level Contact Maps

183

Table 3. Percentage of classified predicted residue pairs for the ab initio (12AI ) and ˚ predictor (12T E ) as a function of sequence identity to the best template based 12 A template. Template sequence identity 10 means all proteins that have a best hit template in the identity range [0, 10) %, All is the complete set. 10

20

30

40

50

60

70

80

90

≥

All

12AI

85.9

87.5

86.8

85.6

87.2

86.5

86.2

86.1

86.4

87.3

86.8

12T E

85.3

87.8

91.3

93.6

95.7

96.0

95.8

96.4

97.0

97.3

93.2

90

Table 4. Identical to table 2 except only calculated for non template regions of the map. 10

20

30

40

50

60

70

80

90

≥

All

90 12AI

85.8

87.6

88.1

89.9

92.0

90.8

93.1

90.5

94.0

94.0

87.9

12T E

85.3

87.1

88.4

91.4

92.8

92.8

94.0

94.3

94.8

94.4

87.7

by sequence identity between a template sequence and the query sequence.

4.6. Results and Discussion ˚ ab initio and template based predictions ( 12AI Table 3 reports the comparison between 12 A vs. 12T E ) as a function of sequence identity to the best PDB hit. The only decrease in performance is in the [0,10)% identity range, where the accuracy slightly decreases by 0.6%. However, the same results for multi class maps show that there is never a decrease in performance (Table 6). A role in this is played by the quality of predictions in regions not covered by the templates (reported in Tables 4 and 7). In these areas, for a sequence similarity of 20% and greater both 12T E and MT E perform better than, respectively, 12AI and MAI . However, for lower similarity, 12AI outperforms 12T E on areas not covered by templates, while MT E still improves on MAI . This may be either due to more balanced nature of the problem, easier contextual propagation in the multi-class case (the narrower class

Table 5. Percentage of classified predicted residue pairs for 12T E when only considering the residues covered by the best template. Baseline is a predictor that copies the contact assignment from the best hit template.Template sequence identity 10 means all proteins that have a best hit template in the identity range [0, 10) %. 10

20

30

12T E

79.2

86.8

92.0

Baseline

84.0

89.2

92.1

184

Davide Ba`u, Ian Walsh, Gianluca Pollastri et al.

Table 6. Percentage of classified predicted residue pairs for the ab initio (MAI ) and template based Multi class predictor (MT E ) as a function of sequence identity to the best template. Template sequence identity 10 means all proteins that have a best hit template in the identity range [0, 10) %, All is the complete set. 10

20

30

40

50

60

70

80

90

≥

All

MAI

59.3

59.4

58.4

57.3

58.3

57.4

58.3

58.5

58.2

59.9

58.8

MT E

60.2

64.2

75.9

82.5

87.8

88.8

88.1

89.7

91.5

92.1

80.8

90

Table 7. Identical to table 5 except only calculated for non template regions of the map. 10

20

30

40

50

60

70

80

90

≥

All

90 MAI

59.0

58.3

61.8

64.8

71.6

69.4

75.0

71.4

75.7

75.5

61.1

MT E

60.3

60.7

65.7

71.2

76.4

75.5

80.3

82.1

80.2

79.4

63.8

Table 8. Percentage of classified predicted residue pairs for MT E when only considering residues covered by the best template. Baseline is a predictor that copies the class assignment from the best hit template. Template sequence identity 10 means all proteins that have a best hit template in the identity range [0, 10) % 10

20

30

MT E

60.2

69.8

78.8

Baseline

54.8

67.1

78.6

ranges impose stricter distance constraints among neighbours), or a combination of both. Ultimately, templates improve multi-class predictions in all regions of sequence similarity (including [0,10)%), both for regions covered and regions not covered by templates. Tables 5 and 8 report the comparison between template based predictions and a baseline ˚ and multi- class respectively. The baseline simply calculates the class for position for 12A (i, j) from the coordinates in the best template. We also tested different baselines in which, instead of just the top template, the top 10 templates and all templates were used to get the class by a majority vote among the templates covering each template. We tested both an unweighed vote and a vote in which each template is weighed by its sequence similarity to the query, cubed. The latter weighing scheme is identical to the one used to present the templates to the neural networks (see equation 7). In all cases the baseline is worse than the best hit baseline, therefore the results are not reported here. We only report the predictions vs. baseline for the [0,30)% templates, since above 30% identity, as expected, the results are essentially the same. In this twilight region, where it is difficult to extract information

Fast Modelling of Protein Structures Through Multi-level Contact Maps

185

from templates, MT E outperforms the baseline, however 12T E does not. The multi-class results are clearly encouraging, outperforming the baseline (Table 8), always improving on non-template regions (Table 7) and overall maps (Table 6). Figure 10 and 11 show an example of a map predicted for a low best hit sequence identity of 22.7%.

4.7. Modelling Protein Structures from Predicted Maps In Figure 9, the average RMSD vs sequence length is shown for models for set S258 derived from true 4-class contact maps (stars), from MT E maps (squares) and from MAI maps (Xs), together with the baseline (crosses). The baseline represents a structure collapsed into its center of mass. Note that no templates are allowed that show a sequence identity greater than 90% to the query. Hence, the MT E results are based on a mixture of good, bad and no templates, akin to the distribution one would expect when presented with a protein sequence that is not in the PDB. The distribution of templates for S258 (not reported) resembles closely the one for the training/testing set, reported in Figure 8. It is also important to note that the results are an average of 10 reconstructions. If more reconstructions were run and, especially, if these were ranked, the results would likely improve considerably. The average ˚ and the average TM score 0.51. If the best of the reconstruction RMSD for MT E is 9.46A ˚ and 0.55, respectively. 10 reconstructions is picked, these improve to 8.59 A

Table 9. Reconstruction algorithm results for models derived from multi-class predicted contact maps with (MT E ) and without (MAI ) allowing homology information. Note that, since no templates are allowed that show a sequence identity greater than 90% to the query, the MT E results are based on a mixture of good, bad and no templates (see Figure 8 for a sample distribution of template quality). The reported values are the average over the 10 runs of simulated annealing. Maps MAI MT E

RMSD 14.60 9.46

TM-score 0.27 0.51

Reconstructions based on 4-class maps are significantly better than those from binary maps. Tested on both ab initio and homology-based 4-class maps, results show that homology-based predictions are generally more accurate than ab initio ones even when homology is dubious. For sequence similarity above 30% the predictions’ TM-score is on average slightly above 0.7 indicating high reliability, is approximately 0.45 in the 20-30% interval, and 0.27 in the region below 20%. If reconstruction performances are measured on the S258 set without allowing homology information at any stage (pure ab initio predictions) the average TM-score is 0.27, with 43 of the 258 structures above a TM-score of 0.4.

186

Davide Ba`u, Ian Walsh, Gianluca Pollastri et al.

Figure 9. 4-class contact maps: average RMSD vs sequence length is shown for models derived from true contact maps (blue stars), from predicted contact maps using information derived from homologues (MT E ) (purple squares) and from ab initio predicted contact maps (green Xs), together with the baseline (red crosses). Note that, since no templates are allowed that show a sequence identity greater than 90% to the query, the MT E results are based on a mixture of good, bad and no templates (see Figure 8 for a sample distribution of template quality).

5.

Conclusions

In this work we have described a machine learning pipeline for high-throughput prediction of protein structures, and have introduced a number of novel algorithmic ideas. First, based on the observation that protein binary contact maps are somewhat lossy representations of the structure and yield only relatively low-resolution models, we have introduced multi-class maps, and shown that, via a simple simulated annealing protocol, these lead to much more accurate models, with an average RMSD to the native structure of ˚ and a TM score of 0.83. just over 2 A Secondly, extending on ideas we have developed for predictors of secondary structure and solvent accessibility [36] we have presented systems for the prediction of binary and multi-class maps that use structural templates from the PDB to yield far more accurate predictions than their ab initio counterparts. We have also shown that multi-class maps lead to a more balanced prediction problem than binary ones. Although it is unclear whether because of this, or because of the nature of the constraints encoded into them, templatebased systems for the prediction of multi-class maps we tested are capable of exploiting both sequence and structure information even in cases of dubious homology, significantly improving over their ab initio counterpart well into and below the twilight zone of sequence

Fast Modelling of Protein Structures Through Multi-level Contact Maps

187

˚ contact maps for ab initio (left) and template-based (right) Figure 10. Protein 1B9LA 12 A predictions. The best template sequence identity is 22.7%. The top right of each map is the true map and the bottom left is predicted. In the predicted half white and red are true negative and positive respectively, blue and green are false negative and positive respectively. The three black lines correspond to |i − j| ≥ 6, 12, 24.

Figure 11. Protein 1B9LA Multi class contact maps for ab initio (left) and template-based (right) predictions. The best template sequence identity is 22.7%. The top right of each map is the true map and the bottom left is predicted. In the predicted half red, blue, green and yellow correspond to class 0, 1, 2 and 3 respectively. The greyscale in the predicted half corresponds to falsely predicted classes. The three black lines correspond to |i − j| ≥ 6, 12, 24.

identity. This turns out to be only partly true, at least in our tests, for binary contact map predictors. Moreover, multi-class map predictions are far more accurate than the maps of the best templates for all the twilight and midnight zone of sequence identity, including the case in which only templates with less than 10% sequence identity to the query are available. Conversely, for binary contact maps, the best template is on average more accurate than the prediction for all the [0%,30%) region of sequence identity. Finally we have shown that template-based predictions of multi-class maps lead to fair to good predictions of protein structures, with an average TM score of 0.7 or higher to the

188

Davide Ba`u, Ian Walsh, Gianluca Pollastri et al.

native when good templates are available (sequence identity greater than 30%), and of 0.45 in the [20%, 30%) identity region. Ab initio predictions are still, on average, poor, at an average TM score of 0.27. Nevertheless, it is important to note how the component for homology detection in this study is basic (PSI-BLAST), and entirely modular, in that it may be substituted by any other method that finds templates without substantially altering the pipeline. Whether more subtle homology detection or fold recognition components could be substituted to PSI-BLAST, with or without retraining the underlying machine learning systems, is the focus of our current studies. The overall pipeline, including the templatebased component, is available at the URL: http://distill.ucd.ie/distill/. Protein structure predictions are based on multi-class maps, and templates are automatically provided to the pipeline when available.

Acknowledgments This work is supported by Science Foundation Ireland grant 05/RFP/CMS0029, grant RP/2005/219 from the Health Research Board of Ireland and a UCD President’s Award 2004.

References [1] M. Adams, A. Joachimiak, G. T. Kim, R. Montelione, and J. Norvell. Meeting review: 2003 nih protein structure initiative workshop in protein production and crystallization for structural and functional genomics. J. Struct. Funct. Genomics, 5:1–2, 2004. [2] K. T. Simons, C. Kooperberg, E. Huang, and D. Baker. Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and bayesian scoring functions. J. Mol. Biol., 268:209–225, 1997. [3] P. Larranaga, B. Calvo, R. Santana, Bielza C., J. Galdiano, I. Inza, and J. A. Lozano. Machine learning in bioinformatics. Briefings in bioinformatics , 7(1):86–112, 2006. [4] M. Vendruscolo, E. Kussell, and E. Domany. Recovery of protein structure from contact maps. Folding and Design, 2:295–306, 1997. [5] J. M. G. Izarzugaza, O. Grana, M. L. Tress, A. Valencia, and N. D. Clarke. Assessment of intramolecular contact predictions for casp7. Proteins, 69(S8):152–158, 2007. [6] D.J. Osguthorpe. Ab initio protein folding. Current Opinion in Structural Biology , 10(2):146–152, 2000. [7] A. A. Canutescu, A. A. Shelenkov, and R. L. Dunbrack. A graph theory algorithm for protein side-chain prediction. Protein Science, 12:2001–2014, 2003. [8] Vullo A., I. Walsh, and G. Pollastri. A two-stage approach for improved prediction of residue contact maps. BMC Bioinformatics,7:180, 2006.

Fast Modelling of Protein Structures Through Multi-level Contact Maps

189

[9] G. Pollastri, P. Baldi, A. Vullo, and P. Frasconi. Prediction of protein topologies using giohmms and grnns. Advances in Neural Information Processing Systems (NIPS) 15, MIT Press, 2003. [10] D.A. Debe, M.J. Carlson, J. Sadanobu, S.I. Chan, and W.A. Goddard. Protein fold determination from sparse distance restraints: the restrained generic protein direct monte carlo method. J. Phys. Chem., 103:3001–3008, 1999. [11] A. Aszodi, M.J. Gradwell, and W.R. Taylor. Global fold determination from a small number of distance restraints. J. Mol. Biol., 251:308–326, 1995. [12] E.S. Huang, R. Samudrala, and J.W. Ponder. Ab initio fold prediction of small helical proteins using distance geometry and knowledge-based scoring functions. J. Mol. Biol., 290:267–281, 1999. [13] J. Skolnick, A. Kolinski, and A.R. Ortiz. Monsster: a method for folding globular proteins with a small number of distance restraints. J. Mol. Biol.,265:217–241, 1997. [14] P.M. Bowers, C.E. Strauss, and D. Baker. De novo protein structure determination using sparse nmr data. J. Biomol. NMR, 18:311–318, 2000. [15] W. Li, Y. Zhang, D. Kihara, Y.J. Huang, D. Zheng, G.T. Montelione, A. Kolinski, and J. Skolnick. Touchstonex: Protein structure prediction with sparse nmr data. Proteins: Structure, Function, and Genetics , {53:290–306, 2003. [16] D. Bau, Pollastri. G., and A. Vullo. Analysis of Biological Data: A Soft Computing Approach, chapter Distill: a machine learning approach to ab initio protein structure prediction. World Scientific, 2007. [17] M. Vassura, L. Margara, P. Di Lena, F. Medri, P. Fariselli, and R. Casadio. Ft-comar: fault tolerant three-dimensional structure reconstruction from protein contact maps. Bioinformatics, 24(10):1313–1315, 2008. [18] P. Fariselli and R. Casadio. A neural network based predictor of residue contacts in proteins. Protein Engineering,12(1):15–21, 1999. [19] P. Fariselli, O. Olmea, A. Valencia, and R. Casadio. Prediction of contact maps with neural networks and correlated mutations. Protein Engineering, 14(11):835–439, 2001. [20] G. Pollastri and P. Baldi. Prediction of contact maps by recurrent neural network architectures and hidden context propagation from all four cardinal corners. Bioinformatics, 18, Suppl.1:S62–S70, 2002. [21] R.M. McCallum. Striped sheets and protein contact prediction. Bioinformatics, 20, Suppl. 1:224–231, 2004. [22] Y. Zhao and G. Karypis. Prediction of contact maps using support vector machines. 3rd international conference on Bioinformatics and Bioengineering (BIBE) , pages 26– 33, 2003.

190

Davide Ba`u, Ian Walsh, Gianluca Pollastri et al.

[23] J. Cheng and P. Baldi. Improved residue contact prediction using support vector machines and a large feature set. BMC Bioinfomatics, 8:113, 2007. [24] M. Punta and B. Rost. Profcon: novel prediction of long-range contacts. Bioinformatics, 21:2960–2968, 2005. [25] J. Moult, K. Fidelis, B. Rost, T. Hubbard, and A. Tramontano. Critical assessment of methods of protein structure prediction (CASP) - round 6. Proteins, 7:3–7, 2005. [26] V.A. Eyrich, M.A. Marti-Renom, D. Przybylski, M.S. Madhusudan, A. Fiser, F. Pazos, A. Valencia, A. Sali, and B. Rost. Eva: continuous automatic evaluation od protein structure prediction servers. Bioinformatics, 17:1242–1251, 2001. [27] Andreeva A., D. Howorth, S.E. Brenner, Hubbard T.J.P., C. Chothia, and A.G. Murzin. Scop database in 2004: refinements integrate structure and sequence family data. Nucl. Acid Res., 32:D226–D229, 2004. [28] J. Skolnick Y. Zhang. Scoring function for automated assessment of protein structure template quality. Proteins, 57:702–710, 2004. [29] O. Lund, K. Frimand, J. Gorodkin, H. Bohr, J. Bohr, J. Hansen, and S. Brunak. Protein distance contraints predicted by neural networks and probability density functions. Pro. Eng., 10:1241–1248, 1997. [30] Casp home page, http://predictioncenter.org/. [31] Y. Shao and C. Bystroff. Predicting interresidue contacts using templates and pathways. Proteins, 53:487–502, 2003. [32] H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, and P.E. Bourne. The protein data bank. Nucleic Acids Research, 28:235– 242, 2000. [33] P. Baldi and G. Pollastri. The principled design of large-scale recursive neural network architectures – dag-rnns and the protein structure prediction problem. Journal of Machine Learning Research, 4(Sep):575–602, 2003. [34] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks , 5:157–166, 1994. [35] A Ceroni, P Frasconi, and G Pollastri. Learning protein secondary structure from sequential and relational data. Neural Networks, 18(8):1029–39, 2005. [36] G. Pollastri, A.J.M. Martin, C. Mooney, and A. Vullo. Accurate prediction of protein secondary structure and solvent accessibility by consensus combiners of sequence and structure information. BMC Bioinformatics, 8(201):12, 2007. [37] G. Pollastri and A. McLysaght. Porter: a new, accurate server for protein secondary structure prediction. Bioinformatics, 21(8):1719–20, 2005.

Fast Modelling of Protein Structures Through Multi-level Contact Maps

191

[38] G. Pollastri, P. Fariselli, R. Casadio, and P. Baldi. Prediction of coordination number and relative solvent accessibility in proteins. Proteins, 47:142–235, 2002. [39] U. Hobohm and C. Sander. Enlarged representative set of protein structures. Protein Sci., 3:522–24, 1994. [40] W. Kabsch and C. Sander. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22 (12):2577–637, 1983. [41] B. Rost and C. Sander. Combining evolutionary information and neural networks to predict protein secondary structure. Proteins, 19(1):55–72, 1994. [42] S. K. Riis and A. Krogh. Improving prediction of protein secondary structure using structured neural networks and multiple sequence alignments. J. Comput. Biol., 3:163–183, 1996. [43] D. T. Jones. Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol., 292:195–202, 1999. [44] G. Pollastri, D. Przybylski, B. Rost, and P. Baldi. Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins, 47:228–235, 2002. [45] G Pollastri and P Baldi. Prediction of contact maps by recurrent neural network architectures and hidden context propagation from all four cardinal corners. Bioinformatics, 18(S1):S62–S70, 2002. [46] S.F. Altschul, T.L. Madden, and A.A. Schaffer. Gapped blast and psi-blast: a new generation of protein database search programs. Nucl. Acids Res., 25:3389–3402, 1997. [47] E. Krieger, R.W.W. Hooft, S. Nabuurs, and G. Vriend. Pdbfinderii - a database for protein structure analysis and prediction. Submitted, 2004.

In: Computational Biology: New Research Editor: Alona S. Russe

ISBN: 978-1-60692-040-4 © 2009 Nova Science Publishers, Inc.

Chapter 7

COARSE-GRAINED STRUCTURAL MODEL OF PROTEIN MOLECULES 1

Kilho Eom1,2 and Sungsoo Na2 Nano-Bio Research Center, Korea Institute of Science and Technology (KIST), Seoul, Republic of Korea 2 Department of Mechanical Engineering, Korea University, Seoul, Republic of Korea

Abstract Understanding the protein mechanics is a priori requisite for gaining insight into protein’s biological functions, since most protein performs its function through the structural deformation renowned as conformational change. Such conformational change has been computationally delineated by atomistic simulations, albeit the mechanics of large protein structure is computationally inaccessible with atomistic simulation such as molecular dynamics simulation. In a recent decade, normal mode analysis with coarse-grained modeling of protein structures has been a computational alternative to atomistic simulations for understanding large protein mechanics. In this review, we delineate the current state-of-art in coarse-grained modeling of proteins for normal mode analysis. Specifically, the pioneered coarse-grained models such as Go model and elastic network model as well as recently developed coarse-grained elastic network model are summarized and discussed for understanding large protein mechanics.

Keywords: protein mechanics, coarse-grained model, normal mode analysis, Go model, elastic network model

Introduction Protein mechanics plays a vital role in the biological function of proteins, since protein performs its biological function through its structural deformation driven by mechanical loading. For instance, motor protein is one of renowned proteins that perform the mechanical function, that is, the transduction of chemical energy into mechanical energy [1]. Specifically,

194

Kilho Eom and Sungsoo Na

the mechanical function of ATPase motor protein is carried out via its structural change upon ATP binding [2-6]. The chaperonin GroEL-GroES complex functions the assistance of protein folding through rotational motion of its domain upon ATP binding [7, 8]. The giant muscle molecule known as titin performs the mechanical function through structural change from folded structure to unfolded (denatured) structure or vice versa upon mechanical loading or unloading [9-13]. Protein mechanics related to protein’s biological function has been computationally studied by atomistic simulations [14] since McCammon et al. [15] studied the dynamic behavior of small protein based on molecular dynamics simulation. The thermal fluctuation behavior of small proteins has been well understood by sampling of trajectories obtained from molecular dynamics simulation [16]. Moreover, the mechanical unfolding of a protein such as titin has been well analyzed by molecular dynamics simulation with consideration of mechanical loading applied to the termini of a protein [17-19]. The basic principle of i = fi , molecular dynamics simulation is to numerically solve the equation of motion, i.e. mi u where mi is the mass of the i-th atom, ui is the displacement field for i-th atom, and fi is the force acting on the i-th atom [14, 20]. The computational difficulty in molecular dynamics simulation resides in computation of the force fi that is a gradient of an anharmonic potential field prescribed to all atoms. Further, the time step for integrating the equation of motion is typically in the order of femto (10-15) seconds, whereas protein performs the function at much larger time scale from at least nano (10-9) seconds to a few seconds. It has been reported that until now the accessible time scale for molecular dynamics is at most in the order of nano seconds [21]. This indicates that molecular dynamics simulation may be computationally inhibited for large protein mechanics, where large spatial and temporal scale is required. In recent decades, normal mode analysis (NMA) has been a computational alternative to atomistic simulation such as molecular dynamics for understanding large protein mechanics [14, 22-24]. The principle of NMA is similar to that typically employed for structural mechanics. Specifically, once the stiffness matrix (Hessian matrix) for a structure is constructed, the modal analysis provides the vibration information of such a structure. The stiffness matrix for a protein structure is usually established based on the computation of second gradients of anharmonic potential field prescribed to all atoms. In general, calculation of stiffness matrix is implemented at equilibrium position, which is obtained by minimization of anharmonic potential. This implies that, for large proteins, the computation of stiffness matrix along the minimization process is a computationally expensive process. Recently, atomistic simulation such as molecular dynamics [15] and NMA [22] with all atoms has been replaced with a coarse-grained model, where degrees of freedom are enormously reduced. Since the dominant motion of a protein structure is represented by that of a carbon backbone chain [14, 16], the coarse-grained models have been suggested such that protein structure is delineated by α-carbon atoms for protein backbone chain. Moreover, the computational inefficiency usually arises from the complicated anharmonic potential field. Consequently, the simplification of such potential field for a protein structure described by αcarbon atoms is the key issue in the coarse-grained modeling of proteins. Go and coworkers [22] introduced the more simplified potential field for α-carbon atoms such that α-carbon atoms are prescribed by potential field consisting of covalent bonds for consecutive α-carbon atoms and non-bonded interaction (i.e. van der Waal’s interaction) for native contacts. The thermal fluctuation behavior of protein structures has been well described by Go model. In a

Coarse-Grained Structural Model of Protein Molecules

195

recent decade, Go model has been taken more attention for gaining insight into protein unfolding mechanics. Cieplak and coworkers [25-28] have showed that molecular simulation with Go potential has allowed them to obtain the force-displacement curve for protein unfolding mechanism, quantitatively comparable to experimental data by single-molecule pulling experiments based on atomic force microscopy (AFM) [9]. This may shed light on Go model such that Go potential may be versatile potential field for understanding protein mechanics with computational efficiency. Inspired by Go model, Tirion suggested the more simple harmonic potential for protein structures [29]. In her model, protein structure is regarded as a harmonic spring network in such a way that α-carbon atoms within the neighborhood is connected by harmonic springs with identical force constant. Tirion’s model has revolutionized the protein modeling for understanding protein dynamics relevant to biological function of proteins [30-37]. Such model has reduced computational expense enormously for estimating low-frequency normal modes related to biological function. Moreover, it is very remarkable that low-frequency normal modes from Tirion’s model are highly correlated with displacement vector representing the conformational change of proteins [38]. Tirion’s model has inspired many researchers for studying protein dynamics and protein mechanics. For example, Wolynes and coworkers [39] studied the energy landscape for protein conformational change based on Tirion’s model. Kim et al [40] introduced the linear interpolation method based on Tirion’s model for describing the conformational change. Brooks and coworkers [41, 42] have studied the conformation change based on iterative method applied to low-frequency normal modes from Tirion’s model with a distance constraint for computing the displacement vector related to incremental conformational change. Micheletti and coworkers [43] employed Tirion’s model for depicting the thermal denaturation (thermal unfolding) of protein’s folded structure. Zheng et al [44] have studied the power-stroke mechanism of motor proteins based on Tirion’s model. The recent studies by Brooks et al [37, 45] and Kim et al [46] have showed that low-frequency normal modes of Tirion’s model is sufficient to provide the functional mode of viral capsid protein. Recently, Thirumalai and coworkers [47] have shown that lowfrequency normal modes are able to describe the allosteric transition of proteins. Bahar and coworkers [48] have provided that allosteric change of protein structure is well delineated by low-frequency normal modes of Tirion’s model. Moreover, they have recently suggested that Tirion’s model with Markov method may enable one to understand the allosteric signal transduction corresponding to conformational change [49, 50]. Although Tirion’s model has greatly succeeded in studying protein dynamics and mechanics with high computational efficiency, the model reduction scheme (coarse-graining) has been taken into account for large protein structures. The model reduction of Tirion’s model is rational since few low-frequency normal modes are necessary for describing the protein dynamics such as conformational change relevant to biological function. Such model reduction has been first attributed to Bahar and coworkers [51] who introduced the coarsegrained structure represented by nodal points whose number is less than total number of αcarbon atoms. In their model, the nodal points within the neighborhood were connected by harmonic spring with identical spring constant. In a recent year, Eom and coworkers [52] have provided the more systematic model reduction method applicable to protein structures. Specifically, they used the model condensation method in a similar spirit to skeletionization method suggested by Rohklin and coworkers [53, 54]. Bahar and coworkers [55] have introduced the Markov method for transformation of original molecular structure into coarse-

196

Kilho Eom and Sungsoo Na

grained structure. Ma and coworkers [56] have employed the substructure synthesis method, which was broadly used for engineering structural dynamics, for obtaining the low-frequency normal modes relevant to biological function.

Molecular Simulation: Normal Mode Analysis (NMA) All-atom simulation such as molecular dynamics was first provided by Karplus and coworkers [15]. The potential field V prescribed to protein structure was given by [14] V =∑ i

⎛ A B⎞ qi q j 2 2 K C bi − bi0 ) + ∑ (θi − θi0 ) + ∑ D ⎡⎣1 + cos ( nϕi − δ ) ⎤⎦ + ∑ ⎜ 12 − 6 ⎟ + ∑ ( ⎜ rij ⎟⎠ i , j χ rij 2 i 2 i i , j ⎝ rij

(1)

Here, bi, θi, and φi are the i-th covalent bond length, bending angle, and dihedral angle (torsional angle), respectively, rij is the distance between i-th atom and j-th atom, qi is the charge for i-th atom, and a symbol 0 indicates the equilibrim state. The first term in potential energy represents the stretching energy for covalent bonds, the second term indicates the bending energy, the third term shows the torsional energy, and last two terms provides the non-bonded interactions such as van der Waal’s interaction and electrostatic interaction. With the potential energy V given by Eq. 1, molecular dynamics simulation provides the trajectories of position vectors denoted as xi for all atoms. The quantity renowned as crosscorrelation matrix Lij shows the thermal fluctuation behavior comparable to experimental quantity such as Debye-Waller factor [16, 57]. Lij =

(x

i

− xi0 ) ⋅ ( x j − x0j )

(2)

where xi is the position vector for i-th atom, a symbol 0 represents the equilibrium state, and a braket symbol indicates the ensemble average (time average). The diagonal component of cross-correlation matrix, Lii, is the mean-square fluctuation proportional to Debye-Waller factor (B-factor), i.e. Bi = (8π2/3)Lii. Normal mode analysis (NMA) is referred to as quasi-harmonic analysis [58], since the modal analysis is implemented with harmonic approximation to potential energy V for small displacement. V ≈ V0 +

1 ∑ Kij ( xi − xi0 )( x j − x0j ) 2 i, j

(3)

Here, xi is the generalized coordinates for atoms, and Kij is the Hessian matrix (stiffness matrix) for a protein structure given by Kij = ∂2V/∂xi∂xj. Quasi-harmonic analysis (or NMA) is to solve the eigen-value problem such as Kijvj = −ω2mivi, where ω is the natural frequency, vi is the normal mode corresponding to natural frequency ω, and mi is the atomic mass for i-th atom. The cross-correlation matrix representing the thermal fluctuation motion can be computed from equilibrium statistical mechanics theory [57, 59].

Coarse-Grained Structural Model of Protein Molecules L ij =

3N

∑

n=7

k BT (vi ,n ⊗ v j ,n ) m iω n 2

197 (4)

where kB is the Boltzmann’s constant, T is the absolute temperature, and the subscript n for natural frequency and normal mode represents the mode number. It should be noted that summation goes from 7 to 3N, where N is the total number of atoms, since there are six rigid body modes corresponding to zero eigen-modes. Even though there are several different potential fields such as CHARMM and AMBER applicable to protein structures, it was shown that the thermal fluctuation motion and low-frequency normal modes are consistent regardless of details of potential field [58].

Go Model As stated above, the low-frequency normal modes relevant to protein dynamics is insensitive to details of potential field [58]. One may ask which interactions dominate the protein dynamics among various potential fields as mentioned in Eq. 1. Go and coworkers [22] conjectured that short-range interactions may govern the protein dynamics. Moreover, the motion of protein structure is well described by that of backbone chain represented by αcarbon atoms. Go potential is simply represented in the form of [25, 26] V ≈

⎡ k1

∑ ⎢⎣ 2 ( r

− ri 0, i + 1 ) + 2

i , i +1

i

4⎤ k2 ri , i + 1 − ri 0, i +1 ) ⎥ + ∑ 4ε ( 4 ⎦ i, j

⎛ 1 1 ⎜⎜ 6 − 12 r r ij ij ⎝

⎞ ⎟⎟ ⎠

(5)

Here, ri,j is the distance between i-th and j-th α-carbon atoms, and superscript 0 indicates the equilibrium state. The first summation represents the nonlinear elastic energy for covalent bonds, while the last summation shows the non-bonded interaction for native contact. Native contact is defined in such a way that two α-carbon atoms (i-th and j-th α-carbon atoms) are in the native contact if rij is less than the certain distance referred to as cut-off distance, dc, typically given as dc ≈ 10 Å.

Tirion’s Model: Elastic Network Model (ENM) The success of Go model has resulted in the emergence of more simplified model suggested by Tirion [29]. Specifically, Tirion assumed the harmonic approximation to potential field prescribed to α-carbon atoms. Inspired by Go model, she proposed the harmonic potential field only for native contacts and covalent bonds with identical force constant. V ≈∑ i, j

γ 2

(r

ij

− rij0 ) ⋅ H ( rc − rij0 ) 2

(6)

198

Kilho Eom and Sungsoo Na

where γ is a force constant, rij is the distance between i-th and j-th α-carbon atoms, superscript 0 indicates the equilibrium state, rc is the cut-off distance defining a native contact, and H(x) is the Heaviside unit-step function defined as H(x) = 0 if x < 0; otherwise H(x) = 1. With Tirion’s potential, Bahar and coworkers studied the Gaussian dynamics of proteins, which resulted in the emergence of Gaussian network model (GNM) [30, 32]. GNM assumes the isotropic fluctuation, that is, directionality of fluctuation is not taken into account. Even though the motion of proteins is generally anisotropic, the fluctuation behavior such as Bfactor is well depicted by GNM. The stiffness matrix for GNM, also referred to as Kirchoff matrix, is given by Γij =

N ∂ 2V = −γ (1 − δ ij ) ⋅ H ( rc − rij0 ) − δ ij ∑ Γ ik ∂ri ∂rj k ≠i

(7)

Here, N is the total number of α-carbon atoms, and δij is the Kronecker delta defined as δij = 1 if i = j; otherwise δij = 0. Since isotropic motion is assumed in GNM, GNM is able to only provide the low-frequency normal modes related to mean-square fluctuations. Further, fluctuation information for every residue may provide the insight into the hot spots (residues) which undergo large deformation during the conformational change. In general, Tirion’s model is referred to as elastic network model (ENM) [29, 33] since protein structure is represented by harmonic spring network, which takes into account the anisotropy in thermal fluctuation. For simplicity, let us consider only two α-carbon atoms i and j, which are connected by an entropic spring (Gaussian chain) [33, 60-64]. V ( rij ) =

γ

(r 2

ij

− rij0 )

2

(8)

where rij = [(xi – xj)2 + (yi – yj)2 + (zi – zj)2] with a position vector ri for an α-carbon atom i, given by ri = xiex + yiey + ziez. The stiffness matrix K for a potential given by Eq. 3 can be easily computed such as ⎡ K ij K =⎢ ⎣ − K ij

− K ij ⎤ K ij ⎥⎦

(9)

Here, Kij is the 3×3 block matrix given by K ij = γ

(r

i

− r j ) ⊗ ( ri − r j ) ri − r j

2

(10)

This indicates that the stiffness matrix for an entropic spring is equivalent to the stiffness matrix for an elastic spring (linear elastic truss) with a spring constant γ. Based on 3×3 block matrix Kij, the stiffness matrix corresponding to Tirion’s potential given by Eq. 6 can be easily computed by assembly of such block matrices. The protein structure described by ENM is suggested in Figure 1.

Coarse-Grained Structural Model of Protein Molecules

199

Coarse-Grained Elastic Network Model Coarse-graining of protein structures with few degrees of freedom has been attempted, since protein structure is composed of several rigid domains whose motional behavior is like a rigid-body motion such as rotational motion. In recent years, Jernigan and coworkers [65, 66] suggested that protein structure is represented by complex of rigid bodies corresponding to protein domains. That is, they introduced the Hamiltonian for rigid-body motion of a domain as well as interactions between domains. It was remarkably shown that the dynamic behavior such as conformational change of large protein complex (e.g. GroEL-GroES, viral capsid) has been well illustrated by their coarse-grained model [46, 66]. Bahar and coworkers [51] have taken into account the coarse-graining of ENM based on their physical intuition. Their coarse-grained ENM was established in the same manner to ENM except they rescaled the force constant as well as cut-off distance. It is remarkable that their simple coarse-grained model successfully predicts the thermal fluctuations comparable to original structure as well as experimental data. Furthermore, multi-scale model for proteins has been suggested in such a way that the biologically significant substructures such as binding site are described by refined model such as ENM whereas the rest regions of a protein is described by coarse-grained ENM [67].

Figure 1. Model protein, i.e. citrate synthase (pdb: 4cts) described by (a) molecular structure and (b) elastic network model.

The coarse-graining process based on ENM may be systematically implemented by employing the model reduction method typically used in applied mathematics. For instance, Rohklin and coworkers [53, 54] suggested the low-rank approximation to linear algebraic equation, resulting in the reduction of degrees of freedom. They showed that their low-rank approximation, referred to as skeletionization, can be directly applicable to electrostatics [68], hydrodynamics [53], and any other applied mathematics problem represented by linear equation [54]. Inspired by skeletonization scheme, we have employed the model condensation method to reconstruct the coarse-grained structure, i.e. low-resolution structure, from the original structure, i.e. refined structure (See Figure 2) [52, 64]. We define the master residues as the residues which are taken in the coarse-grained structure, while the slave residues are referred to as the residues which are to be eliminated during model condensation. The dynamic motion of a protein structure is governed by harmonic potential V in the form of V=

1 [u M 2

⎡K uS ] ⎢ M ⎣ K SM

K MS ⎤ ⎡u M ⎤ K S ⎥⎦ ⎢⎣ u S ⎥⎦

200

Kilho Eom and Sungsoo Na

where the subscripts M and S indicate the master residues and slave residues, respectively. KM represents the harmonic interactions between master residues, KS provides the harmonic interactions between slave residues, and KMS shows the harmonic interactions between master and slave residues. With assumption that slave residues are in equilibrium, the effective stiffness matrix Keff for coarse-grained ENM described by master residues is computed as follows. K MS ⎤ ⎡ I 3 M ⎤ ⎡K (12) −K MS K −S1 ⎤⎦ ⎢ M ⎥⎢ ⎥ K ⎣ SM K S ⎦ ⎣ 0 ⎦ Here, ψ is the linear operator that transfroms the original structure described by stiffness matrix K to the coarse-grained structure depicted by effective stiffness matrix Keff, and I3M is the 3NM × 3NM identity matrix, where NM is the total number of master residues. K eff = ψ [ K ] = ⎡⎣I 3 M

Conformational Fluctuation Dynamics In recent decades, the molecular structures of various proteins have been characterized by experiments based on X-ray crystallography and/or nuclear magnetic resonance (NMR) [20]. Until recently, many experimentalists are attempting to characterize the large protein structures based on X-ray crystallography and NMR, and such protein structures realized by experimentalists are deposited in the protein data bank (PDB; http://www.pdb.org). Characterization of protein structure with such experiments is typically given in terms of Debye-Waller factor (B-factor) representing the mean-square fluctuation of residues driven by thermal energy kBT. Consequently, the dynamic behavior of proteins based on theoretical models such as molecular model and/or coarse-grained model is typically compared with Bfactor obtained by experiments. That is, the conformational fluctuation behavior of proteins plays a role in validating the theoretical models for protein structures. As shown in Figure 3, the conformational fluctuation predicted by Tirion’s model (ENM) and/or GNM is quantitatively comparable to that obtained by experiments. It is quite remarkable that simple harmonic oscillator network model delineated by two parameters such as force constant and cut-off distance are able to provide the conformational fluctuation of proteins. This remarkable result indicates that native topology (topology of native contacts) plays a dominant role in the conformational fluctuation. Moreover, the comparison of thermal fluctuations predicted by ENM with that by experiments provides the force constant for an entropic spring connecting the native contacts. For instance, F1-ATPase motor protein (pdb: 1e79) can be represented by GNM with force constant of 0.347 kcal/mol and cut-off distance of 12Å. It should be noted that one has to be cautious in selecting the cut-off distance because the short cut-off distance may generate the unphysical behavior of a structure such as more than six rigid body modes [33]. Further, if one chooses the very large cut-off distance, then the structure is too rigid to fluctuate in the similar pattern to that of real protein. We take into account the coarse-grained elastic network model and its conformational fluctuation behavior. It is shown that, in Figure 3, coarse-grained ENM predicts the thermal fluctuation behavior depicted by B-factor qualitatively comparable to that estimated by experiments and/or original structural model. For a protein composed of N α-carbon atoms, the prediction of B-factor based on ENM requires O(N3) computation, while on the basis of

Coarse-Grained Structural Model of Protein Molecules

201

coarse-grained ENM composed of (N/n) α-carbon atoms the calculation of B-factor requires O(N3/n2) computation.

Figure 2. Molecular structure of a model protein (citrate synthase) delineated by (a) elastic network model and (b) coarse-grained elastic network model

Figure 3. B-factor of a motor protein (pdb: 1e79) predicted by elastic network model and coarsegrained elastic network model in comparison with experimental data

Coarse-grained ENM reduces the computational cost for predicting thermal fluctuation of proteins by factor of n2, compared with ENM, whereas the coarse-grained ENM predicts the thermal fluctuation of proteins quantitatively and qualitatively comparable to that predicted by ENM. The success of coarse-grained ENM in depicting the conformational fluctuation of proteins may be attributed to the fact that protein structure is usually represented by combinations of rigid domains that can be described by few degrees of freedom. This feature of protein structure has been taken into account for establishing the coarse-grained models of proteins. For instance, Jernigan and coworkers [66] provide the rigid cluster model such that protein structure is represented by clusters of rigid bodies with soft springs connecting rigid domains. Further, Tama and coworkers [69] suggested block normal mode analysis, where block matrices were used to describe the rigid blocks of proteins, for delineating the conformational fluctuation of proteins. Moreover, as the protein structure is more coarse-grained, the magnitude of conformational fluctuation becomes larger, even though the patterns of conformational fluctuation predicted by coarse-grained structure are qualitatively consistent with original structure. This is rational since our coarse-graining scheme reduces the harmonic springs corresponding to slave residues, resulting in the increase of overall compliance of protein structure. This is consistent with a recent work by Bahar and coworkers [51], where they

202

Kilho Eom and Sungsoo Na

rescaled the force constant in such a way that the force constant for coarse-grained structure is larger than that for original structure. In order for a coarse-grained ENM to predict the conformational fluctuation quantatively comparable to experimental data or original structural model, the force constant should be rescaled in such a way that the overall stiffness of protein structure described by coarse-grained ENM is comparable to that of protein structure depicted by ENM. Figure 4 shows the thermal fluctuation predicted by coarse-grained ENM with rescaled force constant. It is shown that the conformational fluctuation predicted by coarsegrained ENM is very consistent with experimental data.

Lowest-Frequency Normal Mode Coarse-grained models such as Go model and ENM are computationally acceptable for computational biology communities, since such models are able to capture the low-frequency normal modes relevant to the biological function of proteins. Such coarse-grained models reduce the degrees of freedom enormously as well as simplify the potential field, and they can provide the meaningful low-frequency normal modes comparable to that computed from atomistic model. This indicates that a de novo coarse-grained model for protein structures can be verified based on the comparison of low-frequency normal modes computed from such coarse-grained model with that obtained by conventional models such as atomistic model and/or wellaccepted coarse-grained model such as Go model and Tirion’s model.

Figure 4. Comparison between experimental data and B-factor predicted by coarse-grained elastic network model with rescaled force constant

We have validated our coarse-grained ENM by investigating the low-frequency normal modes predicted by coarse-grained ENM. For instance, we consider the lowest-frequency normal modes predicted by both ENM and coarse-grained ENM. As shown in Figure5, the lowest-frequency normal mode for hemoglobin is well predicted by coarse-grained ENM such that its lowest-frequency normal mode is qualitatively comparable to that obtained from ENM. Specifically, anti-correlated motion between substructure A (residues: 1 ~ 287) and substructure B (residues: 288 ~ 428) can be found in both ENM and coarse-grained ENM.

Coarse-Grained Structural Model of Protein Molecules

203

This indicates that coarse-grained ENM may provide the lowest-frequency normal mode, related to the functional motion of protein structure, qualitatively comparable to that computed from original structural model such as Go model and ENM. Further, the rescaling of force constant for coarse-grained ENM may not affect the pattern of lowest-frequency normal mode, since the protein topology is only described by cut-off distance. It can be realized that our coarse-graining allows us to predict the functional lowest-frequency normal mode of proteins with reducing the computational cost by factor of n3.

Collective and Correlated Motion of Proteins The conformational motion of proteins has been well described as collective motion and/or correlated motion. As shown previously in Figure 5, the low-frequency normal modes exhibit the collective motion of a protein domain, and such modes depicts the correlated motion of protein domains. Before we demonstrate the collective and/or correlated motion predicted by ENM and/or coarse-grained ENM, we review the parameters representing the collective motion and/or correlated motion. The collectivity parameter denoted as κi for a given mode index i is defined as [70]

κi =

⎡ Nω 2 2⎤ 1 exp ⎢ −∑ vi , j log vi , j ⎥ Nω ⎣ j =1 ⎦

(13)

where Nω is the total number of normal modes, vi,j represents the j-th component of normal mode vi corresponding to mode index i. The collectivity κi is in the range between 1/Nω and 1, where the value of collectivity close to 1/Nω represents the localized motion while the value of collectivity close to 1 indicates the collective motion.

Figure 5. Lowest-frequency normal mode for a hemoglobin computed from elastic network model and coarse-grained elastic network model.

204

Kilho Eom and Sungsoo Na

The correlated motion between residues i and j is well delineated by correlation matrix Cij defined as [71]

(x

C ij =

i

−x

xi − x

0 i

) ⋅ (x

0 2 i

j

−x

0 j

xj −x

) 0 2 j

3

=

∑L p =1

3( i − 1) + p ,3( j − 1) + p

(14)

⎛ 3 ⎞⎛ 3 ⎞ ⎜ ∑ L 3( i −1) + p ,3 ( i −1) + p ⎟ ⎜ ∑ L 3( j −1) + q ,3( j −1) + q ⎟ ⎝ p =1 ⎠ ⎝ q =1 ⎠

Here, the correlation matrix Cij in terms of cross-correlation matrix Lij, shown in Eq. 14, is based on ENM whose degrees of freedom are 3N. The value of Cij close to –1 shows the anti-correlated motion between residues i and j, whereas the value of Cij close to 1 indicates the correlated motion between these two residues. When the correlation Cij is close to zero, the motion of a residue i is uncorrelated with and/or orthogonal to that of a residue j. For clear understanding of correlated motion described by Cij, let us consider the simple harmonic oscillator embedded in a heat bath with thermal energy kBT. The potential energy for a harmonic oscillator is represented in the form of V = (γ/2)(ui – uj)2, where γ is a force constant (spring constant) and ui is a one-dimensional displacement for a node i (see Figure 6). The Hessian matrix (stiffness matrix) for this system can be given by ⎡γ K=⎢ ⎣ −γ

−γ ⎤ γ ⎥⎦

(15)

which provides the natural frequencies ω0 = 0 and ω1 = (2γ)1/2 and their corresponding normal modes v0 = (1, 1) and v1 = (1, –1). As stated earlier, the zero modes should be excluded for estimating the thermal fluctuation of a system. The non-zero normal mode v1 for a harmonic oscillator enables us to easily know that the thermal energy drives the anti-correlated motion between two nodal points i and j. This is consitent with the quantity of correlation Cij, i.e. Cij = –1, from the definition of correlation Cij such as Cij = Lij/(LiiLjj)1/2, where cross-correlation Lij is given by L=

k BT

ω

2 1

v1 ⊗ v1 =

k BT ⎡ 1 −1⎤ 2γ ⎢⎣ −1 1 ⎥⎦

(16)

Thus, correlation Cij is a physical parameter describing the correlated motion between these two nodal points. As shown in Figure 7, we consider the collectivity parameters κi calculated from both ENM and coarse-grained ENM. It is quite remarkable that coarse-grained ENM is able to reproduce the collectivity parameters corresponding to low-frequency normal mode quantitatively comparable to that estimated from ENM. This indicates that collective motion of proteins can be well depicted by coarse-grained structure represented by few degrees of freedom. This may be attributed to the fact that protein consists of several rigid domains that can be described by few degrees of freedom, and that the collective motion arises from the low-frequency functional modes. However, the coarse-grained ENM cannot predict the collectivity for high-frequency normal modes. Specifically, as shown in Figure 7, the high-

Coarse-Grained Structural Model of Protein Molecules

205

frequency normal modes are related to localized motion which cannot be predicted from coarse-grained ENM. This indicates that localized modes (high-frequency modes) of protein can be only estimated from refined molecular model. Figure 8 shows the correlation matrix Cij evaluated from ENM and coarse-grained ENM. It is remarkably found that the collective motion of each domain is well described by both ENM and coarse-grained ENM. Further, coarse-grained ENM provides the correlation between domains, qualitatively comparable to the correlation predicted by ENM. However, coarse-grained ENM overestimates the quantity of correlation between domains. This may be ascribed to our coarse-graining scheme, that is, reduction of harmonic springs corresponding to slave residues, leading to overestimation of overall flexibility and its corresponding correlated motion between domains.

Conformational Transition Conformational change of a protein is quite related to the biological function of a protein. Atomistic simulation such as targeted MD simulation has been employed for understanding conformational change of very small proteins. Remarkably, NMA has been an alternative to MD simulation, since the low-frequency normal modes at equilibrium state are able to well describe the conformational change of proteins. This NMA is referred to as principal component analysis (PCA) that diagonalizes the Hessian matrix (stiffness matrix) [72].

Figure 6. Schematic of a one-dimensional harmonic oscillator that undergoes the thermal fluctuation

Since it was shown that low-frequency normal modes are independent of details of potential field [58] but depend on the topology of protein structures [73], ENM was broadly employed for understanding the conformational change of proteins. Tama and coworkers [37, 38] showed that low-frequency normal modes obtained from ENM are highly correlated with a vector representing the conformational change between two equilibrium states. Bahar and coworkers [74] showed that the conformational change from tense form to relaxed form for hemoglobin is driven by entropic effect described by low-frequency normal modes from ENM. Brooks and coworkers [41, 67] predicted the conformational change depicted by lowfrequency normal modes with a perturbation of Tirion’s potential that incorporates the distance constraint. Kim et al. [40] suggested the linear interpolation between two conformations with constraint that the intermediate conformation distant from the interpolated coordinate is determined by minimization of harmonic potential. Wolynes and coworkers [39]

206

Kilho Eom and Sungsoo Na

provided the nonlinear elastic energy landscape for conformational change of proteins based on low-frequency normal modes from ENM. Further, Karplus and coworkers [75] employed the same methodology suggested by Wolynes and coworkers for describing the conformational change of a motor protein. Further, Kidera and coworkers [76] used the linear response theory with Tirion’s model for depicting the conformational change of proteins.

Figure 7. Collectivity parameter κi for a hemoglobin (pdb: 1a3n) estimated from elastic network model and coarse-grained elastic network model

Figure 8. Correlation matrix Cij for a motor protein (pdb: 1e79) evaluated by (a) elastic network model and (b) coarse-grained elastic network model

For delineating the correlation between low-frequency normal modes and conformational change, the parameters referred to as overlap Ik and/or cumulative involvement Sk are defined such as

Coarse-Grained Structural Model of Protein Molecules

Ik =

(r (r

open

− rclose ) ⋅ v k

open

− rclose )

207

(17.a)

and k

Sk = ∑ I p 2

(17.b)

p =1

Here, ropen and rclose represent the position vectors corresponding to open form and close form, respectively, and vk is the k-th normal mode. Ik indicates the correlation between k-th normal mode and conformational change, and Sk is a quantity representing the contribution from low-frequency normal modes (from first mode to k-th mode) to the conformational change. Figure 9 shows the overlap and cumulative involvement predicted by ENM and coarse-grained ENM. It is remarkable that both models predict that the conformational change is highly correlated with a few low-frequency normal modes. This is consistent with a recent finding that low-frequency normal modes are sufficient to represent the conformational change of a protein.

Figure 9. Overlap Ik and cumulative involvement Sk for citrate synthase computed from elastic network model and coarse-grained elastic network model. Blue color represents the calculation based on elastic network model, whereas red color indicates the computation based on coarse-grained elastic network model. A bar graph shows the square of overlap, and dotted line presents the cumulative involvement.

Conclusion In this article, we review the current state-of-art in coarse-graining of protein molecules for understanding their dynamics relevant to biological functions. The coarse-graining procedure is usually acceptable as long as the protein topology related to its dynamics is sufficiently delineated by such coarse-grained models. We briefly overviewed the broadly accepted coarse-grained models such as Go model and Tirion’s model (ENM), which enables

208

Kilho Eom and Sungsoo Na

one to gain insight into protein dynamics such as conformational fluctuation and conformational change related to the biological function. Moreover, a recently developed coarse-grained ENM models are taken into account and it is shown that such coarse-grained ENM may allow one to achieve the fast computation on low-frequency normal modes related to biological function. It is provided that the possibility of coarse-graining for a protein structure is attributed to the fact that protein structure is usually composed of several rigid domains that can be described by few degrees of freedom. As previously shown, both ENM and coarse-grained ENM predicts the low-frequency normal modes and the thermal fluctuation quantitatively similar to that obtained by experiments. Further, both ENM and coarse-grained model such as rigid cluster model predict the conformational transitions between two conformations. However, it has to be validated whether coarse-grained ENM is acceptable or not for prediction of conformational change. To our knowledge, this issue has not been well considered except a recent work by Brooks and coworkers [67] who employed mixed ENM for understanding conformational change. As stated above, coarse-grained models have been successful for studying the conformational dynamics of proteins. However, since some proteins such as titin perform the mechanical function, the protein unfolding behavior has to be well understood for insight into the mechanical function. Atomistic simulation is still restricted to small proteins, leading to consideration of coarse-grained models. A recent study by Cieplak and coworkers [28] suggested the molecular model based on Go potential under the mechanical loading. It is very remarkable that their model based on Go potential allows them to predict the forcedisplacement relation under the mechanical loading, comparable to the results of AFM experiments. Moreover, McCammon and coworkers [77] employed Tirion’s potential with mechanical loading applied to the termini of a protein. It was remarkably shown that even Tirion’s model is acceptable for gaining insight into mechanical unfolding of proteins. Recently, Rief and coworker [78] revisited Tirion’s model with bond-breaking model for protein unfolding mechanics. It is remarkably found that their elastic bond network model [78] allowed Rief and coworker to predict the probability distribution of rupture force, quantitatively comparable to AFM experimental data [79]. In summary, the coarse-grained models such as Go model and Tirion model have been reviewed for protein dynamics relevant to biological function. Moreover, such models can be extended for understanding mechanical unfolding of protein structure. In conclusion, coarsegrained models such as Tirion’s model may be versatile for understanding the large protein dynamics and/or large protein unfolding mechanics.

Acknowledgement This work was supported in part by Nano-Bio Research Center at KIST (to K.E.) and LG YONAM FOUNDATION and Basic Research Program of the Korea Science and Engineering Foundation (KOSEF) under grant No. R01-2007-000-10497-0 (to S.N.).

Coarse-Grained Structural Model of Protein Molecules

209

References [1] [2]

[3] [4] [5] [6] [7]

[8] [9]

[10]

[11]

[12] [13]

[14] [15] [16] [17]

[18]

Kolomeisky, A.B. and M.E. Fisher, Molecular Motors: A Theorist's Perspective. Annu. Rev. Phys. Chem., 2007. 58: 675. Duncan, T.M., V.V. Bulygin, Y. Zhou, M.L. Hutcheon, and R.L. Cross, Rotation of subunits during catalysis by Escherichia coli F1-ATPase. Proc. Natl. Acad. Sci. USA, 1995. 92: 10964. Kinoshita, K., R. Yasuda, K. Noji, S. Ishiwata, and M. Yoshida, F1-ATPase: a rotary motor made of a single molecule. Cell, 1998. 93: 21. Noji, H., R. Yasuda, M. Yoshida, and K. Kinosita, Direct observation of the rotation of F1-ATPase. Nature, 1997. 386: 299. Sabbert, D., S. Engelbrecht, and W. Junge, Functional and idling rotatory motion within F1-ATPase. Proc. Natl. Acad. Sci. USA, 1997. 94: 4401. Yasuda, R., H. Noji, K. Kinosita, and M. Yoshida, F1-ATPase is a highly efficient molecular motor that rotates with discrete 120o steps. Cell, 1998. 93: 1117. Ranson, N.A., D.K. Clare, G.W. Farr, D. Houldershaw, A.L. Horwich, and H.R. Saibil, Allosteric signaling of ATP hydrolysis in GroEL-GroES complexes. Nat. Struct. Mol. Biol., 2006. 13: 147. Keskin, O., I. Bahar, D. Flatow, D.G. Covell, and R.L. Jernigan, Molecular mechanisms of chaperonin GroEL-GroES function. Biochemistry, 2002. 41: 491. Marszalek, P.E., H. Lu, H.B. Li, M. Carrion-Vazquez, A.F. Oberhauser, K. Schulten, and J.M. Fernandez, Mechanical unfolding intermediates in titin modules. Nature, 1999. 402: 100. Carrion-Vazquez, M., A.F. Oberhauser, T.E. Fisher, P.E. Marszalek, H. Li, and J.M. Fernandez, Mechanical design of proteins studied by single-molecule force spectroscopy and protein engineering. Prog. Biophys. Mol. Biol., 2000. 74: 63. Li, H., A.F. Oberhauser, S.B. Fowler, J. Clarke, and J.M. Fernandez, Atomic force microscopy reveals the mechanical design of a modular protein. Proc. Natl. Acad. Sci. USA, 2000. 97: 6527. Oberhauser, A.F., P.E. Marszalek, H.P. Erickson, and J.M. Fernandez, The molecular elasticity of the extracellular matrix protein tenascin. Nature, 1998. 393: 181. Schafer, L.V., E.M. Muller, H.E. Gaub, and H. Grubmuller, Elastic Properties of Photoswitchable Azobenzene Polymers from Molecular Dynamics Simulations. Angew. Chem. Int. Ed., 2007. 46: 2232. McCammon, J.A. and S. Harvey, Dynamics of proteins and nucleic acids. 1987, Cambridge: Cambridge University Press. McCammon, J.A., B.R. Gelin, and M. Karplus, Dynamics of folded proteins. Nature, 1977. 267: 585. Amadei, A., A.B.M. Linssen, and H.J.C. Berendsen, Essential Dynamics of Proteins. Proteins: Struct. Funct. Genet., 1993. 17: 412. Lu, H., B. Isralewitz, A. Krammer, V. Vogel, and K. Schulten, Unfolding of titin immunoglobulin domains by steered molecular dynamics simulation. Biophys. J., 1998. 75: 662. Lu, H. and K. Schulten, Steered molecular dynamics simulations of force-induced protein domain unfolding. Proteins, 1999. 35: 453.

210

Kilho Eom and Sungsoo Na

[19] Sotomayor, M. and K. Schulten, Single-Molecule Experiments in Vitro and in Silico. Science, 2007. 316: 1144. [20] Brooks, C.L., M. Karplus, and B.M. Pettit, Adv. Chem. Phys., 1988. 71: 1. [21] Elber, R., Long-timescale simulation methods. Curr. Opin. Struct. Biol., 2005. 15: 151. [22] Hayward, S. and N. Go, Collective Variable Description of Native Protein Dynamics. Annu. Rev. Phys. Chem., 1995. 46: 223. [23] Cui, Q., G.H. Li, J.P. Ma, and M. Karplus, A normal mode analysis of structural plasticity in the biomolecular motor F-1-ATPase. J. Mol. Biol., 2004. 340: 345. [24] Ma, J.P., Usefulness and limitations of normal mode analysis in modeling dynamics of biomolecular complexes. Structure, 2005. 13: 373. [25] Cieplak, M., T.X. Hoang, and M.O. Robbins, Thermal folding and mechanical unfolding pathways of protein secondary structures. Proteins: Struct. Funct. Genet., 2002. 49: 104. [26] Cieplak, M., T.X. Hoang, and M.O. Robbins, Folding and stretching in a Go-like model of titin. Proteins: Struct. Funct. Genet., 2002. 49: 114. [27] Cieplak, M., T.X. Hoang, and M.O. Robbins, Thermal effects in stretching of Go-like models of titin and secondary structures. Proteins: Struct. Funct. Bioinfo., 2004. 56: 285. [28] Marek, C., P. Annalisa, and H. Trinh Xuan, Mechanical properties of the domains of titin in a Go-like model. J. Chem. Phys., 2005. 122: 054906. [29] Tirion, M.M., Large amplitude elastic motions in proteins from a single-parameter, atomic analysis. Phys. Rev. Lett., 1996. 77: 1905. [30] Haliloglu, T., I. Bahar, and B. Erman, Gaussian dynamics of folded proteins. Phys. Rev. Lett., 1997. 79: 3090. [31] Bahar, I., A.R. Atilgan, M.C. Demirel, and B. Erman, Vibrational dynamics of folded proteins: Significance of slow and fast motions in relation to function and stability. Phys. Rev. Lett., 1998. 80: 2733. [32] Bahar, I., B. Erman, R.L. Jernigan, A.R. Atilgan, and D.G. Covell, Collective motions in HIV-1 reverse transcriptase: Examination of flexibility and enzyme function. J. Mol. Biol., 1999. 285: 1023. [33] Atilgan, A.R., S.R. Durell, R.L. Jernigan, M.C. Demirel, O. Keskin, and I. Bahar, Anisotropy of fluctuation dynamics of proteins with an elastic network model. Biophys. J., 2001. 80: 505. [34] Bahar, I. and A.J. Rader, Coarse-grained normal mode analysis in structural biology. Curr. Opin. Struct. Biol., 2005. 15: 586. [35] Tozzini, V., Coarse-grained models for proteins. Curr. Opin. Struct. Biol., 2005. 15: 144. [36] Cui, Q. and I. Bahar, Normal Mode Analysis: Theory and Applications to Biological and Chemical Systems. 2005: CRC Press. [37] Tama, F. and C.L. Brooks, Symmetry, form, and shape: Guiding principles for robustness in macromolecular machines. Annu. Rev. Biophys. Biomol. Struct., 2006. 35: 115. [38] Tama, F. and Y.H. Sanejouand, Conformational change of proteins arising from normal mode calculations. Protein Eng., 2001. 14: 1.

Coarse-Grained Structural Model of Protein Molecules

211

[39] Miyashita, O., J.N. Onuchic, and P.G. Wolynes, Nonlinear elasticity, proteinquakes, and the energy landscapes of functional transitions in proteins. Proc. Natl. Acad. Sci. USA., 2003. 100: 12570. [40] Kim, M.K., W. Li, B.A. Shapiro, and G.S. Chirikjian, A comparison between elastic network interpolation and MD simulation of 16S ribosomal RNA. J. Biomol. Struct. Dyn., 2003. 21: 395. [41] Zheng, W.J. and B.R. Brooks, Normal-modes-based prediction of protein conformational changes guided by distance constraints. Biophys. J., 2005. 88: 3109. [42] Zheng, W.J. and B.R. Brooks, Modeling protein conformational changes by iterative fitting of distance constraints using reoriented normal modes. Biophys. J., 2006. 90: 4327. [43] Micheletti, C., J.R. Banavar, and A. Maritan, Conformations of Proteins in Equilibrium. Phys. Rev. Lett., 2001. 87: 088102. [44] Zheng, W.J. and S. Doniach, A comparative study of motor-protein motions by using a simple elastic-network model. Proc. Natl. Acad. Sci. USA., 2003. 100: 13253. [45] Tama, F. and C.L. Brooks, Diversity and Identity of Mechanical Properties of Icosahedral Viral Capsids Studied with Elastic Network Normal Mode Analysis. J. Mol. Biol., 2005. 345: 299. [46] Kim, M.K., R.L. Jernigan, and G.S. Chirikjian, An elastic network model of HK97 capsid maturation. J. Struct. Biol., 2003. 143: 107. [47] Zheng, W.J., B.R. Brooks, and D. Thirumalai, Low-frequency normal modes that describe allosteric transitions in biological nanomachines are robust to sequence variations. Proc. Natl. Acad. Sci. USA., 2006. 103: 7664. [48] Tobi, D. and I. Bahar, Structural changes involved in protein binding correlate with intrinsic motions of proteins in the unbound state. Proc. Natl. Acad. Sci. USA., 2005. 102: 18908. [49] Chennubhotla, C. and I. Bahar, Markov propagation of allosteric effects in biomolecular systems: application to GroEL-GroES. Mol. Syst. Biol., 2006. 2: Article No 36. [50] Chennubhotla, C. and I. Bahar, Signal propagation in proteins and relation to equilibrium fluctuations. PLOS Computat. Biol., 2007. 3: 1716. [51] Doruker, P., R.L. Jernigan, and I. Bahar, Dynamics of large proteins through hierarchical levels of coarse-grained structures. J. Comput. Chem., 2002. 23: 119. [52] Eom, K., S.-C. Baek, J.-H. Ahn, and S. Na, Coarse-graining of protein structures for normal mode studies. J. Comput. Chem., 2007. 28: 1400. [53] Cheng, H., Z. Gimbutas, P.G. Martinsson, and V. Rokhlin, On the compression of low rank matrices. SIAM J. Sci. Comput., 2005. 26: 1389. [54] Liberty, E., F. Woolfe, P.-G. Martinsson, V. Rokhlin, and M. Tygert, Randomized algorithms for the low-rank approximation of matrices. Proc. Natl. Acad. Sci. USA., 2007. 104: 20167. [55] Chennubhotla, C. and I. Bahar, Markov methods for hierarchical coarse-graining of large protein dynamics, in Lecture Notes in Computer Science. 2006. p. 379. [56] Ming, D., Y. Kong, Y. Wu, and J. Ma, Substructure synthesis method for simulating large molecular complexes. Proc. Natl. Acad. Sci., 2003. 100: 104. [57] Chandler, D., Introduction to modern statistical mechanics. 1987: Oxford University Press.

212

Kilho Eom and Sungsoo Na

[58] Teeter, M.M. and D.A. Case, Harmonic and quasiharmonic descriptions of crambin. J. Phys. Chem., 1990. 94: 8091. [59] Weiner, J.H., Statistical mechanics of elasticity. 1983: Dover publication. [60] Doi, M. and S.F. Edwards, The Theory of Polymer Dynamics. 1986, New York: Oxford University Press. [61] Makarov, D.E. and G.J. Rodin, Configurational entropy and mechanical properties of cross-linked polymer chains: Implications for protein and RNA folding. Phys. Rev. E., 2002. 66: 011908. [62] Eom, K., P.C. Li, D.E. Makarov, and G.J. Rodin, Relationship between the Mechanical Properties and Topology of Cross-Linked Polymer Molecules: Parallel Strands Maximize the Strength of Model Polymers and Protein Domains. J. Phys. Chem. B, 2003. 107: 8730. [63] Eom, K., D.E. Makarov, and G.J. Rodin, Theoretical studies of the kinetics of mechanical unfolding of cross-linked polymer chains and their implications for singlemolecule pulling experiments. Phys. Rev. E., 2005. 71: 021904. [64] Eom, K., J.H. Ahn, S.C. Baek, J.I. Kim, and S. Na, Robust reduction method for biomolecules modeling. CMC-Computers Materials and Continua, 2007. 6: 35. [65] Kurkcuoglu, O., R.L. Jernigan, and P. Doruker, Mixed levels of coarse-graining of large proteins using elastic network model succeeds in extracting the slowest motions. Polymer, 2004. 45: 649. [66] Kim, M.K., R.L. Jernigan, and G.S. Chirikjian, Rigid-cluster models of conformational transitions in macromolecular machines and assemblies. Biophys. J., 2005. 89: 43. [67] Zheng, W., B.R. Brooks, and G. Hummer, Protein conformational transitions explored by mixed elastic network models. Proteins: Struct. Funct. Bioinfo., 2007. 69: 43. [68] Martinsson, P.G., Fast evaluation of electro-static interactions in multi-phase dielectric media. J. Comput. Phys., 2006. 211: 289. [69] Tama, F., F.X. Gadea, O. Marques, and Y.H. Sanejouand, Building-block approach for determining low-frequency normal modes of macromolecules. Proteins: Struct. Funct. Genet., 2000. 41: 1. [70] Lienin, S.F. and R. Bruschweiler, Characterization of collective and anisotropic reorientational protein dynamics. Phys. Rev. Lett., 2000. 84: 5439. [71] Van Wynsberghe, A.W. and Q. Cui, Interpreting correlated motions using normal mode analysis. Structure, 2006. 14: 1647. [72] Lou, H. and R.I. Cukier, Molecular Dynamics of Apo-Adenylate Kinase: A Principal Component Analysis. J. Phys. Chem. B, 2006. 110: 12796. [73] Lu, M.Y. and J.P. Ma, The role of shape in determining molecular motions. Biophys. J., 2005. 89: 2395. [74] Xu, C.Y., D. Tobi, and I. Bahar, Allosteric changes in protein structure computed by a simple mechanical model: Hemoglobin T <-> R2 transition. J. Mol. Biol., 2003. 333: 153. [75] Maragakis, P. and M. Karplus, Large amplitude conformational change in proteins explored with a plastic network model: Adenylate kinase. J. Mol. Biol., 2005. 352: 807. [76] Ikeguchi, M., J. Ueno, M. Sato, and A. Kidera, Protein structural change upon ligand binding: Linear response theory. Phys. Rev. Lett., 2005. 94.

Coarse-Grained Structural Model of Protein Molecules

213

[77] Shen, T., L.S. Canino, and J.A. McCammon, Unfolding Proteins under External Forces: A Solvable Model under the Self-Consistent Pair Contact Probability Approximation. Phys. Rev. Lett., 2002. 89: 068103. [78] Dietz, H. and M. Rief, An elastic bond network model for protein unfolding mechanics, unpublished. [79] Dietz, H., F. Berkemeier, M. Bertz, and M. Rief, Anisotropic deformation response of single protein molecules. Proc. Natl. Acad. Sci. USA., 2006. 103: 12724.

In: Computational Biology: New Research Editor: Alona S. Russe

ISBN: 978-1-60692-040-4 © 2009 Nova Science Publishers, Inc.

Chapter 8

DIFFERENTIATING SUPERFICIAL AND ADVANCED UROTHELIAL BLADDER CARCINOMAS BASED ON GENE EXPRESSION PROFILES ANALYZED USING SELF-ORGANIZING MAPS Phei Lang Chang∗, Ke Hung Tsui, Tzu Hao Wang♣, Chien Lun Chen and Sheng Hui Lee Department of Surgery, Division of Urology and Chang Gung Bioinformatics Center, and ♣ Genomic Medicine Research Core Laboratory, Chang Gung Memorial Hospital, Chang Gung University, Taipei, Taiwan

Abstract The aim of this study was to differentiate between superficial and advanced bladder cancers by analyzing the gene expression profiles of these tumors by using the self-organizing maps (SOMs). We also used the GoMiner software for the biological interpretation of 473 interesting genes. Materials and Methods: Between December 2003 and November 2004, 17 patients with urothelial bladder cancers who were admitted to the Chang Gung Memorial Hospital for transurethral resection of the tumor were included in this study. The gene expression data included 7400 cDNAs in 17 arrays. The software, GeneCluster 2.0, was used for analyzing gene expression data by using SOMs. We used a 2-cluster SOM to cluster automatically a set of 17 tissues samples into superficial and advanced bladder cancers based on the gene expression patterns. We also used the GoMiner software for the biological interpretation of top 473 interesting genes. Results: Patients included 11 males and 6 females. Pathological studies confirmed the presence of superficial tumors in 9 patients and advanced tumors in 8 patients. Of the 7400 genes analyzed, 473 genes showed significant changes in their expression. Of these 268 were up-regulated, and 205 were down-regulated. Using the top 473 genes, SOMs were used to differentiate between the gene expression patterns of superficial and advanced bladder cancer ∗

Address correspondence to:Phei Lang Chang, M.D., Chairman, Department of Surgery, Professor; Department of Urology, Chang Gung Memorial Hospital, No. 5, Fu-Shing Street, Kweishan, Taoyuan 333, Taiwan; Tel: 8863-3281200-2137; Fax: 886-3-3274541; E-mail: [email protected]

216

Phei Lang Chang, Ke Hung Tsui, Tzu Hao Wang et al. tissue samples. The patient tissue samples were clustered into 2 groups, namely, superficial and advanced bladder cancers, comprising 10 and 7 samples, respectively. Only one patient tissue sample with advanced bladder cancer was clustered into the superficial bladder cancer group. This analysis had a high accuracy rate of 94% (16/17). The top 473 genes also were classified into biologically coherent categories by the GoMiner software. The results revealed that 452, 435, and 452 genes were associated with biological processes, cellular components, and molecular functions, respectively. Conclusion: Based on our results, we believe that superficial and advanced urothelial bladder cancers can be differentiated by their gene expression profiles analyzed by SOMs. The SOM method may be used on microarray data analysis to distinguish tumor stages and predict clinical outcomes. The genes that are uniquely expressed in either stage of bladder cancer can be considered as possible candidates for biomarkers.

Introduction Urothelial carcinoma of the urinary bladder is a common cancer in adult patients and is the second most common cancer of the genitourinary tract.[1] It accounts for 7% and 2% of primary carcinomas in males and females, respectively. Two-thirds of all urothelial carcinoma patients have superficial tumors, and 30% of these tumors become infiltrative.[2] The incidence of primary urinary bladder carcinomas in the United States is approximately 54,000 per year. It is the eighth leading cancer in women and the fourth leading cancer in men.[3] Although the incidence of urinary bladder cancer is lesser than that of other cancers its recurrence rate is the highest.[4] The overall mortality of urinary bladder cancer is approximately 30%.2 The treatment of bladder cancer is based on the clinical stage and the degree of differentiation of the tumor. Generally, superficial bladder tumors are treated by transurethral resection (TUR) with or without intravesical chemotherapy. Radical cystectomy is an effective treatment for patients with advanced muscle-invasive bladder tumors.[5] Although bladder cancer has response to chemotherapy, prognosis is still poor for metastatic tumors.[6] The differential diagnosis of superficial and advanced bladder cancers plays an important role in determining the treatment strategy and in evaluating the prognosis of these cancers. The behavior of bladder cancers varies according to the clinical stage and histological grade of the tumor. The diagnostic modalities that are clinically used for the diagnosis and staging of bladder cancers are urinary cytology, cystourethroscopy, chest x-ray, intravenous urography, intravesical ultrasonography, computed tomography (CT), magnetic resonance imaging (MRI), positron emission tomography, and radionuclide bone scan.[7] Recently, bladder tumor staging and outcome prediction has been carried out based on the genetic alterations and molecular markers of bladder cancer.1 Gene expression profiling by DNA microarray technology enables the identification of the genes responsible for the heterogeneity, recurrence, and progression of bladder cancers. This profiling enables the selection of appropriate treatment strategies and the prediction of disease outcome.[8] The identification of the molecular genetic alterations in bladder cancer has enhanced the diagnosis of bladder cancer. Many molecular genetic alterations have been identified in superficial and advanced bladder cancers. The roles of telomerase, methylation, oncogenes, and tumor suppressor genes in the development and progression of bladder cancer have been studied for many years. The identification of certain genetic alterations have proven to be useful in the diagnosis and treatment of superficial and advanced bladder cancers.[9]

Differentiating Superficial and Advanced Urothelial Bladder Carcinomas…

217

Microarray analysis is an effective tool for understanding the progression and metastasis of bladder cancer.[10] However, the data obtained from microarrays is extremely complicated. It is crucial to select an appropriate method to analyze the microarray data. The features of self-organizing maps (SOMs) make them well suited for the clustering of gene expression patterns. They are particularly useful for exploratory data analysis and visualization. They impose a partial structure on the clusters, thereby facilitating the interpretation of the data. SOMs have been tested on a wide variety of problems and were found to be significantly superior in both robustness and accuracy.[11] GeneCluster 2.0 is a software package used for analyzing gene expression data. It offers various methods to evaluate class predictors, visualize marker lists, cluster data, and validate results. It was released in June 1999. It implements the SOMs algorithm popularized by Tamayo et al. as well as various standard preprocessing methodologies used in microarray analysis.[12] Therefore, we selected this software to analyze microarray data.[13] Further, we used the GoMiner software to study the characteristics of noteworthy genes. This program organizes the lists of expressed genes generated from the microarray analysis for biological interpretation within the context of the Gene Ontology (GO) database. It provides quantitative and statistical output files and useful visualizations.[14] We identified the fundamental patterns of top 473 genes inherent in the gene expression data of patients with superficial or advanced bladder cancers. Further, we analyzed and visualized the gene expression data obtained by complementary DNA (cDNA) microarray analysis by using the SOMs method. The aim of this study was to differentiate between superficial and advanced bladder cancers by analyzing the gene expression profiles of these tumors by using the SOMs. We also used the GoMiner software for the biological interpretation of 473 interesting genes.

Materials and Methods Between December 2003 and November 2004, 17 patients with urothelial bladder cancers who were admitted to the Chang Gung Memorial Hospital underwent TUR. These patients were included in this study. Tumor specimens were collected after TUR. The specimens were divided to 2 groups: the fresh tissue and formalin-fixed tissue. Fresh tumor tissue samples obtained during TUR were immediately cut into small pieces, snap-frozen in liquid nitrogen, and stored at –70°C for microarray studies. Formalin-fixed tissues were stained with hematoxylin and eosin (H&E) for pathological examination.

Extraction of RNA from human tissue samples During TUR, samples of cancerous and normal neighboring tissues were obtained from the 17 patients. For RNA isolation, 1 ml of Trizol reagent (Invitrogen, Carlsbad, Calif, USA) was added to every 50–100 mg of pulverized frozen tissue immediately after removal. We used 1 ml of Trizol for (5–10) × 106 cultured cells. We incubated the homogenized tissue at room temperature for 5 min and then added 0.2 ml of chloroform. The mixture was shaken vigorously for 15 s, incubated at room temperature for another 3 min, and centrifuged at 12,000 × g at 4°C for 15 min. The upper colorless aqueous phase was transferred to a new

218

Phei Lang Chang, Ke Hung Tsui, Tzu Hao Wang et al.

microfuge tube. The RNA in this phase was precipitated by adding 0.5 ml isopropanol per ml of Trizol reagent and centrifugation at 12,000 × g at 4°C for 15 min. The RNA pellet was washed with 1 ml of 75% ethanol, vacuum dried briefly, and dissolved in RNase-free water. The RNA quantity and quality were initially evaluated by measuring absorbance at 260/280 nm. We evaluated RNA quality and quantity by using RNA LabChip, a lab-on-a-chip device, and Bioanalyzer 2100 (Agilent, Calif, USA). The Agilent Bioanalyzer provides a common platform for sample handling, separation, detection, and data analysis. With as little as 25-500 ng of RNA input, the Bioanalyzer calculated the ratio of 28S and 18S ribosomal RNA and indicated the concentrations of total RNA.

Microarray Procedures We used 5 to 10 μg total RNAs for labeling and hybridization with 3DNA Submicro Expression Array Detection Kit (Genisphere, Hatfield, PA) according to the protocol of the manufacturers. Indirect labeling of cDNA targets using 3DNA Submicro Expression Array Detection kit (Genisphere) was performed in 2 steps. First, 10 μg total RNA from each experimental (test) and control (control) sample was reverse transcribed to target cDNA using an oligo-d(T) primer tagged with either Cy3- or Cy5-specific 3DNA-capture sequences. Both the synthesized test and control cDNAs were then competitively hybridized to probe cDNAs spotted on the microarray slides for 16 hours (overnight) in 10-slide hybridization chambers (Genetix Ltd, UK). In the second step, synthesized tagged target cDNAs were in situ labeled for 2 hours on the microarray with Cy3-3DNA or Cy5-3DNA based on the sequencecomplementation to the capture sequence. Replications of the experiment with the dyeswapping microarray design, previously described, was used in the experiment to minimize statistical variances of data.[15, 16] After hybridization and washing, the slides were scanned with a ChipReader confocal laser scanner (Virtek, Waterloo, Canada). Spot and background intensities were obtained by using the GenePix Pro 4.1 software (Axon Instruments, Union City, CA). Within-slide normalization based on the local weighted regress (LOWESS) algorithm was then carried out. SOMs were developed by Teuvo Kohonen in 1995 as a method to visualize highdimensional data.[17] It is an excellent tool for data mining and is an unsupervised neural network algorithm. Unsupervised learning involves the aggregation of a diverse collection of data into clusters based on different features in a data set. After preprocessing of the data, the SOMs can cluster the data into biologically meaningful groups. It projects input space on prototypes of a regular grid, which can be utilized to explore the properties of the data. When the number of SOMs units is large, similar units are clustered to facilitate effective analysis of the map and the data.[18] The relationship between the known functional classes of genes are studied by analyzing their distribution on the SOMs. The SOMs presents a non-linear mapping of the data as a 2-dimensional map grid that can be used as a data analysis tool for generating hypotheses on the relationship between the genes. We used the SOMs for visualizing a high-dimensional data set of gene expression patterns on a graphical map display. The similarity relationships within the analyzed data can be interpreted. SOMs were used to differentiate between the gene expression patterns of

Differentiating Superficial and Advanced Urothelial Bladder Carcinomas…

219

superficial and advanced bladder cancer tissues samples. The details of SOMs analysis methods are shown in Figure 1.

Figure 1. The SOM analysis method. (SOMs: Self-Organizing Maps)

A file containing the gene intensity values for each sample scanned along with the associated annotations was used for entering data in the GeneCluster 2.0 software. Necessary image processing steps such as background subtraction and dye correction were carried out. Once loaded, the experiment file was filtered and normalized using a variety of algorithms. These included: thresholding, scaling, normalizing to a given mean and variance, fold-change analysis, and exclusion of high- and low-scoring features. We also randomized the dataset by bootstrap sampling columns from the dataset with replacement. Unsupervised learning, or clustering, is implemented by the SOMs algorithm to allow us to perform batch runs by varying the number of clusters and cluster geometries. Results were viewed in a visualizer that displays cluster profiles and relevant information about the cluster member.

220

Phei Lang Chang, Ke Hung Tsui, Tzu Hao Wang et al.

GoMiner is a tool for the biological interpretation of genomic and proteomic data, including the data from gene expression microarray.[14] It leverages the GO to identify the biological processes, molecular functions, and cellular components. It takes as input two lists of genes: the total gene set on the microarray and the flagged subset of interesting genes. It displays the genes within the framework of the GO hierarchy, both as a directed acyclic graph and as the equivalent tree structure. The GoMiner software classifies the genes into biological coherent categories and assesses these categories.[14]

Results Between December 2003 and November 2004, cDNA microarray gene expression data were obtained from 17 bladder cancer patients. The patients ranged in age from 47 to 87 years (average age, 64 years). All patients had undergone TUR. Pathological studies confirmed the presence of superficial tumors in 9 patients and advanced tumors in 8 patients. Eight out of the 9 patients with superficial bladder tumors received bladder Bacillus Calmette-Guerin (BCG) instillation after TUR. One of the 9 patients with superficial tumors underwent radical cystectomy and concurrent chemotherapy and radiotherapy due to the presence of local lymph node metastasis. All 8 patients with advanced bladder tumor underwent radical cystectomy after TUR. After radical cystectomy, 2 out of these 8 patients received concurrent chemotherapy and radiotherapy, while 1 of them received only chemotherapy. The gene expression data comprised data from 7400 cDNAs in 17 arrays. Table 1 shows the characteristics of the patient tissue samples. Of the 7400 genes analyzed, 473 genes showed significant changes in their expression. Of these 268 were up-regulated, and 205 were down-regulated. Gene expression data were filtered and normalized after running the GeneCluster program. Using the top 473 interesting genes, the SOM method was applied to a balanced dataset and was used to analyze the gene expression data. The underlying structure of the data was explored by varying the geometry of the SOMs. The SOMs forms a non-linear mapping of the data to a two-dimensional map grid that can be used as an exploratory data analysis tool for generating hypotheses on the relationships of the genes. A 2-cluster SOM was used to automatically cluster a set of 17 tissues samples into superficial and advanced bladder cancer groups based on gene expression patterns. The relationships between the known functional classes of the top 473 genes were investigated by analyzing their distribution on the SOMs. The patient tissue samples were clustered into 2 groups, namely, superficial and advanced bladder cancers, comprising 10 and 7 samples, respectively (Figure 2). One patient tissue sample with advanced bladder cancer was clustered into the superficial bladder cancer group. The top 473 genes were analyzed using SOMs. SOMs were used to differentiate between the gene expression patterns of superficial and advanced bladder cancer tissue samples. This analysis had a high accuracy rate of 94% (16/17).

Differentiating Superficial and Advanced Urothelial Bladder Carcinomas… Table 1. Clinical and pathological parameters of patient tissue samples

221

222

Phei Lang Chang, Ke Hung Tsui, Tzu Hao Wang et al.

Figure 2. A 2-cluster SOM in GeneCluster analysis

The top 473 genes also were classified into biologically coherent categories by the GoMiner software. The results revealed that 452, 435, and 452 genes were associated with biological processes, cellular components, and molecular functions, respectively. Of the 452 genes annotated as being involved in biological processes, most genes were involved in cellular processes (438 genes), physiological processes (426 genes), and regulation (194 genes) and development (143 genes) of biological processes. Of the 435 genes involved in cellular components, 417 genes were associated with cells and cell

Differentiating Superficial and Advanced Urothelial Bladder Carcinomas…

223

components; 317 genes, with cell organelles; 158 genes, with organelle components; and 138 genes with protein complexes were classified. Of the 452 genes influencing molecular function, 382 genes were associated with binding; 212, with catalytic activity; and 80, with signal transducer activity. Table 2 shows the results in detail. Table 2. GO analysis of the top 473 genes using GoMiner •

biological_process (452 genes) cellular processes (438 genes) development (143 genes) growth (14 genes) interaction between organisms (8 genes) physiological processes (426 genes) regulation of biological processes (194 genes) reproduction (28 genes) response to stimulus (114 genes) viral life cycle (5 genes)

•

cellular_component (435 genes) cell (417 genes) cell component (417 genes) envelope (23 genes) extracellular matrix (9 genes) extracellular matrix part (1 gene) extracellular region (68 genes) extracellular region component (61 genes) membrane-enclosed lumen (68 genes) organelle (307 genes) organelle part (158 genes) protein complex (138 genes) synapse (2 genes) virion (1 gene) virion part (1 gene)

•

molecular_function (452 genes) antioxidant activity (4 genes) binding (382 genes) catalytic activity (212 genes) enzyme regulator activity (36 genes) motor activity (5 genes) protein tag (1 gene) signal transducer activity (80 genes) structural molecule activity (33 genes) transcription regulator activity (65 genes) translation regulator activity (10 genes) transporter activity (53 genes)

224

Phei Lang Chang, Ke Hung Tsui, Tzu Hao Wang et al.

According to the organizing principles of GO, many important genes are found in cancerrelated biological processes. The cell differentiation-related genes include BMP7, BRD2, CD24, CSF1, DRG1, DZIP1, EFNB2, ELAVL3, FGF1, FZD5, INHA, MDK, NOS3, NTRK2, PMCH, PMP22, SORT1, and TSSK2. The apoptosis-related genes comprise APAF1, APPBP1, BAG1, BCAP31, BIRC5, CDKN2A, ELMO1, IL1B, RAD21, RIPK2, TEGT, TFDP1, UNC5B, VDAC1, and YARS. The genes associated with cell migration include VAV3, WASF2, APBB2, BMP7, PGRMC1, UNC5B, CDH1, HMGCR, IL1B, SPP1, IL8, SYK, ARHGAP8, CDKN1B, and AKAP4. Cell proliferation-related genes include BCAT1, BLZF1, CKS2, CSF1, DYRK2, FGF1, FRAT2, IL1B, KIP2, MCM7, MDK, PIM1, PRDX1, RPS21, SYK, TACSTD2, TFDP1, YY1, and ZFP36L2. Genes related to transcription regulator activity are FALZ, HES1, MEN1, NEO1, NUP62, TBX5, TCF3, and TRIP4. These data are summarized in Table 3. Table 3. Important genes found in cancer-related biological processes Cancer-related biological processes Cell differentiation

Apoptosis

Cell migration

Cell proliferation

Transcription regulator activity

Related genes BMP7, BRD2, CD24, CSF1, DRG1, DZIP1, EFNB2, ELAVL3, FGF1, FZD5, INHA, MDK, NOS3, NTRK2, PMCH, PMP22, SORT1, TSSK2 APAF1, APPBP1, BAG1, BCAP31, BIRC5, CDKN2A, ELMO1, IL1B, RAD21, RIPK2, TEGT, TFDP1, UNC5B, VDAC1, YARS VAV3, WASF2, APBB2, BMP7, PGRMC1, UNC5B, CDH1, HMGCR, IL1B, SPP1, IL8, SYK, ARHGAP8, CDKN1B, AKAP4 BCAT1, BLZF1, CKS2, CSF1, DYRK2, FGF1, FRAT2, IL1B, KIP2, MCM7, MDK, PIM1, PRDX1, RPS21, SYK, TACSTD2, TFDP1, YY1, ZFP36L2 FALZ, HES1, MEN1, NEO1, NUP62, TBX5, TCF3, TRIP4.

Discussion It has been reported that gene profiling provides a genome-based classification method for the diagnosis and prognosis of advanced urothelial bladder carcinoma. It may help identify the patients who require aggressive therapeutic interventions.[19] The gene expression profiling of urothelial bladder cancer may provide insights into the biology of cancer progression and help identify patients with distinct clinical phenotypes.[8] Therefore, in this study, we compared the gene expression profiles of superficial and advanced urothelial bladder cancers and attempted to identify changes in gene expression during the progression of urothelial cancer. We also analyzed the annotation of the significant genes to understand the role of critical genes and pathways during bladder cancer progression. Microarray technology has simplified the monitoring of gene expression patterns during cellular differentiation.[20, 21] Many statistical techniques have been developed for identifying the underlying patterns in complex data. Recently, several clustering techniques

Differentiating Superficial and Advanced Urothelial Bladder Carcinomas…

225

for gene expression data have been described. They include hierarchical clustering, Bayesian clustering, k-means clustering, and SOMs.[22-24] Hierarchical clustering has proven to be a valuable method of analysis. Bayesian clustering is a structured approach useful when prior data distribution is available. K-means clustering is an unstructured approach, which produces an unorganized collection of clusters. The features of SOMs make them well-suited for clustering and analysis of gene expression patterns and impose partial structure on the clusters.[12] SOMs facilitate easy visualization and interpretation. SOMs are easy to implement, fast, and scalable to large data sets. In this study, the program package GeneCluster V. 2.0 was used to produce and display SOMs of gene expression data. The program was developed by the Center for Genome Research/Massachusetts Institute of Technology and was used to facilitate the interpretation of gene expression data. Transcription studies on a genomic scale using microarray technology have led to major advances in our understanding of the pathogenesis of human diseases. In this study, the gene expression data comprised 7400 cDNAs in 17 arrays. We used the top 473 genes that demonstrated significant changes in expression for the differential diagnosis of superficial and advanced urothelial bladder cancers. Microarray studies have generated controversy due to the probabilistic and inconsistent nature of the results obtained. It is, however, possible to develop simple expression measures that will allow comparisons in the same platforms and studies.[25] A SOM is an unsupervised neural network learning algorithm. It has been successfully used for the analysis and organization of large data files.[26] Therefore, we applied the SOM algorithm for the analysis and visualization of gene expression profiles. A 2-cluster SOM was used to automatically cluster a set of 17 urothelial bladder cancer samples into superficial and advanced bladder cancers based on the expression patterns of 473 genes. The SOM analysis yielded a high accuracy rate of approximately 94%. The clustering algorithm was effective in categorizing the samples into biologically meaningful groups. However, clustering has yielded results that are interpretable in the context of a priori knowledge to know bladder cancer subclasses. In the absence of such knowledge the biological interpretation of clustering results remains a challenge. A gene may be located in some cellular components and influence some biological processes, in which certain molecular functions may be involved. GO addresses the need for consistent descriptions of genes in databases. The GO project has developed 3 structured ontologies that describe gene products in terms of their associated cellular components, biological processes, and molecular functions.[27] The functions of the differentially expressed genes can be assessed by querying the GO database via the GoMiner software. GoMiner is a program package that organizes lists of expressed genes for biological interpretation within the context of the GO database.[14] Therefore, in our study, we used the GoMiner software to query the GO database to analyze the functions of 473 genes. Of the 473 genes, 268 genes were up-regulated and 205 genes were down-regulated. We analyzed these genes and attempted to identify the genes that could function as biomarkers of urothelial bladder cancers. Of these 473 genes, many have been reported to be associated with superficial or advanced bladder cancers. In 2007, it was reported that the overexpression of CD24 is associated with invasiveness in urothelial carcinoma of the bladder. The CD24 gene is related to cell differentiation. It may be a potential serum marker for urothelial bladder carcinoma or a target of antibody-based therapeutics for bladder carcinoma.[28] FGF1 has been observed to behave as a tumorigenic factor in a bladder carcinoma cell model.[29] NOS3

226

Phei Lang Chang, Ke Hung Tsui, Tzu Hao Wang et al.

may play a role in the differentiation of the transitional epithelium in fetal life. It may also be involved in physiological functions of the adult bladder mucosa; this gene is also involved in bladder carcinogenesis.[30] It has been suggested that the expression of the tumor suppressor gene APAF1 is controlled by promoter methylation in the early stages of bladder cancer.[31] The apoptosis-related gene BIRC5 is a useful adjunct marker for the grading of papillary urothelial carcinoma.[32] Decreased CDKN2A expression was found to be correlated with tumor progression in patients with minimally invasive bladder cancer.[33] It was reported that an single nucleotide polymorphism (SNP) in the promoter of the cell migration-related gene CDH1 is a risk factor for bladder cancer.[34] The overexpression of interleukin 8 (IL-8) in transitional cell carcinoma (TCC) results in increased tumorigenicity and metastasis.[35] An analysis of the different patterns of aberrant methylation of the SYK gene in the various types of bladder cancers indicated that SYK may play a role in tumor cell differentiation causing the conversion of TCC into non-urothelial carcinomas and thus increasing the malignant potential of TCC.[36] Analysis of the cell cycle regulators of urothelial bladder cancer indicated that CDKN1B is a potentially useful prognostic biomarker.[37] Real-time PCR assay confirmed that the up-regulation of the CKS2 gene in advanced bladder cancer was significantly higher than that in superficial bladder cancer.[10] The KIP2 gene could be an important target of genetic and epigenetic alterations in bladder cancer affecting the maternal chromosome 11p15.5.[38] The up-regulation of the cell proliferation-related gene YY1 in bladder cancer may account for the efficacy of gefitinib administration in the treatment of this disease.[39] Gene expression profiling of urothelial bladder cancers provides insights into the pathogenesis and prognosis of bladder cancer. Further it enables the identification of the distinct clinical phenotypes of bladder cancer patients. Alterations in specific genes are associated with modifications in various cellular functions such as cell differentiation, signal transduction, transcription, translation, DNA replication, and mitosis. Some studies in the past decade have demonstrated that genetic alterations are responsible for the development and progression of bladder cancer.

Conclusion Based on our results, we believe that superficial and advanced urothelial bladder cancers can be differentiated by their gene expression profiles analyzed by SOMs. The differences between the gene expression profiles of superficial and advanced bladder carcinomas suggest that these cancers may have unique genetic pathways with similar genetic alterations. The genes that are uniquely expressed in either type of bladder cancer can be considered as possible candidates for urinary biomarkers.

References [1] [2]

Konety, BR; Carroll, PR. Smith’s General Urology: Urothelial carcinoma: cancers of the bladder, ureter and renal pelvis. 16th. Columbus, OH: McGraw-Hill; 2008. Flechon, A; Droz, JP. Chemotherapy practices and perspectives in invasive bladder cancer. Expert Rev Anticancer Ther, 2006 Oct 6(10):1473-82.

Differentiating Superficial and Advanced Urothelial Bladder Carcinomas… [3] [4] [5]

[6] [7]

[8]

[9]

[10]

[11]

[12]

[13] [14]

[15]

[16]

[17]

227

Greenlee, RT; Hill-Harmon, MB; Murray, T; Thun, M. Cancer statistics. CA Cancer J Clin, 2001;51:15–36. Agarwal, PK; Black, PC; Kamat, AM. Considerations on the use of diagnostic markers in management of patients with bladder cancer. World J Urol, 2007;155(2):208-14. Goebell, PJ; Vom Dorp, F; Rödel, C; Frohneberg, D; Thüroff, JW; Jocham, D; Stief, C; Roth, S; Knüchel, R; Schmidt, KW; Kausch, I; Zaak, D; Wiesner, C; Miller, K; Sauer, R; Rübben, H. Noninvasive and invasive bladder cancer: diagnostics and treatment. Urologe A, 2006 Jul 45(7):873-84; quiz 885. Meyer, D; Schmid, HP; Engeler, DS. Therapy and follow-up of bladder cancer. Wien Med Wochenschr, 2007;157(7-8):162-9. Satoh, H; Morimoto, Y; Arai T; Asanuma, H; Kawauchi, S; Seguchi, K; Kikuchi, M; Murai, M. Intravesical ultrasonography for tumor staging in an orthotopically implanted rat model of bladder cancer. J Urol, 2007 Mar 177(3):1169-73. Modlich, O; Prisack, HB; Pitschke, G; Ramp, U; Ackermann, R; Bojar, H; Vögeli, TA; Grimm, MO. Identifying superficial, muscle-invasive, and metastasizing transitional cell carcinoma of the bladder: use of cDNA array analysis of gene expression profiles. Clin Cancer Res, 2004 May 15;10(10):3410-21. Baffa, R; Letko, J; McClung, C; LeNoir, J; Vecchione, A; Gomella, LG. Molecular genetics of bladder cancer: targets for diagnosis and therapy. J Exp Clin Cancer Res, 2006 Jun 25(2):145-60. Kawakami, K; Enokida, H; Tachiwada, T; Gotanda, T; Tsuneyoshi, K; Kubo, H; Nishiyama, K; Takiguchi, M; Nakagawa, M; Seki, N. Identification of differentially expressed genes in human bladder cancer through genome-wide gene expression profiling. Oncol Rep, 2006 Sep;16(3):521-31. van Osdol, WW; Myers, TG; Paull, KD; Kohn, KW; Weinstein, JN. Use of the Kohonen self-organizing map to study the mechanisms of action of chemotherapeutic agents. J Natl Cancer Inst, 1994 Dec 21;86(24):1853-9. Tamayo, P; Slonim, D; Mesirov, J; Zhu, Q; Kitareewan, S; Dmitrovski, E; Lander, E; Golub, T. Interpreting patterns of gene expression with self-organizing maps: methods and applications to hematopoietic differentiation. Proc. Natl Acad. Sci, USA, 1999;96:2907–12. Reich, M; Ohm, K; Angelo, M; Tamayo, P; Mesirov, JP. GeneCluster 2.0: an advanced toolset for bioarray analysis. Bioinformatics, 2004 Jul 22;20(11):1797-8. Zeeberg, BR; Feng, W; Wang, G; Wang, MD; Fojo, AT; Sunshine, M; Narasimhan, S; Kane, DW; Reinhold, WC; Lababidi, S; Bussey, KJ; Riss, J; Barrett, JC; Weinstein, JN. GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biology, 2003;4(4):R28. Wang, TH; Lee, YS; Chen, ES; Kong, WH; Chen, LK; Hsueh, DW; Wei, ML; Wang, HS; Lee, YS. Establishment of cDNA microarray analysis at the Genomic Medicine Research Core Laboratory (GMRCL) of Chang Gung Memorial Hospital. Chang Gung Med J, 2004;27:243-60. Chao, A; Wang, TH; Lee, YS; Hsueh, S; Chao, AS; Chang, TC; Kung, WH; Huang, SL; Chao, FY; Wei, ML; Lai, CH. Molecular characterization of adenocarcinoma and squamous carcinoma of the uterine cervix using microarray analysis of gene expression. Int J Cancer, 2006;119(1):91-8. Kohonen, T. Self-Organizing Maps. NY: Springer; 1995.

228

Phei Lang Chang, Ke Hung Tsui, Tzu Hao Wang et al.

[18] Vesanto, J; Alhoniemi, E. Clustering of the self-organizing map. IEEE Trans Neural Netw, 2000;11(3):586-600. [19] Sanchez-Carbayo, M; Socci, ND; Lozano, J; Saint, F; Cordon-Cardo, C. Defining molecular profiles of poor outcome in patients with invasive bladder cancer using oligonucleotide microarrays. J Clin Oncol, 2006 Feb 10;24(5):778-89. [20] Wodicka, L; Dong, H; Mittmann, M; Ho, M; Lockhart, D. Genome-wide expression monitoring in Saccharomyces cerevisiae. Nat. Biotechnol, 1997;15: 1359–67. [21] Chu, S; DeRisi, J; Eisen, M; Mulholland, J; Botstein, D; Brown, PO; Herskowitz, I. The transcriptional program of sporulation in budding yeast. Science, 1998:282, 699– 705. [22] Spellman, PT; Sherlock, G; Zhang, MQ; Iyer, VR; Anders, K; Eisen, MB; Brown, PO; Botstein, D; Futcher, B. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell, 1998;9:3273–97. [23] Eisen, MB; Spellman, PT; Brown, PO; Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA, 1998;95: 14863–8. [24] Kangas, JA; Kohonen, TK; Laaksonen, JT. Variants of self-organizing maps. IEEE Trans Neural Netw, 1990;1(1):93-9. [25] Garrett-Mayer, E; Parmigiani, G; Zhong, X; Cope, L; Gabrielson, E. Cross-study validation and combined analysis of gene expression microarray data. Biostatistics, 2008 Apr;9(2):333-54. [26] Toronen, P; Kolehmainen, M; Wong, G; Castren, E. Analysis of gene expression data using self-organizing maps. FEBS Letters, 1999;451:142-6. [27] Cimino, JJ; Zhu, X. The practical impact of ontologies on biomedical informatics. Yearb Med Inform, 2006;124-35. [28] Choi, YL; Lee, SH; Kwon, GY; Park, CK; Han, JJ; Choi, JS; Choi, HY; Kim, SH; Shin, YK. Overexpression of CD24: association with invasiveness in urothelial carcinoma of the bladder. Arch Pathol Lab Med, 2007 Feb 131(2):275-81. [29] Jouanneau, J; Plouet, J ; Moens, G; Thiery, JP. FGF-2 and FGF-1 expressed in rat bladder carcinoma cells have similar angiogenic potential but different tumorigenic properties in vivo. Oncogene, 1997 Feb 13;14(6):671-6. [30] Shochina, M; Fellig, Y; Sughayer, M; Pizov, G; Vitner, K; Podeh, D; Hochberg, A; Ariel, I. Nitric oxide synthase immunoreactivity in human bladder carcinoma. Mol Pathol, 2001 Aug 54(4):248-52. [31] Christoph, F; Hinz, S; Kempkensteffen, C; Weikert, S; Krause, H; Schostak, M; Schrader, M; Miller, K. A gene expression profile of tumor suppressor genes commonly methylated in bladder cancer. J Cancer Res Clin Oncol, 2007 Jun 133(6):343-9. [32] Chen, YB; Tu, JJ; Kao, J; Zhou, XK; Chen, YT. Survivin as a useful adjunct marker for the grading of papillary urothelial carcinoma. Arch Pathol Lab Med, 2008 Feb 132(2):224-31. [33] Krüger, S; Mahnken, A; Kausch, I; Feller, AC. P16 immunoreactivity is an independent predictor of tumor progression in minimally invasive urothelial bladder carcinoma. Eur Urol, 2005 Apr 47(4):463-7.

Differentiating Superficial and Advanced Urothelial Bladder Carcinomas…

229

[34] Kiemeney, LA; van Houwelingen, KP; Bogaerts, M; Witjes, JA; Swinkels, DW; den Heijer, M; Franke, B; Schalken, JA; Verhaegh, GW. Polymorphisms in the E-cadherin (CDH1) gene promoter and the risk of bladder cancer. Eur J Cancer, 2006 Dec 42(18):3219-27. [35] Mian, BM; Dinney, CP; Bermejo, CE; Sweeney, P; Tellez, C; Yang, XD; Gudas, JM; McConkey, DJ; Bar-Eli, M. Fully human anti-interleukin 8 antibody inhibits tumor growth in orthotopic bladder cancer xenografts via down-regulation of matrix metalloproteases and nuclear factor-kappaB. Clin Cancer Res, 2003 Aug 1;9(8):316775. [36] Kunze, E; Wendt, M; Schlott, T. Promoter hypermethylation of the 14-3-3 sigma, SYK and CAGE-1 genes is related to the various phenotypes of urinary bladder carcinomas and associated with progression of transitional cell carcinomas. Int J Mol Med, 2006 Oct 18(4):547-57. [37] Brunner, A; Verdorfer, I; Prelog, M; Mayerl, C; Mikuz, G; Tzankov, A. Large-scale analysis of cell cycle regulators in urothelial bladder cancer identifies p16 and p27 as potentially useful prognostic markers. Pathobiology, 2008;75(1):25-33. [38] Oya, M; Schulz, WA. Decreased expression of p57(KIP2)mRNA in human bladder cancer. Br J Cancer, 2000 Sep 83(5):626-31. [39] Inoue, R; Matsuyama, H; Yano, S; Yamamoto, Y; Iizuka, N; Naito, K. Gefitinibrelated gene signature in bladder cancer cells identified by a cDNA microarray. Anticancer Res, 2006 Nov-Dec 26(6B):4195-202.

In: Computational Biology: New Research Editor: Alona S. Russe

ISBN 978-1-60692-0404-4 c 2009 Nova Science Publishers, Inc.

Chapter 9

F ULL S IBLING R ECONSTRUCTION IN W ILD P OPULATIONS F ROM M ICROSATELLITE G ENETIC M ARKERS Mary V. Ashley∗, Tanya Y. Berger-Wolf†, Isabel C. Caballero∗, Wanpracha Chaovalitwongse ‡, Bhaskar DasGupta† and Saad I. Sheikh † ∗ Department of Biological Sciences, University of Illinois at Chicago, Chicago, IL 60607.

I do not believe that the accident of birth makes people sisters and brothers. It makes them siblings. Gives them mutuality of parentage. .

– Maya Angelou

Abstract New technologies for collecting genotypic data from natural populations open the possibilities of investigating many fundamental biological phenomena, including behavior, mating systems, heritabilities of adaptive traits, kin selection, and dispersal patterns. The power and potential of genotypic information often rests in the ability to reconstruct genealogical relationships among individuals. These relationships include parentage, full and half-sibships, and higher order aspects of pedigrees. Some areas of genealogical inference, such as parentage, have been studied extensively. Although methods for pedigree inference and kinship analysis exist, most make assumptions that do not hold for wild populations of animals and plants. In this chapter, we focus on the full sibling relationship and first review existing methods for full sibship reconstructions from microsatellite genetic markers. We then ∗

Email: {ashley,icabal2}@uic.edu Department of Computer Science, University of Illinois at Chicago, Chicago, IL 60607. email: {tanyabw,dasgupta,ssheikh}@cs.uic.edu ‡ Department of Industrial Engineering, Rutgers University, Piscataway, NJ 08854. email: [email protected] †

232

Mary V. Ashley, Tanya Y. Berger-Wolf, Isabel C. Caballero et al. describe our new combinatorial methods for sibling reconstruction based on simple Mendelian laws and its extension even in the presence of errors in the data. We also describe a generic consensus method for combining sibling reconstruction results from other methods. We present experimental comparison of the best existing approaches on both biological and simulated data. We discuss relative merits and drawbacks of existing methods and suggest a practical approach for reconstructing sibling relationships in wild populations.

1.

Introduction

Kinship analysis of wild populations is often an important and necessary component of understanding an organism’s biology and ecology. Population biologists studying plants and animals in the field want to know how individuals survive, acquire mates, reproduce, and disperse to new populations. Often these parameters are difficult or impossible to infer from observational studies alone, and the establishment of kinship patterns (parentage or sibling relationships, for example) can be extremely useful. The powerful toolbox provided by advances in molecular biology and genome analysis has offered population biologists a growing list of possibilities for inferring kinship. Paternity analysis in wild populations became common upon the arrival of the first DNA-based markers in the mid-1980s, when multi-locus DNA fingerprinting methods became available. Probably the most notable discoveries came from studies of avian mating systems. Multi-locus DNA fingerprinting revealed that many bird species that were behaviorally monogamous were in fact often reproductively promiscuous. Females of such species would furtively engage in extra-pair copulations, apparently unbeknownst to their cuckolded male social partners. In fact, the frequency of extra-pair fertilizations (up to 50% in some species) led avian behavioral ecologist to distinguish between social mating systems and genetic mating systems (reviewed in [55]). The invention of the polymerase chain reaction (PCR) [38] quickly led to the replacement of multi-locus fingerprinting with single-locus PCR-based techniques by the mid 1990s [3, 39]. Microsatellites (also known as SSRs and STRs) were the first and still are the most widespread molecular marker for inferring kinship in wild populations, although their development in each new species studied is often a time-consuming and expensive obstacle. Microsatellite genotypes, which could be obtained from tiny amounts of blood, tissue, or even feces, have been used to infer parentage, particularly paternity, in a large number of wild species. Notable examples include the study of pollination patterns in forest trees [13, 14, 47], identifying fathers of the famed chimpanzees of Gombe [12], and evaluating the success of alternative mating strategies used by male big horn sheep [24]. A breakthrough in paternity assignment came with the release of the software program CERVUS [30] that provided a user-friendly Windows-based program that employed a statistical likelihood method to assign paternity to a candidate father with an estimated level of statistical confidence. There are many cases where field studies can sample cohorts of offspring yet sampling putative parents is problematic. In these cases, sibling relationships (sibship) reconstruction, rather than parentage assignment, is required. For genetic markers showing Mendelian inheritance, such as microsatellites, parentage assignment (maternity or paternity) is computationally much simpler than sibship reconstruction. In diploid organisms, a parent and

Full Sibling Reconstruction in Wild Populations...

233

each offspring must share an allele at every genetic locus (barring rare mutations). On the other hand, full siblings will share, on average, half their alleles, but at any one locus, they may share 0,1, or 2 alleles. Sibling reconstruction methods have lagged behind those developed for paternity assignment, but several methods of sibling reconstruction are now available. In this review, we will examine the constraints that Mendelian inheritance dictates for sibling reconstructing, review the use of microsatellite genotyping in wild populations, and evaluate alternative genetic markers. We will then review the various methods for full sibling reconstruction that are currently available and present experimental validation of various methods using both real biological data and simulated data.

1.1.

Microsatellites

While there are several molecular markers used in population genetics, microsatellites are the most commonly used in kinship studies in wild populations. First discovered in the late 1980s when genomic sequencing studies began [48, 54], microsatellites are short (one to six base pairs) simple sequence repeats, such as (CA/GT )n or (AGC/T CG)n that are scattered around eukaryotic genomes. A genomic library for a study species is screened for such repeats and primers for PCR amplification are constructed from the regions flanking the short repeats. Alternatively, microsatellite primers developed for one species may be used for closely related species. For example, microsatellites developed for humans amplify homologous loci in chimpanzees [12]. Figure 1 shows a schematic example of a microsatellite marker with three alleles and the resulting genotypes. Because there is a relatively high rate of mutation for adding or subtracting repeat units, microsatellite loci have high numbers of alleles and high levels of heterozygosity. PCR-based microsatellite analysis provides co-dominant, unlinked markers where alleles and genotypes can be scored precisely by size. These are the characteristics that make them especially useful for estimating kinship and relatedness. There are some technical problems associated with scoring microsatellites, and any method of sibling reconstruction with microsatellites needs to be able to accommodate a low frequency of scoring errors or artifacts, in addition to occasional mutation. Microsatellites have been successfully applied to a wide range of non-model organisms, including vertebrates, invertebrates, plants, and fungi, and are used to infer large-scale population structure as well as individual kinship. For kinship studies, microsatellites have been used more commonly for parentage than for sibship reconstruction, but there are an increasing number of studies that have attempted to reconstruct sibships with partial or no parental sampling. In lemon sharks, cohorts of juvenile sharks were sampled annually from nursery lagoons, and sibship reconstruction was used to infer the mating system and fertility of adults [17]. Sibship reconstruction was used to infer patterns of brood parasitism for individual female cowbirds, who lay their eggs in the nests of other birds [45, 46]. In a study of wood frogs, tadpoles were sampled from ponds and sibgroups reconstructed to study their spatial distribution and the potential for kin selection [22]. Such studies have employed a variety of methods to reconstruct sibling groups from microsatellite data because there was no widely accepted or easily implemented software available. In addition to microsatellites, which assay DNA repeat variation, several PCR-based methods are available to assay variation in DNA sequence. RAPDs (randomly amplified

234

Mary V. Ashley, Tanya Y. Berger-Wolf, Isabel C. Caballero et al.

Figure 1. A schematic example of a microsatellite marker. polymorphic DNA), ISSRs (inter-simple sequence repeats), and AFLPs (amplified fragment length polymorphisms) are dominant, multi-locus techniques which are problematic for kinship inference. SNPs (single nucleotide polymorphisms) are single locus markers that focus on a variable single nucleotide position in the genome. While they are numerous in the genome and, once identified, easy to score, they have limitations in the area of kinship reconstruction. The power to identify related individual depends mainly on the number of alleles per locus and their heterozygosity. SNPs are usually biallelic, whereas microsatellites may have 10 or more alleles per locus and typically have high heterozygosities. It appears for at least the next few years, microsatellites will remain the marker of choice for estimating relatedness in wild populations. We thus focus our efforts on developing and comparing methods of sibling reconstruction that are applicable to microsatellites or, more generally, codominant, multiallelic markers.

2.

Sibling Reconstruction Problem

In order to reason about the inherent computational properties of the problem of reconstructing sibling relationships and to compare the accuracy and performance of various computational methods for solving the problem, we must define it formally. The problem of siblings reconstruction was first formally defined in [5] and is restated here. Definition 1. Let U be a population of n diploid individuals of the same generation genotyped at at l microsatellite loci: U = {X1, ...Xn},

where Xi = (hai1, bi1i, ..., hail, bili)

Full Sibling Reconstruction in Wild Populations...

235

and aij and bij are the two alleles of the individual i at locus j represented as some identifying string. The goal of the Sibling Reconstruction Problem is to reconstruct the full sibling groups (groups of individuals with the same parents). We assume no knowledge of parental information. Formally, the goal is to find a partition of individuals P1 , ...Pm such that ∀1 ≤ k ≤ m,

∀Xp , Xq ∈ Pk :

P arents(Xp ) = P arents(Xq )

Note, that we have not defined the function P arents(X). This is a biological objective. Computational approaches use the formalization of various biological assumptions and constraints to achieve a good estimate of the biological sibling relationship. We describe the fundamental genetic properties that serve as a basis for most computational approaches in the next section.

3. 3.1.

Genetics of Sibship Mendelian Genetics

Mendelian genetics lay down a very simple rule for gene inheritance in diploid organisms: an offspring inherits one allele from each of its parents for each locus . This introduces two overlapping necessary (but not sufficient) constraints on full sibling groups in absence of genotyping errors or mutations: the 4-allele property and the 2-allele property [5, 10]. 4-Allele Property: The total number of distinct alleles occurring at any locus may not exceed 4. Formally, a set of individuals S ⊆ U has the 4-allele property if [ ∀1 ≤ j ≤ l : {aij , bij } ≤ 4. i∈S

Clearly, the 4-allele property is necessary since a group of siblings can inherit only combinations of the 4 alleles of their common parents. The 4-allele property is effective for identifying sibling groups where the data are mostly heterozygous and the parent individuals share few common alleles. Generally, as in Table 1, a set consisting of any two individuals trivially satisfies the 4-allele property. The set of individuals 1, 3 and 4 from Table 1 satisfies the 4-allele property. However, the set of individuals 2, 3 and 5 fails to satisfy it as there are five alleles occurring at the first locus: {12, 28, 56, 44, 51}. 2-Allele Property: There exists an assignment of individual alleles within a locus to maternal and paternal such that the number of distinct alleles assigned to each parent at this locus does not exceed 2. Formally, a set of individuals S ⊆ U has the 2-allele property if for each individual Xi in each locus there exists an assignment of aij = cij or bij = cij (and the other allele assigned to c¯ij ) such that [ [ cij } ≤ 2 ∀1 ≤ j ≤ l : {cij } ≤ 2 and {¯ i∈S

i∈S

236

Mary V. Ashley, Tanya Y. Berger-Wolf, Isabel C. Caballero et al.

The 2-allele property is clearly stricter than the 4-allele property. Looking at the Table 1, our previous 4-allele set of individuals 1, 3 and 4 fails to satisfy the 2-allele property since there are more than two alleles on the left side of locus 1: {44, 28, 13}. Moreover, there is no swapping of the left and right sides of alleles that will bring down the number of alleles on each side to two: individuals 1 and 4 with their alleles 44/44 and 13/13 already fill the capacity. Again, any two individuals trivially satisfy the 2-allele property. Table 1. An example of input data for the sibling reconstruction problem. The five individuals have been sampled at two genetic loci. Each allele is represented by a number. Same numbers within a locus represent the same alleles. Individual 1 2 3 4 5

Alleles ha, bi at locus 1 44, 44 12, 56 28, 44 13, 13 28, 51

Alleles ha, bi at locus 2 55, 27 18, 39 55, 18 39, 27 18, 39

Assuming the order of the parental alleles is always the same in the offspring (i.e. the maternal allele is always on the same side), the 2-allele property is equivalent to a biologically consistent full sibling relationship. The parental allele order, however, is not preserved, and an interesting problem arises: given a set of individuals S that satisfies the 4-allele property, does there exist a series of allele reorderings within some loci of individuals in S so that after those reorderings S satisfies the 2-allele property? For example, in Table 1, the individuals 1, 3, and 5 have more than two alleles on the right side of locus 2: {27, 18, 39}. However, switching the alleles 18 and 39 at locus 2 in the individual 5 will bring the number of alleles on either side down to two. Since the number of alleles on either side of locus 1 is also two, the set of individuals 1, 3, and 5 satisfies the 2-allele property. In [10] we show the connection between the two properties that we restate here: Theorem 1. Let a be the number of distinct alleles present in a given locus and R be the number of distinct alleles that either appear with three different alleles in this locus or are homozygous (appear with itself). Then, given a set of individuals with the 4-allele property, there exists a series of allele reorderings within loci resulting in a set that satisfies the 2allele property if and only if for all the loci in the set a + R ≤ 4. In our example of individuals 1, 3, and 5 in locus 1, a = |{44, 28, 51}| = 3 and R = 1 since each allele is paired up only with at most two different alleles but 44 is a homozygote. In locus 2, a = |{55, 27, 18, 39}| = 4 but R = 0 since there are no homozygote alleles and no allele appears with more than two different alleles. Thus, the set of individuals 1, 3, and 5 satisfies a + R ≤ 4 for all loci and, hence, the 2-allele property. The 2-allele property takes into account the fact that the parents can contribute only two alleles each to their offspring. Note, that the 2-allele property is, again, a necessary but not a sufficient constraint for a group of individuals to be siblings (in absence of errors or

Full Sibling Reconstruction in Wild Populations...

237

mutations). The full formalization of the Mendelian inheritance constraints in the context of sibling reconstruction is presented in [5, 10].

3.2.

Relatedness Estimators

In the 1980’s several statistical coefficients of relatedness were introduced [31, 33, 36]. All methods use observed allele frequencies to define the probabilistic degree of relatedness between two individuals. In 1999, Queller and Goodnight improved on their approach [37] by defining simple statistical likelihood formulae for different types of relationships and used those to infer sibling relationships. The 1999 paper also defines a method to determine the statistical significance, or “p-value”, of a relationship estimate. This is done by randomly generating two individuals using the observed allele frequencies and the estimated probabilities of inheriting a shared allele as defined in the paper. Such random pairs of individuals are generated a large number of times, then the likelihood ratio that excludes 95% of the individuals is accepted as being at p-value 0.05. Even though this approach was not presented or aimed as a method for sibship reconstruction, it served as a basis for likelihood methods that followed. A number of assumptions are made by all relatedness estimators, including ignoring mutations and genotyping errors. More importantly, the methods assumes that a sample representative of the population has been scored, and there is accurate estimates of allele frequencies for the entire population. If these assumptions do not hold, results will be biased [34]. Finally, any method relying purely on a pairwise genetic distance may lead to inconsistent results, i.e. the transitivity of the sibling relationship may not hold. Moreover, as mentioned before, any pair of individuals can be siblings yet no pairwise distance estimate method cannot exclude that possibility [49].

4.

Methods for Full Sibling Reconstruction

As more microsatellite markers become available for wild species there is a growing interest in the possibility of inferring relatedness among individuals when part or all of the pedigree information is lacking [43]. The majority of the available software requires parental data. However, recently there have been several methods attempting to reconstruct sibship groups from genetic data without parental information [1, 2, 6, 8, 29, 32, 43, 49, 53]. Fernandez and Toro [18] and Butler et al. [9] review many of the methods discussed here. In their survey, Butler et al. [9] classified sibship reconstruction methods into two main groups: (1) methods that generate complete genealogical structures and, thus, require explicit pedigree reconstruction, and (2) pairwise methods that do not imply such pedigree reconstruction. This latter group can be subdivided into methods that estimate pairwise relatedness based on genotypic similarity and likelihood approaches that classify pairs of individuals into different types of relationships based on marker information. In one of the earlier examples of the first type of method, Painter [32] used a Bayesian approach to calculate relationship likelihood and then an exhaustive search to find the most likely sibship in a small population of 9 individuals. He identified the need for using better optimization techniques for larger populations. Among the methods that followed, some use Markov-Chain Monte Carlo (MCMC) techniques to locate a partition of individuals that maximizes the likelihood of the proposed family relationship, such as COLONY [53]

238

Mary V. Ashley, Tanya Y. Berger-Wolf, Isabel C. Caballero et al.

software and Almudevar’s method [1]. Smith [43] has developed an approach that maximizes a relatedness configuration score derived form the pairwise relatedness likelihood ratio. Almudevar and Field [2] used an exclusion principle that looks for the largest fullsibling families, using partial likelihoods to pick between families of the same size. Another approach is based on Simpson’s index of concentration [9], where groups that conform to Mendelian inheritance rules are formed according to marker information. One of the advantages of these methods is that they avoid the inconsistency problems of pairwise estimators described below. However, the statistical likelihood methods still depend on the knowledge of population allelic data (to calculate likelihoods) which is typically unavailable or inaccurate. Moreover, since most of these methods employ global optimization at their core, they are usually computationally demanding. As described above, a second type of approach, pairwise methods, are widely used for sibship reconstruction. While these methods are typically simple and fast they suffer several disadvantages. First, they can lead to incongruous assignments because only two individuals are considered at a time and transitivity is not preserved. Second, like all statistical methods, they are dependent on the knowledge of allelic frequencies of the population considered. Third, if multiple definite relationships exist, such as full siblings, half siblings, or unrelated, arbitrary thresholds have to be defined to decide the category to which a particular pair is assigned [18]. Here, we consider a different classification of sibling reconstruction methods, based on the computational approach a method employs as the basis for reconstruction. SIBSHIP [49], Pedigree [43], KINGROUP [29], and COLONY [53] rely on statistical estimates of relatedness [37] and reconstruct the maximum likelihood sibling groups. Family Finder [8] and Almudevar [1] mix statistical and combinatorial approaches. Finally, Almudevar and Field [2], 2-allele Minimum Set Cover [5, 6, 10, 41] and Sheikh et al. [40] use only the fundamental Mendelian constraints and combinatorial techniques to reconstruct sibling groups. A common assumption of all but two (Sheikh et al. [40] and COLONY [53]) of the sibship reconstruction methods is that the molecular data is error and mutation free [18]. Data that contain errors test the robustness of these methods and are a major problem of the estimators involving pedigree reconstruction [9]. Following our computationally based classification, we now describe some of the methods in more detail, providing deeper analysis of the two best-performing methods (see Section 5. for experimental comparison), the likelihood based COLONY and the combinatorial 2-allele Minimum Set Cover.

4.1. Statistical Likelihood Methods As Painter’s [32] first likelihood-based sibling reconstruction method exemplified, likelihood maximization methods require sophisticated optimization techniques to find the most likely sibship partition for datasets of size greater than 10 individuals. In 2000, Thomas and Hill [49] introduced a Markov Chain Monte Carlo (MCMC) approach to find the maximum likelihood of a sibship reconstruction. The method compares the likelihood ratio of two individuals being siblings to that of the the pair being unrelated [36]. Starting with a random partition of individuals into potential sibling groups, the

Full Sibling Reconstruction in Wild Populations...

239

method uses a “hill-climbing” approach to explore different sibship reconstructions, reassigning individuals into sibling groups to improve the likelihood of all pairs being siblings. The process continues until one of the halting conditions is reached: either the number of iterations exceeds a threshold, or the sibling reconstruction stabilizes, i.e. the likelihood value reaches a fixed point. The algorithm was not computationally efficient and was subsequently improved. Like most likelihood based methods, the main assumption of the approach is that the sample at hand is representative of the entire population in terms of allele frequencies and, thus, the relatedness probabilities. More detrimentally, the method also assumes that the population contains only full siblings and unrelated individuals which typically does not hold for any population. In 2002, Thomas and Hill [50] extended their approach by adding half sibling relationships, thus creating a limited family hierarchy. The algorithm is similar to their previous approach in [49], with the addition that an individual could be assigned to either a half sibling group or a full sibling group at every iteration. Half sibling groups were randomly created every few hundred iterations to ensure that a hierarchical structure existed in the population. In that paper, Thomas and Hill also explored the effects of population size, population structure, and the allelic information available on the performance of their MCMC approach. Typical of the statistical approaches, the accuracy of the reconstruction improved with the increase of available marker information and the nestedness of the full siblings within half sibling groups but decayed with the increase of the population size. In 2001 Smith et al. [43] presented two different MCMC methods for sibship reconstruction. One of the methods is very similar to [50], while the other aims to maximize the joint likelihood of the entire sibship reconstruction rather than pairwise relatedness ratio. The methods performed very well for the Atlantic salmon dataset the authors used in the original publication. The software P EDIGREE is now available for general use as an online service. Smith et al. have also assayed the dependency of the accuracy of reconstruction various data parameters. In general, the methods suffer from typical assumptions of other statistical methods. The accuracy of reconstruction decreases when there is insufficient allelic diversity per locus or the sample is not representative of the population. Konovalov et al. [29] introduced KINGROUP, available as an open source Java TM program. KINGROUP uses the relatedness estimators of [37] with additional algorithms designed for the reconstructions of groups of kin that share a common relationship. Family Finder [8] was introduced in 2003. It is a very efficient method that uses a combination of statistics and graph theory. This approach constructs a graph with individuals as vertices. Edges represent pairwise sibling relationship and are weighted using, again, the likelihood ratio of individuals being siblings to their being unrelated [37]. After constructing this graph “clusters”, or components, corresponding sibling groups are identified by finding light edge cuts. Cuts with the number of edges less than one third of the edges in the graph are chosen. It is a simple and efficient method that can be effective if enough loci are available and allelic diversity is high. While there is some theoretical basis, usage of the likelihood ratio implies the same assumptions as [37]. Furthermore, it assumes that sibling groups are roughly equally sized, which is a dubious assumption and often does not hold, especially for wild population samples.

240

Mary V. Ashley, Tanya Y. Berger-Wolf, Isabel C. Caballero et al.

4.1.1. COLONY A different likelihood maximization approach was used by Wang [53]. COLONY is a comprehensive statistical approach that uses the simulated annealing heuristic to find a (local) likelihood maximum of a sibship reconstruction. The algorithm starts with known full and half siblings (if any are available) and places the rest into singleton sibling groups, along with the computed likelihood of each group. A proposed alternate solution at every iteration is created by moving a random number of individuals from one full sibling group to another (both groups must not be one of the known full sibling groups). For half siblings, a random number of entire full sibling groups are moved from one half sibling group to another. As before, these must not be the original known half sibling families. After generating a new proposed solution, the likelihood of the old and new configurations of the altered families is calculated. The new configuration is accepted or rejected based on a threshold which depends on the ratio of the new and old likelihoods. COLONY is the first method to fully accommodate sampling bias and genotyping errors, although it relies on many user input parameters to do so. Errors are estimated using the calculated probability of observing the given allele assuming the actual allele is different. The probabilities of allelic dropouts and other typing of errors are based on [19], allelic dropout is considered to be twice as likely as other errors. Simulated annealing relies on random numbers and explores a vast solution space. COLONY can be quite slow, and its performance both in terms of time and accuracy depends drastically on the amount of microsatellite information available. COLONY was designed for both diploid and haplodiploid species. It is perhaps the most comprehensive and sophisticated method currently available for full sibling reconstruction with a strong theoretical basis. However, in addition to other disadvantages common to all statistical sibship reconstruction methods, it also assumes that one of the parents is monogamous which, unfortunately, renders it inappropriate for many species that have promiscuous mating systems.

4.2. Combinatorial Approaches Combinatorial approaches to sibling reconstruction use Mendelian constraints to eliminate sibling groups that are infeasible and to form potential sibling groups that conform to these constraints. Various methods then use different objectives to choose from among these the groups to form the solution. Almudevar and Field [2] were the first to introduce a combinatorial approach. They formulated the Mendelian properties in form of graphs and constructed all maximal feasible sibling groups. They then performed an exhaustive search to select the minimal number of these groups using maximum likelihood of the reconstruction as the guide. Their approach yielded reasonably good results but was computationally very expensive, often resulting in the system running out of memory in our experiments (see Section 5.). Almudevar presented a “hybrid” approach in [1] that used simulated annealing based on MCMC methods to find a locally optimal solution. The method generates putative triplets of parents and children, and then uses simulated annealing to explore the space of different possible pedigrees. The exploration is similar to the approach taken by COLONY described above and uses the likelihood of the sibling group configuration as a guide. Such a heuristic approach is not

Full Sibling Reconstruction in Wild Populations...

241

guaranteed to find a globally minimum number of sets. This new version of the method allows for the use of other information in the reconstruction, such as multiple generations of siblings, parental genotypes and sex where available. All the information is translated into constraints that guide the formation of the potential feasible solution. 4.2.1. 2-Allele Minimum Set Cover The 2-Allele Minimum Set cover approach [5, 6, 10, 41], like Almudevar and Field’s, uses Mendelian constraints, specifically the 2-allele property, to form all maximal feasible sibling groups. The goal, then, is to find the smallest number of these that contain all individuals. Unlike Almudevar and Field, this approach finds the true global, rather than local, minimum. We describe the technical details of the approach and the computational complexity of this formulation of the problem below. Recall that we are given a population U of n diploid individuals sampled at l loci U = {X1, ...Xn}, where Xi = (hai1, bi1i, ..., hail, bili) and aij and bij are the two alleles of the individual i at locus j. The goal of the Minimum 2-Allele Set Cover problem is to find the smallest S number of subsets S1 , ..., Sm such that each Si ⊆ U and satisfies the 2-allele constraint and Si = U . We shall denote the Minimum 2-Allele Set Cover on n individuals with l sampled loci as 2-ALLELE n,` . Of all the sibling reconstruction problem formulations, this is the only one for which its computational complexity is known. Computational Complexity The Minimum 2-Allele Set Cover problem is a special case of the M INIMUM S ET C OVER problem, a classical NP-complete problem [28]. M INIMUM S ET C OVER is defined as follows: given a universe U of elements X1, ..., Xn and a collection of subsets S of U , the goal is to find the minimum collection of subsets C ⊆ S whose union is the entire universe U. Recall, that a (1 + ε)-approximate solution (or simply an (1 + ε)-approximation) of a minimization problem is a solution with an objective value no larger than 1 + ε times the value of the optimum, and an algorithm achieving such a solution is said to have an approximation ratio of at most 1 + ε. To say that a problem is r-inapproximable under a certain complexity-theoretic assumption means that the problem does not have a r-approximation unless that complexity-theoretic assumption is false. M INIMUM S ET C OVER cannot be approximated in polynomial time to within a factor of (1 − ) ln n unless N P ⊆ DT IM E(nloglogn) [16]. Johnson introduced a 1 + ln n approximation in 1974 [27]. In the 2-ALLELE n,` the problem the elements are the sampled individuals and the sets S are the groups of individuals that satisfy the 2-allele property. The main difference between M INIMUM S ET C OVER and 2-ALLELE n,` , or more generally k-ALLELE n,` problem for k ∈ {2, 4}, is that the latter add the 2-allele or the 4-allele restriction on

242

Mary V. Ashley, Tanya Y. Berger-Wolf, Isabel C. Caballero et al.

the structure of the subsets S. We show that this restriction does not make the problem computationally easier and k-ALLELE n,` remains NP-complete. A natural parameter of interest in this class of problems is the maximum size (number of elements) a in any set in S. We denote the corresponding problem of finding the minimum set cover when the size of sibling sets is at most a as a-k-ALLELEn,` in the subsequent discussions. For example, 2-4-ALLELE n,` and 2-2-ALLELE n,` are the problem instances where each subset contains at most two individuals. Recall, that any pair of individuals necessarily satisfies both the 2-allele and the 4-allele properties. Thus, the collection S for 2-k-ALLELE n,` consists of all possible pairs of individuals and the smallest number of subsets that contain all the individuals are any n/2 disjoint pairs. In general, if a is a constant, then a-k-ALLELE n,` can be posed as a minimum set cover problem with the number of subsets polynomial in n and the maximum set size being a. This problem has a natural (1 + ln a)-approximation using the standard approximation algorithms for the minimum set cover problem [51]. For a general a, the same algorithm guarantees a ac + ln c -approximation for any constant c > 0. Recently, Ashley et al. [4] have been able to obtain several non-trivial computational complexity results for these problems which we restate here. For the smallest non-trivial value of a = 3, the 3-k-ALLELE n,n3 problem is 1.0065inapproximable unless RP = N P . This was proved by a reduction from the T RIANGLE PACKING problem [20, p. 192]. A 76 + ε -approximation for any ` > 0 and any constant ε > 0 is easily achieved using the results of Hurkens and Schrijver [25]. For the second smallest value of a = 4 and l = 2, 4-k-ALLELE n,2 is 1.00014inapproximable unless RP 6= N P , proved by a reduction from the M AX -C UT problem on cubic graphs via an intermediate novel mapping of a geometric nature. The 32 + approximation can be achieved for a = 3 by using the result of Berman and Krysta [7]. The n -inapproximability result under the assumption of ZPP 6=NP was proved for all sufficiently large values of a, that is a = nδ , where is any constant strictly less than δ. This result was obtained by reducing a suitable hard instance of the graph coloring problem. In all the reductions above additional loci play an important role of adding complexity to the problem to ensure the inapproximability result. Thus, interestingly and somewhat counterintuitively, while sampling more loci provides more information and typically improves the accuracy of most sibling reconstruction methods, it also adds computational complexity and increases the computational time needed to construct the solution, even beyond the scope of practical computability. The Algorithm In [6] we have presented a fully combinatorial solution for the siblings reconstruction problem based on the 2-Allele Minimum Cover formulation. We briefly describe the 2-A LLELE C OVER algorithm here. The algorithm works by first generating all maximal sibling groups that obey the 2-allele property and then finds the optimal minimum number of sibling groups necessary to explain the data. The algorithm maintains a complete enumeration of canonical possible sibling groups, called the possibilities table, shown in Table 2. Each potential sibling group is mapped to a set of possible canonical representations. Genetic feasibility of membership of each new individual in a sibling group is checked using this

Full Sibling Reconstruction in Wild Populations...

243

mapping. The intricate process of generating the maximal feasible 2-allele sets is described in detail in [6]. The 2-allele property reduces the possible combinations of alleles at a locus in a group of siblings down to a few canonical options, assuming that the alleles in the group are renumbered 1 through 4. Table 2 lists all different types of sibling groups possible with the 2-allele property using such a numbering. We do this by listing all possible pairs of parents whose alleles are among 1,2,3, and 4 and all the genetically different offspring they can produce. However, in any sibling group with a given set of parents only a subset of the offspring possibilities from the table may be present. Table 2. Canonical possible combinations of parent alleles and all resulting offspring allele combinations Parents

8*(1, 2) and (3, 4)

7*(1, 2) and (1, 3)

4*(1, 2) and (1, 2)

Offspring allele a allele b 1 1 2 2 3 4 3 4

3 4 3 4 1 1 2 2

1 1 2 2 3 1 3

1 3 1 3 1 2 2

1 1 2 2

1 2 1 2

Parents

Offspring allele a allele b

(1, 1) and (1, 1)

1

1

3*(1, 1) and (1, 2)

1 1 2

1 2 1

4*(1, 1) and (2, 3)

1 1 2 3

2 3 1 1

2*(1, 1) and (2, 2)

1 2

2 1

The maximal feasible 2-allele sets are generated using the canonical possibilities in Table 2 in a way which provably produces all maximal such sets and does it in provably fewest number of queries per individual. After that, the minimum set cover is constructed

244

Mary V. Ashley, Tanya Y. Berger-Wolf, Isabel C. Caballero et al.

as the solution to the sibling reconstruction problem. Note, that since 2-allele minimum cover and Minimum Set Cover are both NP-complete problems, the solution time is not guaranteed to be polynomial. We use the commercial mixed integer linear program solver CPLEX1 to solve the problem to optimality. On datasets with several hundreds individuals it may take several hours to days to obtain a solution. Subsequently, Chaovalitwongse et al. [10] have presented a full mathematical optimization formulation for the Minimum 2-allele Cover problem. We shall briefly describe the 2- ALLELE OPTIMIZATION MODEL (2AOM) here. The formulation directly models the objective of finding the minimum number of 2-allele sets that contain all individuals, rather than using the intermediate steps of generating all maximal 2-allele sets and finding the minimum set cover of those.

Individual 1 2 3 4 5 .. .

Locus 1 alleles ha, bi

Locus 2 alleles ha, bi

44, 44 12, 56 28, 44 13, 13 28, 51

55, 27 18, 39 55, 18 39, 27 18, 39

...

Figure 2. A multidimensional matrix representation of a dataset of microsatellite samples. Recall, that U is the set of individuals, S is a set of sibling groups, and C ∈ S is the reconstructed set of sibling groups which is returned as the solution. Let K be the set of possible observed alleles and L be the set of sampled loci. As the input, we are given |U | = n individuals sampled at |L| = l loci. We represent the data as a multidimensional 0-1 matrix M shown in Figure 2. The matrix entry M (i, k, l) = 1 if the individual i ∈ U has the allele k ∈ K in locus l ∈ L. From the input matrix, alik is defined as an indicator variable and equals to 1 if the first allele at locus l of individual i is k. Similarly, blik is an indicator variable for the second l = max{al + bl } is an indicator of whether k allele at locus l of individual i is k. fik ik ik l . Finally hlik = alik · blik is an appears at locus l of individual i, that is, M (i, k, l) = fik indicator of whether the individual i is homozygous (allele k appears twice) at locus l. The following decision variables are then used: • zs ∈ {0, 1}: indicates whether any individual is selected to be a member of sibling group s; • xis ∈ {0, 1}: indicates whether the individual i is selected to be a member of sibling group s; 1

CPLEX is a registered trademark of ILOG

Full Sibling Reconstruction in Wild Populations...

245

l • ysk ∈ {0, 1}: indicates whether any member of sibling group s has the allele k at locus l; l ∈ {0, 1}: indicates whether there is at least one homozygous individual in sibling • wsk group s with the allele k appearing twice at locus l; l 0 • vskk 0 ∈ {0, 1}: indicates whether the allele k appears with allele k in sibling group s at locus l.

With these variable, the mathematical representation of the objective function and the constraints of the 2AOM problem are as follows. Objective function: The overall objective function is to minimize the total number of sibling groups: X zs min ∀s∈S

The minimization objective is subject to three types of constraints stated below. Cover and logical constraints: Ensure that every individual is assigned to at least one sibling group: X xis ≥ 1, ∀i ∈ U ∀s∈S

The binary sibling group variable s is activated for the assignment of any individual i to the sibling group s: xis ≤ zs ,

∀i ∈ U, ∀s ∈ S

l with the assign2-allele constraints: Activate the binary indicator variable for alleles ysk ment of any individual i to the sibling set s. Here C1 is a large constant which can be defined as C1 = 2|U | + 1: X l l fik xis ≤ C1 ysk , ∀s ∈ S, ∀k ∈ K, ∀l ∈ L ∀i∈U

Activate the binary indicator variables for homozygous individuals with allele k appearing twice at locus l in sibling group s. Here C2 is a large constant which can be defined as C2 = |U | + 1: X l hlik xis ≤ C2 wsk , ∀s ∈ S, ∀k ∈ K, ∀l ∈ L ∀i∈U l Activate the binary indicator variable for allele pair vskk 0 for any assignment to the 0 sibling group s of the individual i with alleles hk, k i at locus l. Here C3 is a large constant and can be defined as C3 = |U | + 1: X l l l fik hik xis ≤ C3 vskk ∀s ∈ S, ∀k 6= k0 ∈ K, ∀l ∈ L 0, ∀i∈U

246

Mary V. Ashley, Tanya Y. Berger-Wolf, Isabel C. Caballero et al. Ensures that the number of distinct alleles plus the number of homozygous alleles does not exceed 4, conforming to Theorem 1: X

l l ysk + wsk ≤ 4,

∀s ∈ S, ∀l ∈ L

∀k∈K

Every allele in the set should not appear with more than two other alleles (excluding itself), also conforming to Theorem 1: X

l vskk 0 ≤ 2,

∀s ∈ S, ∀k ∈ K, ∀l ∈ L

∀k 0 ∈K\k

Binary and nonnegativity constraints: l l , wsk ∈ {0, 1}, zs , xis, ysk

∀i ∈ U, ∀s ∈ S, ∀k ∈ K, ∀l ∈ L

The total number of discrete variables in the 2AOM is O(|U ||K||S|) and so is the total number of constraints. Thus, the 2AOM formulation of the 2-allele minimum cover problem is a very large-scale mixed integer program problem and may not be easy to solve in large instances. The main justification for a formal mathematical model of the problem is that it allows for the theoretical investigation of its computational properties and guides approximation approaches.

4.3. Consensus-based Approach Among all the methods for sibling reconstruction, only COLONY [53] is designed to tolerate genotyping errors or mutation. Yet, both errors and mutations cannot be avoided in practice and identifying these errors without any prior kinship information is a challenging task. A new approach for reconstructing sibling relationships from microsatellite data designed explicitly to tolerate genotyping errors and mutations in data based on the idea of a consensus of several partial solutions was proposed by Sheikh et al. in [40, 42] Consider an individual Xi which has some genotyping error(s). Any error that is affecting sibling reconstruction must be preventing Xi ’s sibling relationship with at least one other individual Xj , who in reality is its sibling. It is unlikely that an error would cause two unrelated individuals to be paired up as siblings, unless all error-free loci do not contain enough information. Thus, we can discard one locus at a time, assuming it to be erroneous, and obtain a sibling reconstruction solution based on the remaining loci. If all such solutions put the individuals Xi and Xj in the same sibling group (i.e., there is a consensus among those solutions), we consider them to be siblings. The core of the consensus-based error-tolerant approach is concerned with pairs of individuals that do not consistently end up in the same sibling group during this process, that is, there is no consensus about their sibling relationship. Definition 2. A consensus method for the sibling reconstruction problem is a computable function f that takes k solutions S = {S1, ..., Sk} as input and computes one final solution.

Full Sibling Reconstruction in Wild Populations...

247

The strict consensus places two individuals into a sibling groups only if they are together in all input solutions. While it always results in a consistent solution, it also produces many singleton sibling groups. In [40, 42] a distance based consensus for sibling reconstruction was introduced. Starting with a strict consensus of the input solutions, distance based consensus iteratively merges two sets until the quality of the solution cannot be improved. The computational complexity and the algorithms change depending on the cost of the merging operations and the function that defines the quality of the solution. The approach taken in [40, 42] uses the number of the sibling groups in the resulting solution as the measure of the quality of the solution, that is, it seeks to minimize the number of groups. The cost of the merging operation is based on the size of the groups being merged and errors that need to be corrected for the 2-allele property to be preserved in the combined group. Any method or a mix of methods for sibling reconstruction can be used as the base to produce the input solution for the consensus method. The running time of the consensus method depends on the running times of the base methods. In our experiments (see Section 5.) consensus based on 2-allele minimum cover algorithm typically achieved over 95% accuracy.

5.

Experimental Validation

To assess and compare the accuracy of various sibling reconstruction methods we used datasets with known genetics and genealogy. Since most sibling reconstruction methods do not tolerate errors in data, we first used error free datasets. However, biological datasets containing no errors are rare. Thus, in addition to biological datasets, we created simulated sets using a large number of parameters over a wide range of values. We compare the performance of five sibling reconstruction methods, spanning the variety of computational techniques: Almudevar and Field [2], Family Finder [8], KINGROUP [29], COLONY [53], and 2-allele Minimum Cover [6]. In addition, we used the same datasets with introduced errors to assess the performance of COLONY and the distance-based consensus of the 2-allele Minimum Cover when errors are present. We measure the error by comparing the known sibling sets with those generated by various sibling reconstruction methods, and calculating the minimum partition distance [21]. The error is the percentage of individuals that would need to be removed to make the reconstructed sibling sets equal to the true sibling sets. Note, we are computing the error in terms of individuals, not in terms of the number of sibling groups reconstructed incorrectly. Thus, the accuracy is the percent of individuals correctly assigned to sibling groups. The experiments were run on a combination of a cluster of 64 mixed AMD and Intel Xeon nodes of 2.8 GHz and 3.0GHz processors and a single Intel Xeon Quad Core 3.2 GHz Intel processor with 24 GB RAM memory.

5.1.

Biological Datasets

For validation of our methods, both the 2-allele and the consensus extension, we used biological datasets of offspring that resulted from one generation of controlled crosses, thus

248

Mary V. Ashley, Tanya Y. Berger-Wolf, Isabel C. Caballero et al.

the identity of the parents and their microsatellite genotypes were known. Radishes. The wild radish Raphanus raphanistrum dataset is a subsample of [11]. It consists of samples from 64 radishes from two families with 11 sampled loci. Close to 53% of allele entries are missing. Salmon. The Atlantic salmon Salmo salar dataset comes from the genetic improvement program of the Atlantic Salmon Federation [23]. We use a truncated sample of 351 individuals from 6 families and 4 loci. There are no missing alleles at any locus. This dataset is a subset of one of the samples of genotyped individuals used by [2] to illustrate their technique. Shrimp. The tiger shrimp Penaeus monodon dataset [26] consists of 59 individuals from 13 families with 7 loci. There are 16 missing allele entries (3.87% of all allele entries). Flies. Scaptodrosophila hibisci dataset [56] consists of 190 same generation individuals (flies) from 6 families sampled at various number of loci with up to 8 alleles per locus. All individuals shared at least 2 sampled loci which were chosen for our study. 25% of allele entries were missing. Table 3 summarizes the results of the four algorithms on the biological datasets. Table 3. Accuracy (percent) of the 2-allele algorithm and the three reference algorithms on biological datasets. Here l is the number of loci in a dataset and “Inds” column gives the number of individuals in the dataset. The three reference algorithms are [2] (A&F), Family Finder by [8] (B&M), and the KINGROUP by [29] (KG). Dataset Shrimp Salmon Radishes Flies

l 7 4 5 2

Inds 59 351 64 190

Ours 77.97 98.30 75.90 100.00

A&F 67.80 Out of memory Out of memory

31.05

B&M 77.97 99.71 53.30 27.89

KG 77.97 96.02 29.95 54.73

Almudevar and Field’s algorithm ran out of 4GB memory on the salmon and radish datasets.

5.2.

Synthetic Datasets

To test and compare sibling reconstruction approaches, we also use random simulations to produce synthetic datasets. We first create random diploid parents and then generate complete genetic data for offspring varying the number of males, females, alleles, loci, number of families and number of offspring per family. We then use the 2-allele algorithm described above to reconstruct the sibling groups. We compare our results to the actual known sibling groups in the data to assess accuracy. We measure the error rates of algorithm using the Gusfield Partition Distance [21]. In addition, we compare the accuracy of our 2allele algorithm to the two reference sibling reconstruction methods, [8] and [29], described

Full Sibling Reconstruction in Wild Populations...

249

above. We repeat the entire process for each fixed combination of parameter values 1000 times. We omit the comparison of the results to the algorithm of [2] since the current version of the provided software requires user interaction and therefore it is infeasible to use it in the automated simulation pipeline of 1000 iterations of over a hundred combinations of parameter values. First, we generate the parent generation of M males and F females with parents with l loci and a specified number of alleles per locus a. We create populations with uniform as well as non-uniform allele distributions. After the parents are created, their offsprings are generated by selecting f pairs of parents. A male and a female are chosen independently, uniformly at random from the parent population. For these parents a specified number of offsprings o is generated. Here, too, we create populations with a uniform as well as a skewed family size distribution. Each offspring randomly receives one allele each from its mother and father at each locus. This is a rather simplistic approach, however, it’s consistent with the genetics of known parents and provides a baseline for the accuracy of the algorithm since biological data are generally not random and uniform. The parameter ranges for the study are as follows: • The number of adult females F and the number of adult males M were equal and set to 5, 10 or 15. • The number of loci sampled l = 2, 4, 6 • The number of alleles per locus (for the uniform allele frequency distribution) a = 5, 10, 15. • Non-uniform allele frequency distribution (for 4 alleles): 12 - 4 - 1 - 1, as in [1]. • The number of families in the population f = 2, 5, 10. • The number of offspring per mating pair (for the uniform family size distribution) o = 2, 5, 10. • Non-uniform family size distribution (for 5 families): 25 - 10 - 10 - 4 - 1, as in [1] All datasets were generated on the 64-node cluster running RedHat Linux 9.0. The 2allele algorithm is used on this generated population to find the smallest number of 2-allele sets necessary to explain this offspring population. We use the commercial MIP solver CPLEX 9.0 for Windows XP on a single processor machine to solve the minimum set cover problem to optimality. The reference algorithms were run on a single processor machine running Windows XP 2 . We measure the reconstruction accuracy of various methods as the function of the number of alleles per each locus, family size (number of offspring), number of families (and polygamy), and the variation in allele frequency and family size distributions. Figure 3 shows representative results for the accuracy of our 2-allele algorithm, the Greedy Consensus algorithm and the two reference algorithms on uniform allele frequency and family sizes distributions. 2

The difference in platforms and operating systems is dictated by the available software licenses and provided binary code

250

Mary V. Ashley, Tanya Y. Berger-Wolf, Isabel C. Caballero et al.

Figure 3. Accuracy of the sibling group reconstruction methods on randomly generated data. The y-axis shows the accuracy of reconstruction as a function of various simulation parameters. The accuracy of our 2-allele algorithm and Greedy Consensus approach is shown, as well as that of the two reference algorithms: [8] and [53] (COLONY). The title shows the value of the fixed parameters: the number of adult males/females, number of families, the number of offspring per family, the number of loci, and the number of alleles per locus.

Full Sibling Reconstruction in Wild Populations...

251

The results of COLONY, our 2-allele Minimum Cover and the consensus based approach on simulated datasets with introduced errors are shown in Figure 4.

Figure 4. Results on simulated datasets with errors. Only 50 iterations were used for the COLONY algorithm due to its computational inefficiency and time constraints.

252

Mary V. Ashley, Tanya Y. Berger-Wolf, Isabel C. Caballero et al.

Overall, we have compared our 2-allele algorithm as well as the robust consensus approach to the best existing sibling reconstruction methods on biological and synthetic data over a wide range of parameters. We have identified the strengths and weaknesses of various approaches to sibling reconstruction and pinpointed the data parameters under which those are manifested.

6.

Conclusion

Full utilization of new genetic tools provided by advances in DNA and genome analysis will only be realized if computational approaches to exploit the genetic information keep pace. Pedigree reconstruction in wild populations is an emerging field, made possible by the development of markers, particularly DNA microsatellites, that can be used to genotype any organism, including free-living populations sampled in the field. Rules of Mendelian inheritance and principles of population genetics can be applied to microsatellite genotyping data to infer familial relationships such as parentage and sibship, and thus reconstruct wild pedigrees. Such pedigrees, in turn, can be used to learn about a species’ evolutionary potential, their mating systems and reproductive patterns, dispersal and inbreeding (reviewed in [35]). The findings of pedigree reconstruction have been especially notable in the area of paternity assignment, where dozens of examples of previously undocumented multiple paternity have now been reported (e.g. [15, 17, 44, 52]). Our focus has been on a more challenging computational problem than paternity (or parentage) assignment, that of sibling reconstruction. Sibling reconstruction is needed when wild samples consist primarily of offspring cohorts, in cases where it is logistically difficult or impossible to sample the parental generation. We first develop a formal definition of the sibling reconstruction problem and formalize the genetics of sibship. Sibling reconstruction methods can be divided into three categories depending on their approach, methods that rely only statistical estimates of relatedness [29, 32, 43, 49, 50, 53], those that combine statistical and combinatorial approaches [8], and those that use only Mendelian constraints and combinatorial techniques [1, 2, 5, 6, 10, 41]. Statistical methods rely on estimates of pairwise relatedness and typically reconstruct maximum likelihood sibling groups. The performance of statistical methods depends upon an accurate estimate of underlying allele frequencies within the sampled populations , rather than the observed sample. Furthermore, they are often computationally demanding. Combinatorial approaches offer the advantage that sibling groupings are based only on Mendelian constraints without needing information on population allele frequencies. A new method we describe here, the 2-allele minimum set cover, generates all sibling groups that obey the 2-allele property and then finds the optimal minimum number of sibling groups needed to explain the data. To accommodate genotyping errors and mutations, we also describe a new consensus-based approach applied here to the 2-allele minimum cover algorithm. We tested the performance of various sibling reconstruction methods using both real biological data and synthetic data sets. For real data, the actual pedigree and sibgroups were known from controlled crosses, and we tested the accuracy of five different methods in recovering the known sibgroups. We found that our 2-allele distance-based consensus method performed very well, recovering over 95% of the known sibgroups. We also produced synthetic datasets which simulated a variety of mating systems, family structures,

Full Sibling Reconstruction in Wild Populations...

253

and genetic data. Again, our method produced very good results. Of the other methods tested, COLONY [53], a statistical approach, also performed very well when the assumptions of monogamy held and there were a sufficient number of loci and accurate estimates of allele frequencies. There is no one method that is guaranteed to provide the correct answer, since samples of different populations suffer from different sampling biases and all methods make assumptions that may not hold for a specific dataset. We favor the 2-allele method for this very reason: it makes the fewest assumptions. Also, the 2-allele algorithm overall performs well over a wide range of data parameters, thus making it a good general method, especially when few loci are sampled or the allelic variation is low. Our current recommendation is to use the proposed consensus approach on the 2-allele method in combination with other available methods, keeping in mind aspects of the study organism’s biology or sampling biases, as a way to achieve confidence in sibling reconstruction. Another consideration is presentation and implementation of the methods. Most molecular ecologists do not have a background in computer science, and will opt for a method that is easily accessible, user-friendly, and produces results that can be readily interpreted, regardless of the underlying mathematical or computational elegance. COLONY is available as a Windows executable. However, it is computationally intensive and as such, is impractical to run on a personal computer. Our method does not require installation on a user’s computer but provides a web-based service. It only requires an Internet connection to send the dataset for analysis using a web interface 3 . Our software accepts any file formatting using Excel software which is widely used by biologists. Sibling reconstruction is among the first kinship reconstruction problems that have generated a variety of computational methods. However, more complicated pedigrees and genealogical relationships await computational solutions. Computationally, kinship reconstruction in wild populations is not only a rich source of interesting problems, but one that poses a particular challenge of testing the accuracy of devised solutions. Real biological data must be used to conduct comparisons of feasibility and accuracy of different methods. More benchmark data is needed to ground truth algorithms and software. Finally, novel approaches must be developed to assess accuracy of the resulting solutions and confidence in the answers provided.

Acknowledgments This research is supported by the following grants: NSF IIS-0612044 and IIS-0611998 (Berger-Wolf, Ashley, Chaovalitwongse, DasGupta), Fulbright Scholarship (Sheikh), NSF CCF-0546574 (Chaovalitwongse). We are grateful to the people who have shared their data with us: Jeff Connor, Atlantic Salmon Federation, Dean Jerry, and Stuart Barker. We would also like to thank Anthony Almudevar, Bernie May, and Dmitry Konovalov for sharing their software. 3

See http://compbio.cs.uic.edu for more details

254

Mary V. Ashley, Tanya Y. Berger-Wolf, Isabel C. Caballero et al.

References [1] A. Almudevar. A simulated annealing algorithm for maximum likelihood pedigree reconstruction. Theoretical Population Biology , 63(2):63–75, 2003. [2] A. Almudevar and C. Field. Estimation of single generation sibling relationships based on DNA markers. Journal of Agricultural, Biological, and Environmental Statistics , 4(2):136–165, 1999. [3] M. V. Ashley and B D. Dow. The use of microsatellite analysis in population biology: background, methods and potential applications. EXS, 69:185–201, 1994. [4] Mary Ashley, Tanya Y. Berger-Wolf, Piotr Berman, Wanpracha Chaovalitwongse, Bhaskar DasGupta, and Ming-Yang Kao. On approximating four covering/packing problems with applications to bioinformatics. Technical report, DIMACS, 2007. [5] T. Y. Berger-Wolf, B. DasGupta, W. Chaovalitwongse, and M. V. Ashley. Combinatorial reconstruction of sibling relationships. In Proceedings of the 6th International Symposium on Computational Biology and Genome Informatics (CBGI 05) , pages 1252–1255, Utah, July 2005. [6] Tanya Y. Berger-Wolf, Saad I. Sheikh, Bhaskar Dasgupta, Mary V. Ashley, Isabel C. Caballero, Wanpracha Chaovalitwongse, and Satya P. Lahari. Reconstructing sibling relationships in wild populations. Bioinformatics, 23(13):49–56, July 2007. [7] Piotr Berman and Piotr Krysta. Optimizing misdirection. In SODA ’03: Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms , pages 192– 201, Philadelphia, PA, USA, 2003. Society for Industrial and Applied Mathematics. [8] J. Beyer and B. May. A graph-theoretic approach to the partition of individuals into full-sib families. Molecular Ecology, 12:2243–2250, 2003. [9] K. Butler, C. Field, C.M. Herbinger, and B.R. Smith. Accuracy, efficiency and robustness of four algorithms allowing full sibship reconstruction from DNA marker data. Molecular Ecology, 13(6):1589–1600, 2004. [10] W. Chaovalitwongse, T. Y. Berger-Wolf, B. Dasgupta, and M. V. Ashley. Set covering approach for reconstruction of sibling relationships. Optimization Methods and Software, 22(1):11 – 24, February 2007. [11] J. K. Conner. personal communication, 2006. [12] J. L. Constable, M. V. Ashley, J. Goodall, and A. E. Pusey. Noninvasive paternity assignment in gombe chimpanzees. Molecular Ecology, 10(5):1279–1300, 2001. [13] B. D. Dow and M. V. Ashley. Microsatellite analysis of seed dispersal and parentage of saplings in bur oak, quercus macrocarpa. Molecular Ecology, 5(5):615–627, May 1996.

Full Sibling Reconstruction in Wild Populations...

255

[14] B. D. Dow and M. V. Ashley. High levels of gene flow in bur oak revealed by paternity analysis using microsatellites. Journal of Heredity, 89:62–70(9), January 1998. [15] H.L. Dugdale, D. W. MacDonald, L. C. Pop, and T. Burke. Polygynandry, extragroup paternity and multiple-paternity litters in european badger ( Meles meles) social groups. Molecular Ecology, 16:5294–5306, 2007. [16] U. Feige. A threshold of ln n for approximating set cover. Journal of the ACM, 45:634–652, 1998. [17] K. A. Feldheim, S. H. Gruber, and M. V. Ashley. Population genetic structure of the lemon shark (negaprion brevirostris) in the western atlantic: DNA microsatellite variation. Molecular Ecology, 10(2):295–303, February 2001. [18] J. Fern´andez and M. A. Toro. A new method to estimate likelihood from molecular markers. Molecular Ecology, pages 1657–1667, May 2006. [19] P. Gagneux, C. Boesch, and D. S. Woodruff. Microsatellite scoring errors associated with noninvasive genotyping based on nuclear dna amplified from shed hair. Molecular Ecology, 6(9):861–868, September 1997. [20] M. R. Garey and D. S. Johnson. Computers and Intractability - A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., 1979. [21] D. Gusfield. Partition-distance: A problem and class of perfect graphs arising in clustering. Information Processing Letters , 82(3):159–164, May 2002. [22] M. A. Halverson., D. K. Skelly, and A. Caccone. Kin distribution of amphibian larvae in the wild. Molecular Ecology, 15(4):1139–1145, 2006. [23] C. M. Herbinger, P. T. O’Reilly, R. W. Doyle, J. M. Wright, and F. O’Flynn. Early growth performance of atlantic salmon full-sib families reared in single family tanks versus in mixed family tanks. Aquaculture, 173(1–4), March 1999. [24] J. T. Hogg and S. H. Forbes. Mating in bighorn sheep: frequent male reproduction via a high-risk unconventional tactic. Journal Behavioral Ecology and Sociobiology , 41(1):33–48, July 1997. [25] C. A. Hurkens and A. Schrijver. On the size of systems of sets every t of which have an sdr with applications to worst-case heuristics for packing problems. SIAM Journal of Discrete Mathematics, 2(1):68–72, 1989. [26] Dean R. Jerry, Brad S. Evans, Matt Kenway, and Kate Wilson. Development of a microsatellite DNA parentage marker suite for black tiger shrimp Penaeus monodon. Aquaculture, pages 542–547, May 2006. [27] D. S. Johnson. Approximation algorithms for combinatorial problems. J. Comput. System Sci., 9:256–278, 1974.

256

Mary V. Ashley, Tanya Y. Berger-Wolf, Isabel C. Caballero et al.

[28] Richard M. Karp. Reducibility among combinatorial problems. In R. E. Miller and J. W. Thatcher, editors, Complexity of Computer Computations , pages 85–103. Plenum Press, 1972. [29] D. A. Konovalov, C. Manning, and M. T. Henshaw. KINGROUP: a program for pedigree relationship reconstruction and kin group assignments using genetic markers. Molecular Ecology Notes, 4(4):779–82, December 2004. [30] T. C. Marshall, J. Slate, L. E. B. Kruuk, and J. M. Pemberton. Statistical confidence for likelihood-based paternity inference in natural populations. Molecular Ecology, 7(5):639–655, May 1998. [31] D. E. McCauley, M. J. Wade, F. J. Breden, and M. Wohltman. Spatial and temporal variation in group relatedness: Evidence from the imported willow leaf beetle. Evolution, 42(1):184–192, January 1988. [32] I. Painter. Sibship reconstruction without parental information. Journal of Agricultural, Biological, and Environmental Statistics , 2(2):212–229, 1997. [33] P. Pamil. Genotypic Correlation and Regression in Social Groups: Multiple Alleles, Multiple Loci nd Subdivided Populations. Genetics, 107(2):307–320, 1984. [34] P. Pamilo. Estimating relatedness in social groups. Trends in Ecology & Evolution, 4(11):353–355, 1989. [35] J. M. Pemberton. Wild pedigrees: the way forward. Proceedings of Biological Sciences, 2008. [36] D. C. Queller and K. F. Goodnight. Estimating relatedness using genetic markers. Evolution, 43(2):258–275, March 1989. [37] D. C. Queller and K. F. Goodnight. Computer software for performing likelihood tests of pedigree relationship using genetic markers. Molecular Ecology, 8(7):1231–1234, July 1999. [38] R. K. Saiki, D. H. Gelfand, S. Stoffel, S. J. Scharf, R. Higuchi, G. T. Horn, K. B. Mullis, and H. A. Erlich. Primer-directed enzymatic amplification of DNA with a thermostable DNA polymerase. Science, 239(4839):487–491, 1988. [39] C. Schltterer. The evolution of molecular markers–just a matter of fashion? Nature Review Genetics, 5:63–69, January 2004. [40] S. I. Sheikh, T. Y. Berger-Wolf, M. V. Ashley, I. C. Caballero, W. Chaovalitwongse, and B. DasGupta. Error-tolerant sibship reconstruction in wild populations. In Proceedings of 7th Annual International Conference on Computational Systems Bioinformatics (CSB) (to appear), 2008. [41] S. I. Sheikh, T. Y. Berger-Wolf, W. Chaovalitwongse, and M. V. Ashley. Reconstructing sibling relationships from microsatellite data. In Proceedings of the European Conf. on Computational Biology (ECCB) , January 2007.

Full Sibling Reconstruction in Wild Populations...

257

[42] S. I. Sheikh, T. Y. Berger-Wolf, A. A. Khokhar, and B. DasGupta. Consensus methods for reconstruction of sibling relationships from genetic data. In Proceedings of the 4th Workshop on Advances in Preference Handling (to appear) , 2008. [43] B. R. Smith, C. M. Herbinger, and H. R. Merry. Accurate partition of individuals into full-sib families from genetic data without parental information. Genetics, 158(3):1329–1338, July 2001. [44] S.M. Sogard, E. Gilbert-Horvath, E. C. Anderson, R. Fisher, S. A. Berkeley, and J. Carlos Garza. Multiple paternity in viviparous kelp rockfish, Sebastes atrovirens. Environmental Biology of Fishes , 81:7–13, 2008. [45] B. M. Strausberger and M. V. Ashley. Breeding biology of brood parasitic brownheaded cowbirds (Molothrus ater) characterized by parent-offspring and sibling-group reconstruction. The Auk, 120(2):433–445, 2003. [46] B. M. Strausberger and M. V. Ashley. Host use strategies of individual female brownheaded cowbirds molothrus ater in a diverse avian community. Journal of Avian Biology, 36(4):313–321, 2005. [47] R. Streiff, A. Ducousso, C. Lexer, H. Steinkellner, J. Gloessl, and A. Kremer. Pollen dispersal inferred from paternity analysis in a mixed oak stand of Quercus robur L. and Q. petraea (Matt.) Liebl. Molecular Ecology, 8(5):831–841, 1999. [48] D. Tautz. Hypervariabflity of simple sequences as a general source for polymorphic DNA markers. Nucl. Acids Res., 17(16):6463–6471, August 1989. [49] S. C. Thomas and W. G. Hill. Estimating Quantitative Genetic Parameters Using Sibships Reconstructed From Marker Data. Genetics, 155(4):1961–1972, 2000. [50] S. C. Thomas and W. G. Hill. Sibship reconstruction in hierarchical population structures using markov chain monte carlo techniques. Genetics Research, 79:227–234, 2002. [51] V. Vazirani. Approximation Algorithms . Springer, 2001. [52] M.J. Vonhof, D. Barber, M. B. Fenton, and C. Strobeck. A tale of two siblings: multiple paternity in big brown bats ( Eptesicus fuscus) demonstrated using microsatellite markers. Molecular Ecology, 15:241–247, 2006. [53] J. Wang. Sibship reconstruction from genetic data with typing errors. 166:1968–1979, April 2004.

Genetics,

[54] J. L. Weber and P. E. May. Abundant class of human DNA polymorphisms which can be typed using the polymerase chain reaction. American journal of human genetics , 44(3):388–396, March 1989. [55] D. F. Westneat and M. S. Webster. Molecular analysis of kinship in birds: interesting questions and useful techniques. In B. Schierwater, B. Streit, G. P. Wagner, and R. DeSalle, editors, Molecular Ecology and Evolution: Approaches and Applications , pages 91–128. Basel, 1994.

258

Mary V. Ashley, Tanya Y. Berger-Wolf, Isabel C. Caballero et al.

[56] A.A.C. Wilson, P. Sunnucks, and J.S.F. Barker. Isolation and characterization of 20 polymorphic microsatellite loci for Scaptodrosophila hibisci. Molecular Ecology Notes, 2:242–244, 2002.

In: Computational Biology: New Research Editor: Alona S. Russe

ISBN: 978-1-60692-040-4 © 2009 Nova Science Publishers, Inc.

Chapter 10

RECENT ISSUES AND COMPUTATIONAL APPROACHES FOR DEVELOPING PROGNOSTIC GENE SIGNATURES FROM GENE EXPRESSION DATA Seon-Young Kim1,∗ and Hyun Ju Kim2,† 1

Functional Genomics Research Center, KRIBB, 111 Gwahangno, Yuseong-gu, Daejeon, 305-806, Korea 2 Department of Food and Nutrition, Daejeon Health Sciences College, 77-3 Gayang 2-dong, Dong-gu, Daejeon, 300-711 Korea

Abstract Microarray gene expression profiling, which monitors the expression of tens of thousands of genes simultaneously, is a promising tool for developing prognostic markers for cancer patients. Many researchers have applied microarray gene expression profiling in order to develop better prognostic markers, and have demonstrated promising results in many types of cancer. Unfortunately, there are concerns regarding the premature clinical use of newlydeveloped prognostic gene signatures, as problems associated with their application remain unresolved, diminishing the reliability of their intended results. This review first discusses these presently unsolved problems in the development of prognostic gene signatures. Recent computational approaches to circumventing these problems are then presented, and therein we discuss these approaches in the categorized framework of mechanism-derived bottom-up approaches, meta-analytic approaches, integrative approaches that combine genomics and clinical data, and sub-type-specific analysis approaches. We believe that recent bioinformatics approaches, which integrate rapidly accumulating genomics, clinical, and other forms of data, will help overcome current problems, and will help realize the successful application of prognostic gene signatures in personalized medicine.

∗ Correspondence: Seon-Young Kim ([email protected]); Tel.: 82-42-879-8116; Fax: 82-42-879-8119 † E-mail:[email protected]

260

Seon-Young Kim and Hyun Ju Kim

1. Introduction The accurate prognosis of cancer patients is important for avoiding over- or under-treatment, and is a means to improve a patient’s survivability and quality of life. For example, many early-stage, node-negative breast cancer patients currently receive hormonal therapy and/or adjuvant chemotherapy to prevent distant metastases, while it is estimated that approximately 70 to 80% of these patients would have survived without these treatments [1]. The situation is similar with other types of cancer. For example, while adjuvant chemotherapy is clearly beneficial for stage III colon cancer patients, its benefit for stage II colon cancer patients is not clearly defined [2]. Improved prognostic markers that can discern high-risk from low-risk patients will save 75% of colon cancer patients, who are ordinarily treated by surgery only, from the unnecessary suffering that results from chemotherapy [2, 3]. The recent developments of several high-throughput technologies, including microarray gene expression profiling, proteomics, metabolomics, and genome-wide genotyping, have provided researchers with an unprecedented opportunity to develop effective biomarkers for cancer diagnosis, prognosis, and treatment. Among these technologies, microarray gene expression profiling has been the most ubiquitously used, as it readily provides information concerning the expression of tens of thousands of genes. The use of gene expression profiling for prognostic marker development has steadily increased in recent years. For example, when we queried the number of articles in PubMed following a search with the keywords ‘microarray cancer prognosis,’ the returned results increased from five in 1999 to 365 in 2007 (Figure 1). Indeed, many researchers have applied gene expression profiling to develop better prognostic markers, and have demonstrated that gene expression signatures can more effectively predict patient outcomes compared to conventional clinical criteria for many types of cancer, including bladder [4, 5], breast [1, 3, 6], colon [3, 7, 8], gliomas [9, 10], head and neck [5, 11], kidney [12, 13], leukemia [14, 15], liver [16], lung [17, 18], lymphoma [19, 20], prostate [21], and stomach [22] cancers. There are already several commercial products specifically targeted to breast cancer, including MammaPrint from Agendia (http://www.agendia.com) or Oncotype DX from Genomic Health (http://www. genomichealth.com), and are used to select breast cancer patients for adjuvant chemotherapy. Additionally, several large-scale clinical trials are presently being investigated to prove the clinical utility of gene expression-based prognostic tests [23-27]. Unfortunately, there are also some concerns regarding the usefulness of gene expressionbased prognostic markers, as several issues remain unresolved [28-30], and many promising results that have been presented in the literature suffer from one or more flaws [31-33]. These unresolved issues include the small overlap between independently-developed gene signatures [30], the instability of many published gene signatures [28], and the poor performance of a gene signature when applied to other data sets [34]. In this review, we first discuss the current challenges to the development of prognostic gene signatures. We then describe recent computational approaches to overcome these problems, with the intent to develop improved prognostic gene signatures.

Recent Issues and Computational Approaches for Developing PGS…

261

Number of papers

400

300

200

100

0 1998

2000

2002

2004

2006

2008

Year Figure 1. Exponential growth of the number of publications studying the development of prognostic gene signatures for predicting cancer patient outcomes. We searched PubMed using the keywords ‘microarray cancer prognosis’ from 1999 to 2007 and summarized the number of published articles in each year.

2. Current issues in the development of prognostic gene signatures 2. 1. Small overlap and instability of gene expression signatures The first challenge to prognostic gene signature development is that there is only a small overlap between independently-identified prognostic gene signatures, which can affect their reliability when used. This problem can be illustrated by two examples, consisting of three breast cancer prognostic signatures and three colon cancer prognostic signatures (Figure 2). In the breast cancer example, the pair-wise overlap between three well-known prognostic gene signatures is only three genes between the Amsterdam-70 [1] and Veridix-76 [3] signatures, one gene between the Amsterdam-70 and Sweden-64 signatures [35], and no genes between the Veridix-76 and Sweden-64 signatures (Figure 2A). The situation is similar in the colon cancer example. When three prognostic gene signatures were compared as a basis for discerning good from poor prognostic patients, there was only one gene overlap between the Wang-23 [3]and Eschlich-121 [7] gene signatures, and no gene overlap between the Barrier30 [8] and either the Wang-23 or Eschlich-121 gene signatures (Figure 2B). In fact, since many gene expression studies have observed such small overlaps between gene signatures, it has become a subject of great interest.

Seon-Young Kim and Hyun Ju Kim

262

B. Colon Cancer

A. Breast Cancer Amsterdam-70

1

3 0

Wang-23

0

0

Veridix-76

1 0 0

Sweden-64

Barrier-30

Eschlich-121

Figure 2. The small overlap between independently developed prognostic gene signatures. Two examples derived from breast and colon cancer studies are shown. We retrieved the reported prognostic gene signatures from each study and counted the number of overlapped genes between different gene signatures. A. Breast cancer; Amsterdam-70 [1, 6], Veridix-76 [3], and Sweden-64 [35]. B. Colon cancer; Wang-23 [3], Eschlich-121 [7], and Barrier-30 [8].

The small overlap between gene signatures, as observed in several studies from the literature, was originally ascribed to many factors, including differences in the microarray platforms used, patient cohorts, and statistical analyses; however, Ein-Dor et al.’s work demonstrated that even with a single data set, wherein many of the above-mentioned variables were controlled, it is possible to create many non-overlapping prognostic gene signatures [30]. Similarly, when Michiels et al. re-analyzed seven published microarray studies using a multiple random sampling strategy, they found that many of the originally reported prognostic gene signatures were not identified, thus casting doubt on the stability of the prognostic gene signatures [28]. Moreover, they found that five out of the seven re-analyzed studies did not classify patients better than random chance, and concluded that the prognostic value of many overly optimistic studies need to be cautiously evaluated. Ein-Dor et al. further studied this instability problem, developed a mathematical method to better understand the relationship between gene list overlap and sample size, and concluded that thousands of samples are needed to generate a gene list with more than a half overlap [34]. Clearly, one of the causes for the small overlap and instability of prognostic gene signatures is the small sample size [34, 36]; however, researchers are now realizing that the complexity and scale of genomic data is a more fundamental cause for the small overlap and instability problems [67, 68]. Current gene expression data contain information consisting of over tens of thousands of genes, whose respective expressions are correlated and co-regulated in a complex way. Due to the complexity and correlation structure of the gene expression data, many equally prognostic, but related gene signatures, can be identified from the same data [62, 67]. For example, although Ein-Dor et al. were able to identify at least eight equally prognostic gene signatures from a single data set, the eight gene signatures were not completely independent of one another, but represented similar biological processes when analyzed at the gene set level [62, 69, 70]. Fan et al. applied five gene expression signatures

Recent Issues and Computational Approaches for Developing PGS…

263

on a single data set to compare the predictions derived from distinct gene signatures, and found that four of the five gene signatures agreed well with the predicted outcome, despite a poor overlap between them [70]. Thus, in terms of the predicted outcome, which should be the final criterion for measuring classifier performance, recent gene expression-based prognostic classifiers agree well with one another [68, 70]. In this regard, thousands of samples may not be needed for developing prognostic gene signatures, so long as a reliable predicted outcome, rather than a high-degree of overlap between gene lists, is the primary criterion for a marker’s performance [68].

2.2. Poor inter-study performance The second challenge to prognostic gene signature development is poor inter-study performance, wherein the performance of a developed and validated prognostic gene signature in one data set is not reproduced in another data set [34]. The causes of poor interstudy performance may include differences in microarray platforms, patient cohorts, and data analysis. For example, many important prognostic genes selected in one microarray platform cannot be evaluated in other data sets produced using another microarray platform, simply because those genes are not represented in the second platform. As a result of a reduced number of prognostic genes, the overall performance of a gene signature is likely to be reduced. Poor inter-study performance is also closely related to the problem of small overlap and the instability of prognostic gene signatures. According to Michiels et al., the performance of many promising gene signatures was not reproduced even in their own data sets when seven data sets were reanalyzed using a multiple random sample method [28]. The predicted outcome performance was observed to strongly depend on the selected patients in the training sets [28]. Another important cause of poor inter-study performance is the preponderance of flaws in the statistical analyses of many gene expression studies, which is the subject of the next section.

2. 3. Flaws in statistical analyses in many published microarray studies The third challenge to prognostic gene signature development involves the preponderance of flaws in many inadequately reported microarray gene expression studies. Microarray gene expression studies deal with data consisting of features that number in the tens of thousands, while samples typically number in the tens to hundreds, creating a situation that can be simultaneously described as both a curse of dimensionality and a curse of dataset sparsity [36]. This peculiar nature of microarray data is a significant challenge to statistical analyses, and is also a source of the flaws inherent to many reported studies [31-33, 36-39]. Dupuy et al. reviewed many published microarray studies investigating cancer patient outcomes, found various flaws, and summarized them in three categories, corresponding to either an inadequate control of false-discovery rate in outcome-related gene findings, a spurious claim of outcome-related class discovery using genes selected for their correlation with outcome, or a biased estimation of the prediction accuracy of outcomes in supervised class prediction [33].

264

Seon-Young Kim and Hyun Ju Kim

Among these, we discuss in greater detail a few flaws commonly found in supervised class prediction. The first flaw is the reporting of overly optimistic results coming from the inadequate validation of identified prognostic gene signatures. Due to samples being limited in most gene expression studies, sample training-testing splits or n-fold cross-validations are commonly used. Herein, the critical point to mention is that the entire model building process should be repeated in each cross-validation step [33, 38]. A partial validation is likely to produce overly optimistic results [33, 38]. For example, van’t Veer et al.’s work is criticized for including the same patients in the validation step, leading to much lower error rates [1, 6, 38]. Reporting an odds ratio, a hazard ratio, or a p-value from a log-rank test to assess the performance of a prognostic classifier is another flaw found in many studies [33, 40, 41]. The odds or hazard ratio is simply a measure of association, not of prediction accuracy [33, 40]. The performance of a prognostic marker should be assessed by how it successfully classifies patients into different prognostic groups; therefore, the prediction error rate, specificity, and sensitivity should be the ultimate results for reporting [40]. In addition, the value of new markers should be judged by their ability to improve an already optimized predictive model [42, 43].

3. Current computational approaches for developing prognostic gene signatures We now describe recent computational approaches to developing prognostic gene signatures. Many researchers have suggested interesting ideas for the integration of genomics, clinical, and many other forms of data to develop more efficient prognostic gene signatures, and to enhance our understanding of the carcinogenic processes underlying different clinical outcomes. Herein, we have grouped these ideas into the categories of bottom-up approaches, meta-analyses, integrative analyses, and subtype-specific analyses.

3. 1. Bottom-up approach Most gene expression studies with the purpose of developing prognostic gene signatures begin with complete gene expression data, and then derive a prognostic model by selecting a number of genes (usually in the tens) by the degree of association between individual genes and clinical parameters. This approach is referred to as top-down, as it is not based on any mechanistic assumptions of the prognostic model. Recently, an opposite approach, referred to as the bottom-up approach, has been successfully applied to the development of prognostic gene signatures for several types of cancer [44, 45]. In the bottom-up approach, the first step is the derivation of mechanism-based gene model(s) based on experimental data or a priori knowledge of pathways or gene ontology information. The prognostic value of the gene model is then validated in clinical sample gene expression data. In addition to the potential of developing effective prognostic gene signatures, the bottom-up approach has the added advantage of providing specific testable ideas of mechanisms behind the disease [45]. In the bottom-up approach, mechanism-based gene models are derived from diverse sources (Figure 3). In vitro-derived cellular processes, comparative genomics approaches using transgenic or

Recent Issues and Computational Approaches for Developing PGS…

265

knockout mouse models, and a priori knowledge, such as pathways and gene ontology information, are examples of data sources used in gene model construction. Prognostic gene signatures

Evaluate the signature(s) with gene expression data of clinical samples

Candidate gene signature(s)

In vitro cell model

In vivo animal model

Gene sets from diverse sources

Gene Ontology

KEGG

BioCarta

Figure 3. Bottom-up approaches for developing prognostic gene signatures. In vitro cellular model, comparative genomics approach using in vivo animal models, and the gene sets approach using predefined gene sets are three examples of mechanism-derived bottom-up approaches for gene expression signature identification.

3.1.1. In vitro-derived prognostic models In vitro-derived prognostic models begin from the identification of differentially expressed genes in specific cellular processes. For example, Chang et al. identified a common serum response signature by selecting differentially expressed genes in fibroblasts by serum treatment, and defined a core serum response signature by removing a set of proliferationrelated genes [46]. They then observed that the defined core serum-response signature was consistently expressed in diverse human cancers, including breast, lung, gastric, prostate, and hepatocellular carcinomas, and was prognostic of metastasis and patient survival in those cancers [45, 46]. Bild et al. identified gene expression patterns of several oncogenic pathways by infecting human mammary epithelial cells with adenovirus expressing either c-Myc, activated H-Ras, c-Src, E2F3 or activated β-catenin [47]. They then demonstrated that tumors can be classified into prognostically different groups by the patterns of pathway deregulation. Moreover, they demonstrated that the patterns of pathway deregulation in breast cancer cell lines could predict the sensitivity of the cells to therapeutic drugs that target the pathways, thus paving the way for individualized treatments [48]. Oh et al. treated the ER+ MCF-7 breast cancer cell line with 17β-estradiol to identify estrogen-related genes, observed their natural pattern of expression in primary tumors in order to divide primary tumors into two

266

Seon-Young Kim and Hyun Ju Kim

groups, identified a gene signature by supervised analysis of the two groups of tumors, and finally validated their outcome predictor in three independent data sets [49].

3.1.2. Comparative genomics approach The comparative genomics approach using in vivo transgenic or knockout mouse models has also been useful for understanding tumor progression processes and developing prognostic gene signatures. Lee et al. developed seven different mouse models, including Myc, MycTgfa, Myc-E2f1, E2F1, Acox1-/-, diethylnitrosamine (DENA), and ciprofibrate-induced mouse hepatocellular carcinoma (HCC) models, and compared the patterns of gene expression in the seven mouse models to the patterns from 91 human HCCs [50]. They found that gene expression patterns of Myc, E2f1, and Myc-E2f1 transgenic mice were most similar to human HCC patients with good prognoses, while gene expression patterns of Myc-Tgfa mice and DENA-induced mice were most similar to those of human HCC patients with poor prognoses. Their work demonstrated that appropriate mouse models can be effectively used to understand human cancers [50]. Lee et al. later integrated gene expression data from rat fetal fibroblasts and adult hepatocytes with HCCs from human and mouse models, and demonstrated that a fetal fibroblast-like gene expression pattern was indicative of poor prognoses among human HCC patients [51]. Sweet-Cordero et al. generated mouse lung cancers using the latent mutated Kras2 allele, developed a gene expression signature from the Kra2-mediated mouse lung cancer model, and applied gene set enrichment analysis to compare the gene expression patterns of mouse models and human lung cancers [52]. They then identified a gene expression signature for the KRAS2 mutation in human lung cancer by integrating mouse and human data, and validated their gene expression signature by gene expression analysis of the KRAS2 knockdown human lung cancer cell line [52]. 3.1.3. Gene sets approach The gene sets approach is a recently introduced bottom-up approach that uses predefined gene sets prepared from diverse biological knowledge, including pathways, chromosomal locations, protein domains, protein-protein interactions, and gene ontology information [5355]. As a result of genes in the same pathways or biological processes often being coregulated, observing gene expression changes at the gene set level enables the understanding of moderate, but coordinate changes that are often missed by individual gene analysis [55, 56]. Moreover, concordance between two independent studies is greatly improved by gene set level comparison [53, 57]; thus, a gene set approach is a potential solution to the problem of the small overlap between independent studies pursuing similar biological questions. Pang et al. described a pathway-based classification and regression method using a random forests algorithm in the analysis of gene expression data [58]. They prepared a total of 441 pathways, derived from KEGG and BioCarta pathway databases, and applied random forest classification to analyze a categorical phenotype and random forest regression to analyze continuous clinical outcome. They demonstrated that the pathway-based method may be more useful for identifying and developing good classifiers and predictors compared to single gene-based methods [58]. Chuang et al. described a network-based classification of metastasis by combining the analysis of protein-protein interaction and gene expression data [59]. Their approach circumvents the weakness of microarray gene expression analysis in identifying genes that contribute to metastasis by gene mutation (i.e., TP53, BRACA1, and ERBB2 in breast cancer)

Recent Issues and Computational Approaches for Developing PGS…

267

rather than by changes in gene expression. By focusing on sub-networks of interconnected genes instead of individual genes, their method facilitated the identification of mutated, but not differentially expressed, genes that interconnect many differentially expressed genes. In addition, they showed that sub-network markers are more reproducible between different studies than markers developed without network information, and that sub-network markers achieve a higher accuracy of outcome prediction [59]. Kim et al. described a gene sets approach for identifying prognostic gene signatures for outcome prediction by simultaneously applying a gene set-based classification on multiple data sets [56]. They collected 12 publicly available breast cancer data sets comprising 1,756 tissues, and prepared a total of 2,411 gene sets from diverse sources, including gene ontology, pathways, protein domains, and chromosomal locations. By exhaustively searching all gene sets against all data sets, they found many gene sets to be prognostic in most of the analyzed data sets. Many gene sets related to biological processes, such as cell cycle and proliferation, were found to have prognostic power in differentiating metastatic from non-metastatic breast cancer patients. As more data sets become available, their approach will be useful in developing stable prognostic gene signatures and understanding the underlying biology for different patient outcomes [56].

3.2. Meta-analysis Meta-analysis is the quantitative synthesis of information from several studies [60], and is applicable to variable study designs in genetics, from family-based linkage studies and population-based association studies, to genome-wide association studies [61]. By combining the relevant evidence from many studies, one can reach more precise estimates of effect. In the area of cancer prognosis, the integrated analysis of multiple data sets can provide a broader insight into the genetic regulation of specific biological pathways under a variety of conditions. Moreover, increasing sample size by meta-analysis increases the possibility of developing more robust prognostic gene signatures. Using the integrated analysis of two independent microarray data of breast cancer prognosis, Zhang et al. confirmed that the gene expression profile generated by an integrated analysis of multiple data sets achieves better prediction of breast cancer recurrence [62]. Choi et al. adopted the classical meta-analysis framework in microarray analysis, and used a t-like statistic, defined as effect size, as the summary statistic [63]. Using a hierarchical modeling approach to assess both intra- and inter-study variations across multiple data sets, they estimated an overall effect size as the measurement of the magnitude of differential expression for a gene through parameter estimation. They then presented advantages of the effect size approach applied to microarray data, that is, it provides a standardized index that allows a direct comparison between results from different measures, is based on a wellestablished statistical framework for combining different results to integrate multiple microarray data efficiently, and by using appropriate modeling of inter-study variation, it has the capacity to accommodate variability between multiple studies. As a result of the explosion in microarray technology-based data by different investigators working on similar experiments, it is challenging to combine results across multiple studies using different platforms. Shen et al. proposed a Bayesian mixture modelbased transformation of DNA microarray data, and applied it to develop a signature of breast

268

Seon-Young Kim and Hyun Ju Kim

cancer recurrence across multiple microarray experiments produced using different platforms [64]. They combined multiple studies on a common probability scale, and developed a 90 gene meta-signature that strongly associated with survival in breast cancer patients. The metasignature accommodated the heterogeneity of diverse study settings, and achieved better prognostic performance compared to the individual signatures. A key feature of the model is the use of latent variables that represent quantities, which can be combined across diverse platforms [65]. Warnat et al. used mean rank scores and quantile discretization to derive numerically comparable measures of gene expression data produced from different platforms, and achieved higher classification accuracies from the combined data set than individual data sets [66]. Recently, Hong et al. compared three meta-analysis methods for detecting differentially expressed genes in microarray experiments [67]. They compared three methods, including tbased hierarchical modeling, rank products, and Fisher’s Inverse χ2-test with P-values, and using both simulated and real datasets, demonstrated that, in general, the non-parametric rank product method had a higher sensitivity and specificity than the parametric t-based method [67]. They also demonstrated that meta-analysis, either parametric or non-parametric, always identified more genes at the same P-value, suggesting an increased power and a potentially low false negative rate.

3.3. Integrated analysis of genomic and clinical data Cancer is an enormously heterogeneous disease that is represented by complex biological phenotypes that reflect multiple genetic changes. As a result of complexity and heterogeneity, individual cancer patients have distinct tumor phenotypes, disease outcomes, and responses to therapies [68]. For these reasons, integrated analysis of both genomics and clinical information is necessary to understand the full spectrum of diverse carcinogenesis processes, and to develop individualized prognostic and predictive regimens [68]. The integrated use of both genomic and clinical data was introduced by Pittman et al., who suggested an integrated clinocogenomic modeling framework based on statistical classification tree models [69]. They first summarized gene expression data in terms of metagenes, a dominant common expression pattern within a cluster of genes, and achieved an improved accuracy of predicting a recurrence of individual breast cancer patients by combining metagenes with traditional clinical risk factors [69]. Gevaert et al. proposed a strategy based on Bayesian networks to integrate both clinical and microarray data in their construction of prognostic models to classify cancer patients into poor or good prognosis groups [70, 71]. The probabilistic model has an advantage of flexibility in model building by allowing the integration of data sources in several ways [70]. The integrative analysis of genomics and clinical data was also successfully applied to predict the outcome of patients with diffuse large B-cell lymphoma after chemotherapy [72]. Sun et al. applied a new feature selection algorithm, referred to as IRELIEF, to derive a hybrid prognostic signature from both gene expression and clinical data, and demonstrated that the hybrid prognostic signature performed better than either conventional clinical markers or markers developed from gene expression data only [73].

Recent Issues and Computational Approaches for Developing PGS…

269

3.4. Subtype-specific analysis As previously mentioned, cancer is a vastly heterogeneous disease. For example, molecular profiling studies have established that breast cancer consists of at least three to six different heterogeneous subtypes [74-76]. The status of estrogen receptor (ER) expression is one of the most important molecular characteristics in distinguishing breast cancer patients and guiding their hormonal therapy [74, 75, 77]. Another study demonstrated that different breast cancer molecular subtypes respond differently to preoperative chemotherapy, which emphasizes the importance of identifying molecular sub-types within each cancer [78]. The subtype-specific development of prognostic gene signatures is another promising approach, considering the enormous molecular heterogeneity of many cancers. For example, for breast cancer, wherein several ER+-specific and a few ER--specific prognostic gene signatures have been reported, it was found that ER+ and ER- prognostic gene signatures are fundamentally different from one another in terms of biological processes or pathways [3, 49, 79, 80]. While most ER+ gene signatures are primarily composed of genes related to cell proliferation and growth, an immune response gene expression module was representative of an ER- prognostic gene signature, clearly demonstrating the importance of developing separate prognostic gene signatures for ER+ and ER- breast cancers [49, 79, 80].

4. Conclusion Microarray gene expression profiling has dramatically increased our understanding of cancer biology, and has revealed a new opportunity for developing effective prognostic markers for cancer patients. Many gene expression studies have presented promising results in developing prognostic markers, while there are also concerns for the premature clinical use of gene expression-based prognostic markers due to several unresolved issues. Clearly, the complexity and enormous amount of information in gene expression data is one source of these problems. Recently, many promising computational approaches, including the bottomup approach, meta-analytic approach, integrative approach, and subtype-specific analyses, have contributed to an improved use of complex genomics data, and will lead to the development of more effective prognostic gene signatures.

Acknowledgements This work was supported by a grant from NTC700711 from the Korea Research Council for Fundamental Science & Technology and KRIBB Research Initiative Grant (to SYK).

References [1]

van 't Veer, L. J., Dai, H., van de Vijver, M. J., He, Y. D., Hart, A. A., Mao, M., Peterse, H. L., van der Kooy, K., Marton, M. J., Witteveen, A. T., Schreiber, G. J., Kerkhoven, R. M., Roberts, C., Linsley, P. S., Bernards, R. & Friend, S. H. (2002).

270

Seon-Young Kim and Hyun Ju Kim

Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415(6871), 530-6. [2] Johnston, P. G. (2005). Stage II colorectal cancer: to treat or not to treat. The oncologist, 10(5), 332-4. [3] Wang, Y., Klijn, J. G., Zhang, Y., Sieuwerts, A. M., Look, M. P., Yang, F., Talantov, D., Timmermans, M., Meijer-van Gelder, M. E., Yu, J., Jatkoe, T., Berns, E. M., Atkins, D. & Foekens, J. A. (2005). Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet, 365(9460), 671-9. [4] Dyrskjot, L., Thykjaer, T., Kruhoffer, M., Jensen, J. L., Marcussen, N., HamiltonDutoit, S., Wolf, H. & Orntoft, T. F. (2003). Identifying distinct classes of bladder carcinoma using microarrays. Nature genetics, 33(1), 90-6. [5] Chung, C. H., Parker, J. S., Karaca, G., Wu, J., Funkhouser, W. K., Moore, D., Butterfoss, D., Xiang, D., Zanation, A., Yin, X., Shockley, W. W., Weissler, M. C., Dressler, L. G., Shores, C. G., Yarbrough, W. G. & Perou, C. M. (2004). Molecular classification of head and neck squamous cell carcinomas using patterns of gene expression. Cancer cell, 5(5), 489-500. [6] van de Vijver, M. J., He, Y. D., van't Veer, L. J., Dai, H., Hart, A. A., Voskuil, D. W., Schreiber, G. J., Peterse, J. L., Roberts, C., Marton, M. J., Parrish, M., Atsma, D., Witteveen, A., Glas, A., Delahaye, L., van der Velde, T., Bartelink, H., Rodenhuis, S., Rutgers, E. T., Friend, S. H. & Bernards, R. (2002). A gene-expression signature as a predictor of survival in breast cancer. The New England journal of medicine, 347(25), 1999-2009. [7] Eschrich, S., Yang, I., Bloom, G., Kwong, K. Y., Boulware, D., Cantor, A., Coppola, D., Kruhoffer, M., Aaltonen, L., Orntoft, T. F., Quackenbush, J. & Yeatman, T. J. (2005). Molecular staging for survival prediction of colorectal cancer patients. J Clin Oncol, 23(15), 3526-35. [8] Barrier, A., Boelle, P. Y., Roser, F., Gregg, J., Tse, C., Brault, D., Lacaine, F., Houry, S., Huguier, M., Franc, B., Flahault, A., Lemoine, A. & Dudoit, S. (2006). Stage II colon cancer prognosis prediction by tumor gene expression profiling. J Clin Oncol, 24(29), 4685-91. [9] Pomeroy, S. L., Tamayo, P., Gaasenbeek, M., Sturla, L. M., Angelo, M., McLaughlin, M. E., Kim, J. Y., Goumnerova, L. C., Black, P. M., Lau, C., Allen, J. C., Zagzag, D., Olson, J. M., Curran, T., Wetmore, C., Biegel, J. A., Poggio, T., Mukherjee, S., Rifkin, R., Califano, A., Stolovitzky, G., Louis, D. N., Mesirov, J. P., Lander, E. S. & Golub, T. R. (2002). Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature, 415(6870), 436-42. [10] Nutt, C. L., Mani, D. R., Betensky, R. A., Tamayo, P., Cairncross, J. G., Ladd, C., Pohl, U., Hartmann, C., McLaughlin, M. E., Batchelor, T. T., Black, P. M., von Deimling, A., Pomeroy, S. L., Golub, T. R. & Louis, D. N. (2003). Gene expressionbased classification of malignant gliomas correlates better with survival than histological classification. Cancer research, 63(7), 1602-7. [11] Cromer, A., Carles, A., Millon, R., Ganguli, G., Chalmel, F., Lemaire, F., Young, J., Dembele, D., Thibault, C., Muller, D., Poch, O., Abecassis, J. & Wasylyk, B. (2004). Identification of genes associated with tumorigenesis and metastatic potential of hypopharyngeal cancer by microarray analysis. Oncogene, 23(14), 2484-98.

Recent Issues and Computational Approaches for Developing PGS…

271

[12] Vasselli, J. R., Shih, J. H., Iyengar, S. R., Maranchie, J., Riss, J., Worrell, R., TorresCabala, C., Tabios, R., Mariotti, A., Stearman, R., Merino, M., Walther, M. M., Simon, R., Klausner, R. D. & Linehan, W. M. (2003). Predicting survival in patients with metastatic kidney cancer by gene-expression profiling in the primary tumor. Proceedings of the National Academy of Sciences of the United States of America, 100(12), 6958-63. [13] Yang, X. J., Tan, M. H., Kim, H. L., Ditlev, J. A., Betten, M. W., Png, C. E., Kort, E. J., Futami, K., Furge, K. A., Takahashi, M., Kanayama, H. O., Tan, P. H., Teh, B. S., Luan, C., Wang, K., Pins, M., Tretiakova, M., Anema, J., Kahnoski, R., Nicol, T., Stadler, W., Vogelzang, N. G., Amato, R., Seligson, D., Figlin, R., Belldegrun, A., Rogers, C. G. & Teh, B. T. (2005). A molecular classification of papillary renal cell carcinoma. Cancer research, 65(13), 5628-37. [14] Yagi, T., Morimoto, A., Eguchi, M., Hibi, S., Sako, M., Ishii, E., Mizutani, S., Imashuku, S., Ohki, M. & Ichikawa, H. (2003). Identification of a gene expression signature associated with pediatric AML prognosis. Blood, 102(5), 1849-56. [15] Bullinger, L., Dohner, K., Bair, E., Frohling, S., Schlenk, R. F., Tibshirani, R., Dohner, H. & Pollack, J. R. (2004). Use of gene-expression profiling to identify prognostic subclasses in adult acute myeloid leukemia. The New England journal of medicine, 350(16), 1605-16. [16] Lee, J. S., Chu, I. S., Heo, J., Calvisi, D. F., Sun, Z., Roskams, T., Durnez, A., Demetris, A. J. & Thorgeirsson, S. S. (2004). Classification and prediction of survival in hepatocellular carcinoma by gene expression profiling. Hepatology, 40(3), 667-76. [17] Garber, M. E., Troyanskaya, O. G., Schluens, K., Petersen, S., Thaesler, Z., PacynaGengelbach, M., van de Rijn, M., Rosen, G. D., Perou, C. M., Whyte, R. I., Altman, R. B., Brown, P. O., Botstein, D. & Petersen, I. (2001). Diversity of gene expression in adenocarcinoma of the lung. Proceedings of the National Academy of Sciences of the United States of America, 98(24), 13784-9. [18] Beer, D. G., Kardia, S. L., Huang, C. C., Giordano, T. J., Levin, A. M., Misek, D. E., Lin, L., Chen, G., Gharib, T. G., Thomas, D. G., Lizyness, M. L., Kuick, R., Hayasaka, S., Taylor, J. M., Iannettoni, M. D., Orringer, M. B. & Hanash, S. (2002). Geneexpression profiles predict survival of patients with lung adenocarcinoma. Nature medicine, 8(8), 816-24. [19] Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A., Boldrick, J. C., Sabet, H., Tran, T., Yu, X., Powell, J. I., Yang, L., Marti, G. E., Moore, T., Hudson, J., Jr., Lu, L., Lewis, D. B., Tibshirani, R., Sherlock, G., Chan, W. C., Greiner, T. C., Weisenburger, D. D., Armitage, J. O., Warnke, R., Levy, R., Wilson, W., Grever, M. R., Byrd, J. C., Botstein, D., Brown, P. O. & Staudt, L. M. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403(6769), 503-11. [20] Rosenwald, A., Wright, G., Chan, W. C., Connors, J. M., Campo, E., Fisher, R. I., Gascoyne, R. D., Muller-Hermelink, H. K., Smeland, E. B., Giltnane, J. M., Hurt, E. M., Zhao, H., Averett, L., Yang, L., Wilson, W. H., Jaffe, E. S., Simon, R., Klausner, R. D., Powell, J., Duffey, P. L., Longo, D. L., Greiner, T. C., Weisenburger, D. D., Sanger, W. G., Dave, B. J., Lynch, J. C., Vose, J., Armitage, J. O., Montserrat, E., Lopez-Guillermo, A., Grogan, T. M., Miller, T. P., LeBlanc, M., Ott, G., Kvaloy, S., Delabie, J., Holte, H., Krajci, P., Stokke, T. & Staudt, L. M. (2002). The use of

272

[21]

[22]

[23]

[24]

[25]

[26]

[27] [28] [29] [30]

[31] [32]

[33]

Seon-Young Kim and Hyun Ju Kim molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. The New England journal of medicine, 346(25), 1937-47. LaTulippe, E., Satagopan, J., Smith, A., Scher, H., Scardino, P., Reuter, V. & Gerald, W. L. (2002). Comprehensive gene expression analysis of prostate cancer reveals distinct transcriptional programs associated with metastatic disease. Cancer research, 62(15), 4499-506. Chen, X., Leung, S. Y., Yuen, S. T., Chu, K. M., Ji, J., Li, R., Chan, A. S., Law, S., Troyanskaya, O. G., Wong, J., So, S., Botstein, D. & Brown, P. O. (2003). Variation in gene expression patterns in human gastric cancers. Molecular biology of the cell, 14(8), 3208-15. Bueno-de-Mesquita, J. M., van Harten, W. H., Retel, V. P., van't Veer, L. J., van Dam, F. S., Karsenberg, K., Douma, K. F., van Tinteren, H., Peterse, J. L., Wesseling, J., Wu, T. S., Atsma, D., Rutgers, E. J., Brink, G., Floore, A. N., Glas, A. M., Roumen, R. M., Bellot, F. E., van Krimpen, C., Rodenhuis, S., van de Vijver, M. J. & Linn, S. C. (2007). Use of 70-gene signature to predict prognosis of patients with node-negative breast cancer: a prospective community-based feasibility study (RASTER). Lancet Oncol, 8(12), 1079-87. Cardoso, F., Van't Veer, L., Rutgers, E., Loi, S., Mook, S. & Piccart-Gebhart, M. J. (2008). Clinical application of the 70-gene profile: the MINDACT trial. J Clin Oncol, 26(5), 729-35. Mook, S., Van't Veer, L. J., Rutgers, E. J., Piccart-Gebhart, M. J. & Cardoso, F. (2007). Individualization of therapy using Mammaprint: from development to the MINDACT Trial. Cancer Genomics Proteomics, 4(3), 147-55. Bogaerts, J., Cardoso, F., Buyse, M., Braga, S., Loi, S., Harrison, J. A., Bines, J., Mook, S., Decker, N., Ravdin, P., Therasse, P., Rutgers, E., van 't Veer, L. J. & Piccart, M. (2006). Gene signature evaluation as a prognostic tool: challenges in the design of the MINDACT trial. Nat Clin Pract Oncol, 3(10), 540-51. Eng-Wong, J. & Zujewski, J. A. (2008). Current NCI-sponsored Cooperative Group trials of endocrine therapies in breast cancer. Cancer, 112(3 Suppl), 723-9. Michiels, S., Koscielny, S. & Hill, C. (2005). Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet, 365(9458), 488-92. Ioannidis, J. P. (2005). Microarrays and molecular research: noise discovery? Lancet, 365(9458), 454-5. Ein-Dor, L., Kela, I., Getz, G., Givol, D. & Domany, E. (2005). Outcome signature genes in breast cancer: is there a unique set? Bioinformatics (Oxford, England), 21(2), 171-8. Ioannidis, J. P. (2007). Is molecular profiling ready for use in clinical decision making? The oncologist, 12(3), 301-11. Ioannidis, J. P., Polyzos, N. P. & Trikalinos, T. A. (2007). Selective discussion and transparency in microarray research findings for cancer outcomes. Eur J Cancer, 43(13), 1999-2010. Dupuy, A. & Simon, R. M. (2007). Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. Journal of the National Cancer Institute, 99(2), 147-57.

Recent Issues and Computational Approaches for Developing PGS…

273

[34] Ein-Dor, L., Zuk, O. & Domany, E. (2006). Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proceedings of the National Academy of Sciences of the United States of America, 103(15), 5923-8. [35] Pawitan, Y., Bjohle, J., Amler, L., Borg, A. L., Egyhazi, S., Hall, P., Han, X., Holmberg, L., Huang, F., Klaar, S., Liu, E. T., Miller, L., Nordgren, H., Ploner, A., Sandelin, K., Shaw, P. M., Smeds, J., Skoog, L., Wedren, S. & Bergh, J. (2005). Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two population-based cohorts. Breast Cancer Res, 7(6), R953-64. [36] Somorjai, R. L., Dolenko, B. & Baumgartner, R. (2003). Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions. Bioinformatics (Oxford, England), 19(12), 1484-91. [37] Ransohoff, D. F. (2005). Bias as a threat to the validity of cancer molecular-marker research. Nature reviews, 5(2), 142-9. [38] Simon, R., Radmacher, M. D., Dobbin, K. & McShane, L. M. (2003). Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. Journal of the National Cancer Institute, 95(1), 14-8. [39] Simon, R. (2006). A checklist for evaluating reports of expression profiling for treatment selection. Clin Adv Hematol Oncol, 4(3), 219-24. [40] Pepe, M. S., Janes, H., Longton, G., Leisenring, W. & Newcomb, P. (2004). Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker. American journal of epidemiology, 159(9), 882-90. [41] Pepe, M. S. (2005). Evaluating technologies for classification and prediction in medicine. Statistics in medicine, 24(24), 3687-96. [42] Kattan, M. W. (2003). Judging new markers by their ability to improve predictive accuracy. Journal of the National Cancer Institute, 95(9), 634-5. [43] Eden, P., Ritz, C., Rose, C., Ferno, M. & Peterson, C. (2004). "Good Old" clinical markers have similar power in breast cancer prognosis as microarray gene expression profilers. Eur J Cancer, 40(12), 1837-41. [44] Liu, E. T. (2005). Mechanism-derived gene expression signatures and predictive biomarkers in clinical oncology. Proceedings of the National Academy of Sciences of the United States of America, 102(10), 3531-2. [45] Chang, H. Y., Nuyten, D. S., Sneddon, J. B., Hastie, T., Tibshirani, R., Sorlie, T., Dai, H., He, Y. D., van't Veer, L. J., Bartelink, H., van de Rijn, M., Brown, P. O. & van de Vijver, M. J. (2005). Robustness, scalability, and integration of a wound-response gene expression signature in predicting breast cancer survival. Proceedings of the National Academy of Sciences of the United States of America, 102(10), 3738-43. [46] Chang, H. Y., Sneddon, J. B., Alizadeh, A. A., Sood, R., West, R. B., Montgomery, K., Chi, J. T., van de Rijn, M., Botstein, D. & Brown, P. O. (2004). Gene expression signature of fibroblast serum response predicts human cancer progression: similarities between tumors and wounds. PLoS biology, 2(2), E7. [47] Bild, A. H., Yao, G., Chang, J. T., Wang, Q., Potti, A., Chasse, D., Joshi, M. B., Harpole, D., Lancaster, J. M., Berchuck, A., Olson, J. A., Jr., Marks, J. R., Dressman, H. K., West, M. & Nevins, J. R. (2006). Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature, 439(7074), 353-7. [48] Bild, A. H., Potti, A. & Nevins, J. R. (2006). Linking oncogenic pathways with therapeutic opportunities. Nature reviews, 6(9), 735-41.

274

Seon-Young Kim and Hyun Ju Kim

[49] Oh, D. S., Troester, M. A., Usary, J., Hu, Z., He, X., Fan, C., Wu, J., Carey, L. A. & Perou, C. M. (2006). Estrogen-regulated genes predict survival in hormone receptorpositive breast cancers. J Clin Oncol, 24(11), 1656-64. [50] Lee, J. S., Chu, I. S., Mikaelyan, A., Calvisi, D. F., Heo, J., Reddy, J. K. & Thorgeirsson, S. S. (2004). Application of comparative functional genomics to identify best-fit mouse models to study human cancer. Nature genetics, 36(12), 1306-11. [51] Lee, J. S., Heo, J., Libbrecht, L., Chu, I. S., Kaposi-Novak, P., Calvisi, D. F., Mikaelyan, A., Roberts, L. R., Demetris, A. J., Sun, Z., Nevens, F., Roskams, T. & Thorgeirsson, S. S. (2006). A novel prognostic subtype of human hepatocellular carcinoma derived from hepatic progenitor cells. Nature medicine, 12(4), 410-6. [52] Sweet-Cordero, A., Mukherjee, S., Subramanian, A., You, H., Roix, J. J., Ladd-Acosta, C., Mesirov, J., Golub, T. R. & Jacks, T. (2005). An oncogenic KRAS2 expression signature identified by cross-species gene-expression analysis. Nature genetics, 37(1), 48-55. [53] Kim, S. Y. & Volsky, D. J. (2005). PAGE: parametric analysis of gene set enrichment. BMC bioinformatics, 6, 144. [54] Nam, D. & Kim, S. Y. (2008). Gene-set approach for expression pattern analysis. Brief Bioinform, 9(3), 189-97. [55] Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., Paulovich, A., Pomeroy, S. L., Golub, T. R., Lander, E. S. & Mesirov, J. P. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genomewide expression profiles. Proceedings of the National Academy of Sciences of the United States of America, 102(43), 15545-50. [56] Kim, S. Y. & Kim, Y. S. (2008). A gene sets approach for identifying prognostic gene signatures for outcome prediction. BMC genomics, 9(1), 177. [57] Cheadle, C., Becker, K. G., Cho-Chung, Y. S., Nesterova, M., Watkins, T., Wood, W., 3rd, Prabhu, V. & Barnes, K. C. (2007). A rapid method for microarray cross platform comparisons using gene expression signatures. Mol Cell Probes, 21(1), 35-46. [58] Pang, H., Lin, A., Holford, M., Enerson, B. E., Lu, B., Lawton, M. P., Floyd, E. & Zhao, H. (2006). Pathway analysis using random forests classification and regression. Bioinformatics (Oxford, England), 22(16), 2028-36. [59] Chuang, H. Y., Lee, E., Liu, Y. T., Lee, D. & Ideker, T. (2007). Network-based classification of breast cancer metastasis. Mol Syst Biol, 3,140. [60] Lau, J., Ioannidis, J. P. & Schmid, C. H. (1997). Quantitative synthesis in systematic reviews. Annals of internal medicine, 127(9), 820-6. [61] Trikalinos, T. A., Salanti, G., Zintzaras, E. & Ioannidis, J. P. (2008). Meta-analysis methods. Advances in genetics, 60, 311-34. [62] Zhang, Z., Chen, D. & Fenstermacher, D. A. (2007). Integrated analysis of independent gene expression microarray datasets improves the predictability of breast cancer outcome. BMC genomics, 8,331. [63] Choi, J. K., Yu, U., Kim, S. & Yoo, O. J. (2003). Combining multiple microarray studies and modeling interstudy variation. Bioinformatics (Oxford, England), 19 Suppl 1(i84-90. [64] Shen, R., Ghosh, D. & Chinnaiyan, A. M. (2004). Prognostic meta-signature of breast cancer developed by two-stage mixture modeling of microarray data. BMC genomics, 5(1), 94.

Recent Issues and Computational Approaches for Developing PGS…

275

[65] Choi, H., Shen, R., Chinnaiyan, A. M. & Ghosh, D. (2007). A latent variable approach for meta-analysis of gene expression data from multiple microarray experiments. BMC bioinformatics, 8,364. [66] Warnat, P., Oberthuer, A., Fischer, M., Westermann, F., Eils, R. & Brors, B. (2007). Cross-study analysis of gene expression data for intermediate neuroblastoma identifies two biological subtypes. BMC cancer, 7,89. [67] Hong, F. & Breitling, R. (2008). A comparison of meta-analysis methods for detecting differentially expressed genes in microarray experiments. Bioinformatics (Oxford, England), 24(3), 374-82. [68] West, M., Ginsburg, G. S., Huang, A. T. & Nevins, J. R. (2006). Embracing the complexity of genomic data for personalized medicine. Genome Res, 16(5), 559-66. [69] Pittman, J., Huang, E., Dressman, H., Horng, C. F., Cheng, S. H., Tsou, M. H., Chen, C. M., Bild, A., Iversen, E. S., Huang, A. T., Nevins, J. R. & West, M. (2004). Integrated modeling of clinical and gene expression information for personalized prediction of disease outcomes. Proceedings of the National Academy of Sciences of the United States of America, 101(22), 8431-6. [70] Gevaert, O., De Smet, F., Timmerman, D., Moreau, Y. & De Moor, B. (2006). Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks. Bioinformatics (Oxford, England), 22(14), e184-90. [71] Gevaert, O., Van Vooren, S. & de Moor, B. (2008). Integration of microarray and textual data improves the prognosis prediction of breast, lung and ovarian cancer patients. Pac Symp Biocomput, 279-90. [72] Li, L. (2006). Survival prediction of diffuse large-B-cell lymphoma based on both clinical and gene expression information. Bioinformatics (Oxford, England), 22(4), 466-71. [73] Sun, Y., Goodison, S., Li, J., Liu, L. & Farmerie, W. (2007). Improved breast cancer prognosis through the combination of clinical and genetic markers. Bioinformatics (Oxford, England), 23(1), 30-7. [74] Perou, C. M., Jeffrey, S. S., van de Rijn, M., Rees, C. A., Eisen, M. B., Ross, D. T., Pergamenschikov, A., Williams, C. F., Zhu, S. X., Lee, J. C., Lashkari, D., Shalon, D., Brown, P. O. & Botstein, D. (1999). Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. Proceedings of the National Academy of Sciences of the United States of America, 96(16), 9212-7. [75] Perou, C. M., Sorlie, T., Eisen, M. B., van de Rijn, M., Jeffrey, S. S., Rees, C. A., Pollack, J. R., Ross, D. T., Johnsen, H., Akslen, L. A., Fluge, O., Pergamenschikov, A., Williams, C., Zhu, S. X., Lonning, P. E., Borresen-Dale, A. L., Brown, P. O. & Botstein, D. (2000). Molecular portraits of human breast tumours. Nature, 406(6797), 747-52. [76] Sorlie, T., Tibshirani, R., Parker, J., Hastie, T., Marron, J. S., Nobel, A., Deng, S., Johnsen, H., Pesich, R., Geisler, S., Demeter, J., Perou, C. M., Lonning, P. E., Brown, P. O., Borresen-Dale, A. L. & Botstein, D. (2003). Repeated observation of breast tumor subtypes in independent gene expression data sets. Proceedings of the National Academy of Sciences of the United States of America, 100(14), 8418-23. [77] Sorlie, T., Perou, C. M., Tibshirani, R., Aas, T., Geisler, S., Johnsen, H., Hastie, T., Eisen, M. B., van de Rijn, M., Jeffrey, S. S., Thorsen, T., Quist, H., Matese, J. C., Brown, P. O., Botstein, D., Eystein Lonning, P. & Borresen-Dale, A. L. (2001). Gene

276

Seon-Young Kim and Hyun Ju Kim

expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proceedings of the National Academy of Sciences of the United States of America, 98(19), 10869-74. [78] Rouzier, R., Perou, C. M., Symmans, W. F., Ibrahim, N., Cristofanilli, M., Anderson, K., Hess, K. R., Stec, J., Ayers, M., Wagner, P., Morandi, P., Fan, C., Rabiul, I., Ross, J. S., Hortobagyi, G. N. & Pusztai, L. (2005). Breast cancer molecular subtypes respond differently to preoperative chemotherapy. Clin Cancer Res, 11(16), 5678-85. [79] Teschendorff, A. E., Naderi, A., Barbosa-Morais, N. L., Pinder, S. E., Ellis, I. O., Aparicio, S., Brenton, J. D. & Caldas, C. (2006). A consensus prognostic gene expression classifier for ER positive breast cancer. Genome biology, 7(10), R101. [80] Teschendorff, A. E., Miremadi, A., Pinder, S. E., Ellis, I. O. & Caldas, C. (2007). An immune response gene expression module identifies a good prognosis subtype in estrogen receptor negative breast cancer. Genome biology, 8(8), R157.

In: Computational Biology: New Research Editor: Alona S. Russe

ISBN: 978-1-60692-040-4 © 2009 Nova Science Publishers, Inc.

Chapter 11

COMPARISON OF Φ-VALUES AND FOLDING TIME PREDICTIONS BY USING MONTE-CARLO AND DYNAMIC PROGRAMMING APPROACHES Oxana V. Galzitskaya∗ and Sergiy O. Garbuzynskiy Institute of Protein Research, Russian Academy of Sciences, Pushchino, Moscow Region, Russian Federation, 142290

Abstract We calculate time of folding and explore the transition state ensembles for ten proteins with known experimental data at the point of thermodynamic equilibrium between unfolded and native state using a Monte Carlo Gō model and Dynamic Programming where each residue is considered to be either folded as in the native state or completely disordered. The order of events in folding simulations has been explored in detail for each of the proteins. The times of folding for ten proteins which reach the native state within a limit of 108 Monte Carlo steps are in a good correlation with experimentally measured folding time at mid-transition point (the correlation coefficient is 0.71). A lower correlation was obtained if to use Dynamic Programming approach (the correlation coefficient is 0.53). Moreover, Φ-values calculated from the Monte Carlo simulations for ten proteins correlate with experimental data (the correlation coefficient is 0.41) practically at the same level as Φ-values calculated from Dynamic Programming approach (the correlation coefficient is 0.48). The model provides good prediction of folding nuclei for proteins whose 3D structures have been determined by X-ray, and exhibits a more limited success for proteins whose structures have been determined by NMR.

Introduction Folding nucleus from experiment and theory A progress in the understanding of protein folding achieved in 1990s (Fersht 1995; Dobson and Karplus, 1999) has been achieved by investigation of “simple” proteins: without accumulation of any intermediates at the folding pathways, without cis-trans proline ∗

To whom correspondence should be addressed. E-mail: [email protected]

278

Oxana V. Galzitskaya and Sergiy O. Garbuzynskiy

isomerization, and without S-S bond formation. The folding (and the unfolding) kinetics looks very simple in this case: all the properties of the native (or denatured) protein are restored synchronically, following the single-exponential kinetics (Kragelund et al., 1995). For some proteins, this simplicity is observed in a wide range of conditions, including the denaturant-free water ("biological" conditions in Fig. 1), the zone of the reversible thermodynamic transition between two phases (the native and the denatured state) and the unfolding zone; these proteins obtained a name of "two-state proteins". For the other, "multistate" proteins, the two-state folding occurs only in the transition zone, if any, while the unfolding demonstrates a "two-state" manner. Usually, the complicated folding demonstrates three phases, and the corresponding proteins obtained a name of "three-state proteins" (Fersht, 1995; Dobson and Karplus, 1999; Kragelund et al., 1995). Thus, the most universal features of folding (and unfolding) can be observed just in and around the transition zone, while the moving of this zone towards the "biological" conditions reveals individualities of various proteins (which are the "unnecessary complications", when we try to understand the basics of protein folding).

Figure 1. Folding nucleus identification using site-directed mutations (a scheme). (a) Mutation of a residue, having its native environment and conformation (i.e., its native interactions) already in the transition state TS, changes the mutant’s folding rate rather than its unfolding rate. (b) Mutation of residue, which remains denatured in the TS, has the opposite effect. “Wild type” means non-mutated protein. kapp = kf + ku, where kf is the folding rate and ku is the unfolding rate: thus, kapp ≈ kf in the folding zone (where kf » ku), kapp ≈ ku in the unfolding zone (where kf « ku) and kf ≈ ku ≈ kapp/2 at the midtransition (Matouschek et al., 1990). Extrapolations which are necessary for Φ-value analysis are drawn by dotted lines to the zero denaturant concentration.

The transition state corresponds to the free energy maximum on the folding/unfolding pathway, − or, it is better to say, to the free energy saddle point on the network of these pathways (see Fig. 3 below). The folded part of the transition state is called the "folding nucleus", and the folding pathway via formation of a nucleus (which usually consists of amino acid residues remote in protein chain (Abkevich et al., 1994a; Itzhaki et al., 1995) obtained a name of "nucleation-condensation" mechanism of folding. Folding nucleus, being the folded part of transition state, plays a key role in protein folding: its instability determines the folding and unfolding rates. It should be stressed that the folding nucleus corresponds to the free energy maximum. It has been shown that the nucleus looks like some part of 3D structure of the native protein

Comparison of Φ-Values and Folding Time Predictions by Using Monte-Carlo…

279

(Matouschek et al., 1989; Matouschek et al., 1990) which is often surrounded by some unstructured, probably molten globule-like drop. So far, there is only one (unfortunately, only one and very laborious) experimental method to identify folding nuclei in proteins: to find residues whose mutations affect the folding rate by changing the TS stability as strongly as that of the native protein (Fig. 1). For the basics of this method and pioneer works see (Matouschek et al., 1989; Fersht et al., 1992; Leffler and Grunwald, 1963; Matthews, 1987; Goldenberg et al., 1989). The participation of a residue in the folding nucleus is expressed by the residue's Φ value. For a given residue, its Φ is defined as Φ =Δlnkf/ΔlnK,

(1)

where kf is the folding rate constant, K = kf /ku is the folding-unfolding equilibrium constant, and Δ means the shift of the corresponding value induced by mutation of this residue. According to the model of a native-like folding nucleus (Matouschek et al., 1989; Matouschek et al., 1990), Φ = 1 means that the residue has its native conformation and environment already in the transition state (i.e., that this residue is in the folding nucleus), while Φ = 0 means that the residue remains unfolded in the TS. The values Φ ≈ 0.5 are ambiguous: either the residue is at the surface of the nucleus, or it is in one of the alternative nuclei, belonging to different folding pathways. It is noteworthy that the values Φ < 0 and Φ > 1 (which would be inconsistent with the model of a native-like folding nucleus) are extremely rare and never concern a residue with a reliable measured ΔlnK. To estimate Φ, the rates kf and ku have to be measured at (or extrapolated to) the same conditions. Usually, being interested in the "biologically-relevant" nucleus, one extrapolates them to the zero denaturant concentration. However, it should be noted that the nucleus corresponding to the protein’s mid-transition is outlined more reliably: here the extrapolation is shorter and therefore more robust, especially when the branches of the chevron are curved; the latter suggests a change of the nucleus with the folding conditions (Otzen et al., 1999). The major assumptions, underlying the Φ-analysis of the folding nucleus by point mutations (Matouschek et al., 1989), are that the mutations do not change substantially either the folding pathway, or the nucleus, or the structure of the folded state, or the unfolded state ensemble. Experimentally, this is proved to be usually correct when the mutated residue is not larger than the initial one, and when the mutation is not connected with introduction of charges inside the globule; the proof is done by double mutations (Fersht et al., 1992). However, some strong mutations can significantly affect the distribution of structures in the TS ensemble (Burton et al., 1997). Several other observations have been done: (1) The TS-stabilizing contacts are very diverse. In some proteins the nucleus is stabilized by hydrophobic interactions (Itzhaki et al., 1995; Fulton et al., 1999; Kragelund et al., 1999), in some it includes hydrogen bonds and salt bridges (Lopez-Hernandez E and Serrano, 1996; Grantcharova et al., 1998). (2) The position of nucleus relatively to the whole protein structure is very diverse. In some it is situated in the centre, in the hydrophobic core (Itzhaki et al., 1995; Kragelund et al., 1999; Chiti et al., 1999), in some it is on the boundary of the

280

Oxana V. Galzitskaya and Sergiy O. Garbuzynskiy globule (Kragelund et al., 1999; Lopez-Hernandez and Serrano, 1996; Grantcharova et al., 1998). (3) The accessible surface areas of the nuclei are also rather different (Jackson, 1998). (4) Proteins with different amino acid sequences but with similar three-dimensional structures have similar folding nuclei as a rule (Martinez and Serrano, 1999; Riddle et al., 1999; Perl et al., 1998). However, there are several examples which show that this is not always so (Galzitskaya, 2002, see the examples below, Fig.2).

Summing up the experimental data, Grantcharova et al. (2001) conclude that mutations, both artificial and natural, can radically change folding pathways (create and destroy folding intermediates, transforming two- into multi-state folding proteins and vice versa, shift the folding nuclei at the opposite side of the molecule, etc.) — without any considerable variation of three-dimensional structures of native proteins (Grantcharova et al., 2001). This means that the native structure is a subject of much more severe natural selection than that of the folding nucleus and than of folding pathways, – at least when we speak about relatively small proteins, which usually fold anyway much faster than they are synthesized by a ribosome. As regards the theoretical search for folding/unfolding nuclei in proteins, several different approaches have been suggested. The most direct approach to theoretical search of the nucleus is to generate a plausible transition state for unfolding (which must coincide with that for folding closely to midtransition) using the all-atom molecular dynamic simulations of protein unfolding (Li and Daggett, 1996; Caflisch and Karplus, 1995; Brooks et al., 1998). According to these simulations, held for few very small proteins at highly denaturing conditions (otherwise, the calculation takes too long), the unfolding is hierarchic (Lazaridis and Karplus, 1997; Tsai et al., 1999; Daggett and Fersht, 2003) (at least when it occurs far from the equilibrium): tertiary interactions break early, whereas secondary structures remain for a longer time. The repeated trajectories show a statistical distribution around the experimentally found transition states and demonstrate a broad ensemble of the transition state (TS) structures. However, these simulations usually need extremely denaturing conditions (500 K, etc.) to be completed. Therefore, the transition states found for such an extreme unfolding can be, in principle, rather different from those existing for folding (Finkelstein, 1997). Recently, however, some molecular dynamic simulations of unfolding of very small proteins (Mayor et al., 2000; Ferguson et al., 2001; Mayor et al., 2003) have been performed at more realistic, although also highly denaturating conditions. They have been performed at temperatures, accessible for "wet" experiments (350oK), as well as for simulations at current supercomputers. They gave TS structures, which are consistent with experiment (Mayor et al., 2003); however, these simulations take enormous time and can be performed for very small proteins only.

Comparison of Φ-Values and Folding Time Predictions by Using Monte-Carlo…

281

Figure 2. (a) Profiles of experimental Φ-values obtained for B1 domains of protein G (filled circles) and of protein L (open circles). Schemes of three-dimensional structures of these domains colored according to the Φ-values of the amino-acid residues, from white (Φ = 0) to black (Φ = 1). The experimentally studied residues are shown as beads against the background of the native chain fold. Φvalues are given for them only. Adapted from (Galzitskaya, 2002). Although sequence identity of B1 domains of G and L proteins is as low as 15 % (McCalliter et al., 2000), RMSD between Cα atoms of these two structures after their superposition is 1.35Å indicating that 3D structures of these domains are similar. Nevertheless, their folding nuclei have different location. (b) Profiles of Φ-values obtained from the experiments for SH3-domains. Investigated residues are shown by filled circles for α-spectrin, by open circles for src-kinase and by filled triangles for Sso7d -protein. Schemes of three-dimensional structures of these proteins drawn according to the Φ-values of the amino-acid residues, from white (Φ = 0) to black (Φ = 1). Triangles on the structure correspond to residues with Φ < 0 and Φ > 1.

282

Oxana V. Galzitskaya and Sergiy O. Garbuzynskiy

Further progress is due to the analysis of multidimensional networks of the protein folding-unfolding trajectories done by various algorithms (Galzitskaya and Finkelstein, 1999; Alm and Baker, 1999). All these approaches (Galzitskaya and Finkelstein, 1999; Alm and Baker, 1999; Muñoz and Eaton, 1999) use different approximations and algorithms, consider only the attractive native interactions (the “Gō model”, Taketomi et al., 1975) to reduce the energy frustrations and heterogeneity of interactions, and model the trade-off between the formation of attractive interactions and the loss of conformational entropy during protein folding. These works also simulate unfolding of known 3D protein structures rather than their folding, but the unfolding is considered close to the mid-transition point, where folding and unfolding pathways coincide according to the detailed balance principle. Under these "nearequilibrium" conditions, all single-domain proteins demonstrate two-state (i.e., “all-or-none”) transitions both in thermodynamics (Privalov, 1979) and kinetics (Fersht, 1995; 1997). This means that at the mid-transition all semi-folded and misfolded globules are unstable relative to both native and unfolded states of protein chain, and this allows us to take into account only the pathways going from the native to the unfolded state and to neglect those leading to misfolded globules, stabilized by non-native interactions. These works allowed the authors to outline the folding nuclei. Despite the relative simplicity of these models, they give a promising (~50%) correlation with experimental Φvalues (Baker, 2000; Takada, 1999; Alm et al., 2002; Garbuzynskiy et al., 2004). This suggests that the chain’s folding pattern and the size of the protein, taken into account by these models, play more important role in folding than the high resolution details of protein structure (Alm and Baker, 1999; Finkelstein and Badretdinov, 1997; Clementi et al., 2000). Some progress has been made using experimental constraints to obtain the folding nucleus at atomic resolution (or rather, to visualize a possible shape for the folding nucleus that is consistent with the available, but sparse experimental data). Vendruscolo and coauthors reconstructed the putative transition state ensemble for acylphosphatase using experimental Φ-values as constraints in high-temperature unfolding simulations (Vendruscolo et al., 2001; Paci et al., 2002). However, they did not test whether the proposed conformations represent the set of conformations for which the transmission coefficient to the folded state is equal 0.5 (Galzitskaya and Finkelstein, 1998).

Sensitivity of folding pathway to the details of amino-acid sequence One of the important questions in molecular biology is what determines folding pathways: native structure or protein sequence. There are many proteins that have similar structure but very different sequences, and relevant question is whether such proteins have similar or different folding mechanisms. Comparison of proteins having similar native topologies is an important approach for elucidating fundamental aspects of the protein folding process. Experimental evidence regarding folding nucleus structure shows that proteins similar in three-dimensional (3D) structure have, as a rule, similar folding nuclei (Martinez and Serrano, 1999; Riddle et al., 1999; Perl et al., 1998). However there are several exceptions, indicating that folding pathways are sensitive to some features of the amino acid sequence (Martinez and Serrano, 1999; Riddle et al., 1999).

Comparison of Φ-Values and Folding Time Predictions by Using Monte-Carlo…

283

Proteins with ferredoxin-like fold One of the possibilities to study the influence of details of the sequence on the folding process is to consider two nearest proteins with similar topology from the same family. Thus the transition states for four proteins with ferredoxin-like fold have been characterized and experimentally studied: AcP, Ada2h, U1A, and S6 (two helices are packed on the β-sheet with five or four strands). These proteins have a symmetrical position of secondary structure elements which is destroyed by connection of these elements in the chain. The TS of proteins Ada2h (human activation domain of procarboxypeptidase A2) and AcP (acylphosphatase) are similar in structure despite the low sequence similarity (13%) and different length of the secondary structure elements (Taddei et al., 2000; Chiti et al., 1999). For both proteins the second α-helix and the inside strands are more structured than the rest part of the protein structure. An alternative nucleus which includes the first α-helix has been found for protein U1A and other nucleus with both α-helices – for protein S6 (Ternstrom et al., 1999). At the same time the folding rates of Ada2h and AcP differ by three orders of magnitude. The authors explain such a result by the difference in the relative contact order for these proteins (Chiti et al., 1999). A strong correlation is observed between the relative order and the logarithm of folding rate for some proteins with similar topology (HPr and MerP) (Oliveberg et al., 2001). Immunoglobulin-binding domains of proteins L and G Immunoglobulin-binding domains of proteins L and G are structural homologs (see Fig.2a), but have little sequence similarity. The α-helix is packed across a four-stranded sheet in these proteins. Of interest is the experimental fact that the symmetry of the given topology is fully broken under folding of these proteins. So, the first (N-terminal) β-hairpin belongs to the folding nucleus of protein L and the second (C-terminal) one to that of protein G (McCallister et al., 2000; Kim et al., 2000). Such a result can be explained by the existence of more favorable contacts in the second β-hairpin of protein G. Indeed, the isolated fragment corresponding to the second β-hairpin is stable in water solution (Blanco et al., 1994). Therefore, the region of protein which has high probability to form local structures in the unfolded state can play an important role in the stabilization of the ensemble of TS. Experimental data for other proteins (Yi et al., 2000; Kortemme et al., 2000; Gillespie and Shortle; 1997; Cordier-Ochsenbein et al., 1998) also show that local characteristics of the sequence probably can be important for choosing the specific pathway of folding. Proteins with the SH3 domain fold The folding of small proteins with the SH3 domain fold (with simple topology, see Fig. 2b) has been studied. Protein-engineering and kinetic analysis of the src SH3-domain has been done by the group of Baker (Grantcharova et al., 1998), who give a detailed picture of TS in which the distal hairpin and the short turn 310-helix are highly ordered at the ensemble of TS. Simultaneously Martinez and Serrano (Martinez and Serrano, 1999) described TS for the αspectrin SH3 domain, making the mutations in the same structural positions. A remarkable similarity between TS of these proteins has been observed despite of 27% sequence identity. Stabilizing mutations (Martinez et al., 1998) and changes in pH (Martinez and Serrano, 1999) do not change the structure of TS of the α-spectrin SH3-domain. In the case of the src SH3-domain, stabilization of local structure by introduction of S-S-bonds and global stabilization by sodium sulfate does not change the position of TS along the reaction coordinate (Grantcharova and Baker, 2001). Seemingly, as the authors believe, the structure

284

Oxana V. Galzitskaya and Sergiy O. Garbuzynskiy

of the SH3-domain allows for a large variation in the sequence and experimental conditions without changing TS. Probably, the reason is that there are no alternative structural elements that can be sufficiently stabilized upon folding to become the folding nucleus. On the other hand, modifying the topology by circular permutation for the α–spectrin SH3-domain (Viguera et al., 1996) and circularization of the src SH3-domain (Grantcharova and Baker, 2001) can significantly change the distribution of structures in the ensemble of TS in favor of alternative folding. Thus, the shift in the structure of transition states can be done by at least two methods: when distant elements are covalently linked to reduce the entropic cost upon their interaction or upon introducing mutations that strongly destabilize (or stabilize) the energy of interactions in the protein. The characterization of SH3 structural analogs has shown that the TS structure is not always conserved in proteins with similar topologies. So, the DNA-binding protein Sso7d has different TS than that of the src and α-spectrin SH3-domains, the n-src-loop and the Cterminus (which is an α-helix in the Sso7d-protein instead of β–strand) are structured in TS. At the same time, the distal hairpin is weakly ordered (Guerois and Serrano, 2000). The authors concluded that for prediction of the structure of TS for the given protein it is necessary to take into account not only the protein topology but also the characteristics of the sequence which have not been taken into account in the theoretical methods for prediction of the structure of TS. From these works, one can conclude that the TS structure is conserved within the SH3 sequence superfamily rather than among SH3 analogs. If to consider that the SH3 fold allows several alternative folding pathways, then domination of one pathway over the others depends on the detailed structure. The authors think that this may be due to the fact that functional restriction results in the appearance of conservative regions inside one superfamily but not between them. Namely such characteristics can partly determine which folding pathway will be preferred for the given topology. It should be mentioned that the kinetic analysis for the other SH3 domains, Fyn and PI3kinase, gives correlation between the folding rate and the stability of the native state similar to that pointed out by Clarke et al. (1999) for immunoglobulin domains.

Prediction of protein folding rates There is an enormous diversity in the protein folding behavior from small proteins that usually fold with simple two-state kinetics to large proteins that usually fold with multi-state kinetics. Some general trends and correlations are beginning to emerge between the structural, thermodynamic and kinetic properties of proteins (Jackson, 1998; Plaxco et al., 1998; Shakhnovich, 1998; Galzitskaya et al., 2003; Kuznetsov and Rackovsky, 2004; Ivankov and Finkelstein, 2004). Until now, the differences in folding rates have been investigated much better than the differences in folding behavior, though these two aspects are closely related. The first comparison of a parameter with observed experimentally folding rates has been done when it has been shown that topology may be a critical determinant of two-state folding kinetics (Plaxco et al., 1998). But the topology itself cannot explain the differences in the refolding rates for some proteins sharing the same fold (SH3 domains, cold shock proteins, fibronectin

Comparison of Φ-Values and Folding Time Predictions by Using Monte-Carlo…

285

domains, proteins of the ferredoxin fold) (Guijarro et al., 1998; Plaxco et al., 1998; Perl et al., 1998; van Nuland et al., 1998; Zerovnik et al., 1998). On the other hand, a number of basic correlations between the protein size and folding rate have been suggested (Thirumalai et al., 1995; Gutin et al., 1996; Finkelstein and Badretdinov, 1997). All of them point out that, as might be expected, the folding rate decreases with protein size, but suggest different scaling laws for this decrease. However, the current statistical analysis of protein folding data shows that all the suggested scalings, from log(L) to -L1/2 and -L2/3, correlate with the observed folding rates nearly equally: the correlation between folding rates and protein sizes is not as large, 60% (Gutin et al., 1996; Galzitskaya et al., 2001; Finkelstein and Galzitskaya, 2004). It has been shown, though, that the protein size by itself determines folding rates of only three-state folding proteins and fails to predict those for two-state folders (Galzitskaya et al., 2003). However, sequence length, being the major determinant of the type of folding behavior, is not sufficient to determine the folding type of a protein since large proteins do not necessarily exhibit multi-state kinetics (for example Variable surface antigen VlsE, which is 348 residues long, is nevertheless a twostate protein). These first attempts to explain the differences in folding rates of various proteins were a new stimulus for further efforts to find new parameters and a simple model for describing protein folding processes. It has been found that proteins with two-state and multi-state kinetics have different ratedetermining amino acids: proteins with two-state kinetics are rich in F and G while proteins with multi-state kinetics are rich in C, H, L, and R (Ma et al., 2007). Although the amino acid composition may be one of the determinant factors for protein folding behavior, it gives no further explanation on why the difference in intrinsic properties leads to the different folding type. As the authors noted in the paper, the amino acid sequence composition as an indicator of protein folding type may be unable to account for the effect of a single amino acid mutation that can switch the folding type of a protein. On the contrary, in the other work it has been demonstrated using a simple model that folding rates depend only on the topology of the native state but not on the sequence composition (Voelz and Dill, 2007). One more parameter, the number of native contacts (Makarov et al., 2002) which can be predicted from primary structure, was suggested for prediction of the folding rates of small single-domain proteins that fold through simple two-state kinetics (Punta and Rost, 2005). The above mentioned somewhat conflicting results demonstrate that the theory of protein folding rate requires further development. Therefore, the search for the factors affecting the protein folding process continues. The capillarity model (Finkelstein and Badretdinov, 1997) gave rise to the hypothesis that protein folding rates are determined by the average "entropy capacity" (the entropy capacity of an amino acid residue is defined as the number of contacts divided by the number of degrees of freedom; thus, this value is, in a sense, reciprocal to the expected melting temperature) (Galzitskaya et al., 2000). It has been shown (Galzitskaya and Garbuzynskiy, 2006) that entropy capacity correlates with folding rates for alpha-helical proteins (correlation coefficient is 0.79) and proteins with mixed (α/β) secondary structure (correlation coefficient is 0.84). Consideration of the compactness specifically addresses the issue of why some proteins fold more rapidly than others. Statistical analysis demonstrates that four main structural

286

Oxana V. Galzitskaya and Sergiy O. Garbuzynskiy

classes (Murzin et al., 1995) of proteins (all-α, all-β, α/β, α+β) differ from one another in a statistically significant manner with respect to the number of rotatable angles φ, ψ and χ and the average number of contacts per residue (Galzitskaya and Garbuzynskiy, 2006). On the whole, it has been shown that among proteins of the same size α/β proteins have, on average, a greater number of contacts per residue due to their more compact (more “spherical,” since a sphere is the most compact geometrical body: a sphere has the minimal accessible surface area in comparison with other geometrical bodies of equal volumes) structure, rather than to tighter packing (Galzitskaya et al., 2008). From previous works, it is possible to suggest a relationship between the number of contacts and folding rates. For 75 proteins for which both folding rates and tertiary structures are known, α-helical proteins have on average the fastest folding kinetics and the smallest number of contacts per residue (they are less compact than others), whereas α/β proteins have on average the slowest folding kinetics and the largest number of contacts (they are more compact than others) (Galzitskaya et al., 2008). An explanation is that the expected surface of the boundary between folded and unfolded phases in the transition state (Galzitskaya et al., 2001) for a more spherical protein is larger than for a non-spherical protein. Thus, the fact that α/β proteins are more spherical explains both the more average number of contacts per residue (Galzitskaya et al., 2008) and the slower folding kinetics. In this work, we predict folding time and structure of folding nuclei for proteins with known experimental data on both, and compare the predicted things with experimental ones. The predicted order of events in the course of folding is also analyzed. We model the folding process by means of a relatively simple theoretical approach, and use two methods (MonteCarlo simulations and Dynamic Programming) to investigate free-energy landscapes of protein folding. It is noteworthy that the correlation between predicted and experimental Φ-values is considerably worse than those typical for prediction of protein folding rates. The first obvious reason is that the observed Φ-values being predicted are restricted to the narrow region of 0 – 1 with an experimental error of ~±0.1, while the observed folding rates (determined with a relatively small experimental error) are in the wide range of 107 s-1 – 10-4 s-1. A more important reason is that the folding nucleus is not as stable to the action of mutations (and thus, to the unavoidable errors in energy estimates used to outline them) as a 3D protein structure, and it would be strange to obtain a perfect prediction of the folding nuclei with the same force fields which are still not able to predict the mutation-stable 3D native structure of a protein (Shakhnovich, 2006; Krieger et al., 2004).

Results Monte-Carlo simulations of protein folding To construct a theoretical pathway of protein folding, we did Monte Carlo simulations for 17 proteins (Garbuzynskiy et al., 2004). We performed travelling from the unfolded state to the native 3D structure without misfolding to other compact states. In our model, the folding pathways are treated as sequential insertion of residues from the unfolded state to their native

Comparison of Φ-Values and Folding Time Predictions by Using Monte-Carlo…

287

positions according to the 3D native structure or removal of residues from the native position to the coil, respectively (Fig. 3).

Figure 3. A sketch of the network of pathways of sequential unfolding (and folding) of native 3D protein structure (I0). IL is the coil where all L links of the protein chain are disordered. In each of the many intermediates of the type Iν, ν chain links (shown in the dashed line) are unfolded, while the other L-ν links keep their native positions and conformations (they are shown as the solid line against the background of a dotted cloud denoting the globular part of the intermediate). The central structure in the lower line exemplifies a microstate with ν unfolded links forming one closed unfolded loop and one unfolded tail; the central structure in the central line exemplifies a microstate where ν unfolded links form two closed unfolded loops. The networks used in computations are much larger than the one shown in the sketch: they include millions of semi-folded microstates.

The removed (inserted) residues are assumed to loose (gain) all the non-bonded interactions and gain (loose) the coil entropy except that spent to close the disordered loops protruding from the remaining globule (Galzitskaya and Finkelstein, 1999). The general assumption of this model is that the residues remaining in the globule keep their native position and that the unfolded regions do not fold to another, non-native globule. Thus, we neglect non-native interactions that make our model similar to that of Gō (Taketomi et al., 1975).

Estimation of free energy Our model considers the native structure I0, the unfolded state IL and an ensemble of intermediate microstates Iν consisting of a native-like part and of ν unfolded chain links (ν=0 for I0, ν=L for IL, L being the total number of the chain links, and ν=1, …, L-1 for the semifolded intermediates with ν disordered links). The model uses a simple free energy estimate (Galzitskaya & Finkelstein, 1999) of each microstate I:

288

Oxana V. Galzitskaya and Sergiy O. Garbuzynskiy

F ( I ) = ε × nInb − T [η I × σ +

∑S

loops ∈ I

loop

].

(2)

nInb is the number of native atom-atom contacts in the native-like part of I ( nInb does not include contacts of neighbor residues, also existing in the coil); ε is the energy of one contact;

η I is the number of residues in the unfolded part of I; T is the temperature; σ is the entropy difference between the coil and the native state of a residue (we take σ =2.3R according to Privalov (Privalov, 1979), R being the gas constant). The sum Σ is taken over all closed unfolded loops (see legend to Fig.3) protruding from the native-like part of I. At the point of equilibrium between the native state I0 and the coil IL, we have F(I0)=F(IL), i.e., the contact energy ε (which is influenced by the solvent) and the temperature T are connected (at the mid-transition) by equation

ε = −TNσ / n0nb ,

(3)

nb

where n0 is the number of contacts in the native structure and N is the total number of the protein chain residues. It follows from equations (2) and (3) that the F(I)/T values (which only determine the transition state, see equation (5) below) do not depend on temperature, provided that the solvent composition corresponds to the mid-transition at this temperature. The entropy spent to close a disordered loop between the still fixed residues k and l is estimated (Finkelstein and Badretdinov, 1997) as Sloop = -5/2R ln|k - l|-3/2R (r2kl - a2) / (2Aa|k - l|);

(4)

here rkl is the distance between the Cα atoms of residues k and l, a = 3.8Å is the distance between the neighbor Cα atoms in the chain, and A is the persistence length for a polypeptide (according to Flory (Flory, 1969), we take A = 20Å). The term -5/2R ln|k - l| is the main in equation (4); the coefficient -5/2 (rather than Flory's value -3/2) follows from the condition that a loop cannot penetrate inside the globule (Finkelstein and Badretdinov, 1997). We consider our model as an approximation of the protein folding process, rather than a detailed description of the chain motions. Thereby our model of folding is a trade-off between the configurational entropy loss and the gain of attractive interactions. The model takes into account the topology of the native state.

Investigation of folding kinetics We calculated how long a given protein chain folds to its known native structure, starting from unfolded chain by Monte-Carlo (MC) simulation using the Metropolis scheme (Metropolis et al., 1953) at point of mid-transition. The kinematic scheme of elementary movements includes removal of a residue from the native position to the coil or insertion of a residue from the coil to the native position (Galzitskaya and Finkelstein, 1998). Thus, we did

Comparison of Φ-Values and Folding Time Predictions by Using Monte-Carlo…

289

a travelling from the unfolded state to the native 3D structure without misfolding to other compact states. An elementary MC step was done as follows. We randomly chose a residue (namely, we generated a random number between 1 and L, where L being here the total number of residues; thus, a link was one residue-long). If the chosen residue had been already fixed in the native position, we tried to unfold it. If the chosen residue was in the coil, we tried to fix it according to its native position. Then we computed the free energy difference, ΔF, between new and previous intermediate structures. The MC step leads to the new structure with a probability w, which is equal to exp(-ΔF/RT), if ΔF>0, or to 1, if ΔF≤0. Thus, if ΔF≤0, the MC step always leads to the new structure. If ΔF>0, w is compared with a random number ξ (0<ξ<1; ξ is generated according to the uniform distribution) if w>ξ, the transition is accepted; if w≤ξ, it is rejected and the previous state is preserved. To estimate the characteristic first passage time, t1/2, we performed 50 MC runs for every protein (Galzitskaya and Finkelstein, 1998). For every set of values, t1/2 was determined as the number of MC steps required to complete 50% of MC runs (25 of 50 runs). Each individual run was started from the completely unfolded structure, and was completed when the molecule became completely folded. In the limit of 108 MC steps, we could calculate the time of folding for 10 proteins from 17 investigated ones (Garbuzynskiy et al., 2004) (the rest 7 proteins could not fold during this number of steps). Figure 4 presents the typical MC Gō kinetics for one of the ten proteins. One can see (in Figure 4a) that the protein chain (in this case, it is src SH3 domain) stays near the unfolded state (the fraction of folded residues is below 20%) for a long time (in this case, for 230 000 of MC steps), then very quickly (compared to the first period of time) folds (the period of time of overcoming by protein chain of folding transition state barrier is shown in Figure 4b), then spends some time near folded state (the fraction of folded residues is near 80% for this protein) and finally comes to the completely folded state. It should be noted that some almost folded structures have free energies not much higher compared to that of the completely folded structure (thus, even if each residue can be only in one of two states − folded and unfolded − there is a basin of native states rather than a single completely folded state). The same (in a less extent though) concerns also to the basin of the unfolded structures (however, it is definitely an ensemble because unfolded residues have more than one conformation by definition). The "almost folded" structures which are rather stable are usually those ones where one or several residues are still unfolded (these are usually those residues which form a small number of contacts in the completely folded structure). The brightest cases comprise NMR structures that have long tails forming virtually no contacts in the native structures. Thus, it is not a miracle that they "prefer" to be unfolded. Moreover, in some cases some almost folded structures have lower free energy compared to the completely folded one. Moreover, two proteins did not fold into a completely folded structure at all (for 108 steps) while they folded into almost folded structures. For these two proteins, we took (as t1/2) the time of folding to a stable structure rather than to the completely folded state.

290

Oxana V. Galzitskaya and Sergiy O. Garbuzynskiy

Figure 4. (a) An example of Monte-Carlo kinetics for refolding of src SH3 domain. The plots show a dependence of the free energy (black line) and fraction of native residues (gray line) on the number of MC step. The native state here is achieved in ~2.5×105 steps. Arrows point to the native state and to FN, free energy of the native state. (b) A fragment of the same trajectory corresponding to the period of time of barrier crossing.

Comparison of Φ-Values and Folding Time Predictions by Using Monte-Carlo…

291

Dynamic Programming method A simplified description of a network of folding/unfolding pathways In our model (Galzitskaya and Finkelstein, 1999) we consider a network of simplified stepwise unfolding pathways (see Fig.3), each step of which is the removal of one "chain link" (i.e., a chain fragment of one or a few residues) from the native protein 3D structure. The removed chain fragments are assumed to form a random coil; they lose all their nonbonded interactions and gain the coil entropy (except that spent to close the disordered loops protruding from the remaining globule). This reductive description of the unfolded chain is our first simplification. The next is the assumption that the chain residues remaining in the globule keep their native positions and conformations and that the unfolded regions do not fold into another, nonnative globule. Thus, we actually neglect nonnative interactions (cf. Taketomi et al., 1975). The last and main simplification is that we concentrate on the TS and its free energy, i.e., on the stability (actually, the instability) of partly unfolded intermediates rather than on a detailed description of the chain motions. To use Dynamic Programming in searching for the TS at a network of folding-unfolding pathways, we have, for computational reasons, to restrict this network by no more than 106– 107 intermediates. Therefore we divide an N-residue protein chain into L~20–30 chain links (of l=N/L residues each, except for the last link, which may include l or less residues), and use these links rather than residues as units of unfolding. To the same end, i.e., to restrict the number of semi-folded intermediates, we consider only the intermediates with no more than two closed disordered loops in the middle of the chain plus the N- and the C-terminal disordered tails. These four unfolded regions should be enough to describe unfolding of a protein up to N≈100 residues, since the estimated (Finkelstein and Badretdinov, 1997) number of coil regions in the folding nucleus is close to N2/3/6. The above assumptions sharply decrease the time that is necessary for the calculations, while the links of lesser than N/10≈6–10 residues limit the accuracy of our calculations only a little: these links are still much smaller than the expected size of a nucleus in the vicinity of mid-transition between the folded and unfolded phases (where the folded nucleus should include roughly N/3–N/2 residues (Finkelstein and Badretdinov, 1997). A search for transition states in the whole network of folding/unfolding pathways by Dynamic Programming method Let us consider some unfolding pathway w = (I0→I1→…→IL); then # Fw = max{F(I0),F(I1),…,F(IL)} is the free energy of the TS (“free-energy barrier”) at the pathway w. The most efficient kinetic pathway has the minimal (over all the pathways) TS free energy, F#min = minpossible w{Fw#}: this pathway passes from I0 (the native state) to IL (the coil) via the lowest free energy barrier. Let Iν−1∈{Iν−1→Iν} mean that Iν−1 can be transformed into Iν in an elementary step (i.e., by removal of one link from the globular part of Iν−1). At every pathway I0→I1→…→IL-1→IL, all intermediates satisfy conditions I1∈{I1→I2}, …, IL-2∈{IL-2→IL-1} (while the condition IL-1∈{IL-1→IL} is satisfied automatically). Thus,

292

Oxana V. Galzitskaya and Sergiy O. Garbuzynskiy # Fmin =

min

I1 ,…, I L−1 I1∈{I1 →I 2 }

{max{F(I0 ), F(I1),…, F(IL )}},

(5)

I L−2 ∈{I L−2 →I L−1 }

where the maximum is taken over the microstates' free energies along every pathway I0→I1→…→IL-1→IL, and the minimum is taken across all the pathways. Despite a huge number of possible pathways, the F#min value can be calculated by dynamic programming (Aho et al., 1976; Finkelstein and Roytberg, 1993). The algorithm is as follows (Galzitskaya and Finkelstein, 1999). Let p(Iν) be the altitude of the lowest free energy barrier at the pathways leading from I0 to Iν inclusively (thus, F#min = p(IL)). The p(Iν) values are computed recursively:

p( I1 ) = max{F ( I 0 ), F ( I1 )} for all intermediates I1; p( I 2 ) =

min {max{ p ( I1 ), F ( I 2 )}} for all intermediates I2;

I 1 ∈{ I 1 → I 2 }

..... .....

p( I L −1 ) =

min

(6)

{max{ p ( I L − 2 ), F ( I L −1 )}} for all intermediates IL-1;

I L − 2 ∈{ I L − 2 → I L −1 }

# Fmin = p( I L ) = min{max{ p( I L −1 ), F ( I L )}} .

I L −1

Here, again, the maxima are taken along, and the minima across the possible transitions between microstates. The above algorithm computes the altitude of the lowest saddle point at the free energy barrier separating the native fold and the coil, and it computes this altitude in a time, proportional to the number of elementary transitions between microstates. All the p(I) values are stored for use later to find the saddle point microstate(s) themselves. To find these microstate(s) (we say "microstate(s)", for there is no guarantee that only one saddle point has the minimal free energy), we perform a similar recursion in the inverse direction (Aho et al., 1976; Finkelstein and Roytberg, 1993; Galzitskaya and Finkelstein, 1999). The aim of this inverse recursion is to find q(S), the altitude of the lowest free energy barrier at the pathways following from S (exclusively) to SL, and then to compute (7) F#(I) = max{p(I), q(I)}, the altitude of the lowest free energy barrier at the pathways leading from I0 to IL via the intermediate I. The q(I) values are also computed recursively:

q ( I L −1 ) = F ( I L ) for all intermediates IL-1;

q( I L − 2 ) =

min

{max{F ( I L −1 ), q ( I L −1 )}} for all intermediates IL-2;

I L −1 ∈{ I L − 2 → I L −1 }

Comparison of Φ-Values and Folding Time Predictions by Using Monte-Carlo…

293

..... .....

(8)

q ( I1 ) =

min {max{F ( I 2 ), q ( I 2 )}} for all intermediates I1 ;

I 2 ∈{ I 1 → I 2 }

here Iν∈{Iν−1→Iν} means that microstate Iν can be obtained from Iν−1 in one elementary step. The intermediates I with F#(I)=F#min give a narrow ensemble of "the best" transition microstates {I#min} with the minimal free energy, while the intermediates with F#(I)=F(I) give a more broad ensemble {I#} of all the possible passes over the free energy barrier separating I0 from IL. Although the ensemble {I#} may be somewhat redundant (since a pathway to the TS high in free energy may pass via some TS of the lower free energy), it has been shown (Galzitskaya and Finkelstein, 1999) that this ensemble of all the possible passes described the TS better than the ensemble {I#min} of "the best" TSs only. Further, we use only the ensemble {I#}. To outline the nucleus, we investigate the ensembles {I#} of all possible transition states. The value of the Boltzmann probability of microstate I# in the ensemble {I#} is P(I#) = exp(-F(I#)/RT) /exp(-F#/RT),

(9)

exp(-F#/RT) = ΣI# exp(-F(I#)/RT)

(10)

where

is the partition function of the totality of transition states, and F# is their total free energy. The sum is taken over the whole ensemble {I#}. The lower the free energy F(I#), the higher the weight P(I#), the more rapid the pathway via this I# (according to the conventional (Moore and Pearson, 1981) exponential dependence of the reaction rate on the transition state free energy), and therefore the more the chains use this pass I# at folding and unfolding.

Prediction of folding nucleus from MC simulations and Dynamic Programming method To predict the structure of the folding nucleus of each protein in MC simulations, we took the absolute maximum of the free energy (which corresponds to the folding transition state) between the last occurrence of the completely unfolded state and the first occurrence of the completely folded state (for the two proteins which folded to an "almost folded" state, we took the interval between the last occurrence of the completely unfolded state and the first occurrence of the "almost folded" state). That is, it is the period of successful crossing of freeenergy barrier between unfolded state and folded one. The experimental data on the TS structure are expressed in Φ-values (Matouschek et al., 1990; Fersht, 1995; 1997; Fersht et al., 1992). Φ for the chain residue r is close to 1 when this residue has its native conformation and environment in the TS, to 0 when the residue is unfolded in TS, and 0< Φ <1 when the residue loses a part of its native contacts in TS. Strictly speaking, the experimental value Φ for residue r is determined as (Matouschek et a., 1990)

294

Oxana V. Galzitskaya and Sergiy O. Garbuzynskiy Φ = Δr[F(TS) - F(IL)] / Δr[F(I0) - F(IL)],

(11)

where Δr[F(I0) - F(IL)] is the residue r mutation-induced change of the difference in the free energy between the native state I0 and unfolded state IL, and Δr[F(TS) - F(IL)] is the mutationinduced change of the difference between the transition state (TS) and the unfolded state (IL) free energies. The theory estimates the Φ-values as follows. According to equation (2), the value Δr[F(I0) - F(IL)] is equal to ε × Δr(n0nb), where ε is the contact energy, and Δr(n0nb) is the residue r mutation-induced change in the number of contacts in the native state I0 (since all native contacts are assumed to be equal, and no contacts are assumed to be present in the unfolded state IL). Correspondingly, Δr[F(TS) - F(IL)] = ε × <Δr(nInb)>I# , where <Δr(nInb)>I# is the same residue r mutation-induced change in the number of native contacts in the transition state, averaged over the TS ensemble {I#}. This change can be calculated as <Δr(nInb)>I# = ΣI# P(I#) Δr(nI#nb),

(12)

where P(I#) is the probability of microstate I# in the TS ensemble (calculated according to Equation 9), and Δr(nI#nb) is the residue r mutation-induced change in the number of native contacts in microstate I#. In Monte-Carlo, we performed 50 runs, and thus the number of transition states was 50 (corresponding to the maximum of free energy during barrier crossing in each trajectory). In Dynamic Programming approach, we considered all transition states found on the free-energy folding landscape. The values Δr(nInb) can be calculated for each microstate I from atomic coordinates of non-mutated protein when we know what atoms are deleted in the mutant. However, this calculation assumes that the protein structure is not disturbed by mutation. Therefore, we have to consider only those mutations, which do not insert new groups in residue r. The computed values Φ = <Δr(nInb)>I# / Δr(n0nb)

(13)

are to be compared with the experimental Φ-values to estimate the correlation between the theory and experiment. Thus, we excluded from consideration those experimental mutations which added new groups. It should be noted though that experimentalists also prefer not to do mutations with increasing of the side-chain size because of possible clashes (Fersht et al., 1992). We also excluded several Φ-values which were below zero or greater than unity since they have no structural interpretation in terms of a native-like folding nucleus.

Comparison of Φ-values and folding time predictions by using Monte-Carlo and Dynamic Programming approaches To study the order of events in the process of folding, we divided the interval of time between the last occurrence of the completely unfolded state and the first occurrence of the completely

Comparison of Φ-Values and Folding Time Predictions by Using Monte-Carlo…

295

folded state into 6 equal time periods. We made averaging over 50 runs for every contact to find the mean interval for the first formation of contact. We constructed a matrix of contacts according to the first formation of contact (see Figures 5-14) to understand at which step each of the contacts is formed. It should be noted that, for the sake of simplicity, contacts between amino acid residues rather than those between atoms are shown on the contact matrices (although atom-atom contacts are considered in the program, a residue folds or unfolds as a whole in our model, so we can depict a residue as a whole).

Figure 5. (a) The matrix of contacts for barnase according to the first formation of contact averaged over 50 MC runs to find the mean interval for the first formation of contact. The interval of time between the last occurrence of the completely unfolded state and the first occurrence of the completely folded state has been divided on 6 equal periods of time: 0-16% of the time corresponds to red color on the matrix, 17-33% − yellow, 34-50% − green, 51-66% − cyan, 67-83% − blue and 84-100% − violet. (b) Structure of barnase is colored according to the matrix of contacts (in the case if a residue formed different contacts in different stages of time, we took the earliest time). (c) Comparison of Φ-values predicted by Monte-Carlo and Dynamic Programming methods for each residue corresponding to its replacement with glycine (if the residue is a glycine in the wild-type protein, the Φ-value is presented as the probability P(I#) that this residue is structured in the TS ensemble). (d) Profiles of experimental and predicted (for the same mutation as was done experimentally) Φ-values by Monte-Carlo and Dynamic Programming methods for the experimentally investigated residues. The Φ-value calculations are done with hydrogen atoms and with contact distance between heavy atoms rcont ≤ 6Å, contact distance between hydrogen and heavy atoms rH-heavy ≤ 5Å, contact distance between hydrogen atoms rHH ≤ 4Å. Φ-values have been calculated for each of the mutations using four-residue links. The experimental Φvalues were taken from (Serrano et al., 1992).

296

Oxana V. Galzitskaya and Sergiy O. Garbuzynskiy

It should be noted that for eight of the ten proteins, during the last period of time the protein fluctuates near the completely folded state being almost folded (80-95% of amino acid residues are folded). Thus, we will consider five periods as barrier crossing time, and the last period will be considered as native area. In barnase, the first part which folds in our Monte-Carlo simulations is the β-sheet which is at the C-terminal part of the molecule. This takes the first 20% of the barrier crossing time. Moreover, this part of molecule behaves as a kind of intermediate since the molecule is in this stage for a rather long period of time (the β-sheet is folded while the rest of the molecule is unfolded). Then, the N-terminal (α-helical) part of the molecule folds (the other 80% of the time of barrier crossing time). This sequence of events persists in all 50 Monte-Carlo runs; however, the exact order of structuring of individual β-strands is not the same in different runs; thus, there is in fact an ensemble of folding pathways, but in all 50 runs the rough order is the same: at first, the C-terminal β-sheet; then, the α-helical part of the molecule. According to our analysis, the folding transition state of this protein consists of the C-terminal β-sheet and of N-terminal α-helix which is in contact with the β-sheet. Although both computational methods underestimate the Φ-values in the N-terminal α-helix (see Fig. 5), the predictions are in a considerable agreement with the experiment (the correlation coefficient between predicted Φ-values and experimental ones is 0.55 for the Monte Carlo method and 0.68 for the Dynamic Programming method). Among analyzed proteins, there are three proteins which belong to one fold: SH3-like barrel. These are two SH3 domains (src SH3 domain and SH3 domain of α-spectrin) as well as their structural analog Sso7D protein. In both SH3 domains folding pathways of which were investigated experimentally, the sequence of events during folding is predicted to be roughly the same. The C-terminal β-hairpin (distal β-hairpin) folds first (during 20% of the barrier crossing time), then folds the N-terminal part of the molecule and the extremely Cterminal β-strand. In the case of SH3 domain of α-spectrin, the second phase occurs during 80% of the barrier crossing time (see Fig. 6). In the case of src SH3 domain, the protein is almost folded after 60% of time, and the rest of time it spends fluctuating near the native state (see Fig. 7). In Sso7d, which is a distant structural analog of SH3 domains, the central part of the molecule (β-hairpin II and then β-hairpin III) folds first (during the first 40% of barrier crossing time), then folds the C-terminal α-helix (during 60% of time), and then the Nterminal (β-hairpin I) part of the molecule (the rest 40% of barrier crossing time, see Fig. 8). In the case of two SH3 domains, the sequence of events in each of the 50 runs is roughly the same; however, there are some differences in the order of forming of individual β-strands. In the case of Sso7d, the difference between individual runs is somewhat greater: there are some runs where the folding process begins from the C-terminal α-helix forming, then goes the Cterminal β-hairpin and N-terminal part of the molecule. For all these three proteins, the predicted Φ-values correlate well with the experimental ones. For SH3 domain of α-spectrin, the correlation between theory and experiment is 0.79 and 0.77 for Monte-Carlo and Dynamic Programming methods, respectively. For src SH3 domain, the correlation coefficients are 0.46 and 0.50, respectively. For Sso7d, correlation coefficients are 0.78 and 0.68 for the Monte-Carlo and Dynamic Programming methods, respectively. High Φ-values are predicted (in agreement with experimental ones) in the C-terminal half of SH3 domains. In the case of Sso7d, high Φ-values (both predicted and experimental) are in the central part of the sequence as well as in the C-terminal α-helix.

Comparison of Φ-Values and Folding Time Predictions by Using Monte-Carlo…

297

Figure 6. The same for SH3 domain of α-spectrin. In Dynamic Programming method, the size of link was two residues. The experimental Φ-values were taken from (Martinez and Serrano, 1999).

Figure 7. The same for src SH3 domain. In Dynamic Programming method, the size of link was two residues. The experimental Φ-values were taken from (Grantcharova et al., 1998).

298

Oxana V. Galzitskaya and Sergiy O. Garbuzynskiy

Figure 8. The same for Sso7d protein. In Dynamic Programming method, the size of link was two residues. The experimental Φ-values were taken from (Guerois and Serrano, 2000).

CI2 is one of the best-characterized proteins in terms of Φ-value analysis. In CI2, the Cterminal part of the large α-helix is predicted to fold first (during 20% of the barrier crossing time). Then (40% of the time) folds the rest of the α-helix, two β-strands (adjacent to the binding functional loop) in the C-terminal direction from the α-helix as well as the N-terminal β-strand (60% of barrier crossing time). The last is the C-terminal β-strand. In individual runs, there are considerable differences from the average pathway: there are runs where folding begins not from large α-helix but from a loop before the large α-helix, and there are other runs where folding begins from the N-terminal and the C-terminal β-strands while the large αhelix folds later. Thus, several parallel folding pathways are rather pronounced in CI2 folding. There is also a considerable difference between transition state ensembles predicted by the Monte-Carlo and Dynamic Programming methods (see Fig. 9). Correlation coefficients between prediction and experiment are 0.33 and 0.63 for the Monte-Carlo and Dynamic Programming methods, respectively.

Comparison of Φ-Values and Folding Time Predictions by Using Monte-Carlo…

299

Figure 9. The same for CI2 protein. In Dynamic Programming method, the size of link was two residues. The experimental Φ-values were taken from (Itzhaki et al., 1995).

B1 domains are another case of proteins with the same fold; however, their sequence identity is low. In B1 domain of protein G, the C-terminal β-hairpin and α-helix fold in the beginning in our Monte-Carlo simulations (during 20% of the time of barrier crossing time). Then, the N-terminal β-hairpin folds (during 40-60% of the barrier crossing time). The remaining time the protein spends near the native state (see Fig. 10). In B1 domain of protein L, the N-terminal hairpin tends to fold at first (40% of the time). Then goes the C-terminal hairpin (50-60% of the barrier crossing time), and α-helix is the last (80-100% interval of the barrier crossing time). In both these proteins, a comparison of individual folding trajectories shows that there are parallel folding pathways: in the case of protein G, there are several trajectories where protein G folds in a protein L manner (folding begins from the N-terminal hairpin and the C-terminal one, α-helix is the last, see Fig. 11), while some trajectories for protein L resemble those typical of protein G (the C-terminal β-hairpin and α-helix fold at first). It should be underlined here that some information can be lost because of averaging: it is difficult to resolve parallel folding pathways by examining just the average matrix of contacts. A similar conclusion has been obtained in the work of Shimada and Shakhnovich (2002) where it has been shown that protein G folds through multiple pathways, each of which passes through an on-pathway intermediate using all-atom Monte-Carlo simulation

300

Oxana V. Galzitskaya and Sergiy O. Garbuzynskiy

with Gō potential (Shimada and Shakhnovich, 2002). However, both our methods correctly predict the presence of differences in ensembles of folding nuclei for these two domains. For protein G, correlation coefficients are 0.48 and 0.83 for the Monte-Carlo and Dynamic Programming methods, respectively. For protein L, correlation coefficients are -0.03 and 0.15 for Monte-Carlo and Dynamic Programming methods, respectively. Although both our methods overestimate Φ-values in the C-terminal β-hairpin of protein L, and Monte-Carlo method also underestimates Φ-values in the C-terminal β-hairpin of protein G, in other regions the predictions are consistent with experimental data.

Figure 10. The same for B1 domain of protein G. In Dynamic Programming method, the size of link was two residues. The experimental Φ-values were taken from (McCallister et al., 2000).

Comparison of Φ-Values and Folding Time Predictions by Using Monte-Carlo…

301

Figure 11. The same for B1 domain of protein L. In Dynamic Programming method, the size of link was two residues. The experimental Φ-values were taken from (Kim et al., 2000).

In fibronectin domain FNfn10, the first part of protein which folds is the central β-hairpin (during 20% of time of barrier crossing). Then (during 40-60% of the barrier crossing time), the C-terminal part of the molecule folds. The last is the N-terminal β-hairpin (Fig. 12). It should be mentioned though that this domain does not completely fold: the very N-terminus as well as two large loops "prefer" to be unfolded. For this domain, there are some differences between different individual trajectories: the sequence of structuring of different segments of secondary structures varies, but not much compared to the other proteins. For fibronectin domain FNfn10 we obtained the worst correlation of Φ-value predictions with experiment: correlation coefficients are -0.14 and -0.56 for the Monte-Carlo and Dynamic Programming methods, respectively.

302

Oxana V. Galzitskaya and Sergiy O. Garbuzynskiy

Figure 12. The same for FNfn10 domain of fibronectin. In Dynamic Programming method, the size of link was three residues. The experimental Φ-values were taken from (Cota et al., 2001).

In domain 1 of villin (Fig. 13), the first thing that folds in MC simulations is a rather compact structure (known as an "aliphatic core" of this protein) formed of β-strands β1, β2, β3, and β6. Moreover, this compact structure "lives" rather long before the rest of the protein folds (see Fig. 13 (e, f)), and it may be the intermediate which is observed for this protein experimentally. Then (during 60% of barrier crossing time) β4 strand folds, and the last is the C-terminal part (α2, β5, α3, and β7). The N-terminal α-helix (α1) as well as the very Cterminus remains unfolded in the final structure: this domain does not fold into the completely folded state. All individual runs are approximately the same for the first step (all runs lead to the formation of the same intermediate structure composed of β-strands β1, β2, β3, and β6). However, there are substantial differences between individual trajectories on the second step of folding (from an intermediate structure to the native one). It should be stressed once again that this protein does not fold completely, and some elements of the structure (especially α1 α-helix) tend to be destabilized in the stable structure. For villin, correlation coefficients are 0.41 and 0.53 for the Monte-Carlo and Dynamic Programming methods, respectively. It is also noteworthy that experiments show that folding nucleus of this protein is really located in the "aliphatic core"; thus, our predictions are consistent with experimental data.

Comparison of Φ-Values and Folding Time Predictions by Using Monte-Carlo…

303

Figure 13. (a-d) The same for domain 1 of villin. In Dynamic Programming method, the size of link was four residues. The experimental Φ-values were taken from (Choe et al., 2000). (e) A fragment of one of the folding trajectories. Black curve is the dependence of free energy on time. Gray curve is the dependence of number of native amino acid residues on time. (f) The dependence of free energy on the number of native amino acid residues for the same fragment of the same folding trajectory.

In CheY (Fig. 14), the N-terminal part of the molecule (residues 1-86) folds first (during 20% of time of barrier crossing) forming an intermediate state which exists for quite a long time (see Fig. 14 (e, f)) before the C-terminus folds. After that, the C-terminus folds, and the native state is almost complete. Before this protein completely folds, it fluctuates near its native state for extremely long time in most Monte-Carlo trajectories, and each contact forms and breaks more than once during this fluctuation; thus, the first formation for all contacts appears to occur during the first 20% of time between the last occurrence of the completely

304

Oxana V. Galzitskaya and Sergiy O. Garbuzynskiy

unfolded state and the first occurrence of the completely folded state, and thus, all the structure is red in Figure 14 (a,b). This behavior (the N-terminal part folds first) is roughly the same for all individual folding trajectories; however, there are some differences in the order of folding of individual elements of secondary structure. For CheY, correlation coefficients are 0.49 and 0.61 for the Monte-Carlo and Dynamic Programming methods, respectively.

Figure 14. (a-d) The same for CheY protein. In Dynamic Programming method, the size of link was four residues. The experimental Φ-values were taken from (López-Hernández and Serrano, 1996). (e) A fragment of one of the folding trajectories. Black curve is the dependence of free energy on time. Gray curve is the dependence of number of native amino acid residues on time. (f) The dependence of free energy on the number of native amino acid residues for the same fragment of the same folding trajectory.

Comparison of Φ-Values and Folding Time Predictions by Using Monte-Carlo…

305

Summing up our data, one can see that Φ-values calculated from the Monte Carlo simulations for ten proteins correlate with experimental data (the correlation coefficient is 0.41, see Table 1) roughly at the same level as Φ-values calculated from the Dynamic Programming approach (the correlation coefficient is 0.48, see Table 1). It was already shown by us that our model is rather sensitive to which experimental method was used for resolving the 3D structure which we use for our calculations (X-Ray or NMR) (Garbuzynskiy et al., 2004). Moreover, we found statistically significant differences in contacts between X-Ray and NMR structures of the same proteins in a large database of proteins whose 3D structure was determined by both methods (Garbuzynskiy et al., 2005). Thus, here we divided all structures into X-Ray and NMR ones, and looked at correlation coefficients between theory and experiment separately for X-Ray structures and separately for NMR ones. Average correlation coefficients are 0.70 (Dynamic Programming) and 0.57 (Monte Carlo simulations) for X-Ray structures while for NMR ones they are 0.16 and 0.18, correspondingly. Thus, we can predict folding nuclei in X-Ray structures more successfully than in NMR structures. Table 1. The computed characteristic first passage time t1/2 (the number of MC steps made until a half of molecules folds) and comparison of correlations between predicted and observed Φ-values for X-ray and NMR-resolved proteins Protein (or domain) name (PDB entry)

Logarithm of folding time at mid-transition (experiment), ln(1/kf)

ln(t1/2) (Monte Carlo steps)

Correlation coefficient (Monte Carlo simulations)

Correlation coefficient (Dynamic Programming)

SH3 domain of α-spectrin (1shg)

3.0 (Viguera et al., 1996)

16.8

0.79

0.77

15.8

0.78

0.68

16.9

0.48

0.83

18.5

0.55

0.68

17.4

0.49

0.61

17.5

0.33

0.63

0.57(±0.07)

0.70(±0.03)

Sso7d (1bf4) B1 domain of protein G (1pgb) Barnase (1rnb) CheY (3chy) CI2 (2ci2) Average for proteins which structures were solved by X-Ray src SH3 domain (1srl) Domain 1 of villin (2vik) B1 domain of protein L (2ptl) FNfn10 domain of fibronectin (1ttg) Average for proteins which structures were solved by NMR Average for all proteins

-0.5 (Guerois and Serrano, 2000) 0.6 (McCallister et al., 2000) 4.3 (Matouschek et al., 1990) 1.6 (Muñoz et al., 1994) 3.7 (Jackson and Fersht, 1991)

0.3 (Grantcharova and Baker, 1997)

13.7

0.46

0.50

0.2 (Choe et al., 1998)

16.1

0.41

0.53

1.9 (Kim et al., 2000)

17.9

-0.03

0.15

0.4 (Cota and Clarke, 2000)

15.7

-0.14

-0.56

0.18(±0.15)

0.16(±0.25)

0.41(±0.10)

0.48(±0.10)

306

Oxana V. Galzitskaya and Sergiy O. Garbuzynskiy

Estimation of protein folding rate from calculations To use dynamic programming in searching for the TS at a network of folding-unfolding pathways, we have, for computational reasons, to restrict this network by no more than ~106 intermediates. Therefore, we divide an N-residue protein chain into L~20–30 chain links and use a limited number of loops. Using the Monte-Carlo (MC) simulation of folding, we, in principle, can avoid these simplifications. Since the folding rate kf should be proportional to exp(-F#/RT), and since our model allows us to calculate the transition state free energy F# (see eq. 10), we can estimate a correlation of computed F#/RT values with experimentally obtained folding rates kf (or rather, ln(kf)). The correlation coefficient is 0.53±0.11 between the computed (for mid-transition, see above) -F#/RT values and ln(kf_at_mid-transition) (see Figure 15b).

Figure 15. (a) Correlation between the computed characteristic first passage time t1/2 (the number of MC steps made until a half of molecules folds) and the experimentally measured folding time (i.e., t=1/kf) at the mid-transition for 10 proteins, where the folding simulation has been completed within 108 MC steps. The experimental folding time t is measured in seconds; the time of simulation, t1/2, is measured in MC steps. Both are represented in a logarithmic scale. The correlation coefficient is 0.71±0.05. (b) Correlation between the computed transition state free energy F#/RT and the mid-transition folding time t=1/kf (measured in sec and represented in a logarithmic scale) for all 10 investigated wild-type proteins. The correlation coefficient is 0.53±0.11.

Comparison of Φ-Values and Folding Time Predictions by Using Monte-Carlo…

307

The computed (for mid-transition) the characteristic first passage time, t1/2, values for 10 proteins are in a good correlation with ln(kf_at_mid-transition): the correlation coefficient is -0.71±0.11 (see Figure 15a).

Discussion It should be underlined that it is difficult to detect transition states in Molecular Dynamics simulations because they do not directly calculate free energy (Shakhnovich, 2006). From the presented results, it is obvious that our method is able to predict folding rates as well as folding nuclei of the proteins with known 3D structures. Both methods (Monte-Carlo simulations of folding of individual molecules and analysis of the whole landscape of folding/unfolding pathways with Dynamic Programming) give predictions which are in a reasonable agreement with experiment. Interestingly, the predictions of folding nuclei are better if to consider the whole landscape of pathways while the predictions of folding rates are better if to consider folding of individual protein molecules directly by the Monte-Carlo method. The presented model of protein folding is a rough approximation of the process. But it can describe a substantial point of the folding procedure. This model neglects ruggedness of the landscape. But the fluctuation effects can lead to a very broad trapping-time distribution (Wolynes, 1997). Consideration of models with native topology is an efficient way to explore global structural features of biological molecules and provide some information about the folding pathway both in theoretical models and in real proteins. The presented theoretical simulations and the available experimental data for the rate of folding show the existence of agreement between them. In our theoretical modeling of protein folding the rate of achievement of the native state at the mid transition is limited by value of transition state barrier on the pathway from the unfolded to native state. Considering the theoretical dependence of the folding rate between experimental data and number of Monte-Carlo steps we can predict the folding rate of proteins with unknown experimental data using the Monte Carlo Gō simulations. The obtained results may be useful in prediction of protein folding rate using only 3D protein structure.

Acknowledgments This work was supported by “Molecular and Cell Biology” and “Fundamental Science − Medicine” programs, by the Russian Foundation for Basic Research (08-04-00561-a and 0704-00388-а), by Science School (2791.2008.4), by the INTAS grant (05-1000004-7747), “Russian Science Support Foundation” and by Howard Hughes Medical Institute (grant 55005607).

308

Oxana V. Galzitskaya and Sergiy O. Garbuzynskiy

References Abkevich, V. I., Gutin, A. M. and Shakhnovich, E. I. (1994). Free energy landscape for protein folding kinetics: Intermediates, traps, and multiple pathways in theory and lattice model simulations J. Chem. Phys. 101, 6052-6062. Abkevich, V.I.; Gutin, A.M. and Shakhnovich, E.I. (1994). Specific nucleus as the transition state for protein folding: evidence from the lattice model. Biochemistry, 33, 1002610036. Aho, A., Hopcroft, J. and Ullman, J. (1976). The Design and Analysis of Computer Algorithms (Addison-Wesley, Reading, MA). Alm, E. and Baker, D. (1999). Prediction of protein-folding mechanisms from free-energy landscapes derived from native structures. Proc. Natl Acad. Sci. USA, 96, 11305-11310. Alm, E., Morozov, A. V., Kortemme, T. and Baker, D. (2002). Simple physical models connect theory and experiment in protein folding kinetics. J. Mol. Biol. 322, 463-476. Baker, D. (2000). A surprising simplicity to protein folding. Nature, 405, 39-42. Bastolla, U., Farwer, J., Knapp, E. W. and Vendruscolo, M. (2001) How to guarantee optimal stability for most representative structures in the Protein Data Bank. Proteins: Struct. Funct. Genet., 44, 79-96. Blanco, F.J., Rivas, G. and Serrano, L. (1994) A short linear peptide that folds into a native stable beta-hairpin in aqueous solution. Nature Struct. Biol., 1, 584-590. Brooks, C. L. III, Gruebele, M., Onuchic, J. N. and Wolynes, P. G. (1998) Chemical physics of protein folding. Proc. Natl Acad. Sci. USA, 95, 11037-11038. Burton, R. E., Huang, G. S., Daugherty, M. A., Calderoni, T. L. and Oas, T. G. (1997). The energy landscape of a fast-folding protein mapped by Ala-->Gly substitutions. Nature Struct. Biol., 4, 305-310. Caflisch, A. and Karplus, M. (1995) Acid and thermal denaturation of barnase investigated by molecular dynamics simulations J. Mol. Biol., 252, 672-708. Chiti, F., Taddei, N., White, P., Bucciantini, M., Magherini, F., Stefani, M. and Dobson, C. (1999). Mutational analysis of acylphosphatase suggests the importance of topology and contact order in protein folding. Nature Struct. Biol., 6, 1005-1009. Choe, S. E., Matsudaira, P. T., Osterhout, J., Wagner, G. & Shakhnovich, E. I. (1998). Folding kinetics of villin 14T, a protein domain with a central beta- sheet and two hydrophobic cores. Biochemistry, 37, 14508-14518. Choe, S. E., Li, L., Matsudaira, P. T., Wagner, G. and Shakhnovich, E. I. (2000). Differential stabilization of two hydrophobic cores in the transition state of the villin 14T folding reaction. J. Mol. Biol. 304, 99-115. Clarke, J., Cota, E., Fowler, S.B. and Hamill, S.J. (1999). Folding studies of immunoglobulinlike beta-sandwich proteins suggest that they share a common folding pathway. Structure Fold. Design, 7, 1145-1153. Clementi, C., Jennings, P. A. and Onuchic, J. N. (2000). How native-state topology affects the folding of dihydrofolate reductase and interleukin-1beta. Proc. Natl Acad. Sci. USA, 97, 5871-5876. Cordier-Ochsenbein, F., Guerois, R., Baleux, F., Huynh-Dinh, T., Lirsac, P.N., Russo-Marie, F., Neumann, J.M. and Sanson, A. (1998). Exploring the folding pathways of annexin I, a

Comparison of Φ-Values and Folding Time Predictions by Using Monte-Carlo…

309

multidomain protein. I. non-native structures stabilize the partially folded state of the isolated domain 2 of annexin I. J. Mol. Biol., 279, 1163-1175. Cota, E. and Clarke, J. (2000). Folding of beta-sandwich proteins: three-state transition of a fibronectin type III module. Protein Sci. 9, 112-120. Cota, E., Steward, A., Fowler, S. B. and Clarke, J. (2001). The folding nucleus of a fibronectin type III domain is composed of core residues of the immunoglobulin-like fold. J. Mol. Biol. 305, 1185-1194. Daggett, V. and Fersht, A. R. (2003). The present view of the mechanism of protein folding. Nature Rev. Mol. Cell Biol., 4, 497-502. Dobson, C.M. and Karplus, M. (1999). The fundamentals of protein folding: bringing together theory and experiment. Curr. Opin. Struct. Biol., 9, 92-101. Ferguson, N., Pires, J. R., Toepert, F., Johnson, C. M., Pan, Y. P., Volkmer-Engert, R., Schneider-Mergener, J., Daggett, V., Oschkinat, H. and Fersht, A. (2001). Using flexible loop mimetics to extend phi-value analysis to secondary structure interactions. Proc. Natl Acad. Sci. USA, 98, 13008-13013. Ferguson, N., Capaldi, A.P., James, R., Kleanthous, C. and Radford, S.E. (1999). Rapid folding with and without populated intermediates in the homologous four-helix proteins Im7 and Im9. J. Mol. Biol., 286, 1597-1608. Fersht, A. R. (1995). Characterizing transition states in protein folding: an essential step in the puzzle. Curr. Opin. Struct. Biol., 5, 79-84. Fersht, A. R. (1997). Nucleation mechanisms in protein folding. Curr. Opin. Struct. Biol., 7, 3-9. Fersht, A. R., Matouschek, A. and Serrano, L. (1992). The folding of an enzyme. I. Theory of protein engineering analysis of stability and pathway of protein folding. J. Mol. Biol., 224, 771-782. Finkelstein, A. V. (1997). Can protein unfolding simulate protein folding? Prot. Eng., 10, 843-845. Finkelstein, A. V. and Roytberg, M. A. (1993) Computation of biopolymers: a general approach to different problems. Biosystems, 30, 1-19. Finkelstein, A. V. and Badretdinov, A. Ya. (1997). Physical reasons for fast folding of stable spatial structure of proteins: A solution of the Levinthal paradox. Mol. Biol. (Russia) 31, 391-398. Finkelstein, A. V. and Galzitskaya, O. V. (2004). Physics of protein folding. Phys. Life. Rev., 1, 23-56 (2004). Finkelstein, A. V. and Badretdinov, A. Ya. (1997). Rate of protein folding near the point of thermodynamic equilibrium between the coil and the most stable chain fold. Fold. Des. 2, 115-121. Flory, P, J, (1969) Statistical Mechanics of Chain Molecules (Interscience, New York). Fulton, K., Main, E., Daggett, V. and Jackson, S. E. (1999). Mapping the interactions present in the transition state for unfolding/folding of FKBP12. J. Mol. Biol., 291, 445-461. Galzitskaya O. V., Ivankov, D. N. and Finkelstein, A. V. (2001). Folding nuclei in proteins. FEBS Lett. 489, 113-118. Galzitskaya, O. V. (2002). Sensitivity of folding pathway to the details of amino-acid sequence. Mol. Biol. (Moscow), 36, 386-390. Galzitskaya, O. V. and Finkelstein, A. V. (1995). Folding of chains with random and edited sequences: similarities and differences. Protein Eng., 8, 883-892.

310

Oxana V. Galzitskaya and Sergiy O. Garbuzynskiy

Galzitskaya, O. V. and Finkelstein, A. V. (1999). A theoretical search for folding/unfolding nuclei in three-dimensional protein structures. Proc. Natl Acad. Sci. USA, 96, 1129911304. Galzitskaya, O. V., Skoogarev, A. V., Ivankov, D. N. and Finkelstein, A. V. (1999). Folding nuclei in 3D protein structures (Proceedings of the Pacific Symposium on Biocomputong'2000), ed R. B. Altman, A. K. Dunker, L. Hunter, K. Lauderdale and T. E. Klein (World Scientific, Singapore - New Jersey - London - Hong Kong), 131-142. Galzitskaya, O. V. and Garbuzynskiy, S. O. (2006). Entropy capacity determines protein folding. Proteins, 63, 144-154. Galzitskaya O. V., Reifsnyder, D. C., Bogatyreva, N. C., Ivankov, D. N. and Garbuzynskiy, S. O. (2008). More compact protein globules exhibit slower folding rates. Proteins, 70, 329-332. Galzitskaya O. V., Garbuzynskiy, S. O., Ivankov, D. N. and Finkelstein, A. V. (2003). Chain length is the main determinant of the folding rate for proteins with three-state folding kinetics. Proteins, 51, 162-166. Galzitskaya, O.V., Finkelstein, A.V. (1998). Folding rate dependence on the chain length of RNA-like heteropolymers. Folding& Design, 3, 69-78. Garbuzynskiy, S. O., Finkelstein, A. V. and Galzitskaya, O. V. (2004). Outlining folding nuclei in globular proteins. J. Mol. Biol., 336, 509-525. Garbuzynskiy, S.O., Melnik, B.S., Lobanov, M. Yu., Finkelstein A.V., Galzitskaya O.V. (2005). Comparison of X-ray and NMR structures: is there a systematic difference in residue contacts between X-ray- and NMR-resolved protein structures? Proteins, 60, 139147. Gillespie, J.R. and Shortle, D. (1997). Characterization of long-range structure in the denatured state of staphylococcal nuclease. II. Distance restraints from paramagnetic relaxation and calculation of an ensemble of structures. J. Mol. Biol., 268, 170-184. Goldenberg, D. P., Frieden, R. W., Haack, J. A. and Morrison, T. B. (1989). Mutational analysis of a protein-folding pathway. Nature, 338, 498-511. Grantcharova, V. P., Riddle, D. S., Santiago, J. V. and Baker, D. (1998). Important role of hydrogen bonds in the structurally polarized transition state for folding of the src SH3 domain Nature Struct. Biol., 5, 714-720. Grantcharova, V. P. and Baker, D. (1997). Folding dynamics of the src SH3 domain. Biochemistry, 36, 15685-15692. Grantcharova, V., Alm, E. J., Baker, D. and Horwich, A. L. (2001). Mechanisms of protein folding. Curr. Opin. Struct. Biol., 11, 70-82. Grantcharova, V.P. and Baker, D. (2001). Circularization changes the folding transition state of the src SH3 domain. J. Mol. Biol., 306, 555-563. Guerois, R. and Serrano, L. (2000). The SH3-fold family: experimental evidence and prediction of variations in the folding pathways. J. Mol. Biol., 304, 967-982. Guijarro, J. I., Morton, C. J., Plaxco, K. W., Campbell I. D. and Dobson, C. M. (1998). Folding kinetics of the SH3 domain of PI3 kinase by real-time NMR combined with optical spectroscopy. J. Mol. Biol., 276, 657-667. Gutin A. M., Abkevich V. I. and Shakhnovich, E. I. (1996). Chain length scaling of protein folding time. Phys. Rev. Lett., 77, 5433-5456.

Comparison of Φ-Values and Folding Time Predictions by Using Monte-Carlo…

311

Itzhaki, L. S., Otzen, D. T. and Fersht, A. R. (1995). The structure of the transition state for folding of chymotrypsin inhibitor 2 analysed by protein engineering methods: evidence for a nucleation-condensation mechanism for protein folding. J. Mol. Biol., 254, 260-288. Ivankov D. N. and Finkelstein A. V. (2004). Prediction of protein folding rates from the amino acid sequence-predicted secondary structure. Proc. Natl. Acad. Sci. USA, 101, 8942-8944. Jackson, S. E. (1998). How do small single-domain proteins fold? Fold. Des. 3, R81-R91. Jackson, S. E. & Fersht, A. R. (1991). Folding of chymotrypsin inhibitor 2. 1. Evidence for a two-state transition. Biochemistry, 30, 10428-10435. Kim, D.E., Fisher, C. and Baker, D., (2000). A breakdown of symmetry in the folding transition state of protein L. J. Mol. Biol., 298, 971-984. Kortemme, T., Kelly, M.J., Kay, L.E., Forman-Kay, J. and Serrano, L. (2000). Similarities between the spectrin SH3 domain denatured state and its folding transition state. J. Mol. Biol.,. 297, 1217-1229. Kragelund, B. B., Osmark, P., Neergaard, T. B., Schiodt, J., Kristiansen, K., Knudsen, J. and Poulsen, F. M, (1999). The formation of a native-like structure containing eight conserved hydrophobic residues is rate limiting in two-state protein folding of ACBP. Nature Struct. Biol. 6, 594-601. Kragelund, B.B., Robinson, C.V., Knudsen, J., Dobson, C.M., and Poulsen, F.M. (1995). Folding of a four-helix bundle: studies of acyl-coenzyme A binding protein. Biochemistry, 34, 7217-7224. Krieger, E., Darden, T., Nabuurs, S.B., Finkelstein, A. and Vriend, G. (2004). Making optimal use of empirical energy functions: force-field parameterization in crystal space. Proteins, 57, 678-683. Kuznetsov, I. B. and Rackovsky, S. (2004). Class-specific correlations between protein folding rate, structure-derived, and sequence-derived descriptors. Proteins, 54, 333-341. Lazaridis, T. and Karplus, M. (1997). “New view" of protein folding reconciled with the old through multiple unfolding simulations. Science, 278, 1928-1931. Leffler, J. E. and Grunwald, E. (1963). Rates and equilibria of organic chemistry (Dover, New York). Li, A. and Daggett, V. (1996). Identification and characterization of the unfolding transition state of chymotrypsin inhibitor 2 by molecular dynamics simulations. J. Mol. Biol., 257, 412-429. Lopez-Hernandez, E. and Serrano, L. (1996). Structure of the transition state for folding of the 129 aa protein CheY resembles that of a smaller protein, CI-2, Folding and Design. Fold. Design., 1, 43-55. Ma B. G., Chen, L. L. and Zhang, H. Y. (2007). What determines protein folding type? An investigation of intrinsic structural properties and its implications for understanding folding mechanisms. J. Mol. Biol., 370, 439-448. Makarov D. E., Keller, C. A., Plaxco, K. W. and Metiu, H. (2002). How the folding rate constant of simple, single-domain proteins depends on the number of native contacts. Proc. Natl. Acad. Sci. USA, 99, 3535–3539. Martinez, J.C. and Serrano, L. (1999). The folding transition state between SH3 domains is conformationally restricted and evolutionarily conserved. Nature Struct. Biol., 6, 10101016.

312

Oxana V. Galzitskaya and Sergiy O. Garbuzynskiy

Martinez, J.C., Pisabarro, M.T. and Serrano, L. (1998). Obligatory steps in protein folding and the conformational diversity of the transition state. Nature Struct. Biol., 5, 721-729. Matouschek, J. T., Kellis, Jr, Serrano, L. and Fersht, A. R. (1989). Mapping the transition state and pathway of protein folding by protein engineering, Nature, 340, 122-126. Matouschek, J. T., Kellis, Jr., Serrano, L., Bycroft, M. and Fersht, A. R. (1990). Transient folding intermediates characterized by protein engineering. Nature, 346 440-445. Matthews, C. R. (1987). Effect of point mutations on the folding of globular proteins. Methods Enzymol., 154, 127-132. Mayor, U., Guydosh, N. R., Johnson, C. M., Grossman, J. G., Sato, S., Jas, G. S., Freund, S. M. V., Alonso, D. O. V., Daggett, V. and Fersht, A. R. (2003). The complete folding pathway of a protein from nanoseconds to microseconds. Nature, 421, 863-867. Mayor, U., Johnson, C. M., Daggett, V. and Fersht, A. R. (2000). Protein folding and unfolding in microseconds to nanoseconds by experiment and simulation. Proc. Natl Acad. Sci. USA, 97, 13518-13522. McCallister, E.L., Alm, E. and Baker, D. (2000). Critical role of beta-hairpin formation in protein G folding. Nature Struct. Biol., 7, 669-673. Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A. and Teller, E. (1953). Folding kinetics of proteins and copolymers. J. Chem. Phys. 96 768-780. Mirny, L. A. and Shakhnovich, E. I. (1999). Universally conserved positions in protein folds: reading evolutionary signals about stability, folding kinetics and function J. Mol. Biol., 291, 177-196. Mirny, L. and Shakhnovich, E. (2001). Evolutionary conservation of the folding nucleus. J. Mol. Biol., 308, 123-129. Moore, J. W. and Pearson, R. G. (1981). Kinetics and Mechanism, Wiley, New York. Muñoz, V., Lopez, E. M., Jager, M. and Serrano, L. (1994). Kinetic characterization of the chemotactic protein from Escherichia coli, CheY. Kinetic analysis of the inverse hydrophobic effect. Biochemistry, 33, 5858-5866. Muñoz, V. and Eaton, W. A. (1999). A simple model for calculating the kinetics of protein folding from three-dimensional structures. Proc. Natl Acad. Sci. USA, 96, 11311-11316. Murzin, A. G., Brenner, S. E., Hubbard T. and Chothia, C. (1995). SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247, 536-540. Nölting, B. and Andert, K. (2000). Mechanism of protein folding. Proteins: Struct. Funct. Genet., 41, 288-298. Oliveberg, M. (2001). Characterisation of the transition states for protein folding: towards a new level of mechanistic detail in protein engineering analysis. Curr. Opin. Struct. Biol., 11, 94-100. Otzen, D. E., Kristensen, O., Proctor, M. and Oliveberg, M. (1999). Structural changes in the transition state of protein folding: alternative interpretations of curved chevron plots. Biochemistry, 38, 6499-6511. Paci, E., Vendruscolo, M., Dobson, C.M., Karplus, M. (2002). Determination of a transition state at atomic resolution from protein engineering data. J. Mol. Biol., 324, 151-163. Perl D., Welker, Ch., Schindler, Th., Schroder, K., Marahiel, M. A., Jaenicke, R. and Schmid, F. X. Conservation of rapid two-state folding in mesophilic, thermophilic and hyperthermophilic cold shock proteins. Nature Struct. Biol., 5, 229-235.

Comparison of Φ-Values and Folding Time Predictions by Using Monte-Carlo…

313

Plaxco, K. W., Larson, S., Ruczinski, I., Riddle, D. S., Thayer, E. C., Buchwitz, B., Davidson, A. R. and Baker, D. (2000). Evolutionary conservation in protein folding kinetics J. Mol. Biol., 298, 303-312. Plaxco, K. W., Guijarro, J. I. Morton, C. J. Pitkeathly, M. Campbell I. D. and Dobson, C. M. The folding kinetics and thermodynamics of the Fyn-SH3 domain. Biochemistry, 37, 2529-2537 (1998). Plaxco K. W., Simons, K. W. and Baker, D. Contact order, transition state placement and the refolding rates of single domain proteins. J. Mol. Biol., 277, 985-994 (1998). Privalov, P. L. (1979). Stability of proteins: small globular proteins. Adv. Protein Chem. 33, 167-241. Ptitsyn, O. B. (1995). Molten globule and protein folding. Adv. Prot. Chem., 47, 83-229. Ptitsyn, O. B. (1998). Protein folding and protein evolution: common folding nucleus in different subfamilies of c-type cytochromes? J. Mol. Biol., 278, 655-666. Ptitsyn, O. B. and Ting, K.-L. (1999). Non-functional conserved residues in globins and their possible role as a folding nucleus. J. Mol. Biol., 291, 671-682. Punta, M.and Rost, B. (2005). Protein folding rates estimated from contact predictions. J. Mol. Biol., 348, 507-512. Riddle, D. S., Grantcharova, V. P., Santiago, J. V., Alm, E., Ruczinskiy, I. and Baker, D. (1999). Experiment and theory highlight role of native state topology in SH3 folding. Nature Struct. Biol., 6, 1016-1024. Serrano, L., Matouschek, A. and Fersht, A. R. (1992). The folding of an enzyme. III. Structure of the transition state for unfolding of barnase analyzed by a protein engineering procedure. J. Mol. Biol. 224, 805-818. Shakhnovich, E., Abkevich, V. and Ptitsyn, O. (1996). Conserved residues and the mechanism of protein folding. Nature, 379, 96-98. Shakhnovich, E. I. (1998). Protein design: a perspective from simple tractable models. Fold. Des. 3, R45-R58. Shakhnovich, E. (2006). Protein folding thermodynamics and dynamics: where physics, chemistry, and biology meet. Chem. Rev., 106, 1559-1588. Shimada, J. and Shakhnovich, E.I. (2002). The ensemble folding kinetics of protein G from an all-atom Monte Carlo simulation. Proc. Natl Acad. Sci. USA, 99, 11175-11180. Taddei, N., Chiti, F., Fiaschi, T., Bucciantini, M., Capanni, C., Stefani, M., Serrano, L., Dobson, C.M. and Ramponi, G., (2000). Stabilisation of alpha-helices by site-directed mutagenesis reveals the importance of secondary structure in the transition state for acylphosphatase folding. J. Mol. Biol., 300, 633-647. Takada, S. (1999). Gō-ing for the prediction of protein folding mechanisms. Proc. Natl Acad. Sci. USA, 96, 11698-11700. Taketomi, H., Ueda, Y. and Gō, N. (1975). Studies on protein folding, unfolding and fluctuations by computer simulation. I. The effect of specific amino acid sequence represented by specific inter-unit interactions. Int. J. Pept. Protein Res., 7, 445-459. Ternstrom, T., Mayor, U., Akke, M. and Oliveberg, M., (1999). From snapshot to movie: phi analysis of protein folding transition states taken one step further. Natl. Acad. Sci. USA, 96, 14854-14859. Thirumalai, D. (1995). From minimal models to real proteins: time scales for protein folding kinetics. Journal de Physique Orsay Fr. 5, 1457-1467.

314

Oxana V. Galzitskaya and Sergiy O. Garbuzynskiy

Tsai, J., Levitt, M. and Baker, D. (1999). Hierarchy of structure loss in MD simulations of src SH3 domain unfolding. J. Mol. Biol., 291, 215-225. van Nuland, N. A., Chiti, F., Taddei, N., Raugei, G., Ramponi G. and Dobson, C. M. (1998). Slow folding of muscle acylphosphatase in the absence of intermediates. J. Mol. Biol., 283, 883-891. Vendruscolo, M., Paci, E., Dobson, C. M., Karplus, M. (2001). Three key residues form a critical contact network in a protein folding transition state. Nature, 409, 641645. Viguera, A.R., Serrano, L. and Wilmanns, M. (1996). Different folding transition states may result in the same native structure. Nature Struct. Biol., 3, 874-880. Voelz, V. A. and Dill, K. A. (2007). Exploring zipping and assembly as a protein folding principle. Proteins, 66, 877-888. Wolynes, P.G. (1997). Folding funnels and energy landscapes of larger proteins within the capillarity approximation. Proc. Natl Acad. Sci. USA, 94, 6170-6175. Yi, Q., Scalley-Kim, M.L., Alm, E.J. and Baker, D. (2000). NMR characterization of residual structure in the denatured state of protein L. J. Mol. Biol., 299, 1341-1351. Zerovnik, E., Virden, R. Jerala, R. Turk V. and Waltho, J. P. (1998). On the mechanism of human stefin B folding: I. Comparison to homologous stefin A. Influence of pH and trifluoroethanol on the fast and slow folding phases," Proteins, 32, 296-303.

In: Computational Biology: New Research Editor: Alona S. Russe

ISBN: 978-1-60692-040-4 © 2009 Nova Science Publishers, Inc.

Chapter 12

COMPUTATIONAL METHODS FOR PROTEIN STRUCTURAL CLASS PREDICTION Susan Costantini1,2,∗ and Angelo M. Facchiano1,♣ 1

Laboratorio di Bioinformatica e Biologia Computazionale, Istituto di Scienze dell’Alimentazione, CNR, via Roma 52 A/C 83100 Avellino, Italy 2 Centro Ricerche Oncologiche di Mercogliano “Fiorentino Lo Vuolo”, via Ammiraglio Bianco 83013 Mercogliano, Italy

Abstract The structural class of a given protein represents the first level in the hierarchical structural classification. Its knowledge starts the progressive identification of the next levels, which allows to relate the protein to a family, in evolutionary as well as functional terms. A number of computational methods have been proposed to predict the structural class based on primary sequences. Most of the prediction methods use simple sequence representations such as composition vectors and polypeptide composition or also more advanced representations that combine physico-chemical properties and sequence composition. Moreover, different classification algorithms, including neural network, rough sets and logistic regression, and the application of complex classification models, such as ensembles, bagging and boosting, have been recently used. However, the accuracy of all these methods is strongly affected by sequence similarity. Some algorithms were tested on small datasets with high sequence identity percentage, which results in an overestimated prediction accuracy. On the other hand, low similarity sequences pose a substantial challenge. The main aim of this paper is to present the state of the art in this field, describing some methods developed in the last years for the prediction of the protein structural class and underlining the need of using protein datasets of varying similarity and new testing procedures in order to evaluate correctly the quality and the accuracy of new prediction methods.

∗

Corresponding author: Susan Costantini; Istituto di Scienze dell'Alimentazione – CNR, Via Roma 52A/C, 83100 Avellino, Italy; Email: [email protected]; Tel. +39 0825 299651; Fax: +39 0825 781585 ♣ Angelo M. Facchiano; Email: [email protected]; Tel. +39 0825 299625; Fax: +39 0825 781585

316

Susan Costantini and Angelo M. Facchiano

1. Introduction Since Levitt and Chothia [1] introduced about three decades ago the concept of protein structural class, it has been generally accepted that most globular proteins can be assigned into four structural classes: all-α, all-β, α/β and α+β (Figure 1). The all-α and all-β classes are formed by proteins that contain only α-helices and β-strands, respectively. The α/β class includes proteins with an alternation along the backbone of α-helices and β-strands, the latter forming tipically parallel β-sheets. The α+β class includes proteins with segregated α-helices and β-strands, these forming typically antiparallel β-sheets. The most accurate classifications of protein structural classes can be found in the SCOP and CATH databases [2-3].

Figure 1. Examples of all-α (A), all-β (B), α/β (C) and α+β (D) proteins.

The knowledge of protein structural classes represents an useful information for the definition of its belonging to families and superfamilies and the protein function. It is well known that the number of protein sequences reported in databases is in the magnitude order of 106 whereas the number of tertiary structures is in the magnitude order of 104. This is due to the slow process of experimental determination of the tertiary structure, whereas the

Computational Methods for Protein Structural Class Prediction

317

primary structure is determined by direct sequencing or by translation of gene sequences, and the genome projects have dramatically increased the number of protein sequences in the databases. Therefore, for most of the proteins, it is known the amino acid sequence but there is no experimental knowledge of the tertiary structure and, therefore, of its structural class. Among these proteins, it can be recognized by sequence similarity the belonging to a protein family, and both structural organization and function can be assigned. However, putative proteins identified by genome analyses may result without significant similarity with known proteins from other organisms, and these sequences are often reported in databases without a real name, just codes, and annotations like “unknown protein” and/or “unknown function”. When the absence of sequence similarity makes it impossible to assign proteins to a known family, other structural information, starting with the structural class, could be of help in suggesting its function. Moreover, the knowledge of protein structural classes may represent an useful information for further prediction about protein structure. In fact, the accuracy of secondary structure prediction can be significantly improved by incorporating the knowledge of the structural class [4]. During the last years numerous methods were proposed for computational prediction of the secondary structure content (see Section 4) and protein structural class (see Section 5 and Table 1) and evaluated using some different procedures (see Section 3). Table 1. Methods for computational prediction of the protein structural class Methods Simple sequence representations composition vectors global description of amino acid sequence coupling effect among the amino acids

References

[7, 11, 25-26] [27] [17, 28-29]

amino acid index polypeptide composition

[15] [16, 30]

pseudo AA composition

[31]

Methods using classification algorithms unsupervised fuzzy clustering

[33]

supervised fuzzy clustering neural network

[34] [35-36]

Bayesian classification support vector machine

[37] [38-40]

rough sets ensembles bagging boosting logistic regression

[41] [43] [45] [46-47] [5, 42-44]

318

Susan Costantini and Angelo M. Facchiano

However, some of these algorithms were tested on small datasets with uncontrolled (often high) sequence similarity, which results in an overestimated prediction accuracy [5]. At the same time, secondary structure of homologous sequences can be reliably predicted [6], and this information can be used to come up with the corresponding structural class.

2. Structural Classes Definitions During the years, different structural classifications have been proposed to describe a protein structure (Table 2). The more used classifications are those of Nakashima et al. (1986) and Chou (1995). Nakashima et al. [7] proposed to assign the structural class on the basis of the secondary structural content; proteins with more than 15% of α-helix and less than 10% of βstrand content are classified as all-α proteins; proteins with less than 15% of α-helix and more than 10% of β-strand content are classified as all-β proteins; proteins with more than 15% of α-helix and more than 10% of β-strand content are classified as mixed proteins comprising α/β and α+β; the remaining proteins are classified as irregular. A similar criterion, but with different percentage values, has been proposed by Chou [8]. According to the criterion of Chou, all-α proteins have at least 40% α-helical content and less than 5% βstrand content; all-β proteins have at least 40% β-strand content and less than 5% α-helix content; mixed proteins (considering the combination of α+β and α/β classes) contain more than 15% α-helix and 15% β-strand contents; irregular proteins have less than 10% α-helix and β-strand content. A complete description of other four structural classifications is reported in Table 2 [9-12]. Two main problems exist with these kinds of structural classification. Firstly, although intuitive and of simple application, this approach does not allow to discriminate between α/β and α+β classes. Moreover, the use of threshold values for the classification may assign to different classes two proteins with secondary structure contents very similar, but near the threshold value. For these reasons, more complex criteria of classification have been developed, based on the structural comparison among protein structures, the calculation of similarity scores, and statistical approaches to cluster them. These approaches allow a more deep classification that can distinguish not only a structural class, but also other sublevels. The larger classifications currently used, as SCOP [2], CATH [3], and DALIDD [13], define similar hierarchical classification. The top level of the hierarchy is the structural class: this is a definition based on the content of α-helices and β-strands, or the proportion between them. Therefore, the belonging of a protein to one of the four main classes depends mainly on the amount of amino acids which folds in helices or strands. The four main classes included almost all the known 3D structures of proteins. Other few represented classes refer to proteins with low content of secondary structure elements and other irregular structures. The other levels of the hierarchy describe the architecture, the topology, and the homologous superfamily and family. The names of these levels explain the name CATH (Class, Architecture, Topology, Homologous superfamily).

Computational Methods for Protein Structural Class Prediction

319

Table 2. Definitions of protein structural classes accordingly to the content of secondary structure Reference Sheridan et al. [9]

Structural class α-helix content (%) β-strand content (%) all-α

α>β β>α

all-β all-α

>15

<10

all-β

<15

>10

α/β and α+β

>15

>10

Klein and DeLisi [10]

all-α all-β α/β and α+β

>40 <10 >15

<5 >30 >15

Chou, P.Y. [11]

all-α all-β α/β and α+β

>45 <5 >30

<5 >45 >20

Kneller et al. [12]

all-α all-β α/β and α+β

>30 <10 <15

>5

all-α

>40

<5

all-β α/β and α+β

<5 >15

>40 >15

Nakashima et al. [7]

Chou, K.C. [8]

Currently the SCOP classification includes 11 classes; all-α proteins, all-β proteins, α/β proteins, α+β proteins, multi-domain proteins, membrane and cell surface proteins, small, coiled coils proteins, low resolution proteins, peptides and designed proteins. While SCOP uses a manual inspection of the structure for the assignment of the complete classification, the other methods are based on the calculation of similarity scores and statistical approaches. It is interesting to note that, although the different criteria adopted, there is a prevalent agreement among SCOP, CATH and DALIDD, also at the superfamily levels [14].

3. Methods for Evaluating the Prediction Quality The quality of the predictions of sequences into structural classes can be measured using three possible tests: resubstitution (or self-consistency), jackknife and 10-fold cross validation.

320

Susan Costantini and Angelo M. Facchiano

The so-called resubstitution test is an examination for the self-consistency of a prediction algorithm. It predicts the secondary structural class of each protein in a given dataset by features derived from the same dataset, i.e. the training dataset. As a consequence, the properties derived from the training dataset include also the information of the protein used in the test and this will certainly give a somewhat optimistic error estimate. The jackknife test is also called leave-one-out test, in which each protein in a given dataset is singled out in turn as a test sample and the features used in the prediction are calculated from training all the remaining proteins (without using this protein). In other words, the secondary structural class of each protein is predicted by the properties derived using all other proteins except the one that is being predicted. During the process of jackknife analysis, each protein has one chance to be the test sample, and for other tests this protein will be included in the training dataset. The 10-fold cross validation breaks the data (n) into 10 sets of size n/10 and trains on 9 datasets and tests on 1 dataset. This procedure is repeated 10 times and then takes a mean accuracy. Therefore the resubstitution test leads to unrealistically high accuracies but the jackknife test is perceived as very rigorous and reliable to evaluate classification accuracy and generalization abilities of the tested algorithms [15-17] and, at the same time, it is computationally very expensive. Recent studies [5] showed that the 10-fold cross-validation procedure is computationally less demanding and it can be used instead of the jackknife test because there is no statistically significant difference between these two procedures.

4. Prediction of Secondary Structure Content The first method for the computational prediction of the secondary structure content was proposed in 1973. Krigbraum and Knoutton [18] used multiple linear regression to obtain relationships for predicting the amount of secondary structure in a protein molecule from the knowledge of its amino acid composition. They tested these relations using 18 proteins of known structure, but the predictions were made for the two subchains of hemoglobin and insulin. The average errors for the tested chains were: helix ± 7.1%, sheet ± 6.9%, turn ± 4.2%, and coil ± 5.7%. Muskal and Kim [19] presented a method that uses two neural networks placed in “tandem” to predict the secondary structure content of water-soluble, globular proteins. The first of the two networks, NET1, predicted a protein’s helix and strand content given information about the protein’s amino acid composition, molecular weight and heme presence but the second, NET2, learned to determine when NET1 was in a state of generalization. Together, these two networks produced prediction errors as low as 5.0% and 5.6% for helix and strand content, respectively. An improved multiple linear regression method has been proposed from Zhang et al [2021] to predict the content of α-helix and β-strand of a globular protein based on its primary sequence. These authors used the amino acid composition and the auto-correlation functions based on the hydrophobicity profile of the primary sequence. In details, only the compositions of a part of the amino acids and a part of the auto-correlation functions were selected as the regression terms, which lead to the least prediction error. The resubstitution test showed that the average absolute errors were 0.052 and 0.047 with the standard deviations 0.050 and

Computational Methods for Protein Structural Class Prediction

321

0.047 for the prediction of helix/strand content, respectively. Moreover, the jackknife test showed that the average absolute errors were 0.058 and 0.053 with the standard deviations 0.057 and 0.053 for the prediction of helix/strand content, respectively. These results greatly improved those previous obtained by Eisenhaber et al. [22] that applied two vector decomposition methods for secondary structural content prediction from amino acid composition alone, with and without consideration of amino acid compositional coupling in the learning set of tertiary structures and reached an average absolute error of 0.137, 0.126 and 0.114 for the prediction of helix, sheet, and coil elements. Liu and Chou [23] developed a new algorithm for predicting the content of protein secondary structure elements that was based on an amino acid composition, in which the sequence coupling effects were explicitly included through a series of conditional probability elements. A remarkable improvement was obtained for predicting the contents of α-helix, βsheet, β-bridge, 310-helix, π-helix, H-bonded turn, bend and random coil. The average absolute error for the prediction of all the 10 secondary structure elements by self-consistency test is resulted 0.062, but by excluding the parallel and antiparallel β-strands the overall average error became 0.028. These results showed that the incorporation of the sequence coupling effect can improve the prediction quality as well as for prediction methods of structural class. Recently, Homaeian et al. [24] proposed a novel method for the prediction of secondary structure content through comprehensive sequence representation, called PSSC-core. The method used a multiple linear regression model and introduced a comprehensive featurebased sequence representation to predict amount of helices and strands for sequences from the twilight zone, i.e. sequences with low similarity (see under 5.3 for the definition). In details, the proposed feature-based sequence representation used a comprehensive set of physicochemical properties that are custom-designed for each of the helix and strand content predictions. It includes also composition and composition moment vectors, frequency of tetrapeptides associated with helical and strand conformations, various property-based groups like exchange groups, chemical groups of the side chains and hydrophobic group, autocorrelations based on hydrophobicity, side-chain masses, hydropathy, and conformational patterns for β-sheets. The PSSC-core method was tested and compared with two other state-of-the-art prediction methods on a set of 2187 twilight zone sequences. PSSC-core method provided statistically significantly better results when compared with the competing methods, reducing the prediction error by 5-7% for helix and 7-9% for strand content predictions.

5. Methods of Secondary Structural Class Prediction Most of the methods for predicting the protein structural class use relatively simple sequence representations (Table 1) such as composition vectors [7, 11, 25-26], global description of amino acid sequence [27], coupling effect among the amino acids [17, 28-29], amino acid index [15], polypeptide composition [16, 30], pseudo AA composition and complexity measure factor [31-32]. Different classification algorithms, including unsupervised and supervised fuzzy clustering [33-34], neural network [35-36], Bayesian classification [37], support vector machine [38-40], rough sets [41] and logistic regression [5, 42-44] have been

322

Susan Costantini and Angelo M. Facchiano

already used. Recent works also explored application of complex classification models, such as ensembles [43], bagging [45], and boosting [46-47]. The comparison of results obtained for some prediction methods on 359 [17], 1189 [37] and 25PDB [48] datasets are reported in Tables 3 and 4. Table 3. Results of some prediction methods obtained using the 359 and 1189 datasets and reported by Kedarisetti et al. [32] Dataset 359

1189

Methods Geometric classifier Component coupled Support vector machine Information discrepancy classifier StackingC ensemble Support vector machine Instance-based classifier

References [15] [15] [47] [30] [32] [5] [5]

Total 84.7 90.5 95.2 95.8 96.4 97 97

[5] [37 ] [5] [32]

52.3 53.8 53.9 58.9

Support vector machine Bayes classifier Logistic regression StackingC ensemble

Table 4. Results of some prediction methods obtained using the 25PDB dataset and reported by Kurgan and Chen [44]. Methods References SVM 1st order polyn, kernel [5] Multinomial logistic regression [42] Bagging with random tree [45] Information discrepancy Tripeptides [30-43] LogicBoost with decision tree [47] Information discrepancy Dipeptides [30-43] LogitBoost with decision stump [45] SVM 3rd order polyn, kernel [45] SVM Gaussian kernel [47] Multinomial logistic regression [5] SVM with RBF kernel 32 Multinomial logistic regression 32 StackingC ensemble 32 LLSC-PRED 44 SVM Gaussian kernel 44 SVM 1st order polyn, kernel 44

α 50.1 56.2 58.7 45.8 56.9 59.6 62.8 61.2 68.6 69.1 69.7 71.1 74.6 75.2 76.5 77.4

β 49.4 44.5 47.0 48.5 51.5 54.2 52.6 53.5 59.6 61.6 62.1 65.3 67.9 67.5 64.6 66.4

α/β 28.8 41.3 35.5 51.7 45.4 47.1 50.0 57.2 59.8 60.1 67.1 66.5 70.2 62.1 63.3 61.3

α+β Total 29.5 34.2 18.8 40.2 24.7 41.8 32.5 44.7 30.2 46.0 23.5 47.0 32.4 49.4 27.7 49.5 28.6 53.9 38.3 57.1 39.3 59.5 37.3 60.0 32.4 61.3 44.0 62.2 44.9 62.3 45.4 62.7

Computational Methods for Protein Structural Class Prediction

323

5.1 Methods using simple sequence representation 5.1.1 Composition vectors Some prediction methods used a simple composition vector based on sequence representation and applied discriminant analysis with different distance measures. The composition vector is a 20-dimensional vector, which represents the occurrence frequencies of the 20 amino acids. Example distance measures include Euclidean distance [7], Hamming distance [11, 25], and the component-coupled algorithm [49]. We suppose that N domains form a set S, which is the union of seven subsets, S = Sα U Sβ α/β U S U Sα+β U Sμ U Sσ U Sρ, where the subset Sα consists of only all-α domains, Sβ of only all-β domains, Sα/β of only α/β domains, Sα+β of α+β domains, Sμ of multi-domain, Sσ of small protein, Sρ of peptides. According to the correlation between the structural class of a protein domain and its amino acid composition, any domain in the set S corresponds to a vector or a point in the 20-D space, Xξi = [xξk,1, xξk,2, ..., xξk,20] where k=1,2,....Nξ , ξ=α, β, α/β, α+β, μ, σ, ρ. Moreover, xξκ,1, xξκ,2,… xξκ,20 are the normalized occurrence frequencies of the 20 amino acids in the kth domain Xξk of the subset Sξ and Nξ is the number of domains that the subset contains. The standard vector for the subset Sξ is defined by Xξ = [xξ1, xξ2, ..., xξ20] where

1 xi = Nξ ξ

Nξ

∑X K =1

k ,i

with i= (1, 2, ..., 20). Given X, a protein domain whose structural class may be predicted, it corresponds to a point (x1, x2, . . . , x20) in the 20-D space, where xi has the same meaning as xξk,i but is associated with domain X instead of Xξk. 5.1.1.1 Euclidean distance The squared Euclidean distance between the standard vector Xξ and the domain X in the 20-D space is given by 20

[

DE ( X , X ) = ∑ xi − xi 2

ξ

i =1

ξ

]

2

with ξ=α, β, α/β, α+β, μ, σ, ρ and the prediction is made according to the following formulation: D2E (X,Xλ)=Min{D2E(X,Xα),D2E(X,Xβ),D2E(X,Xα/β),D2E(X,Xα+β),D2E(X,Xμ),D2E(X,Xσ), D2E(X,Xρ)} 5.1.1.2 Hamming distance The Hamming distance between the standard vector Xξ and the domain X in the 20-D space is

324

Susan Costantini and Angelo M. Facchiano 20

DH ( X , X ) = ∑ xi − xi 2

ξ

ξ

i =1

with ξ=α, β, α/β, α+β, μ, σ, ρ and the domain X is predicted to be the structural class for which the corresponding Hamming distance has the least value of following formula: DH (X,Xλ)=Min{DH(X,Xα),DH(X,Xβ),DH(X,Xα/β),DH(X,Xα+β),DH(X,Xμ),DH(X,Xσ), DH(X,Xρ)} where λ can be α, β, α/β, α+β, μ, σ or ρ. 5.1.1.3 Component-coupled algorithm Both Euclidean and Hamming distances are based on simple geometric distances in which coupling effects among different amino acid components are not taken into account. In contrast, the component-coupled algorithm is based on the squared Mahalanobis distance [5051], defined by D2M (X,Xξ)=(X-Xξ)C-1ξ(X-Xξ) where ξ can be α, β, α/β, α+β, μ, σ or ρ, Cξ is a covariance matrix, the superscript T is the transposition operator and C–1ξ is the inverse matrix of Cξ. The matrix elements cξi,j are given by

c

ξ

N

i, j

[

][

]

ξ 1 = x ξ k ,i − x ξ i x ξ k , j − x ξ j where i,j = (1,…,20) ∑ N ξ − 1 k =1

The denominator Nξ-1 can be ignored when the number of protein domains in each of seven subsets Sξ is the same. Therefore, the original component-coupled algorithm was formulated by D2M(X,Xλ)=Min{D2M(X,Xα),D2M(X,Xβ),D2M(X,Xα/β),D2M(X,Xα+β),D2M(X,Xμ),D2M(X,Xσ), D2M(X,Xρ)} However, when the subset sizes are different, some modification is needed and the prediction should be based on the corresponding Mahalanobis discriminant [50-52], FM(X,Xξ)= D2M(X,Xξ)+ln Πξ −2 ln Ψξ + Λ ln (2π) where Πξ is the product of all positive eigenvalues of Cξ, Ψξ is the prior probability of the subset Sξ and Λ is the dimension of the amino acid composition space. The last term in the above equation is a constant and can be ignored.

Computational Methods for Protein Structural Class Prediction

325

5.1.2 Global optimization from amino acid sequence Zhang and Chou [26] developed a new method describing a protein as a vector of 20dimensional space, v(x)=[v1(x), v2(x), …, v20(x)]. This vector was decomposed into the four component vectors, each of which corresponded to one of the four standard vectors v(α), v(β), v(α+β) and v(α/β) representing the norms of the four protein structural classes. Therefore the vector v(x) can be expressed as v(x) = aαv(α) + aβv(β) + aα+βv(α+β) + aα/βv(α/β) or vi(x) = aαvi(α) + aβvi(β) + aα+βvi(α+β) + aα/βvi(α/β)=Σk akvi(k) where ak (k= α, β, α+β, α/β) are the four component coefficients, that may be determined. Because the equation vi(x) contained 20 equations, the Authors adopted the least-squares method that defines an objective function that possesses an unique solution to solve these kinds of contradictory equations. In this way the structural class for a given protein x is predicted by the following equation: aj=max(aα, aβ, aα+β , aα/β) This method was tested on 64 proteins investigated by Chou [11] and predicted correctly 100%, 80%, 71.4% and 75% of all-α, all-β, α/β and α+β proteins, respectively. Especially for the all-α proteins, the rate of correct prediction obtained by this method is much higher than that by Chou (i.e. 84.2%).

5.1.3 Global description of amino acid sequence Dubchack et al. [27] presented a method for predicting protein folding class based on global protein chain description and a voting process. Selection of the best descriptors was achieved by a computer-simulated neural network trained on a data base consisting of 83 folding classes. Protein-chain descriptors included overall composition, transition, and distribution of amino acid attributes, such as relative hydrophobicity, predicted secondary structure, and predicted solvent exposure. In fact, amino acids were divided into three groups based on hydrophobicity (hydrophobic, neutral, and polar), three groups based on secondary structure prediction (helix, strand, and coil), four groups based on consensus secondary structure prediction (helix, coil, strand, and unknown), and two groups based on solvent accessibility (buried and exposed). Cross-validation testing was performed on 15 of the largest classes. The test showed that proteins were assigned to the correct class (correct positive prediction) with an average accuracy of 71.7%, whereas the inverse prediction of proteins as not belonging to a particular class (correct negative prediction) was 90-95% accurate. When the method was tested on 254 structures, the top two predictions contained the correct class in 91% of the cases. 5.1.4 Coupling effect among the amino acids Bahar et al. [28] indicated that the coupling effects among different amino acid components as originally formulated by K. C. Chou [8] were important for improving the prediction of protein structural classes. These authors proposed a compact lattice model to illuminate the

326

Susan Costantini and Angelo M. Facchiano

physical insight contained in the component-coupled algorithm. Some tests were conducted by various approaches for the SCOP database [2]. The results obtained by both self-consistency and jackknife tests indicated that the overall rates of correct prediction by the algorithm incorporating the coupling effect among different amino acid components were significantly higher than those by the algorithms without counting such an effect. This was fully consistent with the knowledge that the folding of a protein is the result of a collective interaction among its constituent amino acid residues, and hence the coupling effects of different amino acid components must be incorporated in order to improve the prediction quality [53]. Moreover, also Chou and Maggiora [17] used the SCOP database to test different prediction algorithms in order to determine whether the rate of correct structural class prediction could be significantly improved by taking into account the coupling effect among the different amino acid components of a protein. Their results obtained using the resubstitution and jackknife tests indicated that the overall rates of correct prediction by an algorithm incorporating the coupling effect among the different amino acid components were significantly higher than those by the algorithms that did not include such an effect. Moreover, a consistent conclusion was also obtained when tests were performed on two large independent testing datasets classified into four and seven structural classes, respectively. In details, the predicted results for the datasets from SCOP database [2], indicated that: i) for the resubstitution test the improvement was about 40% higher; ii) for the jackknife test the improvement was about 15-17% higher when the dataset contains 138 domains, about 2830% higher when it is expanded to contain 253 domains and about 31-42% higher when it is expanded to contain 359 domains; iii) applicating this method to an independent testing dataset of 510 domains classified into four structural classes, the improvement was about 39% higher; and iv) applicating this method to an independent testing dataset of 2438 domains classified into seven structural classes, the improvement was about 30-34% higher. In 1999 Chou [29] suggested that the interaction among the components of amino acid composition may play a considerable role in determining the structural class of a protein. To quantitatively test such hypothesis at a deeper level, three potential functions, U(0), U(1), and U(2), were formulated that respectively represent the 0th-order, 1st-order, and 2nd-order approximations for the interaction among the components of the amino acid composition in a protein. This author observed that the correct rates in recognizing protein structural classes by U(2) were significantly higher than those by U(0) and U(1), indicating that an algorithm able to more completely incorporate the interaction contributions could yield better recognition quality. This demonstrated that the interaction among the components of amino acid composition was an important driving force in determining the structural class of a protein during the sequence folding process.

5.1.5 Amino acid index Bu et al. [15] proposed a new formulation to predict the structural class of a protein from its primary sequence. Instead of the amino acid composition (AAC) they used in the structural class prediction the auto-correlation functions (ACF) based on the profile of amino-acid index along the primary sequence of the query protein. In details, the vector X was defined as:

Computational Methods for Protein Structural Class Prediction

327

X=(r1, r2, …., rm)T where ri (i=1, 2, …, m) were the auto-correlation functions and m was an integer to be determined. To calculate the auto-correlation functions, each residue in the primary sequence was replaced by its amino-acid index of Oobatake and Ooi [54] that is defined as the average nonbonded energy per residue. However, the replacement results in a numerical sequence: h1, h2, …., hN where h1 was the amino-acid index for the ith residue and N the number of residues of the protein. The auto-correlation function rn was defined as

1 N −n rn = ∑ hi hi + n N − n i =1 where n=1, 2, ..., m and hi is the amino-acid index for the ith residue and m is an integer to be determined. Using exactly the same database and the same algorithm of Chou et Maggiora [17], Bu et al. have seen what happens after the replacement. The overall predictive accuracy was remarkably improved. For the same training database consisting of 359 proteins and the same component-coupled algorithm used from Chou et Maggiora [17], this method for the jackknife test predicted correctly about 5-7% higher than the accuracy based only on the amino-acid composition. This result indicated that a significant improvement has been achieved by making full use of the information contained in the primary sequence for the class prediction. In fact, this improvement depended on the size of the training database, the auto-correlation functions selected and the amino-acid index used.

5.1.6 Polypeptide composition Luo et al. [16] presented new approach based on the amino acid composition and the composition of several dipeptides, tripeptides, tetrapeptides, pentapeptides and hexapeptides that are taken into account in the stepwise discriminant analysis. Let S(1) be the set composed of the 20 amino acids, namely S(1) = {A, C, D, E, F, G, H, I, K, L,M, N, P, Q, R,S, T, V, W, Y}. Let S(i) be the set of all possible i-peptides, where ipeptide is the string of i members in S(1). Therefore, S(2)= {AA; AC; AD; . . . ; AY; CA; CC; . . . ;YW; YY}; S(3)={AAA; AAC; AAD;...; AAY; ACA; ACC;...; YYY} etc. The number of elements in S(i) is 20i (i = 1, 2, . . .), and it grows with exponential rate. To predict the structural class of each query sequence, a subset of S(i), denoted as T(i), is constructed such that the result of prediction using T(i) is almost the same as using S(i), while the number of elements in T(i) is much smaller than 20i. To form T(i), first it constructs a subset of T(i-1), i.e. T0(i-1), which is called “seeded (i-1)-peptide(s)”, then adds each amino acid in S(1) in front and at the back of each element in T0(i-1) to get a set including T(i). Therefore, the algorithm 1) chooses “seeded residues”; 2) lengthens “seeded i-peptides”

328

Susan Costantini and Angelo M. Facchiano

constructing T(i+1); 3) chooses “seeded (i+1)-peptides”; 4) checks the discriminant results while polypeptides are taken into account. The variables obtained by stepwise discrimination are the used variables in final prediction and R(i) is the highest predictive accuracy. Otherwise go back to step 2 to lengthen “seeded i-peptides”, with an iterant process. The result of jackknife test showed that this approach can lead to higher predictive sensitivity and specificity for reduced sequence similarity data sets. Considering the dataset PDB40-B constructed by Brenner and colleagues [55], 75.2% protein domain sequences were correctly assigned in the jackknife test for the four structural classes: all-α, all-β, α/β and α + β, which is improved by 19.4% in jackknife test and 25.5% in resubstitution test, in contrast with the component-coupled algorithm using amino acid composition alone (AAC approach) for the same dataset. In the cross validation test with dataset PDB40-J constructed by Park and colleagues[56], more than 80% predictive accuracy was obtained. Furthermore, for the dataset constructed by Chou and Maggiora, the accuracy of 100% and 99.7% was easily achieved in the resubstitution test and in the jackknife test, respectively, merely taking into account the composition of dipeptides. Jin et al. [30] proposed a new measure (FDOD measure) of information discrepancy that was applied to protein structural classes prediction. Different from the methods based on amino acid composition or amino acid index, this method was based on the comparison of subsequence distributions. These authors illustrated that the set of subsequence distributions of a protein (domain) involved the residue order along the sequence and they incorporated more information of the sequence than its amino acid composition does. Concerning that 20 amino acids form the alphabet Σ={a1, a2, . . ., am}={A, B, D, E, F, G, H, I,K, L, M, N, P, Q, R, S, T, V, W, Y}, Ψ{S1, S2, ..., Ss} represents a set of sequences formed from Σ. For a sequence Sk belonging to Ψ, let Lk be its length, nlik the number of contiguous subsequences in Sk and plik = nlik /(Lk-l+1). Therefore, suppose that S is a sequence of L residues, and given the length of subsequence l , the subsequence distribution of S is U lS=(pl1,pl2,…,p lm(l))T where T is the transposition operator, Σ m(l)pli =1and m(l)=20l. When l=1 and 20l=20, U lS reduces to the conventional amino acid composition. When l=2 and 202=400, U2S includes all information of the first-order coupled composition introduced by Liu and Chou [29] in secondary structure content prediction. According to the construction of complete information sets, the longer the subsequence is, the more information it includes. As a matter of fact, when l>2, the residue order along a sequence is contained in its subsequence distribution set. Given a set of distributions of s sequences: U lS1=(pl11,pl21,…,p lm(l)1) U lS2=(pl12,pl22,…,p lm(l)2) ………… l U S1=(pl1s,pl2s,…,p lm(l)s) where Σ m(l)plik =1 and k=1, 2, ….., s. The FDOD measure is defined as:

Computational Methods for Protein Structural Class Prediction

(

)

s m(l )

B U l S1 ,U l S2 ,....,U l Ss = ∑ ∑ p l ik log

(

k =1 i =1

)

∑

p l ik l s p ik

k =1

m(l )

Bk U l S1 ,U l S2 , ....,U l S s = ∑ p l ik log i =1

p

∑

s k =1

l

329

s

ik

p l ik

s

B(U lS1, U lS2, …, U lSs) denoted a measurement of discrepancy among s sequences while Bk(U lS1, U lS2, …, U lSs) a measurement of discrepancy between the kth sequence and all other sequences in the group. Suppose that a set T is the union of four subsets of sequences (Tα, Tβ, Tα/β, Tα+β) where Tα consists of all-α proteins (domains), Tβ consists of all-β proteins (domains), and so forth. The B(U lS1, U lS2, …, U lSs) formula is used to calculate the discrepancy between a query protein (domain) X and some groups of proteins (domains). The discrepancy of Tα and X is performed by Bαx; similarly, Bβx, Bα/βx and Bα+βx are obtained. Accordingly, the query protein (domain) X will be assigned to class R when BRx=min{ Bαx, Bβx, Bα/βx, Bα+βx }. For the same data set as that in the paper by Bu et al. [15], the predictive accuracies of FDOD measure resulted higher than those of the previous approaches in both resubstitution test and jackknife test. Results of resubstitution test based on the same data set of 359 proteins for three different prediction methods indicated that the predictive accuracies improved quickly when the length of subsequence increases from 1 to 3. When l=3, the new method performed better than the previous ones; in fact, the overall rate of correct prediction was as high as 95.8% that was better than the Component-coupled algorithm and the ACF approach did for the same data set. As to the set of 1401 proteins (domains) with redundancy no more than 30%, the accuracies of resubstitution test and jackknife test reached to 99.4 and 75.02%, respectively.

5.1.7 Pseudo AA composition To improve the prediction quality for protein structural classification by effectively incorporating the sequence-order effects Xiao [31] developed a novel approach for measuring the complexity of a protein sequence based on the concept of the pseudo-amino acid composition [57]. The pseudo-amino acid composition was obtained from a combination of a set of discrete sequence correlation factors and the 20 components of the conventional amino acid composition. The advantage of incorporating the complexity measure factor into the pseudo amino acid composition as one of its components was that it can catch the essence of the overall sequence pattern of a protein and hence more effectively reflect its sequence-order effects. It was demonstrated from the jackknife crossvalidation test that the overall success rate by the new approach was significantly higher than those by the others.

330

Susan Costantini and Angelo M. Facchiano

5.2 Methods using classification algorithms 5.2.1 Unsupervised fuzzy c-means Zhang et al. [33] proposed a new method, based upon fuzzy clustering, for predicting the structural class of a protein from its amino acid composition. Each of the structural classes was described by a fuzzy cluster and each protein was characterized by its membership degree, a number between 0 and 1 in each of the four clusters, with the constraint that the sum of the membership degrees equals unity. A given protein was classified as belonging to that structural class corresponding to the fuzzy cluster with maximum membership degree. Calculation of membership degrees was carried out using the fuzzy c-means algorithm on a training set of 64 proteins. Results obtained for the training set showed that the fuzzy clustering approach produced results comparable with or better than those obtained by other methods. A test set of 27 proteins also produced comparable results to those obtained with the training set. The success of this work on protein structure class prediction suggested that further refinements of method may lead to improved predictions. 5.2.2 Neural network Metfessel et al. [35] proposed an approach that used amino acid composition and hydrophobic pattern frequency information as input to two types of neural network: a three-layer back propagation network and a learning vector quantization network. They compared their results with those obtained from Euclidean statistical clustering algorithm. To drive these algorithms the authors used the normalized frequency of 20 amino acid types and six hydrophobic amino acid pattern (i, i+2; i, i+3; i, i+2, i+4; i, i+1, i+4; i, i+3, i+4; i, i+5). The algorithms were trained on 56 training examples and tested on 8 examples. The best performing algorithm on the test sets was the learning vector quantization network using 17 inputs, obtaining a prediction accuracy of 80.2%. The results of Metfessel et al. [35] showed that the information existed in protein primary sequences and it was easily obtainable and useful for the prediction of protein structural class by neural network as well as by standard statistical clustering algorithm. To predict the structural class of a protein Cai et al. [36] try to apply T. Kohonen’s selforganization model, a typical neural network model. In this work, the neural network method was performed based on the SCOP database. As a result, high rates of self-consistency and jackknife test were obtained. This indicates that the structural class of a protein is considerably correlated with its amino acid composition, and the neural network method can become a powerful tool for predicting the structural classes of proteins. In details, this method was tested on two datasets, i.e. 277 domains and 498 domains, and their rates of correct prediction for the four structural classes resulted 93.5% and 94.6%, respectively. The related rates of correct prediction made by jackknife analysis for the four structural classes were 74.7% for 277 domains and 89.2% for 498 domains. These results indicated that after being trained, the neural network grasped the complicated relationship between the amino acid composition and protein structure. 5.2.3 Bayesian classification Wang and Yuan [37] proposed a new method for predicting the structural class of a protein according to its amino acid composition based on the normality assumption and the Bayes

Computational Methods for Protein Structural Class Prediction

331

decision rule for minimum error. This detailed theoretical analysis indicated that if the four protein folding classes are governed by the normal distributions, their method could yield the optimum predictive result in a statistical sense. A non-redundant data set of 1.189 protein domains was used to evaluate the performance of the new method. When both the self-consistency and jackknife tests were applied on a larger data set without significant pairwise similarity, the Authors found that the knowledge of amino acid composition alone cannot lead to a success rate higher than 60% for a 4-type class prediction by their method. The apparent relatively high accuracy level (more than 90%) attained in other studies, which exceed the success rates (75%) of structural class predictions using traditional secondary structure prediction techniques (including those combining evolutionary information and neural networks), was due to the preselection of test sets, which may not be adequately representative of all unrelated proteins.

5.2.4 Support vector machine Cai et al. [38] applied a new machine learning method, the so-called Support Vector Machine method, to predict the protein structural class. SVM is one kind of learning machine based on statistical learning theory. First, map the input vectors into one feature space, either linearly or non-linearly. Then, within the feature space from the first step, seek an optimized linear division, i.e. construct a hyperplane which separates two classes even if this can be extended to multi-class. This SVM method was performed based on the database SCOP, in which protein domains were classified based on known structures and the evolutionary relationships and the principles that govern their three-dimensional structure. The accuracy of this method was tested by self-consistency and jackknife tests on two datasets of Zhou [53] comprising 277 and 498 domains. The rates of correct prediction obtained from self-consistency test reached 100% for both datasets. From the jackknife test the overall rate of correct predictions resulted 79.4 and 93.2% for 277 and 498 domains, respectively. These results suggested that the structural class of a protein was considerably correlated with its amino acid composition. The Authors underlined that SVM method and the component coupled method, if complemented with each other, can provide a powerful computational tool for predicting the structural classes of proteins. Subsequently, in 2003 Cai et al. [39] tested the SVM method on four datasets from Chou et Maggiora [17] comprising 138, 253, 359 and 1601 domains. The overall rates of correct prediction resulted 57, 83, 95 and 84% in four datasets. These results were always higher than those obtained from Chou et Maggiora [17]. These authors examined the quality of their method also by independent dataset test using the 225 protein domains as training datasets and 510 domains as testing datasets and reached 95% of the overall rate of correct prediction. Recently, Chen et al. [40] developed a computational prediction method featured by employing a support vector machine learning system and using a different pseudo amino acid composition (PseAA) which take into account the sequence-order effects to represent protein samples. In details, PseAA was defined in a (20+λ)-D space that considered the normalized occurrence frequencies of the 20 native amino acids in a given protein and the other components are associated with λ different ranks of sequence-order correlations. The correlation function was given by Θ(Ri,Rj) =H(Ri) x H(Rj) where H(Ri) and H(Rj) were the hydrophobicity values of residue Ri and Rj taken from Kyte and Doolittle [58]. The jackknife

332

Susan Costantini and Angelo M. Facchiano

test was performed on a dataset containing 204 non-homologous proteins constructed by Chou [29]. The overall rate resulted of 85.3% that is about 10% higher than those by the second-order component coupled algorithm [29] and the supervised fuzzy clustering [34]

5.2.5 Supervised fuzzy clustering Shen et al. [34] developed a novel approach, the so-called ‘‘supervised fuzzy clustering approach’’ that was featured by utilizing the class label information during the training process. Based on such an approach, a set of ‘‘if-then’’ fuzzy rules for predicting the protein structural classes were extracted from a training dataset. It has been demonstrated through two different working datasets that the overall success prediction rates obtained by the supervised fuzzy clustering approach were all higher than those by the unsupervised fuzzy cmeans introduced by the previous investigators [33]. It is anticipated that the current predictor may play an important complementary role to other existing predictors in this area to further strengthen the power in predicting the structural classes of proteins and their other characteristic attributes. In details, the success rates by the re-substitution test applied on the dataset consisting of 64 protein entries originally constructed by Chou [11] resulted 84.4%, better than that by the unsupervised fuzzy clustering approach or unsupervised FCM. This depended from the unsupervised fuzzy clustering approach that neglects the class information of each sample during the training process, thereby missing the important guidance information. Moreover, using the dataset constructed by Chou [29] to test the prediction quality, the success rates by the unsupervised fuzzy clustering approach for both re-substitution and jack-knife tests resulted 87.25 and 73.5%, respectively, that is superior to the unsupervised approach. 5.2.6 Bootsting Feng et al. [46] introduced the so-called Logit Boost classifier to predict the structural class of a protein domain according to its amino acid sequence. Logit Boost was featured by introducing a log likelihood loss function to reduce the sensitivity to noise and outliers, as well as by performing classification via combining many weak classifiers together to build up a very strong and robust classifier. The prediction was made on two datasets taken from Zhou [53] by jackknife test. The success rates, 84.1 and 94.8% for 277 and 498 domains, resulted higher than the corresponding rates obtained by neural network [36] and SVM [38]. Cai et al. [47] used the Logit Boost, that performs classification using a regression scheme as the base learner, which can handle multi-class problems and is particularly superior in coping with noisy data. The used dataset was that of 204 protein chains taken from Chou [29]. The Logist Boost achieved the 100% and 83.82% overall success rate in the selfconsistency and jackknife tests, respectively. The Logit Boost result by jackknife test resulted about 8% higher than by SVM. 5.2.7 Rough sets Cao et al. [41] developed a new method for the prediction of protein structural class based on Rough Sets algorithm, which is a rule-based data mining method. In details, rough set theory is a machine learning method and a tool of representing and reasoning about imprecise and uncertain data. It distinguishes between objects based on the concept of indiscernibility, deals with the approximation of sets by means of binary relations and constitutes a mathematical

Computational Methods for Protein Structural Class Prediction

333

framework for inducing minimal rules from training examples. Cao et al. [41] used as a conditional attributes for the construction of decision system the amino acid compositions and eight physicochemical properties data, i.e. positive charge (KR), negative charge (ED), total charge (KRED), net charge (KR-ED), major hydrophobic (LVIFM), and groupings ST, AGP, and FIKMNY. After reducing the decision system, decision rules are generated, which can be used to classify new objects. Self-consistency and jackknife tests on two datasets constructed by Zhou [53] comprising 277 and 498 domains were used to verify the performance of this method. Self-consistency tests indicated that all the percentages of correct prediction reach 100%, which was the same as the results of SVM method [38]. The results of jackknife tests, i.e. 79.4% and 90.8% for 277 and 498 domains, showed that the performance of rough sets exceeded the componentcoupled algorithm and neural networks and parallel with SVM algorithm. These results showed that rough sets approach was very promising and may play a complementary role to the existing powerful approach, such as the component-coupled, neural network, SVM and Logit Boost approaches.

5.2.8 Bagging Dong et al. [45] introduced the Bagging (Bootstrap aggregating) for classifying and predicting protein structural classes. By a bootstrap aggregating procedure, the Bagging can improve a weak classifier, for instance the random tree method, to a significant step towards optimality. The used dataset was that of 204 protein chains taken from Chou [29]. This method got 100% accuracy on the resubstitution test. From 10-cross validation the Bagging performed at least as well as Logit Boost and SVM. In fact, the overall rate resulted 82.8% for SVM, 78.9% for Logit Boost and 82.3% for Bagging. From these results it is evident that SVMs was a little better that the Bagging and Logit Boost. Bagging resulted the best predictor for the α+β class. 5.2.9 Ensembles Kedarisetti et al. [43] proposed an ensemble classification method and a compact featurebased sequence representation. In details, these Authors began by creating an extensive feature-based sequence representation (including the composition vector, autocorrelations of hydrophobicity indices, etc.) and then reducing the dimensionality of the feature space via feature selection. The resulting features vectors were fed to four heterogeneous classifiers, which each output a predicted structural class. Finally, these predictions were combined by a specialized ‘‘voting’’ module that outputs the final prediction. In this study a total of six datasets characterized by different sequence similarity was considered. Out of these, four datasets with strictly controlled sequence similarity, i.e., 25PDB, 50PDB, 70PDB, and 90PDB with the sequence similarity of 25%, 50%, 70%, and 90%, respectively, were used for the design of the prediction method. The 25PDB dataset [48] and the remaining two datasets, i.e. 359 [17] and 1189 [37], were used to compare the designed method with other competing methods. The 359 dataset includes highly similar sequences (over 95% sequence identity) but 1189 and 25PDB datasets are both low sequence identity datasets, i.e. 40% and 25%, respectively. This method, indicated as StackingC ensemble, improved prediction accuracy for the four main structural classes compared to competing methods, and provides highly accurate predictions for sequences of widely varying homologies (see Table 3). The experimental

334

Susan Costantini and Angelo M. Facchiano

evaluation of the proposed method showed superior results across the entire spectrum of sequence similarity, ranging from 25% to 90%. The error rates were reduced by over 20% when compared with using individual prediction methods and most commonly used composition vector representation of protein sequences. Comparisons with competing methods on three large benchmark datasets consistently show the superiority of the proposed method.

5.2.10 Logistic regression In 2006 Kurgan and Homaeian [5] revisited and reevaluated the state of the art in the field of the prediction of structural class of proteins. To this end, they performed a first-of-its-kind comprehensive and multi-goal study, including investigation of eight prediction algorithms, three protein sequence representations, three datasets with different homologies and finally three test procedures. Quality of several previously unused prediction algorithms, newly proposed sequence representation, and a new-to-the-field testing procedure was evaluated. In details, these Authors proposed a novel representation including a variety of features related to AA composition, position, hydrophobicity and weight. These features were selected based on their prior successful application in protein structure prediction. The composition vector and composition moment vector were defined based on the count and position of residues in the sequence. Moreover, another amino acid property that was found useful for structure prediction was the chemical composition of their side chains. The composition of each of the chemical groups was computed for a protein sequence, i.e. all residues that have a given chemical group are counted, and the resulting vector constitutes a set of 19 features for the prediction. A feature selection study was conducted to select a subset of the most relevant features respect to structural class prediction. The 25PDB and 1189 datasets were used together with three feature selection methods: 1) Feature Subset Consistency (FSC) feature selection method, which selects subset of features using a probabilistic filter-based approach that uses Las Vegas algorithm to search through different feature subsets [59]; 2)Wrapper Subset Selection (WSS) feature selection method, which is a classification based wrapper that uses Naı¨ve Bayes algorithm [60]; 3) Feature Correlation (FC) feature selection method, which selects subset of features based on their correlation with the class while maintaining low inter-correlation between the selected features [61]. The results between different feature selection algorithms resulted consistent, and showed that composition vector and molecular weight are strongly related to structural classes, the first order composition vector, autocorrelation based on hydrophobicity and chemical groups were related to structural classes, while no features based on the energy index autocorrelations were selected. The chemical composition feature set contained seven features that were never selected, but at the same time its average number of folds was similar to the number for first order composition vector and autocorrelation based on hydrophobicity feature sets, and thus it was also selected. Finally, energy index autocorrelation set resulted relatively less valuable since its average number of folds was at least an order of magnitude lower when compared with other sets. Therefore the proposed representation included 66 features: 20-dimensional composition vector, 20-dimensional composition moment vector, 19-dimensional chemical group composition vector, 6-dimensional hydrophobic autocorrelations, and the sequence molecular weight. This representation was compared with other two published representations: (1) 20dimensional composition vector and (2) 30-dimensional energy autocorrelations.

Computational Methods for Protein Structural Class Prediction

335

Eight different classification algorithms were used to perform structural class prediction including some of the previously used algorithms, such as Bayesian classification (Naı¨ve Bayes) [62], nearest neighbour (instance based learning algorithm) [63], support vector machines [64], decision trees (C4.5 and random forest) [65-66], rule based algorithms (RIPPER) [67], neural networks (RBF network) [68], and logistic regression [69]. From this comparative study several important conclusions and findings were made. First, the logistic regression classifier was shown to perform better than other prediction algorithms, and high quality of previously used support vector machines was confirmed. The results also showed that the proposed new sequence representation improved the accuracy of the high quality prediction algorithms. Their study showed also that commonly used jackknife test was computationally expensive, and therefore computationally less demanding 10-fold cross-validation procedure was proposed underlining that there is no statistically significant difference between these two procedures. Their experiments showed that sequence identity has very significant impact on the prediction accuracy, in fact using highly similar datasets results in higher accuracies. The best achieved prediction accuracy for low sequence identity datasets was about 54% that confirms results reported by Wang and Yuan [37] (Table 4). For a highly similar dataset instance based classification achieved 97% prediction accuracy demonstrating that similarity is a major factor that can result in the overestimated prediction accuracy (Table 3). In 2007 Jahandideh et al. [42] used the multinomial logistic regression model to evaluate the contribution of sequence parameters in determining the protein structural class. These authors generated, using an in-house program, some parameters including single amino acid and all dipeptide composition frequencies. The most effective parameters were selected by a multinomial logistic regression that is a generalization of the logistic regression model. The selected variables were V among single amino acid composition frequencies and AG, CR, DC, EY, GE, HY, KK, LD, LR, PC, QM, QT, SW, VN and WN among dipeptide composition frequencies. Then, a neural network was constructed and fed by the parameters selected by multinomial logistic regression to build a hybrid predictor. This method was tested on database made by Zhou [53] containing 498 domains. All the percentages of correct prediction reached 100% in the resubstitution test, that was the same result of SVM and rough sets based methods [38, 41]. These results indicated that the hybrid model captured the characteristics between sequences and their classes through single amino acid and all dipeptide composition frequencies. The jackknife tests showed that the hybrid method got 94.4% of overall prediction rate, that was always higher than two SVM and rough sets based methods, i.e. 93.2% and 90.8%.

5.3 Structural class predictions for twilight zone sequences In 1986 Doolittle [70] coined the term twilight zone for identifying protein pairs with 15-25% sequence identity. Kurgan and Chen [44] developed an accurate method, LLSC-PRED, for in silico prediction of structural classes from low similarity (twilight zone) protein sequences. This method applied linear logistic regression classifier and a custom-designed, feature-based sequence representation to provide predictions. LLSC-PRED took into account a comprehensive representation that included 58 features describing composition and physicochemical properties of the sequences and transparency of the prediction model. This

336

Susan Costantini and Angelo M. Facchiano

representation included also the predicted secondary structure content, contentHf and contentEf where H corresponds to helix content, E corresponds to strand content and f corresponds to the prediction method, i.e., method by Lin and Pan [71] and by Zhang and colleagues [21]. Based on tests performed on a large set of 1673 twilight zone domains, the LLSCPRED’s prediction accuracy, which exceeds 62%, was better than that of 13 recently published competing in silico methods and similar to accuracy of other, non-transparent classifiers that use the proposed representation (see Table 4). Moreover, the only other comparable results were generated by using SVM on the proposed representation. Although LLSC-PRED and SVM shared similar accuracy, the linear logistic regression model is indicated transparent and easy to interpret while the SVM models appeared virtually impossible to comprehend. The other competing methods obtained accuracies ranging between 35% and 60%. The only two competing methods that reached 60% accuracy were also based on a custom-designed representation that included both composition and physicochemical properties [43]. The most accurate predictions concerned all-α class (75-77% accuracy), while the best results for all-β and α/β classes ranged between 65% and 67% and between 61% and 63%, respectively. The lowest accuracy (44-45%) was obtained for the α + β class. The main reason for good performance for all-α class was that these sequences are helix rich and helical structures were the easiest to predict. Therefore, this paper highlighted that simple representations resulted in low accuracy for the twilight zone proteins and it underlined the importance of the sequence representation; in fact, future improvements could be possible by designing more advanced representations rather than using more advanced classification methods.

6. Conclusion The prediction of protein structural classes is a very important and challenging problem. In fact, the structural class knowledge of a new protein may represent an useful information for defining its belonging to families and superfamilies and its protein function and for improving the accuracy of secondary structure prediction. In the last three decades many attempts were made to propose such prediction methods. Based on the past works a progress was being made. Usually new proposed methods were compared with competing methods using different datasets, in terms of size and the sequence similarity, different sequence representations and different test procedures. This makes difficult an evaluation of the true state of the art for this prediction problem. On the basis of extensive experimental studies three important conclusions can be drawn. First, sequence similarity is found to significantly affect the prediction accuracy. In fact, new algorithms may not be compared using datasets of unknown and different similarity, as the results for highly similar datasets are significantly higher than those for datasets with low identity percentage. For these reasons, the tests may be performed using low similarity and standard (benchmark) datasets. Second, the resubstitution tests are shown to overstimate the prediction accuracy, but the jackknife test procedure leads to unnecessarily high computational demand and the 10-fold cross-validation, being simpler in terms of computational load, is not significantly different

Computational Methods for Protein Structural Class Prediction

337

than the jackknife test. Therefore, the10-fold cross-validation test may be used in the future studies. Finally, the best results were generated from the prediction algorithms based on logistic regression and support vector machine classifiers (Table 3 and 4).

References [1] [2]

[3]

[4] [5]

[6] [7] [8] [9]

[10] [11] [12] [13] [14] [15] [16] [17]

Levitt, M.; Chothia, C. (1976) Structural patterns in globular proteins. Nature, 261, 552-557 Murzin, A.; Brenner, S.; Hubbard, T.; Chothia, C. (1995) SCOP: a structural classification of protein database for the investigation of sequence and structures. J. Mol. Biol., 247, 536-540. Orengo, C.A.; Michie, A.D.; Jones, S.; Jones, D.T.; Swindells, M.B.; Thornton, J.M. (1997) CATH- A Hierarchic Classification of Protein Domain Structures. Structure, 5, 1093-1108. Gromiha, M.; Selvaraj, S. (1998) Protein secondary structure prediction in different structural classes. Protein Engineering, 11, 249-251. Kurgan, L.; Homaeian, L. (2006) Prediction of structural classes for protein sequences and domains—impact of prediction algorithms, sequence representation and homology, and test procedures on accuracy. Pattern Recognition, 39, 2323–2343. Pollastri, G.; McLysaght, A. (2005) Porter: a new, accurate server for protein secondary structure prediction. Bioinformatics, 21, 1719–1720. Nakashima, H.; Nishikawa, K.; Ooi, T. (1986) The folding type of a protein is relevant to the amino acid composition. J. Biochem., 99, 152-162. Chou, K.C. (1995) A novel approach to predicting protein structural classes in a (20-1)D amino acid composition space. Proteins, 21, 319–344. Sheridan, R.P.; Dixon, J.S.; Venkataranghavan, R.; Kuntz, I.D.; Scott, K.P. (1985) Amino acid composition and hydrophobicity patterns of protein domains correlate with their structures. Biopolymers, 24, 1995-2023. Klein, P.; Delisi, C. (1986) Prediction of protein structural class from amino acid sequence. Biopolymers, 25, 1659–1672 Chou, P.Y. (1989) Prediction of Protein Structure and the Principles of Protein Conformation. In: G.D. Fasman (Ed.) pp. 549–586. Plenum Press, New York. Kneller, D.G.; Cohen, F.E.; Langridge, R. (1990) Improvements in protein secondary structure prediction by an enhanced neural network. J. Mol. Biol., 214, 171-82. Holm, L.; Sander, C. (1998) Dictionary of recurrent domains in protein structures. Proteins, 33, 88-96. Hadley, C.; Jones, D.T. (1999) A systematic comparison of protein structure classifications: SCOP, CATH and FSSP. Structure Folding and Design, 7, 1099-1112 Bu, W.S.; Feng, Z.P.; Zhang, Z.; Zhang, C.T. (1999) Prediction of protein (domain) structural classes based on amino-acid index. Eur. J. Biochem., 266, 1043-1049. Luo, R.; Feng, Z.; Liu, J. (2002) Prediction of protein structural class by amino acid and polypeptide composition, Eur. J. Biochem., 269, 4219–4225. Chou, K.C.; Maggiora, G.M. (1998) Domain structural class prediction. Protein Eng., 11, 523–538.

338

Susan Costantini and Angelo M. Facchiano

[18] Krigbraum, W.R.; Knoutton, S.P. (1973) Prediction of the amount of secondary structure in a globular protein from its amino acid composition. PNAS, 70, 2809-2813. [19] Muskal, S.M; Kim, S.H. (1992) Predicting protein secondary structure content. A tandem approach. J. Mol. Biol., 225, 713-727. [20] Zhang, C.T.; Lin, Z.S.; Zhang, Z.; Yan, M. (1998) Prediction of helix/strand content of globular proteins based on their primary sequences. Protein Eng., 11, 971–979. [21] Zhang, Z.D.; Sun, Z.R.; Zhang, C.T. (2001) A new approach to predict the helix/strand content of globular proteins. Journal of Theoretical Biology, 208, 65–78. [22] Eisenhaber, F.; Imperiale, F.; Argos, P.; Frommel, C. (1996) Prediction of secondary structural content of proteins from their amino acid composition alone. I. New analytic vector decomposition methods. Proteins, 25, 157-168 [23] Liu, W.M.; Chou, K.C. (1999) Prediction of protein secondary structure content. Protein Eng., 12, 1041–1050. [24] Homaeian, L.; Kurgan, L.A.; Ruan, J.; Cios, K.J.; Chen, K. (2007) Prediction of protein secondary structure content for the twilight zone sequences. Proteins, 69, 486–498. [25] Chou, P.Y. (1980) Abstracts of Papers, Part I, Second Chemical Congress of the North American Continent, Las Vegas. [26] Zhang, C.T.; Chou, K.C. (1992) An optimization approach to predicting protein structural class from amino acid composition. Protein Sci., 1, 401-408. [27] Dubchak, I.; Muchnikt, I.; Holbrook, S.R.; Kim, S.-H. (1995) Prediction of protein folding class using global description of amino acid sequence. Proc. Natl. Acad. Sci. USA, 92, 8700-8704 [28] Bahar, A.R.; Atilgan, R.L.; Jernigan, B.E. (1997) Understanding the recognition of protein structural classes by amino acid composition. Proteins, 29, 172–185. [29] Chou, K.C. (1999) A Key Driving Force in Determination of Protein Structural Classes. Biochem. Biophys. Res. Commun., 264, 216–224. [30] Jin, L.; Fang, W.; Tang, H. (2003) Prediction of protein structural classes by a measure of information discrepancy. Comp. Biol. Chem., 27, 373-380. [31] Xiao, X.; Shao, S.; Huang, Z.; Chou, K.C. (2006) Using pseudo amino acid composition to predict protein structural classes: approached with complexity measure factor, Journal of Computational Chemistry, 27, 478–482. [32] Kedarisetti, K.D.; Kurgan, L.; Dick, S. (2006) A comment on prediction of protein structural classes by a new measure of information discrepancy. Comp. Biol. Chem., 30, 393-394. [33] Zhang, C.T.; Chou, K.C.; Maggiora, G.M. (1995) Predicting protein structural classes from amino-acid-composition—application of fuzzy clustering. Protein Eng., 8, 425– 435. [34] Shen, H.B.; Yang, J.; Liu, X.-J.; Chou, K.C. (2005) Using supervised fuzzy clustering to predict protein structural classes. BBRC, 334, 577-581. [35] Metfessel, B.A.; Saurugger, P.N.; Connelly, D.P.; Rich, S. (1993) Crossvalidation of protein structural class prediction using statistical clustering and neural networks. Protein Sci., 2, 1171–1182. [36] Cai, Y.D.; Zhou, G.P. (2000) Prediction of protein structural classes by neural network. Biochimie, 82, 783-785. [37] Wang, Z-X.; Yuan, Z. (2000) How good is the prediction of protein structural class by the component-coupled method? Proteins, 38, 165–175.

Computational Methods for Protein Structural Class Prediction

339

[38] Cai, Y.D.; Liu, X.J.; Xu, X.; Zhou, G.P. (2001) Support vector machines for predicting protein structural class. BMC Bioinformatics, 2, 3. [39] Cai, Y.D.; Liu, X.J.; Xu, X.B.; Chou, K.C. (2003) Support vector machines for prediction of protein domain structural class. Journal of Theoretical Biology, 221, 115– 120. [40] Chen, C.; Tian, Y.-X.; Zou, X.-Y.; Cai, P.-X.; Mo, J.-Y. (2006) Using pseudo-amino acid composition and support vector machine to predict protein structural class. Journal of Theoretical Biology, 243, 444–448. [41] Cao, Y.; Liu, S.; Zhang, L.; Qin, J.; Wang, J.; Tang, K. (2006) Prediction of protein structural class with rough sets. BMC Bioinformatics, 7, 20. [42] Jahandideh, S.; Abdolmaleki, P.; Jahandideh, M.; Hayatshahi, S.H. (2007) Novel hybrid method fpr the evaluation of parameters contributing in determination of protein structural classes. J. Theor. Biol., 244, 275-281. [43] Kedarisetti, K.D.; Kurgan, L.; Dick, S. (2006b) Classifier ensembles for protein structural class prediction with varying homology, BBRC, 348, 981-988. [44] Kurgan, L.; Chen, K. (2007) Prediction of protein structural class for the twilight zone sequences. BBRC, 357, 453-460. [45] Dong, L.; Yuan, Y.; Cai, T. (2006) Using bagging classifier to predict protein domain structural. Class. Journal of Biomolecular Structure and Dynamics, 24, 239-242. [46] Feng, K.Y.; Cai, Y.D.; Chou, K.C. (2005) Boosting classifier for predicting protein domain structural class. BBRC, 334, 213-217. [47] Cai, Y.D.; Feng, K.Y.; Lu, W.C.; Chou, K.C. (2006) Using the LogitBoost classifier to predict protein structural classes. J. Theor. Biol., 238, 172-176. [48] Hobohm, U.; Sander, C. (1994) Enlarged representative set of protein structures. Protein Science, 3, 522. [49] Chou, K.C.; Zhang, C.T. (1995) Prediction of protein structural classes. Crit. Rev. Biochem. Mol. Biol., 30, 275–349. [50] Mahalanobis, P.C. (1936) On the generalized distance in statistics. Proc. Natl Inst. Sci. India, 2, 49–55. [51] Pillai, K.C.S. (1985) Encyclopedia of Statistical Sciences. In Kotz, S. and Johnson, N.L. (Eds) Vol. 5 Wiley, pp. 176–181, New York. [52] Duda, R.O. & Hart, P.E. (1973) Pattern Classification and Scene Analysis. Wiley, New York, Ch. 2. [53] Zhou, G.P. (1998) An intriguing controversy over protein structural class prediction. J. Protein Chem., 17, 729-738. [54] Oobatake, M.; Ooi, T. (1977) An analysis of non-bonded energy of proteins. J. Theor. Biol., 67, 567-584. [55] Brenner, S.E.; Chothia, C.; Hubbard, T. (1998) Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc. Natl Acad. Sci. USA, 95, 6073–6078. [56] Park, J.; Karplus, K.; Barrett, C.; Hughey, R.; Haussler, D.; Hubbard, T.; Chothia, C. (1998) Sequence comparisons using multiple sequences detect three times as manyre mote homologues as pairwise methods. J. Mol. Biol., 284, 1201–1210. [57] Chou, K.C. (2001) Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins, 43, 246-255.

340

Susan Costantini and Angelo M. Facchiano

[58] Kyte, J.; Doolitle, R.F. (1982) A simple method for displaying the hydropathic character of a protein. Journal of Molecular Biology, 157, 105–132. [59] Liu, H.; Setiono, R. (1996) A probabilistic approach to feature selection—a filter solution. In Proceedings of the 13th International Conference on Machine Learning, pp. 319–327, Italy. [60] Kohavi, R.; John, G. (1997) Wrappers for feature subset selection. Artif. Intell., 97, 273–324. [61] Hall, M.A. (1999) Correlation-based feature subset selection for machine learning. In Ph.D. Thesis, Department of Computer Science, University of Waikato, Hamilton, New Zealand. [62] John, G.H.; Langley, P. (1995) Estimating continuous distributions in Bayesian classifiers. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann, pp. 338–345, San Mateo. [63] Saha, A.; Wu, C.L.; Tang, D.S. (1993) Approximation, approximation dimension reduction and nonconvex optimization using linear superpositions of gaussians. IEEE Trans. Comput., 42, 1222–1233. [64] Aha, D.; Kibler, D. (1991) Instance-based learning algorithms. Mach. Learn., 6, 37–66. [65] Breiman, L. (2001) Random forests. Mach. Learn., 45, 5–32. [66] Quinlan, R. (1993) C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo. [67] Cohen, W. (1995) Fast effective rule induction. In Proceeding of the 12th International Conference on machine Learning, pp. 115–123, Lake Tahoe, CA. [68] Keerthi, S.S.; Shevade, S.K.; Bhattacharyya, C.; Murthy, K.R. (2001) Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Comput., 13, 637–649. [69] le Cessie, S.; van Houwelingen, J.C. (1992) Ridge estimators in logistic regression. Appl. Stat., 41, 191–201. [70] Doolittle, R.F. (1986) Of URFs and ORFs: a primer on how to analyze derived amino acid sequences. University Science Books, Mill Valley, CA, USA. [71] Lin, Z.; Pan, X. (2001) Accurate prediction of protein secondary structural content. Journal of Protein Chemistry, 20, 217–220.

In: Computational Biology: New Research Editor: Alona S. Russe

ISBN: 978-1-60692-040-4 © 2009 Nova Science Publishers, Inc.

Chapter 13

FUNDAMENTALS OF NATURAL COMPUTATION IN LIVING SYSTEMS Abir U. Igamberdiev∗ Memorial University of Newfoundland, Department of Biology, St. John’s, NL, Canada A1B 3X9 Measurement is the correlation of numbers with entities that are not numbers Ernest Nagel

Abstract The computational process is based on the fundamental semiotic activity linking mathematical equations to the materialized physical world. Its limits are defined by the set of imposed physical values constituting the background structure of the Universe. Computation becomes a direct consequence of semiosis, in the case where the arbitrariness of semiotic signs appears strictly defined in the semiotic context. This results in emergence of an internal formal structure of the system that can be modeled and computed. We consider formalization of the Peircean semiotics in the framework of Peirce algebra as a prerequisite of the understanding of fundamentals of computation. In reproducible semiotic structures such as biological entities, the factor that makes these systems “closed to efficient causation” (in Robert Rosen’s sense) is a basic element of the Peirce algebra that provides semantic closure to the system via introduction of the parameter of organizational invariance. Approaches to define this factor as a set of dual components, related both to relations and sets, are discussed in the frames of the semiotic interpretation of quantum mechanics where actualization is explained as a signification process within the network of quantum measurements. In this framework, enzymes are molecular automata, the set of which maintains highly ordered robust coherent state of the quantum computer, and genome concatenates error-correcting codes into a single reflective set that can be described by the Peirce algebra. The biological evolution can be viewed as a functional unfolding of the organizational invariance constraints in which new limits of iteration emerge, possessing criteria of perfection and having selective values.

Keywords: computation; organizational invariance; Peirce algebra; quantum measurement; Robert Rosen; semiosis ∗

E-mail: [email protected]

342

Abir U. Igamberdiev

Introduction: Grounds for Computability and the Peircean Paradigm The substantial difference between physics and mathematics is not a trivial question. A view exists that everything that can be computed may be present in the physical Universe (Tegmark, 2007). However, in the real physical world, only few solutions are realized. Other solutions can be referred to other universes of the assumed multiverse continuum but the basis for such view is not evident. The mathematical reality can be referred to Logos (λογος), while the physical reality is defined as Physis (φυσις) in the terms of Greek philosophy. The relation between Logos and Physis is based on a fragile correspondence, where the Logos is interpreted in Physis, while Physis becomes in certain limits and in its temporal frames noncontradictory by holding Logos in it. Thus we have the basic trinitary structure, where Logos signifies Physis, Physis is signified by Logos and their relation is an Interpretant that in a particular case can be represented by the acting observer. This all is united in a trinitary structure which is reflected in the Peircean sign (Peirce, 1955). The Peircean triad is presented on Figure 1. It includes two material causes (between the object and the sign-representamen, and between the sign and the interpretant). The relation of object and interpretant constitutes the efficient cause (in the Aristotelian sense). Closure to efficient causation (Rosen, 1991, 2000) as a definition of life arises to Aristotle’s tractate On Soul (De Anima 2:1, 412a) where life is defined through its internal determination: “by life we mean self-nutrition and growth (with its correlative decay)” (Aristotle, 1984).

Figure 1. The scheme of triadic structure of the Peircean sign. O – object (corresponding to the Firstness in Peircean terms), R – Representamen (signifier, corresponding to the Secondness), I – Interpretant (corresponding to the Thirdness). The interaction bonds correspond to efficient causes (solid arrow) and material cause (broken arrow).

To exist, the world should be closed by including potentially its observer. Figure 2 shows further development of the Peircean structure where the triadic sign structure is reflected (observed) by itself. It forms another closed structure where the reflection makes it complete. While the original triangle is dimeric, its reflection forms a 3D structure, in which a new interpretant is related to the initial object in a way as pure Logos is related to pure matter (potency, immediate object in the Aristotelian sense). This structure is described as the simplest Plato’s body – tetrahedron. Now it contains a new link representing not the efficient cause but the final cause: a new interpretant becomes the goal for the immediate object. From this structure also follows the Pythagoras’ triangle – tetractys, containing all Peircean types of

Fundamentals of Natural Computation in Living Systems

343

signs (Christiansen, 1985). This really develops Plato’s idea (expressed in his dialogue “Thimaeus”) that the essences of all things are triangles.

Figure 2. The complete reflective structure of the Peircean sign in which the interpretant (I) of the first relation becomes the representamen (R’) of the second relation, and the representamen (R) of the first relation becomes the object (O’) of the second relation. The new interpretant (I’) encloses the new triangle. The interaction bonds correspond to efficient causes (solid arrows) and material cause (broken arrows), the final cause of initial object O through the new interpretant I’ is depicted by the dotted line.

C.S. Peirce made major contribution to non-classical logic that is not less than of contributions of Frege, Wittgenstein or Russell. His semiotic triads however were not formalized to the extent that allows considering them as a part of the special mathematical construction. This postponed his acknowledgement as a major logician of all times. The real understanding of these triads establishes the bridge between the non-classical and the Aristotelian logic. In semiotic triads of Peirce, the logical schemes appear possessing the properties of Gödel incompleteness and Gödel numbering but expressed in a different language. Being translated to formalizable structures, they exhibit properties that allow better understanding foundations of mathematical logic (Kauffman, 2001). Semiotic triads of C.S. Peirce represent the subtle structures including both the elements and the relations linked together. A relation in one triangle can become an element of another, and this semiotic flexibility makes it difficult the development of mathematical formalization of these structures. However, recently a significant progress in this direction has been made that resulted in formulation of the Peirce algebra, formalizing basic principles of the Peircean semiotics (Brink et al., 1994). Peirce defines ‘semiosis’ as a process by which representations of objects function as signs. In semiosis, the cooperation between signs, their objects, and their ‘interpretants’ (i.e. their relational representations) takes place. In the interpretant, the set and the relation are related to each other, and this relation is not reducible to the original relation and the original set but mediates between them. The translation of the Aristotelian logic to mathematical equations was a great breakthrough achieved by George Boole. The similar translation of the Peircean semiotics aims to develop a powerful apparatus for the description of a generalized semiotic process (Böttner, 2001).

344

Abir U. Igamberdiev

In brief, the Peirce algebra is a two-sorted algebra of relations and sets interacting with each other (Brink et al., 1994). It describes the relations between the object, the sign, and the interpretant. In the Peirce algebra, sets can combine with each other as in a Boolean algebra, relations can combine with each other as in the relation algebra, and in addition, both a setforming operator on relations (the Peirce product of Boolean modules) and a relation-forming operator on sets (a cylindrification operation) are introduced. The latter operator is really a unique property for the Peirce algebra, a kind of organization invariance principle for the concrete Boolean module determining its uniqueness. It is a calculus of relations interacting with sets (Hirsch, 2007). In the framework of the Peirce algebra, not just the algebra of relations as distinct from the algebra of sets is introduced, but ultimately the algebra of relations interacting with sets is constructed. This is achieved by including a unique relation-forming operator on sets. As a result, the relations and the sets interact with each other. This reflects a pioneering approach of C.S. Peirce, who really was the first thinker, making a step towards the development of the calculus of relations interacting with sets. His ideas were presented formally only at the initial stage of formalization, and their further development should bring expected clarification to this approach. The Peirce algebra becomes a necessary next step after Boolean algebras, relation algebras, and Boolean modules, when one can manipulate both sets and relations simultaneously. Such manipulation is a major characteristic of the Peircean semiotic approach. The central operation in the Peirce algebra, the imposing relation-forming procedure on sets, is the one that really establishes semiosis in the system. It introduces a significative relation on sets. The operator of right cylindrification obeys a finite set of equational axioms. It takes a Boolean element and returns a relation algebra element. This procedure does not correspond to a description logic formula but can be defined by a finite set of algebraic equations. Introduction of the right cylindrification procedure makes a representation of the Peirce algebra necessarily straight, while some representable Boolean modules do not have any straight representations at all (Brink et al., 1994). This means the imposition of order by semiosis which is reflected in the formal apparatus of the Peirce algebra. Formulation of the Peirce algebra shows that the “logic of relations” introduced by C.S. Peirce in his semiotic approach, really corresponds to a formalizable linear associative algebra (De Rijke, 1995).

Quantum Mechanical Basis of Computability One attempt of formulation of this algebra was introduced in relation to the physical reality via a linear transformation to quaternions corresponding to the C(3,0) algebra of W.K. Clifford (Beil and Ketner, 2003). Interestingly, this Clifford algebra contains the Pauli matrices known in physics and so constitutes an operator basis for the nonrelativistic quantum theory of spin one-half particles (Beil, 2004). A further unification is achieved by taking the quantum mechanical wave functions themselves to be 2 × 2 matrices which are the Peirce logical operators and also the elements of the Clifford algebra. Thus a direct path exists from the Peirce logic to the quantum theory. This means that one application of the Peirce algebra is the quantum theory, and an approach arising from the Peircean representations, results in understanding the quantum measurement as a semiotic process where actualization is related

Fundamentals of Natural Computation in Living Systems

345

to the formation of a sign, that can be reflected in a cylindrification procedure forming a new relation in the set of quantum mechanical state vectors. The semiotic nature of quantum measurements was first stated in relation to the biological reality by Pattee (1972) and Rosen (1978a). It is based on the understanding of quantum measurement as an internal process where the measuring device is included in the measured reality and thus establishes a semiotic relation between the device and the quantum system interpreted by the observer. In the semiotic interpretation of quantum mechanics (Christiansen, 1985), decoherent histories are reduced not by the environment (as in the consistent histories approach, see Zurek, 2003) but through their significative validity. Moreover, the formation of sign itself is equivalent to this reduction (Igamberdiev, 2008). The sign is formed in the trinitary process including signification (preparation to actualization, the path from the object to the sign), interpretation (detection, i.e. real actualization, the path from the sign to the interpretant), and organization (the path from interpretant to the object). Peirce described semiosis as a triadic relation in which an object is referred to by a sign and by an interpretant. The interpretant is itself a sign which may have a triadic relation with the sign which it signifies and with its own secondary interpretant. Thus, the triadic relation between a sign, an object, and an interpretant may be repeated infinitely. The triadic relation produces another triadic relation between the relation itself, its signification, and the interpretation of that signification. This is reflected in the three-dimensional tetrahedron structure of the Figure 2 and its dimeric representation as a tetractys containing all Peircean types of signs (Christiansen, 1985). In the reflective sign structure, the initial representamen becomes an object (O’) for a new signification, and the initial interpretant becomes a new representamen (R’) (Figure 2). This is based on the unique organizational invariance parameter appearing through the creation of a new interpretant I’. The initial object is the mode of being a possibility, the signification is the mode of making it a fact, and the interpretant makes it a real sign in representation with other signs (generates the law of representation). Computation process has its physical limitation, which belongs to the fact that any calculation action has a price (Liberman, 1989), e.g. the addition of one takes energy, and this energy cannot be reduced to zero value. This means that computability is limited: if the price of action is high, we have no physical capacity to compute effectively the process (Conrad and Liberman, 1982). The main condition for applying computation to the physical world is that it should be stably performed. To proceed with a high precision of the output, which is a prerequisite of the information transfer, the process should preserve quantum coherence over a prolonged period of time (Igamberdiev, 1993, 2004). The quantum device possesses its own potential internal quantum state, which is maintained for the internally set time period via reflective error-correction. The error-correction is a reflection over this state being concatenated within the three-dimensional space and represented as a kind of molecular computer. The internal quantum state can exhibit itself by creative generation of limits of iteration in the physical world. Superpositions can exist only in quantum systems that are free from the external influences, thus the operation over the coherent state should be restricted only to error correction without disturbing this internal state. A decay from a superposition to a statistical mixture of states is called decoherence which rates scale exponentially with the size of superposition.

346

Abir U. Igamberdiev

The internal quantum state by its internal determinism causes decoherence being coherence-free itself. It is a decoherence-free subspace, which applies decoherence through its external shield (body) and remains robust against the measurement process. Although the standard view represents the quantum wave function as extremely fragile against measurement (e.g. the Schrödinger’s cat), the maintenance of internal quantum states is responsible for securing the stability of the macroscopic material world. The internal coherent state is a necessary infrastructure to uphold the material world informationally. Robustness of this state corresponds to the physical limitation of the computational process (Liberman, 1989). In biological systems, it is impossible to define statically the structural parameters of the state space but instead the analysis of the system on observables (that are similar to the quantum mechanical observables), that are functions of states, can be introduced (Wolkenhauer and Hofmeyr, 2007).

Robert Rosen’s Theory of (M,R) Systems and Autopoiesis The semiotic approach to biological systems (Igamberdiev, 1992) reflects the view that the biosystem is not simply fractionable into two functional parts, corresponding to a hardware and software. It contains a significative system (genome) which encodes its structure, but in turn, the system of significations is internally reproduced within the system and is repaired through the sets of internal constraints coming from the elements that it encodes. Such a system represents a single entity, with a locally stable point attractor (steady state). Moreover, it possesses other styles of control besides error-based, cybernetic controls. One of them is based on an anticipation, in which some present action is taken, not to correct an error which has already occurred, but to pre-empt an error which will occur in the absence of that action. The control here is based not on the past, but on the future through the agency of a predictive model, which converts present information into predicted future consequences. These consequences provide the basis for present actions. This control system works on the basis of predictive models rather of cybernetic feedbacks. In such behavior, something in the system must be doing a "double duty", as an ordinary material constituent, and as a predictor for something apparently quite unrelated in material terms (Letelier et al., 2006). It means that the element of the system represents an anticipatory relation, i.e. it belongs both to the sets of elements and the sets of relations (see also Rosen, 1985). This kind of a "double duty" is difficult to accommodate in the dynamical terms but it can be analyzed in the semiotic terms. It is related to the phenotype-genotype dualism, which is characteristic of biological systems in general, but which is essentially absent in the inorganic world. However, it is not completely reducible to the phenotypegenotype dualism representing a quality of certain elements of the system that can fulfill the duties of provision of the organizational invariance. They can belong simultaneously both to the “hardware” and the “software” parts. Rosen (1991, 2000) suggested the alternative to classical dualistic genetic model of the biological system. He called it (M,R) system where he considered M as metabolism and R as repair. In the recent publication (Letelier et al., 2006), it is suggested that instead of repair it is more correct to say “replacement”. The elements of metabolic system are continuously replaced, the elements that replace them are also replaced, and this can go to infinite regression. However, Rosen stated that the system can be “closed to efficient causation” and

Fundamentals of Natural Computation in Living Systems

347

contain the internal principle of organizational invariance (Rosen, 1991) which results in avoiding infinite regression and closing system in a stable non-equilibrium state in which the system, remaining open to the material flows, becomes selective to them and affords being closed to efficient causes that are locked inside it. By formulating these basic principles, Rosen introduced the general basic structure for life, which is the basis for its definition as “closed to efficient causation” arising to Aristotle, and to the computational approaches based on the (M,R) structure. This structure has a capacity for internal development via internal rearrangements with simultaneous redefinition of the organizational invariance. In fact, the structure imposed by Rosen is triadic and it includes the central principle of “organizational invariance” (can be defined as O) holding M and R in the state of avoidance of the infinite regression for the internal system’s time T, so his theory can be defined as the theory of (M,R,O) systems. Rosen’s theory contains an attempt to formulate a relevant formal apparatus for describing biological systems. Although this apparatus needs further development, it represents a unique attempt to structure the formal basis for description of living systems. Other attempts are either reductionistic like the most advanced Eigen’s theory of hypercycles (Eigen and Schuster, 1979) having some features common to Rosen’s ideas in the notion of hypercycle closure but reducing evolution to random mutations within hypercycles and to their natural selection, or pure phenomenological in description including numerous biosemiotic models arising to Pattee (1972), or autopoietic theory of Maturana and Varela (1980). These latter models have common features to Rosen’s approach but remain less developed in relation to their computational value and have other differences (Nomura, 2007). Below we try to outline the basic mathematical background of Rosen’s concept and to describe its possible connection to the Peircean semiotics and the Peirce algebra. It is known that Rosen extensively used the category theory for description of (M,R) systems. The semiotic aspect of this theory was mentioned earlier (Kull, 1998). We will attempt to show that the category approach can be merged to the approach of the Peirce algebra. In the basic Rosen’s theory, the (M,R) system is viewed as a conjunction of three mappings F, Φ, β uniting two sets A and B that have correspondingly elements a and b (for details, see Letelier et al., 2006). F Φ β A ⎯⎯→ B ⎯⎯→ Map( A, B ) ⎯ ⎯→ Map(B(Map( A, B )))

In this system of mappings, the following assumptions are set: F(a) = b, Φ(b) = F, β(F) = Φ with β equivalent to b. A formal entity β has the property that β(F)=Φ. Thus, β is a mapping between Map (A,B) (the set of possible metabolisms) and Map (B(Map(A,B)) (the set of possible selectors). The procedure defined by β consists in the operation that, given a metabolism F, produces the corresponding selector Φ that selects metabolism. For β to exist it is required that the equation Φ(b)=F , for Φ must have one and only one solution. The most fundamental question here is if it is possible to produce a definitive mathematical description (preferably algorithm) to

348

Abir U. Igamberdiev

calculate β. Rosen recognized that the existence of such mathematical description is mathematically difficult (Rosen, 2000, pp. 261–265). The tricky thing here is that β is a function, but it is simultaneously ‘‘equivalent’’ to an element b ∈ B, in the sense that β sends any f to the unique Φ such that Φ(b)=F . As a result, this construction solves the problem of infinite regress. The infinity generates its limit within the system by allotting certain element b with the property reflecting the whole system. Thus the element b gets a dual function becoming a sign (argument) designating a system as a whole. In this sense it is equivalent to the Gödel number or in general to the set of Gödel numbers. It has a semiotic nature imposing the semantic closure to the system in which it is present. If we turn to the Peirce algebra, Φ is a Peircean operator in the Boolean module, while β represents a cylindrification procedure. In the topoic logic, Φ may correspond to a subobject classifier (Goldblatt, 1979), while for β there is no precise definition. Actually, the operator β is that factor which makes the system autopoetic. What means autopoetic? According to Maturana and Varela (1980), the autopoietic system continuously regenerates itself being self-contained and self-maintaining. The autopoetic system has its own time determining the limit of “global system failure” (Rosen, 1978b). The attempt to determine this time has been made by Rosen in his paper (Rosen, 1978b). However, this part of his theory was not significantly developed. This internal time can be substantiated via the theory of quantum measurements and the Heisenberg’s uncertainty ratio energy-time (Igamberdiev, 1993, 2007). In all Rosen’s writings, the precise nature of β has been left open. But as we discussed before, the operation defined by β, designates the system as a whole within itself. It acts as a generator of the complete structure of an (M,R) system (Letelier et al., 2006). In other words, β represents the set of Gödel numbers of the given system. The problem thus exists if β can be recursively deducted from the elements and relations of the system or it is introduced nondeterministically. If in some cases, where the β parameter is not unique, the system may be autocatalytic but it is not organizationally invariant, and thus it is not a true (M,R) system (Letelier et al., 2006). In the true (M,R) system, the β parameter is unique according to Rosen (1991). The Figure 3 introduces a diagram that represents Rosen’s explanation of the closure of an (M,R) system which is his “central result” (Cornish-Bowden et al., 2007). The broken arrows (representing Aristotelian material causes) indicate that a function (located at the start of the arrow) uses a variable (at the destination) to produce a result. Each solid arrow (representing Aristotelian efficient cause) indicates a transformation. The essential aspect of this diagram is that every biological function, i.e. metabolism F, replacement Φ or organizational invariance (the implied β which is equivalent to some element b ∈ B), is entailed by another element in the diagram. No outside causality is needed. This is the basic bootstrapping property of a living system that justifies Rosen’s statement that “Organisms are closed to efficient causes” (Rosen, 1991).

Fundamentals of Natural Computation in Living Systems

349

Figure 3. The diagram representing the structure of (M,R) system. The broken arrows indicate that a function (located at the start of the arrow) uses a variable (at the destination) to produce a result (i.e. it corresponds to material cause). The solid arrows indicate a transformation (corresponding to the efficient cause). The dashed arrow indicates the final cause (goal) aiming the initial variable. Every function (metabolism, replacement, and organizational invariance) is entailed by another element in the diagram (no external causality).

From this view it follows that the metabolic networks, which constitute the foundation of living systems, must satisfy certain logical regularities that go beyond stoichiometric or thermodynamic constraints. These logical regularities arise from the circular nature of biological organization, which can be summarized by the statement that the three-step chain of mappings and sets that constitutes the framework of Rosen’s central result suggests a construction that generates mathematical objects that are solutions to the puzzling-looking equation F(F)=F (a fixed point solution). The generation of mathematical objects here is internal: it comes not as a property of our description but as an internal property of the system to generate its own description. The function F: A → B represents metabolism viewed as a mapping. The function Φ: B → H (A,B) represents ‘‘repair’’ in Rosen’s sense, and generally can be regarded as replacement (Letelier et al., 2006). The function β: H(A, B) → H(B,H(A,B)) is a unique parameter of the organizational invariance. The latter function essentially acts as a generator of the complete formal structure of an (M,R) system. In effect it is possible to reformulate the definition of an organizationally invariant (M,R) system as the kind of system where for some b the equation Φ(b)= F has exactly one solution Φ, for any given F, giving rise to the operator β, which sends any f to its associated Φ and then implicitly giving the structure of the whole system (Letelier et al., 2006). The explicit construction of β was referred to by Rosen as the realization problem (Rosen, 2000, p. 262) and he conceded that it was difficult at the level of the theory, or at the level of a physical model, to construct a metabolic network that would embody the notion of β. The imposition of β is really a generation of Gödel number making possible an internal description of the system itself. It is a creative action that can be successful only if it considers the previous organization, the relation to the environment, and anticipates possible changes that occur to the system due to this introduction.

350

Abir U. Igamberdiev

In Rosen’s “double triangle” (Figure 3), the material causes and the efficient causes follow by each other. However they coincide together upon the signification of β in the element b (double arrow from B to F). This merging of material and efficient causes creates the basis for anticipation within the system. This double triangle, introduced by Rosen, however, still does not contain a link connecting the initial object A with the final interpretant (which is the replacement function Φ). In fact, Φ determines the existence of A through its continuous regeneration. This means that it corresponds to the final cause for A, and the system with this extra connection becomes anticipatory in Rosen’s sense. As a result, the “double triangle” is transformed into the tetrahedron structure (the simplest Plato’s body) which makes the structure three-dimensional. This three-dimensional structure is complete: the system contains all Aristotelian modes of causation (including the formal cause which is not incorporated directly into the diagram but is contained in the whole structure limiting possible ways of its further development). So the question remains if it is possible to derive the value of β from the observed (M,R) structure. The general answer can be positive, in the frames of the problem of uniqueness of Gödel numbers. The Gödel numbering function g can be chosen to be total recursive. We need to proceed with the uniqueness of quantification or the unique existential quantification. This procedure known in logic can be applied for Rosen’s β procedure. In general, the Gödel numbering is not unique, in that sense that for any proof using Gödel numbers, there are many ways in which these numbers could be defined. But from the point of view of optimality, the uniqueness can be restricted by the organizational invariance of the system in the biological realm or by the observability in the physical realm. The organization (the knowledge of which components are needed for each functions) of the (M,R) system is encoded within itself (Cornish-Bowden and Cárdenas, 2007), and the known biological codes (such as the genetic code) are the consequences of this primary encoding based on the organizational invariance. The stable non-equilibrium state is the physical realization of the organizational invariance principle being the visible feature of the metabolic closure. Material-energy fluxes are open, while the efficient causes are shelled within the system. The principle of stable nonequilibrium state was introduced by E.S. Bauer (1935) as the basic principle of theoretical biology underlying the structure of all living systems. Really, the stable non-equilibrium principle is clearly supported by Rosen’s scheme of the (M,R) system.

Organizational Invariance as the Principle of Optimal Design The solution for β can be considered only as anticipatory, based on the final cause of the optimal design principle. This principle was outlined by Rosen (1967) in his early monograph and developed in later works (Rosen, 1980). The main idea is that the system closed to efficient causation ultimately contains feedforwards, i.e. dynamic trends based on anticipation (Rosen, 1985). In this regard, the β parameter expresses a feedforward solution for the whole system. Such a solution for the finite system should also be finite, i.e. it assumes the system’s failure after a certain period of time T also defined internally. Rosen only drafted a preliminary outline for definition of this time T, which is the ultimate internal time for a given stable non-equilibrium structure. Given any such anticipatory system, can we state that the specific constants (like the organizational invariance β) appearing in the system are the ones which optimize it with

Fundamentals of Natural Computation in Living Systems

351

respect to any single reactant, whose rate will play the role of a criterion of the cost of a particular action (price of action of the system). This action is represented by a quantity which can only accumulate, while the remainder of the system is designed in a way that its rate of accumulation is minimal. Rosen says that the accumulation of any such substance may be regarded as defining a clock. If the dynamical equations operate in which no configuration variable changes at a rate independent of its value, we can generally transform the system to new variables, in which at least one of the equations of motion does possess this property (Rosen, 1978b). In other words, the quantity in question (which rate of change is independent on itself) will serve as an independent variable in the system and will play a role as an internal clock of the system. The value β characterizing optimal design has its characteristic time T that characterizes its anticipatory capacity, beyond which “the global systems failure” will take place (Rosen, 1978b). We may define T as a value which characterizes a stability of a system in frames of given Gödel incompleteness. It is the Aristotelian time that should be counted (contrary to the external time by which we count) in biological systems (Igamberdiev, 1996). It is clear that the solution can be temporary – it is not possible to divide contrary statements for eternity. Rosen (1978b) stated the problem of internal time but did not define how to calculate this time. The internal time of the system can be illustrated by concrete biological examples. Rubner introduced a constant that defines average life span from the features of metabolism (Zotin and Alekseeva, 1984). It is related to the unit of physiological time τ(t), which is a variable value, while the internal time is unequal relative to the physical time. The more internal time units τ, i.e. elementary acts of energy consumption, fit in the unit physical time t, the longer is the unit physical time for the unit active mass relative to the internal time unit, i.e., the physical time is seemingly slowed down. And, on the contrary, the less elementary acts of energy consumption take place during the unit of physical time, the shorter seems unit t, i.e. physical time is seemingly accelerated. Unequal course of the internal time is determined by the curve of specific metabolism q(t) during the life under specific conditions and, hence, internal time is individual. The total amount of energy that can be assimilated is limited being a species-specific parameter of organism (Rubner’s constant) proportional to the free energy of an ovicell (see also Bauer, 1935).

Internal Computability of Metabolic Systems The organizational invariance assumes that biological systems are stable dynamically. Being closed to efficient causation, these systems are open to the flows of matter and energy and use these flows for the maintenance of efficient causes and generations of feedforwards of anticipation. The “stable non-equilibrium principle” is the basic principle of theoretical biology introduced by Bauer (1935). In frames of the discussed approach, it is equivalent to Rosen’s principle of organizational invariance in (M,R) systems. How does the stable non-equilibrium work? It is based on double functions of certain components of the system. For example, a hormone affects certain processes on membranes and simultaneously initiates processes at the genomic level, which will provide a long-term further support to the initial effect. The nucleotide system forming matrices for proteins by linear binding of four nucleotides is another example. Formed from nucleoside triphosphates splitting to nucleoside monophosphates, RNA represents a flexible operational memory

352

Abir U. Igamberdiev

system, which in turn is reflected in the stable memory of DNA. On the other hand, the same set of nucleotides also forms a system of “energy charge” governing metabolism (Igamberdiev and Kleczkowski, 2003). In other worlds, nucleotide system has a dual (both informational and energetic) function. From the total pool of adenylates and other nucleotides, the genetic system is formed. DNA is metabolically isolated from the major metabolic pool of nucleotides because it uses deoxynucleotides which synthesis is strictly controlled, while RNA is more directly linked to nucleoside triphosphate pool which can be used both for general metabolism and RNA turnover. This means that the function of nucleoside triphosphates is double: they are used both in general metabolism as free nucleotides and in RNA metabolism by covalent polymerization in RNA molecules. Changes in their pools directly affect both metabolism and RNA turnover. The network of biochemical processes that constitutes metabolism bootstraps itself without the help of external agents generated outside the network, thus keeping cell organization invariant in spite of continuous structural change. Different nucleotides serve as cosubstrates (coenzymes) of many enzymes, and their association into nucleic acids generates matrices for the reproduction of enzymes themselves. When cosubstrates serve as allosteric effectors, they form reflective arrows in metabolic networks leading to the formation of a switching network possessing an internal intrinsic logic. Polymerization of cosubstrates into nucleic acids generates a self-referential set of arrows for the set of catalysts, resulting in the appearance of the digital information of the genetic code forming the internal programmable structure of biosystem (Igamberdiev, 1999a). The genetic code possesses its own intrinsic arithmetic that can be uniquely set from the point of optimality (Gusev and Schulze-Makuch, 2004; Shcherbak, 2003, 2008). The strict theory linking metabolism M (nucleotide system governing metabolism as “energy charge”) and replacement R (nucleotide system governing metabolic functors as “genetic information”) has to be built. The most important point here is that both systems use same components. One subset of these components forms “the energy charge” while another component forms “the genetic information”. These subsets are linked by the set of enzymes, for which reproduction of both information and energy as formal and efficient causes are needed, while the material cause (amino acids) are taken from internal biosyntheses and via external feeding. It is important to mention that both systems (the genetic and the metabolic) have computational properties and both have limits of computation. For the metabolic system, certain ratios between adenylates are established via thermodynamic buffering of enzymes such as adenylate kinase, nucleoside diphosphate kinases and others. Deviations from such equilibrium will result in an unstable state that tends to return back not to the equilibrium but to the initial stable non-equilibrium state (Igamberdiev and Kleczkowski, 2003). Simple rules of nature when discovered can substantially clarify our knowledge and further develop science. The classic example is Mendeleev’s discovery of the periodic system of elements and another example is the genetic rules of complementarity (Chargaff et al., 1953) and the genetic code (Gamow, 1954; Crick, 1963) that established the foundation of molecular biology and became the basis of genomics. It was found that nucleic acid and protein synthesis are governed by simple computable rules and the main original problem was to understand that such rules really exist. Understanding the rules following from protein sequences became a more complicated task and current approaches are mainly based on sequence similarities of protein domains. Protein function emerges during the protein folding

Fundamentals of Natural Computation in Living Systems

353

process which cannot be fully computed up to date but the simple rules governing it may exist indeed (Root-Bernstein, 1982; Siemion, 1994). By recognizing a role for "repair" or “replacement”, represented by the mapping Φ, f became entailed. "The obvious idea of iterating the construction we have used, and introducing repair components...leads to an infinite regress and is useless. In biological cells, the problem is solved by making the repair components... (self-) replicating" (Rosen, 1972). This idea of self-replication of functional components (not parts) of the system is the key to the resolution of the problem. In complex systems, the function is "spread" over the parts of the system in a manner which does not map 1:1 onto those parts (Letelier et al., 2006). To capture this essential property, Rosen utilized the functional component to represent this newly recognized reality within the system. The functional components are the ontological embodiment of the non-fragmentable aspects of the system's organization. Meyen (1977) called these elements as merons emphasizing that for taxonomy, the principles of internal structurization are more important than the visible structures. The merons are defined by their context and have no necessary meaning outside that context. Thus, they capture what is lost by reductionism. The notion that functional components have the same reality as the parts, or even more, is present in Rosen’s works. However, the most ultimate reality is not material and not functional; it is the non-fragmentable reality of organizational invariance, a kind of Plato’s eidos imminent to the system as a whole. Contrary to taxonomy which makes a categorization on the basis of discrete sets, meronomy deals with relationships between the whole and part.

Computability of the Evolutionary Process Some non-trivial conclusions about the biological evolution follow from the consideration of the biological system as (M,R) system. The idea of Rosen on the organizational invariance has a certain parallel with the concept of meron developed by S.V.Meyen (1977,1978). In these views, “meron means a set of parts, belonging to its objects and having some common traits, i.e., the concepts of meron and trait are different”, while the concept of taxon means a set of objects, united by common traits. The internal polymorphism of taxa makes all extrapolations uncertain to some extent (Sharov, 1995). The concept of meron replaces the notion of archetype as the basis for taxon distinction because sometimes it may be impossible to find even one trait common to all objects in the taxon. Nevertheless such a taxon is natural if it has its specific appearance and is well distinguished from other taxa and it is distinct not by a definite form set (archetype) but by a certain law of polymorphism. This law can be formulated by frequency distributions of traits and by sequences of form modification (Sharov, 1995). Meyen associated an application of the same transformation to different initial forms with theme development in a musical composition and named this phenomenon as "refrain", an archetype-like character usually more complex than an ordinary morphological trait. The law of polymorphism associated with meron is really the law of organizational invariance of a given biological system formulated in Rosen’s concept. Meron, according to Meyen, also possesses its own individual time. This time includes a component, which is similar with that in all other objects, which is associated with the physical time. Evolutionary process can be considered as proceeding not in a “container” time but rather generating time flow via resetting the values of organizational invariance. Such resetting is

354

Abir U. Igamberdiev

creative: it cannot be predicted with certainty and can be evaluated a posteriori in the newly evolved system embedded in a new environment modified by it. On the other hand, many traits are independent on the values of organizational invariance forming homology series of variation according to Vavilov’s law. Vavilov (1922) formulated his law of homology series in a way that different taxa (species, genera and up) are distinguished by their radicals (L1, L2, L3) representing organizational invariance while the changes in traits (a, b, c…) are parallel and can be predicted. L1 (а + b + c + d + e + f + g + h + i + k...) L2 (a + b + c + d + e + f + g + h + i + k...) L3 (a + b + c + d + e + f + g + h + i + k...) The difficulty to define radicals in some cases can be related to the difficulty of formulating organizational invariance because it is not generally reduced to a spatial archetype but has a temporal constituent expressed in certain transformation principles. The understanding of the evolutionary process was originally based on the phenotypic (largely derived from metabolic) features of organisms and then it acquired basic support from the genetic point of view (i.e. understood as a change in the replacement system). In a deeper context, evolutionary process is based on the principles of transformation of the organizational invariance parameters, which include determinism of the transformation of previous structures restricted to concrete rules (nomothetic aspect); however, in general these transformations cannot be predicted but can be only forecasted. Thus the evolutionary process is both creative and nomothetic. It represents a contradictory process of growing complexity, which includes fundamental principles of perfection of canons regarded as its nomogenetic laws in the sense of Berg (1969) — and free creativity for their construction based on the internal choice in the sense of Bergson (1917). It also includes parallel series of changes in traits allowed by and not directly dependent on the given organizational invariance.

Computable Structures in Psychology In psychology, the biological interpretant represents the object for a new reality, in which the subject becomes a part of a system where it can reflect himself. In the new triad, the biological reality represents as unconscious (Id by Freud’s definition or Imaginary by Lacan’s definition). It is interpreted as Ego (Freud, 1976) or Real (Lacan, 1970) through signification by the image of other, Superego (Freud) or Symbolic (Lacan). The particulars of this representation are described in my paper (Igamberdiev, 1999b). The Peircean triad of the structure of subject is called in psychology the Oedipus complex. It includes relations between three defined elements: the relation of Ego to Id represents the feeling of lust of Mother, while the relation of Superego to Ego represents the prohibition of possession of Mother. All this is described in detail in Freud’s works and in a more semiotically oriented language in the works of Lacan (1970) who claimed that his task was only to interpret Freud correctly. This semiotic structure determines the way of internalization of the external world. It can be considered as a logical pattern describing interrelations between the consciousness and the external world, which determines the fixation of somebody’s image into the other as a possibility to substitute the other (Igamberdiev, 1999b, 2008).

Fundamentals of Natural Computation in Living Systems

355

The future task is to describe the structure of the Oedipus complex formally by means of the Peirce algebra. However, what is successfully done already is a simplification of the Peircean scheme by reducing it to the Boolean algebra (Lefebvre, 1990). It is possible to interpret this approach in a way that Lefebvre took three components of the Freudian scheme as reflections of Id over himself (Ego) and as a reflection of this reflection coming as a subject’s estimate of himself (Superego). This resulted in a simple algebra of binary relations representing the rules of reflexive control. In the model of reflection developed by Lefebvre (1990), the trinitary Freudian semiotic structure of consciousness is reduced to recursive Boolean schemes. A unique system of dichotomous constructs serves as a special axis for projecting (mapping) the other person (organism or neighboring cellular automaton). In an internal process of making choice, the system performs a procedure of maximization of the pragmatic status of its image of itself. The golden section appears here when the internal choice is made in reflective modeling of the self (Lefebvre, 1990), i.e. when the subject A1 chooses a positive role with a frequency equal to the frequency of positive stimuli input to his image of himself (a2) and to his image’s image of himself (a3). The subject’s state a1 will be a composition of the contradictory statements a2 and a3. Thus the subject will correspond to a character A1 ≡ a1a2a3. When a1 = ½ (e.g. in the case of the choice between two equal elements or equal contradictory states), it will correspond exactly to the values of a2 and a3 as the golden section (Lefebvre, 1990). The value of golden section emerges under the condition when a subject chooses a positive role with the frequency of positive stimuli input to his image of itself and to his image’s image of itself. The development by Lefebvre is very important but it has to be derived from the basic structures of Peirce algebra and topoic logic, otherwise it is not fully substantiated. The closure to efficient causation and Peicean schemes can be applied to the origin of human society and to the Oedipus complex. Labor activity of humans is accompanied by the semiotic internalization of tools in the language. The thesis of Vygotsky (1962) about the internalization of labor tools in signs of language (emerged from the original idea of Engels on the role of labor in human biological and social evolution) can be incorporated in a broader paradigm of the Peircean semiotics. To link the labor activity to mental development, the internalization factor is necessarily similar to the “organizational invariance” β in Rosen’s theory. This transition reflects the ideas of Vygotsky of development as a process of internalization. Bodies of knowledge and tools of thought evolving to be linked together, form a new entity of the human culture internalizing the environment. The development consists of the gradual internalization of the environment, primarily through the language, to form the cultural adaptation. Piaget (1955) considers several stages of such internalization during the individual development of the child possibly reflecting the evolutionary succession. Such internalization proceeds via generation of non-countable sets of new Gödel numbers setting by this a relation to the external reality. The psychological triad of the Oedipus complex can in turn be interpreted via its own signification. Here we come to the same problem that was formulated in Robert Rosen’s theory of (M,R) systems. Determinism imposed by Superego can be mediated via the total essence that makes this determinism an expression of the universal law thus diminishing its negative estimation. As a result we get Ego who contains the image of the whole world (Igamberdiev, 1999b). A new interpretant is formed, enclosing the cycle in the same way as in Rosen’s (M,R) system, but including the whole world in the semiotic scheme. This

356

Abir U. Igamberdiev

supreme semiotic structure can be illustrated through Jorge Luis Borges's story, "The Aleph" (Borges, 1970), appearing there as a small sphere that contains the entire universe, past, present, and future. The uniqueness of β means that the embedded arithmetic system is unique for systems closed for efficient causation. This means that β represents a factor of individual closure for a given system being its main characteristics. This gives also an understanding of the uniqueness of the genetic code (Shcherbak, 2003, 2008). It can be analyzed not from the amount of almost infinite possible versions of the genetic code but from the optimality of the existing structure. The same approach can be applied for the uniqueness of the set of fundamental constants. This set may arise as the only solution for the observability of the world, which however can possibly change through the evolutionary process of whole universe. The parameter β is the statement that cannot be proven or disproven within the formal system. It is the Gödel statement organizing the system as a whole and determining the way of its closure to efficient causation. The notion of β incorporates the arithmetic system into a more general semiotic structure. The future development of the theory of computation should show that the incompleteness of formal representation of the Peircean triangle leads to new semiotic formulations of the Gödel’s incompleteness theorem. The first steps are done in this direction in the formulation of theorem that autopoietic machine necessarily includes Gödel statements (Simeonov, 2002). When the system generates Gödel statements, it becomes temporarily complete (closed to efficient causation) in relation to it previous state before the formulation of such statements. The relative character of such a closure is reflected in the formulation of Gödel's second incompleteness theorem stated as follows: “For any formal recursively enumerable (i.e. effectively generated) theory T including basic arithmetical truths and also certain truths about formal provability, T includes a statement of its own consistency if and only if T is inconsistent”. The Gödel statements show themselves in evolving network of non-equivalent observers (Igamberdiev, 2008).

Conclusion: Rosen’s “Central Result” and Computability Rosen’s central result is similar to that of the Gödel’s (encoding of statements about the system within it) in the application to biological systems. The β parameter representing organizational invariance is equivalent to a set of Gödel numbers generated within the system. The whole biological system can be viewed as triadic in the Peircean sense. It consists of a) Metabolism – sets and relations; b) Replacement – relations on relations, and c) Organizational invariance – Gödel statements about the sets and relations. It is important that Gödel statements are not sets or relations; they are metamathematical statements with the dual function, facing both sets and relations. So, the triadic structure includes sets, relations, and metamathematical statements (encoded within the system). Translation of these statements into the system occurs via a sophisticated “machinery” of transcription, translation and recombination (e.g. splicing) events. Rosen’s structure includes all possible efficient and material causes entailed within the system. In addition to this and making the final ultimate closure, the arrow forming tetrahedron (Figure 3) establishes the link between the ultimate interpretant and the initial object. This link establishes the final cause for the initial object. In this total structure, all initially external effects are internalized.

Fundamentals of Natural Computation in Living Systems

357

Knowing that the material and efficient causes are incorporated in the (M,R) system, we can move to the analysis of formal causes. The formal cause is a shape of this (M,R) structure in total. The system possessing its own organizational invariance is embodied into the physical world as a certain shape (the spatiotemporal pattern). This shape imposes limits to its evolution and thus determines the nomothetical aspect of the evolutionary development. Weird three-dimensional structures can be viewed as solutions-projections in the physical space for existence of triadic (M,R) structures. It should be assumed initially that in frames of pure mathematics we should not search for substantiation of the fundamental constants. They appear as conditions of embodiment of Logos into the physical world. Fundamental constants can be viewed as ultimate Gödel numbers in the global system to close the physical world. Any Gödel number obtained can be uniquely factored into prime factors. Physical temporal world evolves by perpetual solving of the paradoxes via the creative generation of the sets of organizational invariance possessing their own internal time determining the limit for anticipation before the global system failure can occur. This is the time which itself counts but not by which we count (according to Aristotle). The triadic Peircean structure of the physical world is essentially the same as the structure described in the Aristotle’s tractate De Anima. The structure there is the following: Matter – Entelechy as knowledge – Entelechy as an exercise of knowledge (i.e. activity of recognition) The matter corresponds to the set of metabolisms, the entelechy-knowledge is the set of replacements (e.g. genetic system) and the entelechy-activity of recognition is the organizational invariance in the frames of stable non-equilibrium state. In fact, Plato did not develop in detail the principles of physical representation of eidoi. Aristotle did this in his theory of four causes and via introducing entelechy as a measure of information. The arising of computational properties in semiotic systems is a kind of the process of introduction of mathematics into real world by which the world becomes comprehensible. The paradox is that the introduction of mathematics can indeed be described mathematically using non-trivial formal metamathematical language. Einstein stated that the eternal mystery of the world is its comprehensibility. The same is said by Apostle Paul, in his words we can see the notion of the implementation of Logos into the physical world observable by living systems as a “hidden wisdom which God predestined before the ages to our glory” (Corinthians 2:7).

References Aristotle (1984). The Complete Works of Aristotle, ed. by J. Barnes. Princeton, NJ: Princeton University Press. Bauer, E.S. (1935). Theoretical Biology. Moscow-Leningrad: VIEM. Beil, R.G. & Ketner, K.L. (2003). Peirce, Clifford, and quantum theory. International Journal of Theoretical Physics 42, 1957-1972. Beil, R. G. (2004). Peirce, Clifford, and Dirac. International Journal of Theoretical Physics 43, 1301-1315. Berg, L.S. (1969). Nomogenesis. Cambridge: MIT Press. [Originally published in 1922]

358

Abir U. Igamberdiev

Bergson, H. (1917). L'Évolution Créatrice. Paris: Alcan. Borges, J.L. (1970). The Aleph & other stories 1933-1969. Boston: Dutton. Böttner, M. (2001). Peirce grammar. Grammars 4, 1–19. Brink, C. , Britz, K. & Schmidt, R.A. (1994). Peirce Algebras. Formal Aspects of Computing 6,339-358. Chargaff, E., Crampton, C.F. & Lipshitz, R. (1953). Separation of calf thymus deoxyribonucleic acid into fractions of different composition. Nature 172, 289-292. Christiansen, P.V. (1985). The Semiotics of Quantum Non-locality (Tekster fra IMFUFA [ISSN 0106-6242], text number 93). University of Roskilde, Roskilde. Available at http://mmf.ruc.dk/∼ PVC/. Conrad, M. & Liberman, E.A. (1982). Molecular computing as a link between biological and physical theory. Journal of Theoretical Biology 98, 239–252. Cornish-Bowden, A. & Cárdenas, M.L. (2007). Organizational invariance in (M,R)-systems. Chemistry and Biodiversity 4, 2396-2406. Cornish-Bowden, A., Cárdenas, M.L., Letelier, J.C. & Soto-Andrade, J. (2007) Beyond reductionism: metabolic circularity as a guiding vision for a real biology of systems. Proteomics 7, 839-845. Crick, F.H. (1963). On the genetic code. Science 139, 461-464. De Rijke, M. (1995). The logic of Peirce algebras. Journal of Logic, Language and Information 4, 227-250. Eigen, M. & Schuster, P. (1979). The Hypercycle: A Principle of Natural Self-Organization. Berlin: Springer. Freud, S. (1976). Vorlesungen zur Einführung in die Psychoanalyse. Frankfurt a. M.: S. Fischer Verlag. Gamow, G. (1954). Possible relation between deoxyribonucleic acid and protein structures. Nature 13, 318. Goldblatt, R. (1979). Topoi: the Categorial Analysis of Logic. Amsterdam: North Holland. Gusev, V.A. & Schulze-Makuch, D. (2004). Genetic code: Lucky chance or fundamental law of nature? Physics of Life Reviews 1, 202-229. Hirsch, R. (2007). Peirce algebras and Boolean modules. Journal of Logic and Computation 17, 255-283. Igamberdiev, A.U. (1992). Organization of biosystems: A Semiotic approach. In: Biosemiotics. A Semiotic Web 1991 (=Approaches to Semiotics 106) (T. Sebeok & J. Umiker-Sebeok, eds.). Berlin: Moyton de Gruyter, pp. 125-144. Igamberdiev, A.U. (1993). Quantum mechanical properties of biosystems - a framework for complexity, structural stability, and transformations. BioSystems 31, 65-73. Igamberdiev, A.U. (1996). Life as self-determination. In: Defining Life: The Central Problem in Theoretical Biology (M. Rizzotti, ed.). Padova: Padova University Publishers, pp. 129148. Igamberdiev, A.U. (1999a). Foundations of metabolic organization: coherence as a basis of computational properties in metabolic networks. BioSystems 50, 1-16. Igamberdiev, A.U. (1999b). Semiosis and reflectivity in life and consciousness. Semiotica 123, 231-246. Igamberdiev, A.U. (2004). Quantum computation, non-demolition measurements, and reflective control in living systems. BioSystems 77, 47-56.

Fundamentals of Natural Computation in Living Systems

359

Igamberdiev, A.U. (2007). Physical limits of computation and emergence of life. Biosystems 90, 340-349. Igamberdiev, A.U. (2008). Objective patterns in the evolving network of non-equivalent observers. BioSystems 92, 122-131. Igamberdiev, A.U. & Kleczkowski, L.A. (2003). Membrane potential, adenylate levels and Mg2+ are interconnected via adenylate kinase equilibrium in plant cells. Biochimica et Biophysica Acta – Bioenergetics 1607, 111-119. Kauffman, L.H. (2001). The Mathematics of Charles Sanders Peirce. Cybernetics & Human Knowing 8, 79–110. Kull, K. (1998). On semiosis, Umwelt, and semiosphere. Semiotica 120, 299-310. Lacan, J. (1970). Ecrits. Paris: Seuil. Lefebvre, V.A. (1990). The Fundamental Structures of Human Reflexion. New York: Peter Lang. Letelier, J.-C., Soto-Andrade, J., Abarzúa, F.G., Cornish-Bowden, A. & Cárdenas, M.L. (2006). Organizational invariance and metabolic closure: Analysis in terms of (M,R) systems. Journal of Theoretical Biology 238, 949-961. Liberman, E.A. (1989). Molecular quantum computers. Biofizika 34, 913–925. Maturana, H. R., Varela, F. G. (1980). Autopoiesis and Cognition. Dordrecht: Reidel. Meyen, S.V. (1977). Taxonomy and meronomy. In: Methodological Problems in Geological Sciences. Kiev: Naukova Dumka, pp.36-33 (in Russian). Meyen, S.V. (1978). Principal aspects of organism typology. Zhurnal Obshchei Biologii [Journal of General Biology] 39, 495-508 (in Russian). Nomura, T. (2007) Category theoretical distinction between autopoiesis and (M,R) systems. In: F. Almeida e Costa et al. (Eds.): ECAL 2007, LNAI 4648, pp. 465–474. Berlin: Springer. Pattee, H.H. (1972). Laws and Constraints, Symbols and Languages. In: C.H. Waddington (ed.) Towards a Theoretical Biology 4, Essays. Edinburgh: Edinburgh University Press, pp. 248-258. Peirce, C.S. (1955). Philosophical Writings of Charles Sanders Peirce. J. Buchler (ed.). New York: Dover Publications. Originally published in 1906. Piaget, J. (1955). The Child's Construction of Reality. London: Routledge and Kegan Paul. Richardson, I. W. & Rosen R. (1979). Aging and the metrics of time. Journal of Theoretical Biology 79, 415-423. Root-Bernstein, R.S. (1982). Amino acid pairing. Journal of Theoretical Biology 94, 885894. Rosen, R. (1967). Optimality Principles in Biology. New York: Plenum Press. Rosen, R. (1972). Some relational cell models: The metabolism-repair systems. In: Foundations of Mathematical Biology, Volume II, R. Rosen, ed. New York: Academic Press. Rosen, R. (1978a). Fundamentals of Measurement and Representation of Natural Systems. Elsevier. Rosen, R. (1978b). Feedforwards and global system failure: A general mechanism for senescence. Journal of Theoretical Biology 74, 579-590. Rosen, R. (1978c). Cells and senescence. International Review of Cytology 54, 161-191. Rosen, R. (1980). On control and optimal control in biodynamic systems. Bulletin of Mathematical Biology 42, 889-897.

360

Abir U. Igamberdiev

Rosen, R. (1985) Anticipatory Systems. New York: Pergamon Press. Rosen, R. (1991). Life Itself. New York: Columbia University Press. Rosen, R. (2000). Essays on Life Itself. New York: Columbia University Press. Sharov, A.A. (1995). Analysis of Meyen's typological concept of time. In: On the Way to Understanding the Time Phenomenon: the Constructions of Time in Natural Science. Part 1. Interdisciplinary Time Studies. Singapore: World Scientific, pp. 57-67. Shcherbak, V. (2003). Arithmetic inside the universal genetic code. BioSystems 70, 187-209. Shcherbak, V. (2008). The arithmetical origin of the genetic code. In: The Codes of Life: The Rules of Macroevolution. M. Barbieri (ed.). Dordrecht: Springer, pp. 153-185. Siemion, I.Z. (1994). Compositional frequencies of amino acids in the proteins and the genetic code. Biosystems 32, 163-170. Simeonov, P.L. (2002). The Viator approach: about four principles of autopoietic growth on the way to future hyperactive network architectures. Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS-02), 1530-2075/02. Tegmark, M. (2007). The mathematical universe. Foundations of Physics 38, 101-150. Vavilov, N. I. (1922). The law of homologous series in variation. Journal of Genetics 12, 47– 89. Vygotsky, L.S. (1962). Thought and Language. Cambridge, MA: MIT Press. Wolkenhauer, O. & Hofmeyr, J.-H.S. (2007). An abstract cell model that describes the selforganization of cell function in living systems. Journal of Theoretical Biology 246, 461476. Zotin, A.I. & Alekseeva, T.A. (1984). Rubner's constant as a criterion of species' life duration. Fiziiologicheskii Zhurnal [Russian Journal of Physiology] 30, 59-64. Zurek, W.H. (2003). Decoherence, eigenselection, and the quantum origins of the classical. Reviews in Modern Physics 75, 715–775.

In: Computational Biology: New Research Editor: Alona S. Russe

ISBN: 978-1-60692-040-4 © 2009 Nova Science Publishers, Inc.

Chapter 14

EXTRACTION OF POSITION-SENSITIVE PROMOTER CONSTITUENTS Yoshiharu Y. Yamamoto∗ and Junichi Obokata Center for Gene Research, Nagoya University, Nagoya 464-8602, Japan

Abstract Extraction of functional sequences from promoters has been achieved with alignmentbased motif search algorithms, but recently, several new approaches have been developed. In this review, I would like to introduce a methodology called as the LDSS (Local Distribution of Short Sequences) analysis. This approach evaluates distribution profiles of short sequences along the promoter region, and sequences that preferentially appear at specific promoter regions are extracted. Application of this strategy to human, mouse, Drosophila, C. elegans, Arabidopsis, rice, and also yeast resulted in successful extraction of both well known promoter elements and also novel putative elements. This method is so sensitive that various kinds of minor elements can be simultaneously detected by analysis of all the promoters of a genome as one batch. However, the LDSS analysis does not detect all the elements involved in transcriptional regulation, but position-insensitive elements are out of range of the analysis. No need of microarray data for the analysis enables its application to wide range of species beyond the model species.

Introduction Characteristics of Pol II-Dependent Promoter Genome sequence encodes not only transcribed regions but also their promoters that control timing, frequency, direction and position of transcriptional initiation. A typical promoter is 0.5 to 2 kbp long, judged by capability of full representation of the original expression profile due to the promoter fragment in reporter fusion experiments. Functional elements that are scattered within the promoter are 5 to 15 bp long (Fickett and Hatzigeorgiou, 1997), and they recruit proteinaceous transcription-related factors, including RNA polymerases, ∗

Email address: [email protected]

362

Yoshiharu Y. Yamamoto and Junichi Obokata

transcriptional activators and repressors, and chromatin remodeling factors, via sequencespecific DNA-protein interaction. Sequence between functional elements is considered as non-functional spacers (McKnight and Kingsbury, 1982). Therefore, information of a promoter is embedded in a patchy manner. A pol II-dependent promoter is divided into two domains, the core promoter region (-40 to +50 bp relative to transcription start site (TSS)) that provides binding sites of the preinitiation complex, and the regulatory region (-500 to -40) that supports transcriptional regulation (Fig. 1) (Carey and Smale, 2001). Because the core region is recognized by general transcription machinery, only a few types of elements constitute this region, such as the TATA box and the Initiator (Inr) motif. This region determines direction, position, and frequency of transcriptional initiation. The regulatory region contains functional cisregulatory elements that control timing and frequency of transcriptional initiation, and sequences of cis-elements are very diverse reflecting wide variety of trans-acting factors, in contrast to the core elements. In case of human, nearly one thousand trans-factors are registered in Transfac (Wingender et al., 2001). In some cases, enhancers also control transcription that locate outside of the two promoter regions, such as far upstream region, intron, or the downstream region (Blackwood and Kadonaga, 1998; Carey and Smale, 2001; Visel et al., 2007). TSS

Coding Enhancer (position-independent)

Regulatory (-500 to -40) diverse types of elements

Core (-40 to +50) a few conserved types of elements

Figure 1. Promoter domains. The promoter can be divided into core and regulatory regions. Enhancers are found at upstream of the regulatory region, intron, downstream region. More information is summarized by Carey and Smale (Carey and Smale, 2001).

In addition to these functional promoter elements, positioning of nucleosomes also affects transcription. Depletion of nucleosomes at the core promoter region is observed in yeast (Segal et al., 2006) and human (Ozsolak et al., 2007). This feature would also be reflected by the promoter sequence (Ganapathi et al., 2007). Because of technical limitation, experimental identification of all the cis-elements in a genome has not been achieved. This situation raises importance of the computational approach for identification of all the cis-elements. Overviews of methodology for motif discovery from promoters can be found in several books (e.g., (Brejová et al., 2003; Jones and Pevzner, 2004)).

Strategies for motif Extraction from Promoters Development of the microarray technology provided large lists of genes that show coregulated expression. Because co-regulation is thought to be achieved through a shared cisregulatory element(s), these lists can be used for extraction of novel cis-regulatory elements

Extraction of Position-Sensitive Promoter Constituents

363

responsible for the co-regulation. Therefore, availability of microarray data promoted pavement of methodologies for motif extraction from promoter sequences. Gibbs Sampling (Lawrence et al., 1993; Roth et al., 1998; Hughes et al., 2000) and MEME (Bailey and Elkan, 1995; Bailey et al., 2006) are well established procedures for alignment-based discovery of a common motif among a set of co-regulated promoters (Fig. 2). As illustrated in the figure, these methods start with a random model (motif) and sophisticate it through cycling of 1) detection of short sequences fitting with the motif and 2) reconstruction of a motif according to the detected sequences. TSS

Search for motif positions & selection

modified motif Describe a motif of selected sequences

Motif position 1

2

3

4

5

6

A

0

1

0

1

0.4

0.7

C

0

0

0

0

0.05

0.05

G

0

0

0

0

0.05

0.05

T

1

0

1

0

0.5

0.2

Figure 2. Alignment-based motif discovery. Top. A set of co-regulated promoters are subjected to motif search. Diamonds with same color are related short sequences with high homology, but they do not necessarily mean functional promoter elements. Middle. Short sequences matching the motif in search are picked up from the promoters with a threshold. Bottom. Aligned sequence group defines the motif for the next search. Repeating the process refines the motif. A motif is described as a consensus sequence, a position weight matrix (bottom table in the figure), or a HMM model. Methods to set a (random) seed motif, describe a motif, search among promoters, and refine the motif in analysis are different depending on algorisms. Gibbs Sampling and MEME follow this scheme.

364

Yoshiharu Y. Yamamoto and Junichi Obokata

Output quality of this strategy in part depends on selection of promoters. Regarding to an analyzed motif, selection of a too large promoter population with majority of motif-negative promoters may fail the motif detection, because this method is not good at detection of minor motifs. Beside this feature, one major problem of this strategy is extraction of unspecific motifs that are not regulatory elements but that represent common core promoter elements or spacer sequences. Genome sequence is not neutral but biased even at nonfunctional intergenic regions. Therefore, it is not very surprising that spacer sequence have sequence motifs that over-represent at the promoter region. These false positives of motif extraction can not be excluded by this strategy alone, but are able to be removed by parallel analysis of an appropriate sequence set as a negative control, or by combination with the following strategy. Another approach is detection of overrepresented sequences in a co-regulated promoter set over a reference set (van Helden et al., 1998; Marino-Ramirez et al., 2004). This strategy can be achieved alone in an enumerative fashion of a fixed length of short sequences, or as a downstream process of the alignment-based motif extraction. Microarray data is sometimes substituted to a promoter set that is expected to be regulated in the same manner according to biological expectation, e.g., a gene family or genes involved in the same biological process. However, if the expectation is not supported by expression data, extraction should be less accurate than microarray-based analysis. Another substitution is utilization of chromatin immunoprecipitation (ChIP) data. ChIP is a method that can identify in vivo target loci of a DNA-binding factor with the aid of appropriate antibody. Identification of the precipitated DNA fragments are often achieved by the tilling array hybridization or sequencing of the precipitated fragments, and in both cases, DNA sequences several hundred-bp-long are obtained as the target fragments of the analyzed factor. Advantage of the motif extraction from co-regulated promoter sequences is easy association of the extracted motifs with biological information. On the other hand, its limitation is that one co-regulated promoter set can serve as a source of quite a limited number of motifs. Therefore, in order to extract many cis-elements that are used in a genome, a large number of promoter sets are required. For example, a promoter set that responds to a glucocorticoid hormone is able to serve only for motif extraction of the hormone-responsive elements and potentially elements for its downstream pathways. In order to extract responsive elements for other hormones, developmental processes, or environmental responses, other promoter sets based on additional microarray experiments are required. Because preparation of promoter sets depends on micorarray experiments, there is a practical limitation for the preparation. Because of this feature, this strategy can detect only a limited part of cisregulatory elements of a genome. This is also true for the ChIP-aided motif extraction. As a different strategy, availability of closely related genome sequences promoted comparative genomics approaches (Manson McGuire and Church, 2000; Kellis et al., 2003; Harbison et al., 2004; Prakash and Tompa, 2005). This strategy utilizes locally conserved regions at promoters among multiple genomes as candidates for motif extraction under the assumption of motif conservation between the compared genomes. While this approach does not require experiments, there are two disadvantages: difficulty of association of biological information with the extracted motifs, and no availability of a single genome-specific information that is not conserved among multiple genomes.

Extraction of Position-Sensitive Promoter Constituents

365

LDSS Strategy In this review, I would like to focus on another strategy for extraction of promoter elements called as LDSS analysis, for Local Distribution of Short Sequences. Examination of positional distribution of several cis-regulatory elements revealed that they tend to appear at specific promoter positions relative to TSS. Elkon et al. reported that several cell cycle-regulated cis-elements, including binding sites for NRF-1, Sp1, NF-Y, CREB, ATF, and E2F, were all tend to concentrate in the proximity of the TSS (-200 to -1, relative to TSS) within the examined area (-1200 to -1) of 568 cell cycle-regulated promoters (Elkon et al., 2003). Localized distribution was also observed by analysis of the E2F-binding sites in the Arabidopsis promoters (Ramirez-Parra et al., 2003). These studies evoked an idea that localized distribution is a signature of a functional element of the promoter. This idea is also supported by large-scale deletion analysis of human promoters that suggests some relationship between presence of functional elements and distance form TSS (Cooper et al., 2006). According to the above idea, FitzGerald et al. extracted all the octamer sequences that showed localized distribution in the promoter region (FitzGerald et al., 2004). They prepared human 13,010 promoter sequences that were aligned by experimentally determined TSS, and analyzed accumulation profile of each octamer sequence according to promoter position. After examination of all the possible 65,536 (= 48) octamer sequences, they extracted sequences that showed localized distribution. As a result, they found that the extracted 159 octamers that “clusters” in the promoter region include many known motifs such as various cis-regulatory elements, TATA box and Kozak sequences for translational initiation motif, demonstrating successful detection of functional promoter elements by this strategy. In addition, they also detected octamers as putative novel cis-regulatory elements. The same strategy was also successfully applied to plant promoters and this approach was named as LDSS (Local Distribution of Short Sequences) analysis (Yamamoto et al., 2007b). Several other applications are also reported to detect human TATA box (Shi and Zhou, 2006), cisregulatory elements and translational initiation motif of Arabidopsis, S. selevisiae, and C. elegans (Berendzen et al., 2006), and various promoter constituents of Arabidopsis, rice, human, and mouse (Yamamoto et al., 2007a). The strategy is also incorporated in the AGLAM package that extracts cis-elements (Tharakaraman et al., 2005).

Procedures of LDSS Analysis Outline of the LDSS analysis is illustrated in Figure 3. Necessary information prior to the analysis is 1) genome sequence and 2) experimentally identified TSS. Other information like microarray data is not necessary for the extraction step. Therefore, this is easily applicable to non-model organisms whose information of functional genomics is limited. Initially, promoter sequences with the same length (typically 1 kbp) are cut out and aligned, anchoring the alignment on each promoter’s TSS (Fig. 3, top). Then, appearance of short sequences with fixed length is counted in an enumerative manner and accumulation in the promoter set is summarized in a position-sensitive manner (Fig. 3, middle).

366

Yoshiharu Y. Yamamoto and Junichi Obokata TSS

Total appearance

..................

Relative position from T SS +1

+1,000

LDSS-positive sequence

Figure 3. Outline of the LDSS analysis. Top. All the available promoter sequences in a genome are subjected to the analysis. Sequences are aligned at the experimentally determined TSS (red arrow). Diamonds with the same color means the same short sequence. Middle. Accumulation of appearance according to the promoter position gives distribution. Green sequence is found in all the promoters with close locations. Green and Red sequences show localized distribution, but blue one does not. Therefore, green and red sequences are LDSS-positive and blue is negative. Bottom. Extracted positive sequences are summarized.

Analysis with longer length gives higher sequence specificity, but at the same time, results in fewer occurrence, leading higher statistical fluctuations. Considering the balance between sequence specificity and statistical reliability, octamer analysis would be the first choice. With small genomes with a few thousand promoters, heptamer or hexamer analysis might give better results than octamers. Next step is measurement of localization strength and selection of positive sequences. To achieve this, several parameters can be utilized (Fig. 4). The figure shows two possible ways of measurement, direct and a model-based ways. There are several reports on the direct measurement. FitzGerard et al introduced CF for Clustering Factor that is calculated, CF = (Peak_height - base_line)/SD, and also CF-based P value (FitzGerald et al., 2004). In addition to height-based measurement, we utilized area-based assessment (Yamamoto et al., 2007b), and the corresponding P value turned out to show nice correlation with functionality of mutated TATA box (Yamamoto et al., 2007a). In our experience, height-based measurement is good at detection of narrow and sharp peaks such as TATA box and Inr, and area-based analysis is superior in detection of wide and broad peaks like regulatory elements. Another way to evaluate localization is determination of Shannon’s entropy of the distribution curve.

Extraction of Position-Sensitive Promoter Constituents

367

Occurrence

A. Direct measurement

Area SD base 0 -1000

-800

-600

-400

-200

-1

-200

-1

Position from TSS (bp)

Occurrence

B. Curve fitting

base 0 -1000

-800

-600

-400

Position from TSS (bp)

Figure 4. Parameters for peak evaluation. A. Direct measurement. Data can be smoothed by sliding an average window of 3 to 20 bp before measurement. Several parameters that would be useful for evaluation of localization strength is shown in red arrows, that are peak position, peak height, peak width, peak area, and base line. B. Measurement of a fitted model. Observed accumulation data (black solid line) is fitted with a Gaussian model (red). Validity of curve fitting is assessed with difference between the observed data and the model (gray). The model curve provides peak position, peak height, half width, and base line (red arrows).

Height-based analysis like CF utilizes information of base line, its SD, and peak height, but sub-peak fractions flanking the peak are not evaluated. This problem is partially and fully resolved in the entropy- and area-based evaluation methods, respectively. As for the model-based measurement (Fig. 4B), the Gaussian fitting can be easily achieved with distributed software like (R), (PAW) or (IgorPro), and there are established methods to evaluate validity of the model. The parameters of the fitted model can be directly used to measure strength of localization. Concentration in the sub-peak fractions are also reflected in the model. The problem of this approach is a slight shift of a peak position due to asymmetric feature of the distribution profiles. Actually, most of the LDSS-positive profiles have sharper decrease on the TSS-proximal side, leading the peak position of the fitted model

368

Yoshiharu Y. Yamamoto and Junichi Obokata

to the distal side from TSS (Fig. 4B). Although there is no report that applies model-based analysis, it is also a potentially useful method. After extraction of LDSS-positive elements, classification according to distribution profiles is useful. This can be achieved by sorting by peak positions or applying clustering techniques that evaluate distribution profiles over the entire promoter region. The latter approach nicely distinguishes regulatory elements, TATA box, Inr and CpG islands, and novel core element groups were also identified by this classification (Yamamoto et al., 2007a). Related sequences in a group could be consolidated as a motif, or directly used for promoter annotation (Fig. 5). At1g10960 ferredoxin precursor isolog -100 CATAGAGACAATCACCAAGAAGATAACACAAGAGCCCACACATCG

-55 REG TATA Box Y Patch TACGCCACGTGGCAGATTCACCTCTTTATAATCCTCTCTCCCTC ACGCCACG TCTTTATA TCCTCTCT CGCCACGT CTTTATAA CCTCTCTC GCCACGTGG CTCTCTCC CCACGTGGC TCTCTCCC CACGTGGCA ACGTGGCAG

-11 +1 ACGGTTTCTAC G YR Rule

Figure 5. Application of LDSS-positive octamers. Extracted LDSS-positive octamers can be directly used for promoter annotation. As shown in the figure, mapping of the octamers on a promoter reveals promoter structure. Genome-wide promoter annotation is available at Plant Promoter DataBase, ppdb (Yamamoto and Obokata, 2008).

The group of regulatory elements is composed of highly heterogeneous sequences, and they have various biological roles. Because LDSS analysis itself does not give them any biological information, another type of analysis is required in order to associate the regulatory sequences to biological functions. Mapping of the elements to promoters and association with the target genes, and subsequent reference to GO or expression data provides rough link to biological information. Identification of reported cis-regulatory elements can be achieved by consulting cis-element databases such as Transfac (Matys et al., 2006) or PLACE (Higo et al., 1999).

Advantage and Limitation Necessary information for applying LDSS analysis is genomic sequence and large scale TSS data, and essentially no other information, such as microarray data, is needed. Because

Extraction of Position-Sensitive Promoter Constituents

369

preparation of TSS information is relatively easy, this analysis is also open to non-model organisms as long as the two types of data is available. Another advantage of LDSS analysis is simultaneous detection of major and minor elements. This is demonstrated by successful extraction of minor regulatory elements (REG) that appear in less than 1% of the analyzed promoter set, together with major core elements (Yamamoto et al., 2007b). Because of this feature, single extraction from the whole promoter set of a genome is enough, and preparation of subsets of promoter group in order to enhance detection sensitivity is not necessary. In addition, because the analysis does not include genome comparison, single genome-specific information can be detected. Because of the strategy, the LDSS analysis does not detect so called long-range enhancer elements that regulate transcription with a long distance, 1 Mb in some cases, from their target promoters (Carter et al., 2002; Lettice et al., 2002). In addition, FitzGerald et al. also report that two sequences are severely underrepresented near the TSS, the zinc finger protein LYF1 (Ikaros), and the HMG protein SRY (sex-determining region Y gene product), and that the majority of TF binding sites are uniformly distributed from -1,000 to 500 bp including Myb, HSF2, and TRE (FitzGerald et al., 2004). These results indicate that this strategy is able to detect only a subset of cis-regulatory elements. Analysis of Arabidopsis promoters also showed that about half of the reported regulatory elements could not be detected as LDSSpositive elements (Yamamoto et al., 2007b). For example, bZIP-binding elements were LDSS-positive, but Myb and Myc-binding elements were negative (Yamamoto et al., 2007b). As for the core element, BRE could not be detected from mammalian promoters (Yamamoto et al., 2007a). Because BRE functions together with the TATA box, BRE itself might not have sequence specificity. Anyway, these examples document a fundamental limitation of the LDSS analysis that would not be overcome by improvement of detection sensitivity. Some genomes causes additional difficulty in the LDSS analysis. We encountered some trouble in promoter analysis of Physcomitrella patens, a moss. This genome contains many promoters that derived from transposons or LTR, and as a result, hundreds of promoters have quite similar sequences, resulting in high conservation of many promoter sequences not only at functional elements. This situation makes many sequences of the transposon-derived promoters LDSS-positive, which include many false-positives. One solution is exclusion of these promoters from analysis.

Examples of LDSS Analysis The TATA box In 1987, Joshi aligned published 79 gene sequences from 15 plant species, and extracted consensus sequence of the TATA box, Inr, and Kozak (Joshi, 1987). The consensus sequence for the TATA box, TCACTATATATAG, had been used for more than a decade in order to search for the TATA box from newly determined plant promoters. Re-definition of Arabidopsis TATA box with genome-wide information was recently performed with the aid of MEME and Gibbs Sampling (Molina and Grotewold, 2005), to give the most common TATA sequence as TCTATATATA. Another genome wide analysis by LDSS analysis detected TATATATA is the most frequently observed TATA-related octamer in Arabidopsis and rice (Yamamoto et al., 2007b)(Table S2 and S3). Although completely different

370

Yoshiharu Y. Yamamoto and Junichi Obokata

methodologies are utilized, these results are quite consistent, demonstrating reliability of the LDSS analysis. Because the TATA box is a major core element that is found in a specific promoter position, it is one of the most easiest element for motif extraction from promoters. In addition, it is highly conserved among animal and plant kingdoms (Yamamoto et al., 2007a). Therefore, it is a good idea to try extraction of the TATA box in order to detect any problems in preparation of data and software.

REG subgroups Distribution profiles can be used for characterization of regulatory elements. Transcription factors binding sites are found at not only promoter region but first intron and 3’ proximal region, and there is a preference of the region depending on factors (Blanchette et al., 2006). We analyzed distribution profiles of promoter-localizing regulatory elements that are LDSSpositive (REG) (Yamamoto et al., 2007a), and found that human and mouse REGs can be further classified into 3 to 4 subgroups. Because different subgroup is rich in different motifs, each motif turned out to have distinct distribution profiles. This suggests functional differentiation of the subgroups regarding way of involvement in transcriptional initiation. Another example is analysis of CpG islands. Although CpG islands are well known as a core promoter element, its role in transcriptional initiation is poorly defined (Smale and Kadonaga, 2003). The exception is the Sp1-binding site, whose sequence is GC-rich (GGGCGG) (Kriwacki et al., 1992) and considered to be related to CpG island (Smale and Kadonaga, 2003). However, analysis of distribution profiles of CpG-related octamers and sequences containing Sp1-binidng sites revealed that they have distinct characteristics, indicating that Sp1 and CpG islands are distinct elements (Yamamoto et al., 2007a).

Detection of novel element groups Extraction of LDSS-positive octamers from rice promoters and classification according to distribution profiles revealed putative core promoter elements that had not reported (Yamamoto et al., 2007a). They include Y Patch, CA and GA groups. They are also found form Arabidopsis promoters, and their correlation to expression profiles suggests they are actually involved in transcriptional initiation (YY Yamamoto, T Yoshitsugu, T Sakurai, M Seki, K Shinozaki and J Obokata, unpublished results). These examples demonstrated discovery of unknown promoter element groups.

Differentiation of promoter architecture between plants and mammals Parallel analysis of human, mouse, rice, and Arabidopsis promoters by LDSS enabled comparison of core promoter architectures (Yamamoto et al., 2007a). Regulatory element group (REG), the TATA box, and Inr were found among all the four species, judged from conservation of characteristics of the groups, like peak position, distribution width, and direction sensitivity. However, comparison at the sequence level revealed the REG and Inr groups were not so conserved as the TATA box, indicating more differentiation of the former

Extraction of Position-Sensitive Promoter Constituents

371

groups. In addition, plant- and mammal-specific core groups were identified. Y Patch, GA and CA groups were plant-specific, and CpG-islands and Sp1-related groups were specific to mammalian promoters. These results revealed differentiation and conservation of the core promoter architecture between plants and mammals.

Other Application of LDSS Analysis The LDSS analysis has been developed for promoter studies, but its application is open to other studies as a method to detect position-sensitive motifs. We applied it for detection of polyadenylation signals, and could obtain reasonable results (Yamamoto YY, unpublished results).

Conclusion The LDSS analysis is a method for discovery of position-dependent motifs. This is especially suitable for promoter analysis, and several successful applications have been reported in animal and plant studies. However, it cannot detect position-independent motifs because of the utilized methodology.

References Bailey, T.L. and Elkan, C. (1995) The value of prior knowledge in discovering motifs with MEME. Proc Int Conf Intell Syst Mol Biol, 3, 21-29. Bailey, T.L., Williams, N., Misleh, C. and Li, W.W. (2006) MEME: discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Res, 34, W369-373. Berendzen, K.W., Stuber, K., Harter, K. and Wanke, D. (2006) Cis-motifs upstream of the transcription and translation initiation sites are effectively revealed by their positional disequilibrium in eukaryote genomes using frequency distribution curves. BMC Bioinformatics, 7, 522. Blackwood, E.M. and Kadonaga, J.T. (1998) Going the distance: a current view of enhancer action. Science, 281, 60-63. Blanchette, M., Bataille, A.R., Chen, X., Poitras, C., Laganiere, J., Lefebvre, C., Deblois, G., Giguere, V., Ferretti, V., Bergeron, D., Coulombe, B. and Robert, F. (2006) Genomewide computational prediction of transcriptional regulatory modules reveals new insights into human gene expression. Genome Res, 16, 656-668. Brejová, B., Vinar, T. and Li, M. (2003) Pattern discovery: methods and software. In Krawetz, S.A. and Womble, D.D. (eds.), Introduction to bioinformatics. Humana Press, Totowa, New Jersey, pp. 491-521. Carey, M. and Smale, S.T. (2001) A primer on transcriptional regulation in mammalian cells. In Transcriptional regulation in eukaryotes. Cold Spring Harbor Laboratory Press, New York. Carter, D., Chakalova, L., Osborne, C.S., Dai, Y.F. and Fraser, P. (2002) Long-range chromatin regulatory interactions in vivo. Nat Genet, 32, 623-626.

372

Yoshiharu Y. Yamamoto and Junichi Obokata

Cooper, S.J., Trinklein, N.D., Anton, E.D., Nguyen, L. and Myers, R.M. (2006) Comprehensive analysis of transcriptional promoter structure and function in 1% of the human genome. Genome Res, 16, 1-10. Elkon, R., Linhart, C., Sharan, R., Shamir, R. and Shiloh, Y. (2003) Genome-wide in silico identification of transcriptional regulators controlling the cell cycle in human cells. Genome Res, 13, 773-780. Fickett, J.W. and Hatzigeorgiou, A.G. (1997) Eukaryotic promoter recognition. Genome Res, 7, 861-878. FitzGerald, P.C., Shlyakhtenko, A., Mir, A.A. and Vinson, C. (2004) Clustering of DNA sequences in human promoters. Genome Res, 14, 1562-1574. Ganapathi, M., Singh, G.P., Sandhu, K.S., Brahmachari, S.K. and Brahmachari, V. (2007) A whole genome analysis of 5' regulatory regions of human genes for putative cis-acting modulators of nucleosome positioning. Gene, 391, 242-251. Harbison, C.T., Gordon, D.B., Lee, T.I., Rinaldi, N.J., Macisaac, K.D., Danford, T.W., Hannett, N.M., Tagne, J.B., Reynolds, D.B., Yoo, J., Jennings, E.G., Zeitlinger, J., Pokholok, D.K., Kellis, M., Rolfe, P.A., Takusagawa, K.T., Lander, E.S., Gifford, D.K., Fraenkel, E. and Young, R.A. (2004) Transcriptional regulatory code of a eukaryotic genome. Nature, 431, 99-104. Higo, K., Ugawa, Y., Iwamoto, M. and Korenaga, T. (1999) Plant cis-acting regulatory DNA elements (PLACE) database: 1999. Nucleic Acids Res, 27, 297-300. Hughes, J.D., Estep, P.W., Tavazoie, S. and Church, G.M. (2000) Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol, 296, 1205-1214. IgorPro. http://www.wavemetrics.com Jones, N.C. and Pevzner, P.A. (2004) An introduction to bioinformatics algorthms. The MIT Press, Boston. Joshi, C.P. (1987) An inspection of the domain between putative TATA box and translation start site in 79 plant genes. Nucleic Acids Res, 15, 6643-6653. Kellis, M., Patterson, N., Endrizzi, M., Birren, B. and Lander, E.S. (2003) Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature, 423, 241254. Kriwacki, R.W., Schultz, S.C., Steitz, T.A. and Caradonna, J.P. (1992) Sequence-specific recognition of DNA by zinc-finger peptides derived from the transcription factor Sp1. Proc Natl Acad Sci U S A, 89, 9759-9763. Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F. and Wootton, J.C. (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262, 208-214. Lettice, L.A., Horikoshi, T., Heaney, S.J., van Baren, M.J., van der Linde, H.C., Breedveld, G.J., Joosse, M., Akarsu, N., Oostra, B.A., Endo, N., Shibata, M., Suzuki, M., Takahashi, E., Shinka, T., Nakahori, Y., Ayusawa, D., Nakabayashi, K., Scherer, S.W., Heutink, P., Hill, R.E. and Noji, S. (2002) Disruption of a long-range cis-acting regulator for Shh causes preaxial polydactyly. Proc Natl Acad Sci U S A, 99, 7548-7553. Manson McGuire, A. and Church, G.M. (2000) Predicting regulons and their cis-regulatory motifs by comparative genomics. Nucleic Acids Res, 28, 4523-4530. Marino-Ramirez, L., Spouge, J.L., Kanga, G.C. and Landsman, D. (2004) Statistical analysis of over-represented words in human promoter sequences. Nucleic Acids Res, 32, 949-958.

Extraction of Position-Sensitive Promoter Constituents

373

Matys, V., Kel-Margoulis, O.V., Fricke, E., Liebich, I., Land, S., Barre-Dirrie, A., Reuter, I., Chekmenev, D., Krull, M., Hornischer, K., Voss, N., Stegmaier, P., Lewicki-Potapov, B., Saxel, H., Kel, A.E. and Wingender, E. (2006) TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res, 34, D108-110. McKnight, S.L. and Kingsbury, R. (1982) Transcriptional control signals of a eukaryotic protein-coding gene. Science, 217, 316-324. Molina, C. and Grotewold, E. (2005) Genome wide analysis of Arabidopsis core promoters. BMC Genomics, 6, 25. Ozsolak, F., Song, J.S., Liu, X.S. and Fisher, D.E. (2007) High-throughput mapping of the chromatin structure of human promoters. Nat Biotechnol, 25, 244-248. PAW. http://wwwasd.web.cern.ch/wwwasd/paw/ Prakash, A. and Tompa, M. (2005) Discovery of regulatory elements in vertebrates through comparative genomics. Nat Biotechnol, 23, 1249-1256. R. http://www.r-project.org Ramirez-Parra, E., Frundt, C. and Gutierrez, C. (2003) A genome-wide identification of E2Fregulated genes in Arabidopsis. Plant J, 33, 801-811. Roth, F.P., Hughes, J.D., Estep, P.W. and Church, G.M. (1998) Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat Biotechnol, 16, 939-945. Segal, E., Fondufe-Mittendorf, Y., Chen, L., Thastrom, A., Field, Y., Moore, I.K., Wang, J.P. and Widom, J. (2006) A genomic code for nucleosome positioning. Nature, 442, 772-778. Shi, W. and Zhou, W. (2006) Frequency distribution of TATA Box and extension sequences on human promoters. BMC Bioinformatics, 7 Suppl 4, S2. Smale, S.T. and Kadonaga, J.T. (2003) The RNA polymerase II core promoter. Annu Rev Biochem, 72, 449-479. Tharakaraman, K., Marino-Ramirez, L., Sheetlin, S., Landsman, D. and Spouge, J.L. (2005) Alignments anchored on genomic landmarks can aid in the identification of regulatory elements. Bioinformatics, 21 Suppl 1, i440-448. van Helden, J., Andre, B. and Collado-Vides, J. (1998) Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J Mol Biol, 281, 827-842. Visel, A., Bristow, J. and Pennacchio, L.A. (2007) Enhancer identification through comparative genomics. Semin Cell Dev Biol, 18, 140-152. Wingender, E., Chen, X., Fricke, E., Geffers, R., Hehl, R., Liebich, I., Krull, M., Matys, V., Michael, H., Ohnhauser, R., Pruss, M., Schacherer, F., Thiele, S. and Urbach, S. (2001) The TRANSFAC system on gene expression regulation. Nucleic Acids Res, 29, 281-283. Yamamoto, Y.Y., Ichida, H., Abe, T., Suzuki, Y., Sugano, S. and Obokata, J. (2007a) Differentiation of core promoter architecture between plants and mammals revealed by LDSS analysis. Nucleic Acids Res, 35, 6219-6226. Yamamoto, Y.Y., Ichida, H., Matsui, M., Obokata, J., Sakurai, T., Satou, M., Seki, M., Shinozaki, K. and Abe, T. (2007b) Identification of plant promoter constituents by analysis of local distribution of short sequences. BMC Genomics, 8, 67. Yamamoto, Y.Y. and Obokata, J. (2008) ppdb, a plant promoter database. Nucleic Acids Res, 36, D977-981.

In: Computational Biology: New Research Editor: Alona S. Russe

ISBN: 978-1-60692-040-4 © 2009 Nova Science Publishers, Inc.

Chapter 15

DECONVOLUTION OF POSITIONAL SCANNING SYNTHETIC COMBINATORIAL LIBRARIES: MATHEMATICAL MODELS AND BIOINFORMATICS APPROACHES Yingdong Zhao∗ and Richard Simon♣ Biometric Research Branch, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892

Abstract Combinatorial peptide library technology has been proven to be a powerful approach to both T-cell epitope determination and analysis of TCR specificity and degeneracy. During the past ten years, we have developed mathematical models and bioinformatics approaches for deconvolution of positional scanning synthetic combinatorial libraries (PS-SCL). PS-SCL are composed of trillions of peptides systematically arranged in mixtures with defined amino acids at each position. Starting from the mathematical model building to deconvolute the spectrum of PS-SCL, we proposed a biometrical approach using the score matrix to systematically search the protein databases for putative antigenic peptide candidates. We then evaluated of the assumption of independent contribution of the side chains of the amino acids in peptides and applied more sophisticated machine learning algorithms to improve the prediction accuracy based on synthesized peptide data. Finally, we implemented the above approach into a web based tool for searching protein database for T-cell epitopes based on experimental data from PS-SCL with the website employed a strong statistical analysis package, relational database and Java applets. Our work has provided a sound basis for PSSCL data analysis and has been proven efficient and successful for identifying target antigens and highly active peptide mimics.

∗ ♣

Email address: [email protected] Email address: [email protected]

376

Yingdong Zhao and Richard Simon

1. Introduction T cell epitopes are the peptides degraded from foreign or self proteins that bind to the T cell receptor (TCR) in conjunction with a peptide-presenting major histocompatibility complex (MHC) molecule to activate T cells (Figure 1). Factors that contributing to T cell recognition/activation include not only the affinity of TCR with MHC but also the affinity of TCR with peptide and other biological and physiological factors. Identifying characteristic patterns of immunogenic peptide epitopes can provide fundamental information for understanding disease pathogenesis and etiology, and for the design of vaccines and therapeutic approaches to immune-mediated, infectious and neoplastic diseases.

TCR recognition model TCR

peptide

MHC Figure 1. Interactions between T cell epitope, MHC molecule and T cell receptor.

Previous studies using combinatorial peptide libraries to identify biologically active peptides [Dooley and Houghten, 1993; Eichler et al.,1994; Houghten et al., 1999; Pinilla et al., 1999] have shown successful results. Positional scanning synthetic combinatorial libraries (PS-SCL) are composed of trillions of peptides systematically arranged in mixtures, each with a defined amino acid at one position (Figure 2). For example, decapeptide libraries consist of 200 mixtures in OX9 format where O represents one of the 20 natural L-amino acids in a defined position and X represent all of the natural amino acids (with exception of Cysteine) at each of the remaining positions. Therefore, each mixture is composed of 3.04x1011 peptides in equal molar concentration. Such libraries provide an unbiased way to study the interaction of the tri-molecular complex between TCR, MHC, and peptide. T-cell clones (TCC) are established from peripheral blood or cerebrospinal fluid (CSF) lymphomononuclear cells. The proliferations of TCC in response to each peptide mixture in the PS-SCL are then tested and measured [Hemmer et al., 1999]. Figure 3 shows proliferative response of a TCC induced by each mixture of the PS-SCL. Due to the complexity of the PS-SCL, it was technically

Deconvolution of Positional Scanning Synthetic Combinatorial Libraries

377

impossible to fully utilize this technology without the development of quantitative methods for predicting the stimulatory potential of peptides based on data from these complex libraries. In the past ten years, we have developed a strategy that combines data acquisition from PS-SCL and biometric modeling in conjunction with large scale database searches in order to better understand, measure and predict both specific and degenerate interactions between clonotypic T-cell receptors and MHC-peptide ligands [Zhao et al., 2001a]. The approach allows us to quantitatively describe the complex interactions of the tri-molecular antigen recognition complex and to accurately predict the spectrum of stimulatory ligands for individual T cell clones. We present here, for the first time, the mathematical derivation of the model to deconvolute the PS-SCL. This provides a theoretical basis for the PS-SCL data analysis which can play an important role in recognition of antigenic peptide candidates. To take into account the interactions of the side chains of the amino acids within the peptide, we apply a machine learning algorithm to build a model with good prediction accuracy. We have implemented the above approach and designed a web-based tool TEST (http://linus.nci. nih.gov/TEST.html) that incorporates analysis of combinatorial library T cell stimulation data and searches of genomic databases for peptides predicted to stimulate the T cell clone [Zhao et al., 2001b]. Peptides can be identified from database searches efficiently and ranked according to a score that is predictive of their stimulatory potency. This is an effective approach to identify stimulatory peptides for individual TCR and predict their actual stimulatory potency with relatively high accuracy.

Figure 2. Graphical display of the single position fixed PS-SCLs. The libraries consist of mixtures in OX9 format where O represents one of the 20 natural L-amino acids in a defined position and X represent all of the natural amino acids at each of the remaining positions.

378

Yingdong Zhao and Richard Simon 15000

15000

P1

10000

5000

0

0 A

C

D

E

F

G

H

I

K

L

M

N

P

Q

R

S

T

V

W

Y

A

15000

10000

P2

10000

0 A

C

D

E

F

G

H

I

K

L

M

N

P

Q

R

S

T

V

W

F

G

H

I

K

L

M

N

P

Q

R

S

T

V

W

Y

D

E

F

G

H

I

K

L

M

N

P

Q

R

S

T

V

W

Y

P7

A

C

15000

P3

P8

10000

5000

0 A

C

D

E

F

G

H

I

K

L

M

N

P

Q

R

S

T

V

W

0

Y

A

15000

C

D

E

F

G

H

I

K

L

M

N

P

Q

R

S

T

V

W

Y

D

E

F

G

H

I

K

L

M

N

P

Q

R

S

T

V

W

Y

D

E

F

G

H

I

K

L

M

N

P

Q

R

S

T

V

W

Y

15000

P4 10000

5000

P9

5000

0

0 A

C

D

E

F

G

H

I

K

L

M

N

P

Q

R

S

T

V

W

Y

A

15000

10000

E

0

Y

5000

10000

D

5000

15000

10000

C

15000

5000

CPM

P6

10000

5000

C

15000

P5 10000

5000

P10

5000

0

0 A

C

D

E

F

G

H

I

K

L

M

N

P

Q

R

S

T

V

W

Y

A

d e f i n e d

C

a m i n o

a c i d s

Figure 3. Proliferative response of T-cell clones to the 200 mixtures of a decapeptide PS-SCL in which each position has one defined amino acid (20 for each of the 10 positions; the single letter amino acid code is used). Proliferation is shown as CPM induced by each mixture of the PS-SCL (mean and standard deviations of duplicate wells).

2. Mathematical Model 2.1 PS-SCL with single defined position An n-mer peptide PS-SCL with single defined position can be represented as mixture of peptides X1X2…Oij…XN, all of which contain amino acid Oi in the jth position for some indexing of the 20 natural L-amino acids. We will assume that when a T-cell clone is exposed to concentration c of an n-mer peptide, the expected level of stimulation is proportional to n

20

f (ck ) ⋅ (∑∑ Sij xij )

(1)

j =1 i =1

where f ( c ) is an arbitrary function of the concentration of the peptide. Inclusion of an additive function of the concentration would not change our final result. xij is defined as 1 if the j th position in the peptide is occupied by the i th amino acid for some indexing of the amino acids. For a given peptide, each position j has xij =1 for exactly one value of i.

Deconvolution of Positional Scanning Synthetic Combinatorial Libraries

379

The Sij s are unknown values which represent the contribution to stimulation from having amino acid i in position j of a peptide. We assume that the components of a mixture do not interact and hence expected stimulation for a mixture is the sum of the expected stimulation levels of its components. If

xij( k ) denotes the values of xij for the k th component of the mixture, and if ck denotes the concentration of the k th component, then the expected stimulation for the mixture is n

20

N

f ( ck ) ⋅ ( ∑∑ Sij ∑ xij ( k ) ) j =1 i =1

(2)

k =1

where N is the number of components. For PS-SCL mixtures, all components are equal-molar and hence (2) can be written

f(

n 20 N c ) ⋅ ∑∑ Sij ∑ xij( k ) N j =1 i =1 k =1

(3)

where c is the total concentration of the mixture. The last summation in (3) is the number of components for which xij =1. Consider a PS-SCL mixture of n-mers in which amino acid i* is fixed in position j*. (k )

When we examine the terms of expression (3) for which j=j*, all of the xij* values are zero (k )

except those with i=i*. How many terms are there of the form xi *, j ? For a PS-SCL mixture of n-mers there are N=(n-1)20 such terms if all 20 amino acids are used in creating the mixtures. When we examine the terms of expression (3) for position j other than j*, for each (k )

amino acid i there are (N/20) terms with xij

equal to 1. Hence expression (3) can be written

as:

f(

n 20 c N ) ⋅ {N ⋅ Si *, j* + ( ) ⋅ ( ∑ ∑ Sij )} N 20 j =1 i =1

(4)

j ≠ j*

or

Ri *, j* = (

n 20 N c ) ⋅ f ( ) ⋅ {20 ⋅ Si *, j* + ∑ ∑ Sij } 20 N j =1 i =1

(5)

j ≠ j*

where R i *, j * is the expected stimulation for a PS-SCL with position j* containing amino acid i*.

380

Yingdong Zhao and Richard Simon

2.2 Generalized model In some cases PS-SCLs are constructed from less than the full list of amino acids. If m denotes the number of amino acids used, the N=(n-1)m. For a completely randomized mixture with no position fixed, the expected response is

R0 = (

n 20 N c ) ⋅ f ( ) ⋅ ∑∑ Sij m N j =1 i =1

(6)

Consequently, the generalization of (5) can be written as

Ri *, j* − R0 = (

m N c ) ⋅ f ( ) ⋅ {m ⋅ Si *, j* − ∑ Sij*} m N i =1

(7)

Equation (7) can be further written as:

Ri *, j* − R0 = N ⋅ f (

c 1 m ) ⋅ {Si *, j* − ∑ Sij*} N m i =1

(8)

c 1 m ) is a constant related to concentration while ∑ Sij* is the average of scores N m i =1 for all amino acids in each position. R i *, j * has a linear relationship with Si *, j* . Hence N⋅ f(

whether we use stimulation index R i *, j * for the score matrix or Si *, j* , it will not affect the rankings of the peptides.

3. Model Evaluation The score matrix approach to evaluating peptides is based on an assumed independent contribution of each amino acid side chain in the peptide sequence to MHC binding and TCR binding. Indeed, the assumption of independent contribution of each amino acid side chain in the peptide sequence to MHC binding has been used to develop quantitative methods that predict peptide binding to MHC alleles [Hammer et al., 1994; Mallios, 1994; Parker et al., 1994; Southwood et al., 1998]. We can explore the adequacy of this assumption using data for a dual position defined PS-SCL used by Dooley et al. [Dooley et al., 1997], which was designed to study the binding and in vitro activities of peptides with high affinity for the nociceptin/orphanin FQ receptor. An n-mer peptide PS-SCL containing two adjacent defined positions can be represented as X1X2…Oi1jOi2j+1…Xn, in which the positions labeled as O are individually defined by the 20 natural L-amino acids. The combination of 20 amino acids in two positions is 400. For example, 400 hexapeptide PS-SCL mixtures of each of the following types were studied: [Oi11Oi22X3X4X5X6], [X1X2Oi13Oi24X5X6], and [X1X2X3X4Oi15Oi26].

Deconvolution of Positional Scanning Synthetic Combinatorial Libraries

381

Under the assumption of independent contributions of the amino acid side chains, theoretical inhibition data of dual position defined PS-SCL can be expressed as the sum of inhibition measurements for the two corresponding single position mixtures. By comparing the predicted results based on the assumption of independence with the experimental dual position defined PS-SCL data, we can evaluate the assumption of independent contributions of side chains. Figure 4 is a scatter-plot of the inhibition data for the 1200 mixtures. The xaxis specifies the theoretical value generated by our model, while y-axis indicates the experimental inhibition measurements. The Pearson correlation coefficient is 0.8. The comparison shows the approximation is valid among the majority of the 1200 mixtures, however, there are variations when the assumption of independent contribution of the side chain of amino acid in peptide does not hold true.

Experimental inhibition measurement

100

80

60

40

20

0 0

20

40

60

80

100

Theoretical inhibition data

Figure 4. Scatter plot of the inhibition data (experimental measurement, y axis) for the dual position defined PS-SCL versus the theoretical data (x axis) generated from single position defined PS-SCL, in a project to study the binding and in vitro activities of peptides with high affinity for the nociceptin/orphanin FQ receptor.

4. Support Vector Machine We also developed a Support Vector Machine (SVM) based T-cell epitope prediction model [Zhao et al., 2003]. SVM theory provides an approach to minimize a bound on the generalization error of a model rather than minimizing the mean-square error over the data set [Vapnik, 1995]. SVMs provide framework for more sophisticated modeling that can take into account the interactions among the numerous factors that may influence T cell recognition,

382

Yingdong Zhao and Richard Simon

and thereby accelerate the process of finding T-cell epitopes. Compared to Artificial Neural Networks (ANN) which are limited by over-fitting training data and computational difficulties of finding global optima for model parameters, SVMs are particularly appealing for T-cell epitope prediction because of the ability of SVMs to build a predictive model with good generalization power when the dimensionality of the data is high and the number of observation is limited. For two group classification, SVM separates the classes with a surface that maximizes the margin between them, similar to ridge regression fitting. SVM classification of a sample with a vector x of predictors is based on:

f ( x ) = sign( ∑ yiα i k ( xi , x ) + b)

(9)

i

If f ( x ) is positive, then the sample is predicted to be in class +1; otherwise class –1.

k ( xi , x j ) is the kernel function which measures the similarity of two vectors of predictor values. For linear SVM, the inner product kernel function is used. The summation is over the set of “support vectors” that define the boundary between the classes. Support vector xi is

associated with a class label yi that is either +1 or –1. The { α i } and b coefficients are determined by “learning” the data. We found that the simple linear kernel performed best in our two data sets, compared to the polynomial and radial basis kernel functions. Because the VC dimension is lower with a linear kernel, generalization performance with limited training data is likely to be better. In order to further reduce the dimensionality, we encoded twenty amino acids by ten factors, which were derived from 188 physical properties from previous work [Kidera et al., 1985]. This encoding reduces the dimension of predictors by half while enabling structural and biophysical properties to be better represented compared to using amino acid indicator variables. For each training set consisting of 80% of the observations, a fully specified SVM is developed. This SVM model is then applied to the 20% test set. During learning on the 80% training set, leave-one-out cross validation is employed to automatically optimize the relative misclassification costs for the two classes and to optimize the tuning parameter that reflects the trade-off between the training error and class separation. Training and testing are repeated ten times for randomly determined training/test set partitions. The final indices are averaged over the ten replicates. Predictive performance was evaluated based on sensitivity, positive predictive value (PPV) and accuracy. Sensitivity is the portion of all positive (i.e., stimulatory) peptides that are correctly identified. PPV is the probability that a peptide predicted to be positive actually does so. Accuracy is defined as the proportion of all predictions that are correct. The SVMs performed well on two sets of data we tested [Zhao et al., 2003]. The first set of data contained 205 synthetic peptides that were tested against an A*0201 restricted T cell clone (TCC) from tumor-infiltrated lymph node cells of a melanoma patient. The average cross-validated sensitivity was 76.25%, the average cross-validated PPV was 70%, and the average cross-validated overall accuracy was 89.29% in the test set. The second set of data contained 144 peptides that were tested against a DRB5*0101-restricted TCC from peripheral

Deconvolution of Positional Scanning Synthetic Combinatorial Libraries

383

blood of a multiple sclerosis (MS) patient. The average cross-validated sensitivity was 90.00%, the average cross-validated PPV was 83.07%, and the average cross-validated overall accuracy was 94.14% in the test set. We demonstrated that SVMs can be trained on relatively small data sets to provide prediction more accurate than those based on previously published methods or on original additive model (Figure 5). 1.0

Sensitivity

0.8

0.6

0.4

0.2

0.0 0.0

0.2

0.4

0.6

0.8

1.

1-specificity

Figure 5. ROC curves for predictions on peptides for a Melanoma clone. Solid line represents averaged predictions using SVM applied on 10 different test sets. Dashed line represents the prediction using the original additive model.

5. Bioinformatical Approach We developed a web-based epitope search tool to enable investigators to search protein databases to find proteins containing sequences that are predicted stimulate a target T cell clone (Figure 6). Based on our model, the peptide score is the sum of position specific scores of the component amino acids. The scoring is accomplished by calculation of a matrix in which the columns represent positions and the rows the 20 amino acids used in PS-SCL libraries. The scoring matrix entry for a particular amino acid in a specific position is based on the stimulation assay results for the mixture of PS-SCL corresponding to that amino acid defined in that position. The epitope search tool is located on a server running the Redhat Linux operating system. Java and html are used on the front end, the S-PLUS statistical package and MySQL database on the backend. CGI programming is written in Perl.

384

Yingdong Zhao and Richard Simon

• Bacterial • Viral • Human • Myelin proteins • B. Burgdorferi

Figure 6. Screen shot of user interface for the epitope search tool TEST.

5.1 Score matrix and statistical significance The S-PLUS statistical package is used to calculate the scoring matrix used in the search and determine the threshold level based on statistical significance. Under the assumption of independent contribution to stimulation, the predicted stimulatory potential of a given peptide is the sum of the scores in each position. A 10-mer peptide sequence can be represented by a 20 × 10 matrix of 0’s and 1’s ( pij ) where pij =1 if the i th amino acid is in position j. Let S ij denote the components of the positional scoring matrix. Then the score for the peptide is 20 10

S = ∑ ∑ pij Sij

(10)

i =1 j =1

A statistical significance test of the hypothesis is developed that the score for a peptide is no greater than would be expected if the peptide were obtained from 10 random draws of amino acids. Under the null hypothesis it is not assumed that all amino acids are equally likely, but rather the relative frequencies f1 , f 2 ,..., f 20 are derived from the protein database being searched. Under the null hypothesis, the distribution of S will be approximately normally distributed. The mean and the variance of this null distribution can be expressed as

Deconvolution of Positional Scanning Synthetic Combinatorial Libraries 20

10

i =1

j =1

385

m = ∑ f i ∑ Sij

(11)

var = E S 2 − m2

(12)

The variance can be shown to equal 20

10

9

var = ∑ f i ∑ Sij 2 + 2∑ i =1

where m j =

j =1

10

∑m m j

j'

− m2

(13)

j =1 j ' = j +1

20

∑fS i

ij

.

j =1

The statistical significance of any score S can be approximated as p = Φ (

m−S ) , where var

Φ denotes the standard normal distribution function. This significance level does not account for the large number of 10-mer sequences contained in the database for which tests are performed, hence a very stringent significant threshold is used. 5.2 Online search tool The GenPept flat file is loaded to our local MySQL database. GenPept contains translated protein coding sequences by translating the GenBank flat file release. The current version in our database is GenPept version 156. It contains 506,176 viral sequences, 1,326,096 bacterial sequences, and 238,360 human sequences. Total number of protein sequences is 3,697,523. There are five fields for each protein sequence in the database table: GenPept accession number, Keyword, Annotation, Organism, and Amino acid sequence. An easy interface enables the user to restrict and customize their search using any of the above fields. The database also serves as a remote data management system for the users either to maintain their search results in their own tables of the MySQL database or to retrieve those data and find the common hits across different searches. A Perl script is used to systematically search the GenPept database. A window with the same length of peptide as used in the positional scanning combinatorial peptide libraries is applied to slide over the available translated protein-coding sequences. The sum of the scores within the window is used as a ranking criterion. All peptides with scores higher than a threshold are output to a file. Once the search is completed, the server sends an email to the user. The user can open the html page that contains the peptides with GenPept accession number, peptide sequence, and the protein annotation. Those peptides are hyper-linked to the NCBI protein database. The users can check the detailed information about the sequences. We have also implemented a Java applet to display the score distribution of all peptides in organisms selected by the user.

386

Yingdong Zhao and Richard Simon

6. Applications We have applied our approach to a variety of T cell clone systems with known and unknown specificities. It led to the identification of potentially relevant T-cell targets derived from both Borrelia burgdorferi and the host in a patient with chronic neuroborreliosis, which demonstrated that flexibility in T-cell recognition does not preclude specificity by the fact that the antigen specificity of a single T-cell clone can be degenerate and yet the clone can preferentially recognize different peptides derived from the same organism [Hemmer et al., 1999]. Studies of myelin-specific TCC as well as clones from the nervous system of patients suffering from chronic central nervous (CNS) Lyme disease made it clear that at least some T cells are more degenerate than previously anticipated [Martin et al., 2001]. It allowed the identification of T cell epitopes for both autoreactive and foreign Ag-specific TCC with unprecedented efficacy [Zhao et al., 2001a]. The same approach has also been successfully used for the prediction and identification of Ags by CD8+ TCC [Pinilla, 2001]. For the first time, recognition of Ags by clones of unknown specificity can be decrypted. We have applied our approach in projects aiming to identify tumor antigens. A tumorreactive CTL clone specific for an immunodominant peptide from the melanocyte differentiation and tumor-associated Ag Melan-A was used [Rubio-Godoy et al., 2002]. It demonstrated the high predictive value of PS-SCL for the identification of sequences crossrecognized by Ag-specific T cells. We studied the extent of cross-reactivity of a CD4+T-cell clone (TCC) specific for the immunodominant influenza virus hemagglutinin (Flu-HA) peptide derived from a patient with multiple sclerosis (MS) using PS-SCL, which demonstrated that flexibility of TCR recognition is present even in a clone with a high degree of TCR specificity for an infectious agent [Markovic-Plese et al., 2005]. Finally, we analyzed the cross-reactivity of a human HIV-1 gag p24-specific CD4(+) T cell clone obtained from an HIV-1-seronegative donor using PS-SCL [Venturini et al., 2006]. It showed that an HIV-1-specific human T helper clone can react efficiently with peptides from other pathogens and suggested that cellular immune responses identified as being specific for one human pathogen (HIV-1) could arise from exposure to other pathogens. In summary, our approach has been proven efficient and successful for identifying target antigens and highly active peptide mimics. The results from the above applications have implications for vaccine design and for antigen-specific treatment strategies for autoimmune diseases.

Acknowledgement We thank our collaborators Drs Roland Martin from Center for Molecular Neurobiology at University of Hamburg (previously at Cellular Immunology Section, Neuroimmunology Branch, National Institute of Neurological Disorders and Stroke, National Institute of Allergy and Infectious Diseases, NIH), Clemencia Pinilla from Torrey Pines Institute for Molecular Studies, and Danila Valmori from Division of Clinical Onco-Immunology at Ludwig Institute for Cancer Research for their experimental data and helpful discussions.

Deconvolution of Positional Scanning Synthetic Combinatorial Libraries

387

References [1] Dooley, C.T. & Houghten, R.A. (1993) The use of positional scanning synthetic peptide combinatorial libraries for the rapid determination of opioid receptor ligands. Life Sci., 52, 1509-17. [2] Dooley, C.T.; Spaeth, C.G.; Berzetei-Gurske, I.P.; Craymer, K.; Adapa, I.D.; Brandt, S.R.; Houghten, R.A.; Toll, L. (1997) Binding and in vitro activities of peptides with high affinity for the nociceptin/orphanin FQ receptor, ORL1. The Journal of Pharmacology and Experimental Therapeutics, 283, 735-41. [3] Eichler, J.; Lucka, A. W.; Houghten, R. (1994) Cyclic peptide template combinatorial libraries: synthesis and identification of chymotrypsin inhibitors. Pept. Res., 7, 300-7 [4] Hammer, J.; Bono, E.; Gallazzi, F.; Belunis, C.; Nagy, Z.; and Sinigaglia, F. (1994) Precise prediction of major histocompatibility complex class II-peptide interaction based on peptide side chain scanning. J. Exp. Med., 180, 2353-8. [5] Hemmer, B.; Gran, B.; Zhao, Y.; Marques, A.; Pascal, J.; Tzou, A.; Kondo, T.; Cortese, I.; Bielekova, B.; Straus, S. E.; McFarland, H. F.; Houghten, R.; Simon, R.; Pinilla, C.; Martin, R. (1999) Identification of candidate T-cell epitopes and molecular mimics in chronic Lyme disease. Nat. Med., 5, 1375-82. [6] Houghten, R. A.; Pinilla, C.; Appel, J. R.; Blondelle, S. E.; Dooley, C. T.; Eichler, J.; Nefzi, A.; Ostresh, J. M. (1999) Mixture-based synthetic combinatorial libraries. J. Med. Chem., 42, 3743-78. [7] Kidera, A.; Konishi, Y.; Oka, M.; Ooi, T.; Scheraga, H. A. (1985) Statistical analysis of the physical properties of the 20 naturally occurring amino acids. J. Protein Chem., 4, 23-55. [8] Mallios, R. R. (1994) Multiple regression analysis suggests motifs for class II MHC binding. J. Theor. Biol.,166, 167-72. [9] Markovic-Plese, S.; Hemmer, B.; Zhao, Y.; Simon, R.; Pinilla, C.; Martin, R. (2005) High level of cross-reactivity in influenza virus hemagglutinin-specific CD4+ T-cell response: implications for the initiation of autoimmune response in multiple sclerosis. J Neuroimmunol., 169, 31-8. [10] Martin, R.; Gran, B.; Zhao, Y.; Markovic-Plese, S.; Bielekova, B.; Marques, A.; Sung, M.H.; Hemmer, B.; Simon, R.; McFarland, H.F.; Pinilla, C. (2001) Molecular mimicry and antigen-specific T cell responses in multiple sclerosis and chronic CNS Lyme disease. J Autoimmun., 16, 187-92. [11] Parker, K. C.; Bednarek, M. A.; and Coligan, J. E. (1994) Scheme for ranking potential HLA-A2 binding peptides based on independent binding of individual peptide sidechains. J. Immunol., 152, 163-75. [12] Pinilla, C.; Martin, R.; Gran, B.; Appel, J. R.; Boggiano, C.; Wilson, D. B.; Houghten, R. A. (1999) Exploring immunological specificity using synthetic peptide combinatorial libraries. Curr. Opin. Immunol., 11, 193-202. [13] Pinilla, C.; Rubio-Godoy, V.; Dutoit, V.; Guillaume, P.; Simon, R.; Zhao, Y.; Houghten, R. A.; Cerottini, J. C.; Romero, P.; Valmori, D. (2001) Combinatorial peptide libraries as an alternative approach to the identification of ligands for tumor-reactive cytolytic T lymphocytes. Cancer Res., 61, 5153-60.

388

Yingdong Zhao and Richard Simon

[14] Rubio-Godoy, V.; Pinilla, C.; Dutoit, V.; Borras, E.; Simon, R.; Zhao, Y.; Cerottini, J.C.; Romero, P.; Houghten, R.; Valmori, D. (2002) Toward synthetic combinatorial peptide libraries in positional scanning format (PS-SCL)-based identification of CD8+ Tumor-reactive T-Cell Ligands: a comparative analysis of PS-SCL recognition by a single tumor-reactive CD8+ cytolytic T-lymphocyte clone. Cancer Res., 62, 2058-63. [15] Southwood, S.; Sidney, J.; Kondo, A.; del Guercio, M. F.; Appella, E.; Hoffman, S.; Kubo, R. T.; Chesnut, R. W.; Grey, H. M.; Sette, A. (1998) Several common HLA-DR types share largely overlapping peptide binding repertoires. J. Immunol., 160, 3363-73. [16] Vapnik, V. N. (ed.) (1995) The Nature of Statistical Learning Theory, Springer. [17] Venturini, S.; Allicotti, G.; Zhao, Y.; Simon, R.; Burton, D.R.; Pinilla, C.; Poignard, P. (2006) Identification of peptides from human pathogens able to cross-activate an HIV-1gag-specific CD4+ T cell clone. Eur J Immunol., 36, 27-36. [18] Zhao, Y.; Gran, B.; Pinilla, C.; Markovic-Plese, S.; Hemmer, B.; Tzou, A.; Whitney, L. W.; Biddison, W. E.; Martin, R.; Simon, R. (2001a) Combinatorial peptide libraries and biometric score matrices permit the quantitative analysis of specific and degenerate interactions between clonotypic TCR and MHC peptide ligands. J. Immunol., 167, 213041. [19] Zhao, Y.; Grovel, L.; Simon, R. (2001b) TEST: a web-based T-cell Epitope Search Tool. Proceedings of the 14th IEEE symposium on computer-based medical systems, 493-497. [20] Zhao, Y.; Pinilla, C.; Valmori, D.; Martin, R.; Simon, R. (2003) Application of support vector machines for T-cell epitopes prediction. Bioinformatics, 19, 1978-84.

In: Computational Biology: New Research Editor: Alona S. Russe

ISBN: 978-1-60692-040-4 © 2009 Nova Science Publishers, Inc.

Chapter 16

SCRIPTING OF MOLECULAR STRUCTURE VIEWER FOR DATA ANALYSIS USING LUA LANGUAGE INTERPRETER Yutaka Ueno∗ Neuroscience, National Institute of Advanced Industrial Science and Technology, Umezono 1-1, Central 2, Tsukuba, Ibaraki, 305-8568, Japan

Abstract To improve the flexibility and extensibility of application programs, a method to support scripting function by embedding a Lua language interpreter is described. Using this approach, variations of input data and parameters for calculations are supported by a script file without rewriting the application program. For instance, the script information is supplied to the program and internal data can also be exported to a script for extended calculations. This chapter summarizes the basic framework of embedding this scripting language to interact with existing codes using the application programming interface provided by Lua. The method was applied to the molecular structure viewer program MOSBY to support additional visualizations and calculations from atomic coordinate data by script programs. Atomic structure data in the original C structure were mapped to a Lua script by using a mechanism called "metamethod" in Lua. In addition, the table data type in Lua provides a simple database useful for a configuration of molecular graphics. The design of a “domainspecific language” in biocomputing is discussed with reference to other scripting languages.

Introduction A scripting language is used to automate, customize and configure computational tasks, with scripts interpreted to perform individual tasks or launch programs [1]. Perl [2], Python [3] and Ruby [4] are popular in scientific computing with a UNIX-based software environment. They have also been regarded as “glue” languages to connect software components and organize them to a specific computational task. In particular, modular libraries and scripts for ∗

Tel: +81 -29-861-5965; Fax: +81 -29-861-5273; E-mail address: [email protected]

390

Yutaka Ueno

bioinformatics [5] have been provided as bioPerl ,bioPython or bioRuby. Those languages are popular, as their comprehensive programming style with procedural syntax allows many scientists to cope with various computational tasks without complex knowledge of programming. Successful application programs sometimes support their own scripting languages to enhance functionality and support extensions, allowing users to adapt to various computational tasks. For example, NIH-Image and its predecessor Image-J equips macro languages for scripting with image-processing tasks in bioscience. These embedded languages share the same concept of the interpreter of procedure and libraries of processing functions, but with the implementations and methods differentiated for individual programs. Schema [6] has pioneered a common embedded language that has become widely employed in a general purpose image-processing program GIMP. In contrast to its syntax of LISP, Tcl [1], Basic and C-INT [7] use a familiar procedural syntax for scripting. As our experience with the design and implementation of such languages rapidly increases by open-source development projects, the essence in software design could be extracted as a shared core for scripting software tools or a reusable software component of embedded languages. The approach of scripting is a high-level programming to organize and manage data and software components in a large software suite. They are different from traditional studies of programming language [1]. While a compiler technology is used to generate the highestperformance executable code specific to the computer hardware, the high-level programming language addresses engineering issues in practical software development. Roberto and colleagues have proposed a programming language Lua [8] as a candidate for such an embedded language based on their experience in both academic research and industrial applications. This study demonstrates a result with Lua that was embedded in an existing application program to support the scripting facility. Lua also reinforces the structure of the program, providing an infrastructure of data management together with modern programming techniques. The concept of the programming language Lua was described in the original paper [9] as "an extensible extension language", which is now demonstrated by the large number of application software programs that use it. It was originally designed as an embedded language, so that the methods of interface with existing code in C/C++ language were extensively examined to support the construction of a host software system. Lua has functions of a complete programming language featuring an extensible mechanism of the language. In particular, "metamethod", a mechanism to extend the semantics of the language, allows users to redefine behaviors to mathematical operations and data handling of the associative array to support new data types in the script. In discussions of this methodology, fundamental questions are posed: which language should we choose? What is the necessary functionality of the language to support application programs in bioscience? Of course, practical dependences on individual programs may apply rather than choices based solely on analysis; regardless, principles in software engineering must be considered. Then, a basic framework of embedding Lua to connect existing code written in C/C++ language is described focusing on its useful features for scientific computing. The result of embedding Lua in MOSBY, a molecular structure viewer program [10], is described to demonstrate how Lua contributes to construction and configuration of a large software system with hierarchical complex data. Methods with Lua are not only techniques in programming, but also important design methods in software engineering. Lua

Scripting of Molecular Structure Viewer for Data Analysis…

391

provides the desired mechanisms of extensibility to the software, and supports construction of large and complex software systems. Prior to this study, Lua has been embedded into a sequence map viewer program, GUPPY [11] illustrating a potential ability of facilitating variety of computational tasks in bioscience. The programming cost for embedding Lua language is substantially lower than those of other embedded languages. While scripting language is an indispensable computational tool in biosciences, most individual programs have remained within a single executable program, which is usually launched from a script file with limited command line parameters. By contrast, embedding script provides more flexible configurations with additional calculations using contexts residing in the original program.

Introduction to Lua (1) Requirements for the language Many languages have been successfully embedded into application programs: Scheme [5], Basic, C-INT [7], Forth and Tcl [1] are traditional while Python [3], Lua and Ruby [4] provide modern trends of object-oriented programming. For a script language that assumes data description and data processing procedures, the following demands usually arise in design and implementation of the target computer program. 1) Automatic memory management for hierarchical data construction Biological macromolecules consist of a hierarchical structure from sequence to the atomic coordinates together with attributes and associated data [11]. Dynamic descriptions of such data and arrays with automatic extension are required to edit molecular structure data. Basic and C-INT are not for this purpose. 2) Maximum processing performance by linking an original native code with various types of hardware For a program code for large-scale calculation, the critical part of the program should be written in C/C++, which is usually more than two-fold faster than an interpreter language. Portability is always required in scientific computing using current high-performance CPUs and their cluster machines without dependence on a certain operating system. Tough techniques are required for Perl and Ruby. 3) Comprehensive programming for customizing parameter and procedure A procedural syntax grammar is always preferred by general users in biocomputing only with casual training for programming or little knowledge of related hardware environment. Schema, Forth and Tcl are unfavorable. Python and Lua meet these requirements. Besides Python is popular in scientific computing, Lua has demonstrated an outstanding number of applications in computer game software, which shares the same requirements [12]. In this study, Lua was chosen for two additional reasons, its size and comprehensiveness. The size of its language interpreter and library are small enough to be embedded and used by programmers with an average amount of knowledge. In addition, its source code is readable enough to be traced by debuggers when unknown problems are encountered.

392

Yutaka Ueno

Comprehensiveness is an important design concept of Lua, which minimizes abstract concepts and expressions in programming. Usually, scientists are already equipped with abstract concepts and technical reasoning on their topics, and are likely to avoid additions for computing tools i.e., object-oriented programming or programming that requires an understanding of S-expression. As a matter of fact, scripts are typically edited by a nonprogrammer member of the workgroup without extra education. Lua is an implementation of a scripting language that meets these demands of scientists.

(2) An overview of Lua Lua is a general-purpose programming language with a simple procedural syntax using dynamically typed variables. A program statement consists of assignment of data or function calls, where the function is also a first-class data type. In addition to independent data items such as number or string, the table data type that implements an associative array offers a structured data container especially useful in describing hierarchical data. These tables are dynamically created, edited and cleaned by the garbage collection when they are not used. Conventional flow control structures are supported, e.g., if-then-else-end, while-do-end, and repeat-until. The current version also supports co-routines for cooperative multi-tasking. The interpreter was implemented in ANSI-C with great portability and performance by introducing a compiled byte code for a virtual machine. Recently, an implementation of just in-time compiler reported excellent execution performance of the virtual machine. Below is an example of a Lua script. width=800 title="sample-1" mol={

-- a comment start with two hyphens in Lua -- a table data name="alanine", smiles="CC(N)C(=O)O", form="CH3CH2OH"

}

(3) Metamethod in Lua To extend the semantics of a program described in the compact grammar of Lua, a mechanism called metamethod is equipped. It supports new data types to variables and customize actions on processing of them. A set of "events" is defined by its behavior in processing mathematical operations, comparisons and indexing of data fields just like a Lua table. This set, which is called "metatable", provides the attributes for a new data type. It evolved from "fallback" in the original paper on Lua [9] to fulfill various practical extensions. For example, suppose we have a new data type, which is represented in an array of strings, and is actually stored in a file with fixed record size. The metamethod allows users to represent such data in a table, with a custom metamethod to get or set data. For reading data, the following metamethod would be used for an "index" event.

Scripting of Molecular Structure Viewer for Data Analysis…

393

mydiskdata={}

--- the metatable already exists --- add 2-underscore prefix to the event name mydiskdata.__index=function(t,idx) t.file:seek(idx*RECSIZE) return t.file:read(RECSIZE) end A new data type is created as a Lua table and opens a file to read. It starts to work after this metatable is established. A metatable is used as an identifier for the new data type. one={} one.file=io.openfile(filename) setmetatable(one,mydiskdata)

Figure 1. The basic framework of Lua interpreter with its data management embedded in a program. The script is compiled into an intermediate code that is executable by a virtual machine. As well as a script, functions in Lua are also converted to the intermediate code. In addition to this virtual machine, Lua interpreter also provides data management for string data. Memories to store those variable length strings are allocated and reused after use by the garbage collection to maintain unused memory area. This memory management also applies to Lua table, userdata and functions. The numerical data and boolean data are atomic data directly represented by a variable in Lua. As well as a Lua variable, the Lua stack is used by a C program to locate and handle Lua data. (1)~(5) : Interactions indicated by these bold arrows correspond the sub-sections described in the text in Embedding A Lua Interpreter.

394

Yutaka Ueno

Embedding a Lua Interpreter Figure 1 illustrates the basic framework of a Lua interpreter in a host program with the memory management of strings, Lua-table and functions. This is the general infrastructure of an interpreted language. Being an embedded language, the interface to these data from a Cprogram has to be established. By embedding a scripting language we expect to share useful functions in the application program that are already available in this framework. In addition to string data, an array of floating point numbers, a matrix data and other data specific to the application program would be managed in this framework. Lua provides an excellent implementation of such a memory management software component with garbage collection. Since the Lua interpreter is provided as a library to the C-program, it is ready to link into a program. The C language application programming interface (API) provides full control of the Lua environment created by the interpreter. In this study, a simplified version of the API, EasyLua API, was developed and is used here instead of the original C language API because it is more comprehensive and practically useful. Details are provided in our software distribution [13].

(1) Invoking the Lua interpreter from a program At first, the Lua interpreter is initialized by an API function call. Then, a script file written in Lua can be executed by an API function. A code written as string data in the C program can also be used as a Lua code. el_init(NULL): el_dofile("test.lua") el_dostring("mywidth=10") Optionally, an existing program code can also be organized as a Lua module to be called from a Lua script. In that case, the modules can be shared among programs embedding Lua.

(2) Read/Write data in Lua Once the interpreter runs, Lua data are available for a program to look up in the following way using EasyLua functions. n1 =el_num(el_var("width")); title =el_str(el_var("title")); Global variables are sufficient to define data and parameters for a single program. In cases of large programs involving increasing numbers of global variables, tables in Lua are useful to describe them in a hierarchical data. For example, the user account information in an operating system can be represented in the following way in a Lua table. user={ name="tamago san", uid=9601, home="/home/ueno" }

Scripting of Molecular Structure Viewer for Data Analysis…

395

In a Lua script, the field data in the Lua table is represented by notation using a period as an extension to the table. user.name From a C language program to access such field data, the Lua table has to be identified by a handle to the data in Lua, using an integer index. For example, the index number 1 is specified to refer the table: el_table(1,el_var("user")); n1= el_num(el_field(1,"name")); The first code assigns the handle 1 to the Lua table named "user". Then, the second code first accesses its field data and converts it to the number variable. In fact, the field data is handled by a temporary handle in the same way as el_var(). Thus, all the Lua data is handled in this way without invoking the language interpreter. In Lua, the data handles are on the Lua stack, which is actually an indexed array expandable by pushing data.

(3) Exchanging array data between Lua and C Arrays are convenient for listing data in scientific computing. A Lua table can be used as an array if its index is an integer. The following C code reads the array data in a Lua table. Please note that arrays in Lua usually start from 1. el_table(1,el_var("wave")); n=el_getn(1); /** n. of array elements **/ for(i=0;i
(4) Making the C function callable from Lua A function in C language can be called from the Lua script, which matches the event-driven programming style widely employed in the graphical user interface. The following example

396

Yutaka Ueno

interfaces the associated Legendre function to a global function in Lua with 2 parameter arguments. int legendre_lua(lua_State *L) { int l; double g,x; l=el_num(1); x=el_num(2); g= gsl_sf_regendre_Pl(l,x); /** GNU Scientific Lib **/ el_pushnum(g); return 1; } el_register("legendre", legendre_lua); Arguments for the function are directly accessible via a Lua stack with integer indices. The return values of the function are also stored in the Lua stack, where values are pushed on sequentially. As is the convention in Lua, the number of values in the Lua stack must be set to the return value of the C-function. The Lua stack contents are cleaned after sending them to the calling script. In this way, a custom code is called for an event as a registered Lua function from a script.

(5) Allocating "userdata" for C functions If a memory for C-data is allocated in Lua script, it benefits from a garbage collection mechanism provided by Lua, and the data can be handled in C language functions in just the way it was used in the C language. The following is an example of using an array of an 80bit floating point number as "userdata", although not all C language compiler support this "long double" data type. new80bit_lua(lua_State *L) { /** C language **/ long double *lw; int na=el_num(1); lw=el_pushnewuserdata(na*sizeof(long double)); return 1; } el_register("new80bit",cnew80bit_lua); If the data structure has to be created by a custom function in C language, we can still take an item of "userdata" to hold the pointer to such data, and tell Lua to take another function to free the memory on a garbage collection. It is the metamethod to "gc" event assigned by standard C API, i.e. lua_newmetatable() and lua_setmetatable(), as described in the official Lua documents [8].

Scripting of Molecular Structure Viewer for Data Analysis…

397

(6) Mapping C data on Lua Sharing of data between C and Lua is possible with "metamethod" in Lua. Suppose we introduce an 80bit floating point number as a new data type for a precise calculation in C. When Lua encounters an occasion to read or write them, a custom function to convert them into a number value in Lua can be employed. This new data type is first created as a Lua table having the new data as a field of "userdata". The metamethod is arranged from a Lua script, so that the original code must be implemented to set the data in C language and register a new Lua function to be called from the metamethod. This is a preliminary function called from the metamethod to implement this new data type. int cdata_index_lua(lua_State *L) { long double *dptr=(long double *)el_udata(1) int idx=el_num(2); el_pushnum( dptr[idx]); /** reading **/ return 1; } el_register("cdata_index",cdata_index_lua); The other function to store the third argument value to this data on the “newindex” event is trivial. dptr[idx]=el_num(3);

/** writing, and return 0 **/

With these two functions registered in Lua-callable form tagmethods, we may implement this new data type in Lua, which is a "proxy" table for an array of 80bit floating point actually reserved in its field named “w80”. one={} one.w80=new80bit(100)

--- 100 elements array

mt80bit={ __index=function(t,idx) return cdata_index(t.w80,idx) end, __newindex=function(t,idx,d) cdata_newindex(t.w80,idx,data) end, } setmetatable(one,mt80bit) Although a metatable and metamethods can also be managed with C-API, it is a good practice to organize them in a Lua script rather than the rigorous C language code. Optionally, numerical operations such as add, subtract, etc. can also be supported incrementally when they are needed.

398

Yutaka Ueno

(7) Calling Lua from C Another interaction is calling a Lua function from a C language code. Calling a Lua function without arguments is simple. el_call(el_var("print")); The argument to the function is set on a Lua stack by pushing in the sequence. By using this EasyLua function, return values are moved from the Lua stack into a new Lua table,. ip=el_var("print"); el_pushnum(10); el_pushnum(20); ir=el_call(ip); el_num(el_index(ir,1)); el_num(el_index(ir,2)); Now it is possible to start a Lua script from a program, which we could set to a specific event in the application program. In general, such event-driven programming facilitates adaptation, not only to Lua, but also to many graphical user interface environments.

Figure 2. The block diagram of program and the molecular data list and display list of MOSBY. An atomic coordinate data from a PDB database entry usually consist of hierarchical molecular fragments of polypeptide. This sample consist of two chains of "A" and "B". They are directly described in a Lua table, where an array of atomic coordinate data is designated as ‘atom list’, which is an array of structure data in C language. They are converted into the display list to maximize graphical rendering performance for large molecules. The display list is an array of primitives for graphical rendering, which uses vertex data and connection data for the chemical bond. By changing display primitives for a molecule, various molecular representations are supported. Attached attributes for the molecular fragment includes a matrix transformation used for docking study or a description of the symmetric unit as in a crystal structure. The electron density map is also loaded in the molecular data list, which is marked "M" in this sample. The surface generation module converts the map data into display list using polygon primitives.

Scripting of Molecular Structure Viewer for Data Analysis…

399

An Application to Molecular Graphics The methods described in the previous section, which cover most of the ways that Lua can interact with the C/C++ program, was applied to MOSBY, a portable molecular viewer program for proteins and macromolecules. Even though this program employs a plug-in module for extensibility [10], the scripting language is also a desirable enhancement to support a macro processing facility and a property database of atomic elements and molecular parts. Once a script language is embedded in a program, it also participates in the construction of the program. In fact, many parts of the program written in C were replaced with Lua code because the use of an interpreter was favorable in making up the program, while the molecular structure data and graphics rendering display list have to be implemented in C. Those lists of C-data are attached to a Lua table with attribute values and annotations used for a figure with the atomic model (Figure 2).

(1) The molecular structure data The atomic coordinate organized in an array of structure data in C language is accessible from Lua. Its compact data was designed for quick rendering, but is mapped as if it is defined in Lua. Below is a simplified version of the atomic coordinate data in MOSBY. struct xyza { double x,y,z; int atyp;

/** a definition of the structure **/

/** and more attributes **/ }; int cymget() { struct xyza *cobj=(struct xy *)el_usetdata(1); int idx =el_num(2)-1; char *name=el_str(3); if(name[0]=='x') { el_pushnum(cobj[idx].x); return 1; } /** and more members ... **/ return 0; } el_register("mysetxyza",cmyget); This preliminary function allows a Lua code to set the x-coordinate value of an atom in MOSBY as follows: mysetxyza(obj,1,'x',5.0)

--- set 5.0 for x of atom 1

400

Yutaka Ueno

Using a metamethod, this operation is described in much more useful manner using a Lua table; i.e., we can take a table operation where a custom function is called on reading data in the array of Lua table in the following way. obj[1].x=5.0 print(obj[1].x)

--- in a Lua script --- it will show "5.0"

The implementation provided in our software distribution [13] takes two metamethods. The first one, for an "index" event on the molecular object, create a Lua table to expands the structure in C language into each field of the Lua table. This table has another metamethod for a "newindex" event, which in turn sets Lua data into the structure in C language on an assignment of a value into a field of the Lua table.

(2) Scripts for the software configuration The script function is particularly useful for importing and exporting files in different formats. The file format handling modules to test and load different file formats are dynamically configured with a Lua script to support foreign files. Coloring of atomic models is defined in a property database of the periodic table of elements. In a calculation with atomic coordinates, various constants of the atom type are frequently used and sometimes updated. The Lua table with associative array provides a simple database for these constants used in various tasks. Below is a part of the amino acid residue database, where additional field elements can be added. AminoAcid={ THR={nick="T", lname="Threonine", color=11}, PRO={nick="P", lname="Proline", color=9}, ASP={nick="D", lname="Asparatate", color=2}, GLU={nick="E", lname="Glutamate", color=2}, ... }

(3) Scripts for calculations The atomic coordinate data in the program was fully accessible from the Lua script and edited including deletions and additions. Below is a sample calculation of the radius of gyration, which is actually well described by the code. gx,gy,gz=msMolGrav(protein) mass=0 rsq=0 atom=msAtomLoop(protein) while(atom) do x=atom.x-gx

Scripting of Molecular Structure Viewer for Data Analysis…

401

y=atom.x-gy z=atom.x-gz rsq = rsq + atom.ele*sqrt(x*x+y*y+z*z) mass=mass+atom.ele atom=atom.next() ebd rsq = rsq/mass There are functions specific to the molecular data, msMolGrav(), for obtaining the center of gravity for the molecule, and msAtomLoop() to generate an iterator for atomic data. These functions are organized in a molecular structure data library for Lua scripts. Figure 3 is another example for custom rendering of the electrostatic potential, which was obtained by a molecular orbital calculation [14]. The gradient of the potential field was estimated from its difference from the adjacent grid of the calculated function map in a 4dimensional sparse matrix. Prototyping of such calculations greatly benefits from a dynamically-typed interpreter language like Lua.

Figure 3. A visualization of the polarity and gradient of electrostatic potential. On a stick representation of a tetramer of poly-glycine, arrows are depicted on the grid points of the molecular surface. Arrows show gradient magnitude and direction. Their color correspond to their polarity of plus(blue) or minus(red), and magnitude of the electrostatic potential value. The gradient was estimated from a difference between the electrostatic potential values of adjacent grid points. Only necessary values of the gradient vector in the three dimensional grid are calculated and saved in a 4-dimensional sparse matrix.

402

Yutaka Ueno

Discussion (1) Related works For molecular graphics, while the graphical user interface nicely provides browsing and editing functions for molecular illustrations, the script allows users to save and restore ways to recreate pictures to maintain rendering contexts as reusable descriptions. VMD [15] has pioneered this scripting method with Tcl, but it lacks a numerical data type that is indispensable for comprehensive programming in scientific computing. PyMOL [16] is a popular molecular graphics program with scripting support by Python; however, objectoriented programming knowledge and training for users are highly recommended. In contrast, CueMol [17] uses an original scripting language, which might be ideal for the program, but prevents its communication with other programs, which a common scripting language would afford. In addition, implementing these embedded scripting languages requires substantial technical work, and sometimes the developed programs with scripting support are difficult to customize and extend. The described method with Lua affords a feasible solution to this issue as a possible new architecture for a glue language [12]. This technology allows programmers to maintain legacy application programs as important software components and fruitful resources in our community.

(2) The C language API of Lua Based on many practices of embedding Lua into applications, its C language API has been reinvented a few times, rendering the current specifications almost satisfactory in most cases. The efficiency of the operation using the Lua stack, however, sometimes includes tricky know-how. This study describes EasyLua, an alternative API, which greatly facilitates the stack management. This kind of wrapper code is widely used in practical software development with Lua. Since the development of Lua is still active to address highly complex programming issues in computer science, a new version sometime creates incompatibilities in the C language API specifications. In that case, an alternative API like EasyLua helps to adapt the new version without rewriting everything. There are also different approaches to integrating Lua into C/C++ applications. For example, the template mechanism in C++ can be used to establish the Lua interface [18]. Also, SWIG [19] is a tool that automates interfacing with variations of a scripting language. This study describes the principles actually taken in these tools to aid the programmer in choosing a tool to adapt Lua to existing programs.

(3) The multi-paradigm language Lua is also regarded as a 'multi-paradigm' language, which Tim Budd [20] describes as "a framework in which programmers can work in a variety of styles, freely intermixing constructs from different paradigms". Lua supports functional, imperative, and objectoriented programming, which are useful to different levels of users in adaptive scripting work.

Scripting of Molecular Structure Viewer for Data Analysis…

403

Object-oriented programming, in particular, is supported in very flexible manner on Lua. A Lua table is used to describe an object without a strict notion of the class, allowing prototyping with ad-hoc additions to test new functions. After modifications, the class definition can be protected to maintain consistency of the object instance. Pure objectoriented languages and tools have refrained from this kind of rearrangement and extension in object design because the stability of the object model has been a priority. However, in active software development, rearrangements sometimes occur in response to a variety of user requests. Being a functional language, it is notable that function in Lua is the first-class variable, and can be defined dynamically at runtime. This mechanism is very often useful in scientific computing such as in sorting data lists with an arbitrary target function at run time.

(4) Scripting languages developed in molecular dynamics Useful molecular dynamics programs were developed in FORTAN, where the individual components of programs take their own file format for data and job controls. XPLOR [21] introduced its original scripting language for the management of data and control statements that greatly facilitates a variety of tasks in X-ray crystallography and NMR data analysis. However, implementing additional functions by a modification requires substantial knowledge of the code. Also, it is specific to the molecular dynamics tasks and will not be reused in other programs. Meanwhile, MOE (Chemical Computing Group Inc.) released a clean construction of a molecular dynamics software suite using a general purpose vector data-processing language named Scientific Vector Language [22]. Unfortunately this proprietary software is used in a limited application domain without sharing expertise with other application domains. Another challenge in molecular dynamics would be a parallel-processing framework organized by a script language. For example, a script could segment and schedule tasks while most parallel codes reside in native code with MPI or openMP, which is not trivial to implement or modify. The coroutine and multi-thread support in Lua is ready to control parallel computing jobs. Mechanisms developed in Lua demonstrated in this study would contribute a substantial part of the design and implementation of a parallel-processing scripting language.

(5) A vector data-processing tool Lua also allows prototyping a next-generation vector data-processing tool with its C language API. Of cause Gnu R, OCTAVE (Gnu Project, The Free Software Foundation) or MATLAB (The Mathworks Inc.) are successful pieces of software packages with vector data processing having rich programming and visualization functions. Their scripting languages allow comprehensive programming with a custom code written in C/C++. However, their dependence on a particular software package limits the platform scalability. For example, a parallel execution program using grid environment always prefers a small footprint in the execution code. In such a case, a small independent program with only a specific calculation is required, rather than a whole software suite. Sometimes an additional program is developed, even though it could be the same as those in existing packages. It is notable that

404

Yutaka Ueno

the described method with Lua is directly applicable to such a small tool with minimum functions for a specific program. The Lua interpreter itself runs on a very small footprint of memory and such economy is always beneficial in maximizing computational performance. With upcoming multi-core architecture CPUs, these kinds of tools will maximize the performance of vector data processing. In addition, Lua provides not only a complete programming functionality without compromise, but also the means to implement various algorithms such as memorization or closure, making it eligible for the next-generation vector data-processing tool.

(6) Domain-specific Language There have been various tools for molecular graphics and modeling that feature a scripting facility. Most scripting languages have remained primitive enough to be used by users without professional programming knowledge; in fact, PyMol takes an imperative scripting grammar other than Python. Still, users have to follow the specific script and its semantics even though a common language is preferred in this application domain. This study illustrates that Lua could be a key milestone for a shared core for domainspecific languages in wider scientific computing applications. Once a scripting language with a simple grammar is accepted by users, additional functions for the necessary computational task have the priority in development, rather than improvements in the language. A typical example is how Java-script embedded language for the internet web browser became accepted without established studies on the language [23]. For prototyping with available scripting tools in biocomputing, Lua can contribute substantially because embedding is easier than with other languages and the scripting syntax is simple enough to speed up software development. Dynamically defined Lua tables facilitate rearrangement of commands and their parameters instantly with flexibility. Supported by open-source distribution of Lua, several languages, for example, Io [24], Squirrel [25], and etc., have evolved to adapt to their application domains. Although their language syntaxes might differ, the methodology with Lua establishes a shared core for domain-specific languages applied in scientific computing.

Conclusion With the programming language Lua, embedding a script language in an application program is no longer a difficult hurdle. A practical example, achievable with ordinary programming knowledge of C/C++, is demonstrated. This lightweight language provides a method for integrating existing programs written in C and interpreted scripts. In addition to the formal C language API, a simplified Lua interface, EasyLua API, facilitates cooperative use of Lua in biocomputing. The example with MOSBY, a molecular structure browser program, well demonstrates that data structures written in C/C++ language can also be exported to Lua to support further scripting tasks. The method is applicable to wider scientific computing applications that require a shared core for domain-specific languages.

Scripting of Molecular Structure Viewer for Data Analysis…

405

Acknowlegements The author thanks Dr. Roberto Ierusalimschy, Dr. Waldemar Celes, and Dr. Luiz Henrique de Figueiredo for the creation of Lua and successive fruitful developments in Lablua, a laboratory at the Pontifical Catholic University of Rio de Janeiro. This study is part of a software development for an image analysis in electron microscopy supported by JST-CNRS Strategic International Cooperative Program of the Japan Science and Technology Agency (JST).

References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]

[11] [12] [13] [14]

[15] [16] [17]

Ousterhout, J.K. (1998). Scripting: Higher Level Programming for the 21st Century. IEEE Computer magazine, 31(3), 23-30. Perl.Com. The source for Perl 2008. O`Reilly, Available from : http://www.perl.com Python Foundation. Python Programming Language 2008. Available from : http://www.python.org Matsumoto, Y. Ruby Programming Language. 2008. Available from :http://www.rubylang.org/en/ Open Bioinformatics Foundatiaon 2008. Available from: http://www.open-bio.org Kelsey, R., Clinger, W. & Rees, J. (1998). Revised5 report on the algorithmic language Scheme. Higher-Order and Symbolic Computation, 11(3), 7-105. Gotoh, M. The CINT C/C++ Interpreter 2008. Available from: http://root.cern.ch/twiki/bin/view/ROOT/CINT Lua.Org. The programming language Lua. 2008. Available from: http://www.lua.org Ierusalimschy, R., de Figueiredo, L.H. & Celes, W. (1996). Lua-an extensible extension language. Software: Practice & Experience, 26(6), 635-652. Ueno, Y. & Asai, K. (1997). A new plug-in software architecture applied for a portable molecular structure browser. Proceedings of Intelligent systems for molecular biology, 5, 329-332. Ueno, Y., Arita, M., Kumagai, T. & Asai, K. (2003). Processing sequence annotation data using the Lua programming language, Genome informatics, 14, 154-163. Garcés, D. A (2007). Scripting language survey. In M. Dickheirser, Game Programming Gems 6. (1st, pp.323-340). Boston, MA: Charles River Media. Ueno, Y. MOSBY molecular structure browser with analysis 2008. Available from: http://moonscript.net/mosby Nakano, T., Kaminuma, T., Sato, T., Fukuzawa, K., Akiyama, Y., Uebayasi, M. & Kitaura, K. (2002). Fragment molecular orbital method: use of approximate electrostatic potential. Chem. Phys. Lett., 351 ,475-480. Dalke, A. & Schulten, K. (1997). Using Tcl for Molecular Visualization and Analysis. Proceedings from the Pacific Symposium on Biocomputing, January. DeLano, W.L. The PyMOL Molecular Graphics System 202. Available on : http://www.pymol.org Ishitani, R. CueMol, Molecular Visualization Framework 2008. Available on :http//www.cuemol.org

406

Yutaka Ueno

[18] Celes, W. toLua – accessing C/C++ code from Lua 2008. Available from: http://www.tecgraf.puc-rio.br/~celes/tolua/ [19] Beasy, D.M., Simplified Wrapper and Interface Generator 2008. Available from: http://www.swig.org [20] Budd, T.A. (1995). Multiparadigm Programming in Leda. (1st) Reading, MA: Addison Wesley. [21] Brunger, A.T., X-PLOR Manual, Version 3.1. (Yale University, New Haven, Conn., 1992). Available from: http://www.ocms.ox.ac.uk/mirrored/xp_mirror.html [22] Santavy, M. & Labute, P. SVL: The Scientific Vector Language 2008. Available on : http://www.chemcomp.com/journal/svl.htm [23] ECMA International, ECMAScript Language Specification 2008. Available from: http://www.ecma.ch/stand/ ecma-262.htm. [24] Dekorte, S. Io Programming 2008. Available on : http://www.iolanguage.com [25] Demichelis A. Squirrel The Programming Language 2008. Available on :http://squirrel-lang.org

In: Computational Biology: New Research Editor: Alona S. Russe

ISBN: 978-1-60692-040-4 © 2009 Nova Science Publishers, Inc.

Chapter 17

COMPUTATIONAL MEDICINE RESEARCH IN HEMATOLOGY: A STUDY ON HEMOGLOBIN AND PROTHROMBIN DISORDERS Viroj Wiwanitkit Wiwanitkit House, Bangkhae, Bangkok Thailand 10160

Abstract At present, the third wave of medical experiments, in silico or computational simulating, is accepted as a powerful tool to drive the medical society into the new post genomics phase. Computational biology research is an important facet of bioinformatics. In this article, the author presents the concept and shares the experience in computational hematology research. Cases on hemoglobin and prothrombin disorders are demonstrated. Briefly, the computational research can help understand the genome, proteome and expression of hemoglobin and prothrombin disorders.

Introduction to Computational Hematology At present, the third wave of medical experiments, in silico or computational simulating, is accepted as a new powerful tool to drive the medical society into the new post genomics phase. For two decades, the identification of disease genes has been expanding rapidly [1]. Those identified in the earlier part of the decade were largely achieved through positional cloning and the majority is for relatively rare medical disorders which relate to single genes [1]. As the Human Genome Mapping Project is completely finished, the rate of gene discovery is increased substantially through the development of modern DNA sequencing techniques and in silico approaches [1]. Computational biology research is an important facet of bioinformatics. Attention has increasingly focused on the use of computational techniques for the design of combinatorial libraries [2]. Genomics—comparative, functional or structural types—are useful in medicine. In addition to basic genomics, other “omics” sciences are developed. Proteomics is the protein equivalent of genomics and is the specific assessment of gene expression at the functional level [3]. The proteome of an organism is the corresponding

408

Viroj Wiwanitkit

protein complement of its genome [3]. However, unlike the nature of genome, the proteome is dynamic [3]. Application of computer based approach experiment is useful in all branches of medicine. Many applications in medical microbiology, cellular pathology, clinical chemistry, haematology/immunology, pharmacology and toxicology can be seen in the present day [3]. In hematology, computational research is widely applied. Starting from database technology, there are other computational tools in bioinformatics that are useful for hematological experiments. Of interest, a specific database for hematology has also launched. Hembase (http://hembase.niddk.nih.gov) is an integrated browser and genome portal designed for webbased assessment of the human erythroid transcriptome [4]. To date, Hembase has 15,752 entries from erythroblast Expressed Sequenced Tags (ESTs) and 380 referenced genes relevant for erythropoiesis [4]. The database is well accepted in science to provide a cytogenetic band position, a unique name as well as a concise annotation for each entry [4]. Based on genomics and proteomics, a simulating study of many hematological disorders into its in-depth nanolevel can be easily done [5]. In addition, expression analysis by microarray technique is also available. Basically, microarrays are fast becoming routine and fundamental tools for the high-throughput analysis of gene expression in a wide range of biologic systems, including hematology [6]. Although a number of approaches can be taken when implementing microarray-based studies, all are capable of providing important insights into biologic function [7 - 9]. Gene ontology is also launched to systematize a heap of expressional data. Briefly, the computational research can help one understand the genome, proteome and expression of hemoglobin and prothrombin disorders. In this article, the author presents the concept and shares the experience in computational hematology research. Cases on hemoglobin and prothrombin disorders are demonstrated.

Computational Medicine Research on Hemoglobin Disorder A. Basic knowledge on hemoglobin disorder Basically, hemoglobin is composed of the heme and globin chains. The two main hemoglobin disorders are thalassemia and hemoglobinopathy. The fundamental abnormality in thalassemia is impaired production of either the alpha or beta globin chains. The defects in thalassemia might be on alpha globin chain genes or the beta globin chain gene. Alpha thalassemia occurs when one or more of the four alpha chain genes fail to function, while beta thalassemia occurs when one or both of the beta chain genes cannot have function [10 – 14]. Compared to alpha thalassemia, beta thalassemia rarely arises from the complete loss of a beta globin gene [10 – 14]. The beta globin gene is usually present, but can produce little beta globin protein; therefore, the degree of suppression poses high variation [10 – 14]. In some cases, the affected beta gene makes essentially no beta globin protein (beta-0-thalassemia) while in other cases, the production of beta chain protein is significantly lower than normal, but not zero (beta-(+)-thalassemia) [10 – 14]. Similar to other inherited disorders, the defect can be detected as hemogenicity or heterogenicity. In addition, the combination between thalassemia and hemoglobinopathy is common and can result in anemia to millions of the global population. Apart from thalassemia, the other inherited hemoglobin disorder is the specific abnormality called

Computational Medicine Research in Hematology

409

“hemoglobinopathy”. Hemoglobinopathy is a group of inherited hemoglobin disorders where the structure of hemoglobin is abnormal, or where hemoglobin was improperly formed but not owing to the depletion of the globin gene. Until present, there have been more than one hundred of hemoglobinopathies documented in the literature. One hemoglobinopathy has its own underlying genetic defect, and therefore has its specific properties and manifestation. Similar to thalassemia, hemoglobinopathy can result in anemia and becomes an important anemic health problem around the world. B. Some interesting computational medicine researches on hemoglobin disorder There are many recent interesting computational medicine researches on hemoglobin disorders. For data mining, a new hemoglobin disorder database has launched. HbVar (http://globin.cse.psu.edu/globin/hbvar/) is a relational database developed by a multi-center academic effort to provide up-to-date and high quality information on the genomic sequence changes bringing hemoglobin variants and all kinds of thalassemia and hemoglobinopathies [15]. Extensive information is recorded in the literature for each variant and mutation, including sequence alterations, biochemical and hematological effects, associated pathology, ethnic occurrence and references [15]. In addition, HbVar has also been linked with the GALA (Genome Alignment and Annotation database, available at http://globin.cse.psu.edu/gala/) so that users can combine information on hemoglobin variants and thalassemia mutations with a wide spectrum of genomic data [15]. Considering genomics, human genomics can give the result as Predictive--Preventive Medicine and Precision Medicine [16]. It will have profound medical and social implications [16]. DNA diagnosis is relatively inexpensive, helps to develop skills in molecular biology and provides a fundamental for developing national expertise in genomics [17 - 18]. As previously mentioned, genomics is approved to be useful for prevention and control of congenital hemoglobin disorders. In addition, genomics is also useful for in-depth study of the pathogenesis of congenital hemoglobin disorders. Basically, genetic factors affecting postnatal gamma-globin expression, a major modifier of the severity of both betathalassemia and sickle cell anemia, have been not easy to study [19]. To model the human beta-cluster in mice, with the goal of screening for loci affecting human gamma-globin expression in vivo, Lin et al. recently introduced a human beta-globin cluster YAC transgene into the genome of FVB/N mice [19]. Combining transgenic modeling of the human beta-globin gene cluster with quantitative trait analysis, Lin et al. identified and mapped a murine locus that impacts on human gamma-globin level in vivo [19]. In addition to genomics, proteomics is also useful for study of congenital hemoglobin disorder. Basically, the red blood cell or erythrocyte is easily purified and has a relatively simple structure; therefore, it has turned into a very well studied cell in terms of protein composition and function [20]. Erythrocyte proteomic studies performed over the last few years, by many laboratories, have identified hundreds of proteins within the human erythrocyte [20]. In a recent study by Chou et al. in 2006 [21], a proteomic approach using a cleavable ICAT reagent and nano-LC ESI tandem mass spectrometry was applied to perform protein profiling of core RBC membrane skeleton proteins between sickle cell patients and healthy controls and determine the efficacy of this technology. According to this work, there was no

410

Viroj Wiwanitkit significant difference in the mean ratios from control populations (AA1/AA2) and sickle cell versus healthy control populations (AA/SS) [21]. Quantitative changes in the red blood cell membrane proteome in sickle cell disease were also analyzed based on the two-dimensional fluorescence difference gel electrophoresis 2D-DIGE technique in another study by Kakhniashvili et al. [22]. Kakhniashvili et al. concluded that elevated content of protein repair participants as well as oxygen radical scavengers might reflect the increased oxidative stress observed in sickle cells [22]. In addition, there are many recent studies on the structures of hemoglobin disorders. Important studies are listed in Table 1. Table 1. Recent studies on the structures of hemoglobin disorders. Authors

Details

Wiwanitkit [23]

Hb Suan-Dok is an example of a hemoglobinopathy that was first identified and described in Thailand [23]. It has been identified as an unstable hemoglobin variant associated with alphathalassemia [23]. In this study the amino acid sequence of human alpha globin was extracted using ExPASY and compared with that obtained from the Hb Suan-Dok disorder [23]. The derived sequences, alpha globin chains in both the normal and Hb Suan-Dok disorder, were applied for further investigation of the tertiary structures [23]. Modeling these proteins for the tertiary structure was performed using the CPHmodels 2.0 Server [23]. For comparison the tertiary structure of human alpha globin chains in normal and hemoglobin Suan-Dok are calculated and presented [23]. The data suggests that the thalassemic defect is related to the Suan-Dok mutation results from another unidentified process rather than the structural aberration and that the finding of a thalassemic picture might be owing to another undetectable inherited hemoglobin disorder [23].

Wiwanitkit [24]

Hb Q-India is a hemoglobinopathy that was first identified in India [24]. Hb Q-India is caused by the specific mutation GAC --> CAC at codon 64 of the alpha-1 globin gene [24]. The correlation between this hemoglobinopathy and thalassemia was reported. Although the primary structure of disorder Hb Q-India is well documented, the secondary and tertiary structures, which can help explain the pathogenesis of the Hb Q-India disorder is not known [24]. In this study, amino acid sequence of human alpha globin was searched by the tooled namely ExPASY and used for further mutation to Hb Q-India disorder [24]. Based on this study, the main difference between the predicted alpha globin secondary structures of normal and Hb Q-India is a specific extra helix in the Hb Q-India. The predicted tertiary structure also supports this finding [24].

Wiwanitkit [25]

HbGeneva is an unstable hemologin with abnormal elongation [25]. This hemoglobinopathy is known for its high instability. Concerning the pathogenesis of HbGeneva, the data indicate a change in codon 114 from CTG (Leu) to -GG that results in a frame shift and the presumed synthesis of an abnormal beta-chain which is 156 residues long with a completely different Cterminal amino acid sequence [25]. This abnormality results in a frame shift, which results in elongation of the beta-chain amino acids [25]. A bioinformatic analysis was used to study the secondary and tertiary structures of those abnormal amino acid sequences [25]. A computerbased study for protein structure modeling was performed [25]. With regard to the tertiary structure, the deterioration of folds, accompanied by the aberration in secondary structure of globin in Hb Geneva can be identified [25].

Wiwanitkit [26]

Hb Pakse is an unstable hemoglobin with abnormal elongation, first described in Indochina [26]. An alpha2-globin gene termination codon mutation, TAA -->TAT or Term -->Tyr, has been described in the pathogenesis of Hb Pakse [26]. Computer-based protein structure modeling was used in a bioinformatics analysis of the tertiary structure of these elongated amino acid sequences. The elongated part of Hb Pakse showed additional helices, which may cause the main alteration in Hb Pakse [26]. Abnormalities in the fold structure of globin in Hb Pakse were identified, and helices additional to the normal alpha globin chains were shown in the elongated part of Hb Pakse [26].

Computational Medicine Research in Hematology Wiwanitkit [27]

411

A functional analysis was performed on 4 important beta hemoglobinopathies (hemoglobin C, D, E, and S) using PolyPhen, a novel bioinformatic tool. The mutations Hb C (beta 6, Glu --> Lys), Hb D (beta 121, Glu --> Gln), Hb E (beta 26, Glu --> Lys), and Hb S (beta 6, Glu --> Val) were selected for further study [27]. According to the in silico mutation study, the functional change in the studied hemoglobinopathies was variable [27]. The position-specific independent counts (PSIC) difference score ranged from 1.362 (Hb D) to 2.986 (Hb S) [27]. Regarding the degree of damage, all had probable damage [27]. This analysis demonstrated that the functional aberration in the hemoglobinopathy was based on complex pathogenesis [27]. Identifying only the structural aberration in a hemoglobinopathy is not sufficient; additional functional analysis is recommended [27]. The functional analysis presented here may be a good model for further research [27].

Pharmacogenomics study on congenital hemoglobin disorder is also of interest [28 – 30]. Genetic association studies, which attempt to link polymorphisms with certain disease phenotypes and drug response, are taking the first steps in helping individualized therapy in sickle cell patients in order to enhance efficacy and decrease the toxicity [28]. Finally, the newest “omic” science, interactomics, is also proposed for usage in hematology. Red blood cell interactome can lead to considerable insight into disorder diagnosis, severity, and drug or gene therapy response [20]. C. Examples • A study on structure aberration in Hb Siam Hb Siam is an example of hemoglobinopathy that was first identified in India. Hb Siam is caused by the mutation [alpha15(A13)Gly-->Arg (alpha1) (GGT-->CGT)] in the alpha globin gene [31 - 33]. Similar to other hemoglobinopathies, Hb Siam is a protein disorder. At present, the molecular structure of Hb Siam is not welldocumented. Basically, the study on the tertiary structure is warranted for complete knowledge on the structural change in any protein disorder. The study on the secondary and tertiary structures, which can assist explain the pathogenesis of the Hb Siam disorder, is needed. The main objective of this study is to find the secondary and tertiary structures of Hb Siam by bioinformatic method. In this study, the author performed a bioinformatic analysis to assess the effect of sequence change in the Hb Siam disorder on the secondary and tertiary structures of the alpha globin chain. A computer-based study for amino acid sequence comparison and protein structure modeling was done. The database ExPASY [34] was applied for searching for the amino acid sequence of normal human alpha globin chain. Then the mutation alpha 15 Gly-->Arg was experimentally performed. Concerning secondary structure modeling, the author performed protein secondary structure predictions of alpha globins in both normal and hemoglobin Siam disorder from its primary sequence using NNPREDICT server [35]. The calculated secondary structures were presented and compared. In addition, the calculated tertiary structures were also presented and compared. All programs used in this study are standard programs used in bioinformatic study. From searching the database ExPASY, the sequence of the alpha globin chain was derived, as presented in Table 2. The experimentally mutated alpha globin chain in hemoglobin Siam disorder is derived as presented in Table 2. Using NNPREDICT server, the calculation for secondary structure of alpha globin chains of normal and Hb Siam disorder was performed.

412

Viroj Wiwanitkit Table 2. Alpha globin chains according to the database ExPASY and the derived mutated alpha globin chains of hemoglobin Siam disorder. Protein

Primary structure

Normal alpha globin chain (Entry name: Q9NYR7, gene name: HBA2, AC number: Q9NYR7)

MVLSPADKTNVKAAWGKVGAHAGEY GAEALEKMFLSFPTTKTYFPHFDLSHG SAQVKGHGKKVADALTNAVAHVDD MPNALSALSDLHAHKLRVDPVNFKLL SHCLLVTLAAHLPAEFTPAVHASLDKF LASVSTVLTSKYR

2. Derived alpha globin chain of hemoglobin Siam disorder

MVLSPADKTNVKAAWAKVGAHAGEY GAEALEKMFLSFPTTKTYFPHFDLSHG SAQVKGHGKKVADALTNAVAHVDD MPNALSALSDLHAHKLRVDPVNFKLL SHCLLVTLAAHLPAEFTPAVHASLDKF LASVSTVLTSKYR

Considering the derived secondary structures, there are 73 helix and 4 strands in the normal alpha globin chain and there are 72 helix and 4 strands in the alpha globin chain of Hb Siam. According to this study, the secondary structures of human alpha globin chains of normal and hemoglobin Siam disorder are calculated and presented. Based on this information, the main difference between the predicted alpha globin secondary structures of alpha globin chains between normal and hemoglobin Siam is the deletion of a helix in the Hb Siam. The results from this study can be good data for further study on hemoglobin Siam disorder, which can bring to the further understanding on this hemoglobinopathy. A. alpha globin chain in normal --------H--HHHHHHH---H-HHHHHHHHHHH--------------------HEH---HHHHHHHHHHH------HHHHHHHHHHH---------HHHHHHHHHHHHHH-------H HHHHHHHHHHHHHHEEE----B. alpha globin chain in hemoglobin Siam * --------H—HHH-HHH---H-HHHHHHHHHHH--------------------HEH---HHHHHHHHHHH------HHHHHHHHHHH---------HHHHHHHHHHHHHH-------H HHHHHHHHHHHHHHEEE----Figure 1. Calculated secondary structures of alpha globin chains of normal and Hb Siam disorder (Secondary structure prediction: H = helix, E = strand, - = no prediction) * The difference is presented as red highlight.

Computational Medicine Research in Hematology

413

• A study on structure aberration in Hb Siriraj Of several types of hemoglobinopathy, hemoglobin (Hb) Siriraj disorder is a beta chain variant in which beta 7 Glu is replaced by a lysine [36 – 38]. This disorder was firstly found in Bangkok, Thailand [36 - 38]. An individual with Hb Siriraj usually presents with mild or asymptomatic manifestation [36 - 37]. However, an individual with concomitant sickle cell anemia is more severe and manifests anemic symptoms [36]. The molecular structure of Hb Siriraj is not well understood. Considering the primary structure of Hb Siriraj, a specific mutation is seen for a long time and can be easily detected by isolation of beta-chain by ion-exchange chromatography of total globin on CM-cellulose [36 - 38]. However, there is a lack on the knowledge of the secondary structure of this hemoglobinopathy. Further study, which can explain more in the pathogenesis of the Hb Siriraj, is needed. Here, the author performs a bioinformatic analysis to study the effect of sequence change in the Hb Siriraj on the secondary structure of beta globin chain. A computer-based study for amino acid sequence comparison and protein structure modeling is used. A similar procedure as presented in Hb Siam was performed and the main difference between the globin chains of normal and Hb Siriraj is an additional helix in the structure. Indeed, an additional helix in the structure within the globin is reported to be an important factor leading to the instability of Hb [39]. In addition, a similar additional helix within the molecule of Hb C as well as Hb S is reported [40], and this might be a clue for a mild sickle cell syndrome resulted from Hb Siriraj.

Computational Medicine Research on Prothrombin Disorder A. Basic knowledge on prothrombin disorder Prothrombin is an important protein in thrombohaemostasis. There have been extensive studies on the structure and function of prothrombin, a protein critical for the coagulation of blood. The biological functions of prothrombin and its activated form, thrombin are discussed, as well as the structure and functional domains of the protein [41]. Prothrombin deficiencies represent a group of thrombohaemostatic disorders that can be detected and seen in both acquired and congenital forms. Congenital Prothrombin deficiency is a rare bleeding disorder which is inherited as an autosomal recessive trait [42]. Some cases are lethal but most are of mild severity [42]. Several prothrombin deficiency disorders are reported. B. Some interesting computational medicine researches on prothrombin disorder Similar to hemoglobin, there are many recent interesting computational medicine researches on prothrombin disorders. Genomics can be successfully used to study the congenital prothrombin disorder. Considering proteomics, the analysis of prothrombin complex concentrates can be used as a model to assess to what extent these technologies can detect differences in blood-derived treatments beyond that of standard quality control [43]. Proteomic technologies allow the identification of potentially modified proteins in clotting factor concentrates, showing that they could become a useful tool for transfusion medicine to assess the impact of processing on the integrity of blood-derived therapeutics [43]. A proteomic analysis of changes in

414

Viroj Wiwanitkit prothrombin and plasma proteins associated with the G20210A mutation was recently studied by Gelfi et al. [44]. This study, based on proteomic investigation by two-dimensional gel electrophoresis and electrospray ionization tandem mass spectrometry protein identification, indicated that the G20210A mutation was associated with increased glycosylation of prothrombin, which implies greater stability to the protein [44]. In addition, there are many recent studies on the structures of prothrombin disorders. For example, Wiwanitkit recently reported the structural aberration in prothrombin Shanghai [45]. The main structural aberration in prothrombin Shanghai disorder is the loss of two helices [45]. Wiwanitkit noted that the disorder in this region as detected in this study could be a good explanation of pathogenesis [45]. C. Examples • A study on functional aberration in some prothrombin disorders Congenital prothrombin deficiency is a rare bleeding disorder inherited as an autosomal recessive trait [46]. Some cases are lethal but most are mild [46]. Several prothrombin deficiency disorders are reported. Until present, there have been many prothrombin deficiency variants documented in the literature. Every variant has its highly specific underlying genetic defect and therefore has its specific property and manifestation. The single substitution in the amino acid chain is the commoner form of prothrombin deficiency variant. Usually, a prothrombin deficiency variant due to a single amino acid substitution presents only one aberration in the secondary structure. However, the functional aberrations according to the structural aberration are well documented. Although many prothrombin deficiency variants present similar structural abnormal points, their functions sometimes are discordant. Here, the author performed a functional analysis on some prothrombin deficiency variants using a novel bioinformatic tool. The database ExPASY (Expert Protein Analysis System) was used for the data mining of the amino acid sequence for human prothrombin. The mutations of three well-known prothrombin deficiency variants, Shanghai (29 Glu-->Gly) [47], Carora (44 TyrÆ Cys) [48] and Barcelona (273 Arg-->Cys) [49] were selected for further investigation. All these three selected prothrombin deficiencies have as underlying pathogenesis a single amino acid substitution. A novel bioinformatic simulation tool, PolyPhen [50], was applied for mutation study. Briefly, we studied the effect of each mutation on prothrombin structure and function. Briefly, PolyPhen is an automatic tool for prediction of possible impact of an amino acid substitution on the structure and function of a specific human protein. This prediction is based on classical rules, which are applied to the sequence, phylogenetic and structural information characterizing the substitution. Concerning the input, PolyPhen works with human proteins and identifies them by the amino acid sequence itself. Amino acid replacement is indicated by position number and substitution. For a given amino acid substitution in a human protein, PolyPhen performs several steps: a) sequence-based characterisation of the substitution site, b) calculation of the degree of functional change, PSIC scores, c) calculation of structural parameters and contacts and d) specific prediction.

Computational Medicine Research in Hematology

415

Table 3. Functional change in the studied prothrombin deficiency variants. Studied prothrombin deficiency variants

PSIC difference score*

Degree of damaging

Shanghai Carora Barcelona

2.6 2.4 0.4

Probably damaging Probably damaging Benign

* PSIC difference score is the score showing the degree of functional change **There are four degrees of damaging: a) probably damaging, high confidence supposed to affect protein function or structure, b) possibly damaging, supposed to affect protein function or structure, c) benign, most likely lacking any phenotypic effect and d) unknown, in some rare cases, the lack of data do not allow PolyPhen to make a prediction.

According to the in silico mutation study, the functional change in the studied prothrombin variants is shown in Table 3. PSIC difference score varies from 0.4 to 2.6. Concerning the degree of damaging, benign is detected in the Barcelona variant while probably damaging is detected in the Carora and Shanghai variants (Table 1). Here, the author studied the functional changes in some common prothrombin deficiency variants. The three selected prothrombin deficiencies variants are those with a single institution and have a single structural aberration. According to this study, the functional aberration increases from Barcelona, Carora and Shanghai orderly. This finding is concordant with the reported clinical features in these variants [4 - 6]. It also shows the trend that the mutation in earlier order of amino acid tends to have a more severe presentation. Here, it can be demonstrated that the functional aberration in a prothrombin deficiency variant is based on complex pathogenesis. The identification of only the structural aberration in prothrombin deficiency variants is not sufficient and it should be supplemented with a further functional analysis for a better insight on this specific topic on prothrombin.

References [1] [2] [3] [4] [5]

[6] [7] [8]

Thomas SM. Genomics: the implications for ethics and education. Br Med Bull. 1999;55(2):429-45. Leach AR, Hann MM. The in silico world of virtual libraries. Drug Discov Today. 2000 Aug;5(8):326-336. Marshall T, Williams KM. Proteomics and its impact upon biomedical science. Br J Biomed Sci. 2002;59(1):47-64. Goh SH, Lee YT, Bouffard GG, Miller JL. Hembase: browser and genome portal for hematology and erythroid biology. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D572-4. Waenlor W. Analysis of hemoglobin disorders by application of computational tools for nanohematology. Nanomedicine. 2005 Sep;1(3):219. Walker J, Flower D, Rigley K. Microarrays in hematology. Curr Opin Hematol. 2002 Jan;9(1):23-9. Thomas PD, Mi H, Lewis S. Ontology annotation: mapping genomic regions to biological function. Curr Opin Chem Biol. 2007 Feb;11(1):4-11.

416 [9]

[10] [11] [12] [13] [14] [15] [16] [17]

[18] [19] [20] [21]

[22]

[23]

[24]

[25] [26] [27]

Viroj Wiwanitkit Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R. The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D262-6. Daraselia N, Yuryev A, Egorov S, Mazo I, Ispolatov I. Automatic extraction of gene ontology annotation and its correlation with clusters in protein networks. BMC Bioinformatics. 2007 Jul 10;8:243. Lo L, Singer ST. Thalassemia: current approach to an old disease. Pediatr Clin North Am 2002;49:1165-91 Old JM. Screening and genetic diagnosis of haemoglobin disorders. Blood Rev 2003;17:43-53 Fucharoen S, Wanichagoon G. Thalassemia and abnormal hemoglobin. Int J Hematol 2002;76 Suppl 2:83-9 Gabutti V. Current therapy for thalassemia in Italy. Ann N Y Acad Sci 1990; 612:26874 Glader BE, Look KA. Hematologic disorders in children from southeast Asia. Pediatr Clin North Am 1996;43:665-81 Patrinos GP, Giardine B, Riemer C, Miller W, Chui DH, Anagnou NP, Wajcman H, Hardison RC. Improvements in the HbVar database of human hemoglobin variants and thalassemia mutations for population and sequence variation studies. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D537-41. Wasi P. Human genomics: implications for health. Southeast Asian J Trop Med Public Health. 1997;28 Suppl 2:19-24. Alwan A, Modell B. Recommendations for introducing genetics services in developing countries. Nat Rev Genet. 2003 Jan;4(1):61-8. Weatherall DJ. Genomics and global health: time for a reappraisal. Science. 2003 Oct 24;302(5645):597-9. Lin SD, Cooper P, Fung J, Weier HU, Rubin EM. Genome scan identifies a locus affecting gamma-globin level in human beta-cluster YAC transgenic mice. Mamm Genome. 2000 Nov;11(11):1024-9. Goodman SR, Kurdia A, Ammann L, Kakhniashvili D, Daescu O. The human red blood cell proteome and interactome. Exp Biol Med (Maywood). 2007 Dec;232(11):1391-408. Liu W, Silverstein AM, Shu H, Martinez B, Mumby MC. Protein profiling of sickle cell versus control RBC core membrane skeletons by ICAT technology and tandem mass spectrometry. Cell Mol Biol Lett. 2006;11(3):326-37. Kakhniashvili DG, Griko NB, Bulla LA Jr, Goodman SR. The proteomics of sickle cell disease: profiling of erythrocyte membrane proteins by 2D-DIGE and tandem mass spectrometry. Exp Biol Med (Maywood). 2005 Dec;230(11):787-92. Wiwanitkit V. Modeling for tertiary structure of globin chain in Hemoglobin Suan-Dok disorder. Hematology. 2005 Apr;10(2):163-5. Wiwanitkit V. Secondary and tertiary structure aberration of alpha globin chain in haemoglobin Q-India disorder. Indian J Pathol Microbiol. 2006 Oct;49(4):491-4. Wiwanitkit V. Structural analysis on the abnormal elongated hemoglobin "hemoglobin Geneva". Nanomedicine. 2005 Sep;1(3):216-8.

Computational Medicine Research in Hematology

417

[28] Wiwanitkit V. Tertiary structural analysis of the elongated part of an abnormal hemoglobin, hemoglobin Pakse. Int J Nanomedicine. 2006;1(1):105-7. [29] Wiwanitkit V. Analysis of functional aberration of some important beta hemoglobinopathies (hemoglobin C, D, E, and S) from nanostructures. Nanomedicine. 2005 Sep;1(3):213-5. [30] Makis AC, Hatzimichael EC, Stebbing J. The genomics of new drugs in sickle cell disease. Pharmacogenomics. 2006 Sep;7(6):909-17. [31] Weatherall D. Sir David Weatherall reflects on genetics and personalized medicine. Interviewed by Ulrike Knies-Bamforth. Drug Discov Today. 2006 Jul;11(13-14):576-9 [32] Motulsky AG, Stamatoyannopoulos G. Drugs, anesthesia and abnormal hemoglobins. Ann N Y Acad Sci. 1968 Jul 31;151(2):807-21. [33] Turbpaiboon C, Svasti S, Sawangareetakul P, Winichagoon P, Srisomsap C, Siritanaratkul N, Fucharoen S, Wilairat P, Svasti J. Hb Siam [alpha15(A13)Gly-->Arg (alpha1) (GGT-->CGT)] is a typical alpha chain hemoglobinopathy without an alphathalassemic effect. Hemoglobin. 2002;26:77-81. [34] Pootrakul S, Srichiyanont S, Wasi P, Suanpan S. Hemoglobin Siam (alpha 2 15 arg beta 2): a new alpha-chain variant. Humangenetik. 1974;23:199-204 [35] Yodsowan B, Svast J, Srisomsap C, Winichagoon P, Fucharoen S. Hb Siam [alpha15(A13)Gly-->Arg] is a GGT-->CGT mutation in the alpha1-globin gene. Hemoglobin. 2000;24:71-5. [36] Gasteiger E, Gattiker A, Hoogland C, Ivanyi I, Appel RD, Bairoch A. ExPASy: The proteomics server for in-depth protein knowledge and analysis. Nucleic Acids Res 2003;31:3784-8. [37] Kneller DG, Cohen FE, Langridge R. Improvements in Protein Secondary Structure Prediction by an Enhanced Neural Network. J Mol Biol 1990; 214: 171-182. [38] Rhoda MD, Arous N, Garel MC, Mazarin M, Monplaisir N, Braconnier F, Rosa J, Cohen-Solal M, Galacteros F. Interaction of hemoglobin Siriraj with hemoglobin S: a mild sickle cell syndrome. Hemoglobin. 1986;10:21-31. [39] Foldi J, Horanyi M, Szelenyi JG, Hollan SR, Aseeva EA, Lutsenko IN, Spivak VA, Toth O, Rozynov BV. Hemoglobin Siriraj found in the Hungarian population. Hemoglobin. 1989;13:177-80. [40] Ittarat W, Ongcharoenjai S, Rayatong O, Pirat N. Correlation between some discrimination functions and hemoglobin Siriraj. J Med Assoc Thai 2000; 83:259-65. [41] Wiwanitkit V. Structural analysis of the elongated part of abnormal haemoglobin, Haemoglobin Tak. Haema 2005; 8: 626 – 8. [42] Hirsch RE, Juszczak LJ, Fataliev NA, Friedman JM, Nagel RL. Solution-active structural alterations in liganded hemoglobins C (beta6 Glu --> Lys) and S (beta6 Glu -> Val). J Biol Chem 1999;274:13777-82. [43] Sun WY, Degen SJ. Gene targeting in hemostasis. Prothrombin. Front Biosci 2001; 6: D222-D238. [44] Strijks E, Poort SR, Renier WO, Gabreels FJ, Bertina RM. Hereditary prothrombin deficiency presenting as intracranial haematoma in infancy. Neuropediatrics 1999; 30: 320-324. [45] Brigulla M, Thiele T, Scharf C, Breitner-Ruddock S, Venz S, Völker U, Greinacher A. Proteomics as a tool for assessment of therapeutics in transfusion medicine: evaluation of prothrombin complex concentrates. Transfusion. 2006 Mar;46(3):377-85.

418

Viroj Wiwanitkit

[46] Gelfi C, Viganò A, Ripamonti M, Wait R, Begum S, Biguzzi E, Castaman G, Faioni EM. A proteomic analysis of changes in prothrombin and plasma proteins associated with the G20210A mutation. Proteomics. 2004 Jul;4(7):2151-9. [47] Wiwanitkit V.Structure aberration of prothrombin in prothrombin Shanghai disorder. Haema 2006; 9(2): 270-273. [48] Strijks E, Poort SR, Renier WO, Gabreels FJ, Bertina RM. Hereditary prothrombin deficiency presenting as intracranial haematoma in infancy. Neuropediatrics 1999;30:320-4 [49] Wang WB, Wang HL, Huang CY, Fang Y, Fu QH, Zhou RF, Xie S, Ding QL, Wu WM, Wang XF, Hu YQ, Wang ZY. Prothrombin deficiency resulted from a homozygous Glu29 to Gly mutation in the prothrombin gene. Zhonghua Xue Ye Xue Za Zhi 2003;24:449-51 [50] Sun WY, Ruiz-Saez A, Burkart MC, Bosch N, Degen SJ. Prothrombin carora: hypoprothrombinaemia caused by substitution of Tyr-44 by Cys. Br J Haematol 1999;105:670-2 [51] Rabiet MJ, Furie BC, Furie B. Molecular defect of prothrombin Barcelona. Substitution of cysteine for arginine at residue 273. J Biol Chem 1986;261:15045-8 [52] Ramensky V, Bork P, Sunyaev S. Human non-synonymous SNPs: server and survey. Nucleic Acids Res 2002; 30: 3894-900

INDEX A AAA, 327 AAC, 326, 327, 328 aberrant methylation, 226 academic, 12, 16, 390, 409 ACC, 327 accessibility, 157, 161, 166, 179, 180, 182, 186, 190, 191, 325 accounting, 62, 75, 142 accuracy, 13, 14, 15, 76, 90, 92, 133, 134, 135, 136, 145, 166, 177, 182, 183, 216, 217, 220, 225, 234, 239, 240, 242, 247, 248, 249, 250, 252, 253, 263, 264, 267, 268, 291, 315, 317, 318, 320, 325, 327, 328, 330, 331, 333, 335, 336, 337, 375, 377, 382 acetylation, 93, 96 ACF, 326, 329 Ach, 97 achievement, 307 acid, 13, 15, 16, 17, 41, 134, 146, 162, 165, 168, 169, 171, 174, 177, 279, 280, 281, 282, 285, 295, 296, 303, 304, 310, 311, 313, 317, 320, 321, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 337, 338, 339, 340, 352, 358, 359, 376, 378, 379, 380, 381, 382, 383, 384, 385, 400, 410, 411, 413, 414, 415 acidic, 14 ACM, 254, 255 actin, 23, 362, 372 activation, 17, 93, 97, 178, 283, 376 activators, 362 active site, 45 acute, 62, 84, 271 acute leukemia, 84 acute myeloid leukemia (AML), 271 acyl transferase, 14 Adams, 188

adaptation, 20, 24, 25, 355, 398 adaptive control, 67, 80 adenine, 3 adenocarcinoma, 227, 271 adenoviral vectors, 91 adenovirus, 265 adenoviruses, 91 adenylate kinase, 352, 359 adjustment, 76, 77, 85, 127 administration, 226 adult, 216, 226, 249, 250, 266, 271 adults, 233 Africa, 20 African American, 25 Ag, 386 AGC, 233 age, 50, 53, 145, 220 agent, 100, 101, 102, 104, 105, 107, 109, 110, 112, 116, 386 agents, 9, 11, 12, 99, 100, 101, 102, 103, 104, 109, 110, 113, 114, 115, 116, 119, 126, 227, 352 aggregation, 218 aid, 364, 369, 373, 402 air, 246 Airlines, 99 alanine, 392 albinism, 20 allele, 3, 7, 8, 21, 22, 23, 24, 25, 232, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 252, 253, 266 alleles, 9, 3, 19, 20, 21, 22, 23, 24, 25, 233, 234, 235, 236, 241, 243, 244, 245, 246, 248, 249, 250, 380 allosteric, 195, 212, 352 alpha, 143, 145, 155, 285, 313, 408, 410, 411, 412, 417 alternative, 9, 12, 1, 8, 62, 63, 65, 67, 69, 72, 73, 74, 75, 77, 78, 109, 132, 138, 139, 141, 149, 170,

420

Index

171, 175, 193, 194, 205, 232, 233, 279, 283, 284, 312, 346, 387, 402 alternative hypothesis, 63, 65, 73, 74, 78 alternatives, 76, 77, 101, 137, 171, 179 American Airlines, 99 amino acid side chains, 381 amino acids, 15, 43, 133, 136, 167, 168, 169, 170, 172, 174, 175, 178, 179, 180, 285, 317, 318, 320, 321, 323, 325, 327, 328, 331, 352, 360, 375, 376, 377, 378, 379, 380, 382, 383, 384, 387, 410 ammonia, 13 amphibia, 255 amplitude, 211, 213 Amsterdam, 127, 261, 262, 358 analog, 31, 296 anatomy, 6, 7 anemia, 409 angiogenesis, 10, 47, 48 angiogenic, 228 animal models, 10, 59, 265 animals, 13, 60, 61, 66, 79, 231, 232 anisotropy, 198, 211 ANN, 382 annealing, 173, 175, 185, 186, 188, 240, 253 annexin I, 309 annotation, 41, 42, 44, 90, 143, 146, 224, 368, 385, 405, 408, 416 antibacterial, 17 antibody, 9, 6, 11, 15, 16, 17, 18, 225, 229, 364 antigen, 15, 16, 285, 377, 386, 387 antioxidant, 223 API, 394, 395, 396, 397, 402, 403, 404 apoptosis, 10, 47, 48, 93, 96, 97, 224, 226 apoptotic, 97 application, 10, 13, 14, 15, 16, 7, 39, 41, 42, 43, 62, 65, 66, 67, 73, 75, 79, 82, 83, 100, 102, 128, 130, 142, 158, 166, 212, 259, 272, 315, 318, 322, 334, 338, 344, 353, 356, 361, 371, 389, 390, 391, 394, 398, 402, 403, 404, 415 applied mathematics, 200 aqueous solution, 308 ARB, 55 archetype, 353, 354 arginine, 13, 14, 17, 418 argument, 348, 397, 398 Ariel, 228 Aristotelian, 342, 343, 348, 350, 351 Aristotle, 342, 347, 357 arithmetic, 21, 134, 352, 356 arrest, 97 arthropods, 141 Asia, 416 Asian, 416

assessment, 43, 65, 77, 79, 80, 85, 190, 366, 407, 408, 418 assignment, 11, 21, 99, 100, 102, 103, 104, 106, 109, 113, 116, 123, 126, 127, 128, 183, 184, 232, 233, 235, 245, 246, 252, 254, 319, 392, 400 assumptions, 10, 13, 30, 32, 33, 34, 59, 60, 63, 66, 77, 78, 79, 231, 235, 237, 239, 253, 264, 279, 291, 347 asymptomatic, 54, 413 asymptotics, 82 ATF, 365 Atlantic, 239, 248, 253 atomic force, 195 atomic force microscopy (AFM), 195, 209 atoms, 148, 149, 167, 168, 170, 173, 174, 176, 194, 195, 196, 197, 198, 201, 281, 288, 294, 295 ATP, 159, 194, 210 ATPase, 194, 201, 209, 210 autocorrelation, 334 autoimmune, 386, 387 autoimmune diseases, 386 automata, 15, 341 autosomal recessive, 413, 414 availability, 3, 41, 70, 130, 135, 140, 363, 364 averaging, 126, 294, 299 avoidance, 347 Azobenzene, 210

B B cell, 62 Bacillus, 220 Bacillus Calmette-Guerin (BCG), 220 back, 177, 178, 327, 330, 352 bacterial, 145, 385 Banach spaces, 37 barrier, 157, 289, 290, 291, 292, 293, 294, 296, 298, 299, 301, 302, 303, 307 base pair, 91, 162, 233 Bayesian, 85, 133, 134, 136, 144, 145, 178, 225, 237, 267, 268, 275, 317, 321, 330, 335, 340 Bayesian analysis, 134 B-cell, 268, 271, 275 B-cell lymphoma, 268, 271, 275 beads-on-a-string, 150 behavior, 13, 106, 194, 195, 196, 198, 199, 200, 201, 209, 216, 284, 285, 304, 346, 392 benchmark, 12, 136, 143, 144, 165, 166, 253, 334, 336 benchmarking, 172 bending, 155, 157, 160, 196 benefits, 4, 20, 48, 55, 396, 401 benign, 6, 415 bias, 68, 178, 240

Index biliary tract, 62, 81 binary decision, 102 binding, 14, 15, 16, 17, 43, 90, 91, 92, 93, 94, 95, 97, 149, 159, 160, 161, 194, 199, 212, 213, 223, 283, 284, 298, 311, 352, 362, 364, 365, 369, 370, 380, 381, 387, 388 binomial distribution, 76 biochemistry, 12 biogenesis, 93, 95 bioinformatics, 9, 14, 15, 16, 3, 4, 11, 12, 13, 14, 15, 16, 43, 45, 90, 92, 93, 130, 135, 188, 254, 259, 274, 371, 372, 375, 390, 407, 408, 410 biological processes, 12, 13, 43, 60, 147, 216, 220, 222, 223, 224, 225, 262, 266, 267, 269 biological rhythms, 93 biological systems, 346, 347, 351, 356 biomarker, 226 biomarkers, 13, 1, 4, 216, 225, 226, 260, 273 biometric, 377, 388 biomolecular, 210, 212 biomolecular systems, 212 biomolecules, 213 biopolymers, 158, 160, 161, 191, 309, 337 biosciences, 1, 391 biosemiotic, 347 birds, 233, 257 birth, 231 bladder, 10, 12, 13, 47, 54, 215, 216, 217, 219, 220, 224, 225, 226, 227, 228, 229, 260, 270 bladder cancer, 12, 13, 54, 215, 216, 217, 219, 220, 224, 225, 226, 227, 228, 229 bladder carcinogenesis, 226 bleeding, 413, 414 blocks, 14, 77, 136, 144, 145, 202 blood, 232, 376, 382, 409, 411, 413, 416 Bohr, 190 bonds, 196, 283, 342, 343 bone scan, 216 Boolean algebras, 344 bootstrap, 134, 137, 141, 142, 173, 219, 333 Boston, 358, 372, 405 bottlenecks, 166 bottom-up, 14, 259, 264, 265, 266, 269 bounds, 101, 102, 103, 106, 112, 126 brain, 62 branching, 11, 100, 106, 109, 110, 111, 112, 113, 119, 120, 121, 126 breakdown, 311 breast cancer, 7, 62, 81, 86, 260, 261, 265, 266, 267, 268, 269, 270, 272, 273, 274, 275, 276 breast carcinoma, 275 British Columbia, 10, 47, 50, 52 brothers, 231

421

browser, 404, 405, 408, 415 browsing, 402 budding, 228

C C++, 390, 391, 399, 402, 403, 404, 405, 406 cadherin, 228 calculus, 344 calf, 358 calibration, 82 Canada, 47, 48, 49, 50, 55, 218, 341 cancer, 9, 10, 13, 1, 2, 3, 4, 5, 6, 7, 8, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 62, 79, 80, 84, 96, 97, 215, 216, 217, 219, 220, 224, 225, 226, 227, 228, 229, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276 cancer cells, 1, 7, 79, 97, 229 cancer progression, 224, 273 cancerous cells, 93 candidates, 13, 15, 1, 6, 131, 216, 226, 364, 375, 377 carbon, 194, 195, 196, 197, 198, 201 carbon atoms, 194, 195, 196, 197, 198, 201 carboxyl, 14 carcinogen, 60, 77 carcinogenesis, 90, 94, 226, 268 carcinogenic, 264 carcinoma, 62, 91, 216, 224, 225, 226, 227, 228, 265, 266, 270, 271, 274 carcinomas, 85, 216, 226, 229, 265, 270, 275 case study, 96 casting, 262 catabolic, 17 catabolism, 17 catalysis, 14, 209 catalytic activity, 223 categorization, 353 category a, 347 Catholic, 405 cation, 27 Caucasians, 25 causality, 348, 349 causation, 15, 341, 342, 346, 350, 351, 355, 356 cave, 146 CD8+, 386, 388 cDNA, 1, 2, 3, 4, 6, 59, 62, 74, 80, 81, 82, 83, 91, 217, 218, 220, 227, 229 CDR, 15, 16CE, 229 cell, 11, 15, 1, 2, 4, 6, 48, 49, 79, 80, 89, 90, 92, 93, 94, 95, 96, 97, 148, 163, 222, 223, 224, 225, 226, 228, 229, 265, 266, 267, 268, 269, 270, 271, 272, 275, 319, 352, 359, 360, 365, 372, 375, 376, 377, 378, 381, 386, 387, 388, 409, 411, 413, 416 cell culture, 92

422

Index

cell cycle, 90, 92, 93, 94, 95, 97, 226, 228, 229, 267, 365, 372 cell differentiation, 224, 225, 226 cell fate, 11, 89 cell growth, 96 cell line, 6, 265, 266 cell organelles, 223 cell organization, 352 cell surface, 319 cellular automaton, 355 cellular regulation, 157 cellulose, 413 central nervous system, 270 centromeric, 158 cerebrospinal fluid, 376 cervix, 49, 51, 227 CGT, 411, 417 chemical energy, 194 chemical properties, 14, 315, 321 chemotherapeutic agent, 227 chemotherapy, 216, 220, 226, 260, 268, 269, 271, 276 chest, 216 children, 416 chiral, 157, 158 chirality, 149, 160 chloroform, 217 chromatin, vi, 12, 91, 92, 96, 97, 147, 148, 149, 151, 152, 153, 155, 156, 157, 158, 159, 160, 161, 162, 163, 362, 364, 371, 373 chromatography, 413 chromosome, 22, 23, 24, 93, 152, 155, 157, 159, 160, 161, 226 chromosomes, 12, 3, 147, 148, 150, 151, 155, 156, 158, 161 chymotrypsin, 311, 387 circadian, 93, 96 circadian clock, 93 circadian rhythm, 93, 96 circadian timing, 96 cis, 278, 362, 364, 365, 368, 369, 372 classes, 11, 99, 116, 117, 119, 167, 171, 172, 175, 178, 179, 181, 182, 187, 191, 218, 220, 270, 286, 316, 318, 319, 325, 326, 328, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 382 classical, 11, 63, 100, 129, 130, 241, 267, 343, 346, 360, 414 classical logic, 343 classification, 14, 4, 17, 18, 41, 42, 47, 63, 79, 80, 82, 84, 178, 224, 238, 266, 267, 268, 270, 271, 273, 274, 312, 315, 317, 318, 319, 320, 321, 329, 330, 332, 333, 334, 335, 336, 337, 368, 370, 382 clinical oncology, 273

clinical trials, 10, 59, 260 clinics, 48 clone, 2, 377, 378, 382, 383, 386, 388 cloning, 407 closure, 15, 341, 347, 348, 350, 355, 356, 359, 404 cluster analysis, 21 cluster model, 202, 208, 213 clustering, 12, 21, 80, 134, 217, 219, 224, 225, 255, 317, 321, 330, 332, 338, 368 clusters, 6, 137, 167, 202, 217, 218, 219, 225, 239, 330, 365, 416 CMC, 213 c-Myc, 90, 94, 96, 265 CNS, 386, 387 coagulation, 413 codes, 15, 16, 43, 48, 49, 317, 341, 350, 389, 403 coding, 4, 43, 90, 92, 373, 385 codominant, 234 codon, 410 coefficient of variation, 71 coenzyme, 311 coherence, 345, 346, 358 coil, 161, 175, 287, 288, 289, 291, 292, 309, 320, 321, 325 colon, 60, 61, 66, 68, 70, 80, 97, 260, 261, 262, 270 colon cancer, 61, 66, 68, 70, 97, 260, 261, 262, 270 colorectal cancer, 7, 270 Columbia, 10, 47, 50, 52, 360 Columbia University, 360 communication, 254, 402 communities, 202 community, 44, 257, 272, 402 compaction, 149, 151, 155, 157, 158 compiler, 390, 392, 396 complement, 408 complementarity, 15, 43, 352 complementary DNA, 85, 217 complex interactions, 377 complex systems, 353 complexity, 10, 72, 89, 90, 94, 144, 147, 148, 177, 178, 241, 242, 247, 262, 268, 269, 275, 321, 329, 338, 354, 358, 376 compliance, 202 complications, 278 components, 13, 15, 12, 22, 43, 70, 121, 132, 166, 179, 188, 216, 220, 222, 225, 239, 324, 325, 326, 329, 331, 341, 350, 351, 352, 353, 355, 379, 384, 389, 390, 402, 403 composition, 14, 17, 159, 285, 288, 315, 317, 320, 321, 323, 324, 325, 326, 327, 328, 329, 330, 331, 333, 334, 335, 336, 337, 338, 339, 353, 355, 358, 409 compounds, 12, 13, 14, 15, 16, 17

Index computation, 15, 40, 134, 194, 201, 208, 341, 345, 352, 356, 359 computational performance, 404 computed tomography, 216 computer science, 253, 402 computer simulations, 160, 162 computerization, 130 computing, 64, 130, 135, 136, 137, 195, 248, 358, 392, 403, 404 concentrates, 41, 413, 418 concentration, 158, 238, 278, 279, 376, 378, 379, 380 concordance, 266 concrete, 344, 351, 354 condensation, 93, 158, 161, 196, 200, 279, 311 confidence, 48, 50, 126, 232, 253, 256, 415 confidence interval, 48, 50 confidence intervals, 48 configuration, 16, 135, 173, 238, 240, 351, 389, 390, 400 congruence, 141, 142 consciousness, 354, 355, 358 consensus, 13, 138, 190, 232, 246, 247, 248, 250, 252, 253, 276, 325, 363, 369 consent, 20 conservation, 14, 133, 312, 313, 364, 369, 370 constraints, 11, 15, 29, 99, 100, 101, 102, 103, 104, 105, 106, 108, 110, 112, 152, 155, 157, 158, 166, 167, 169, 170, 172, 173, 175, 184, 186, 211, 233, 235, 237, 238, 240, 241, 245, 246, 251, 252, 282, 341, 346, 349 construction, 12, 106, 134, 137, 138, 145, 265, 268, 328, 333, 343, 348, 349, 353, 354, 390, 391, 399, 403 consulting, 368 consumption, 11, 100, 101, 102, 123, 351 control, 12, 2, 26, 62, 63, 67, 68, 69, 70, 72, 73, 74, 75, 77, 79, 80, 82, 85, 86, 90, 92, 95, 96, 128, 147, 160, 218, 263, 346, 355, 359, 360, 361, 362, 364, 373, 392, 394, 403, 409, 416 convergence, 11, 99, 100, 106, 126 conversion, 226 convex, 37, 116 COP, 331 copolymers, 312 copper, 152 corn, 60 correlation, 10, 14, 22, 47, 51, 54, 55, 61, 70, 71, 76, 78, 79, 196, 197, 204, 205, 207, 262, 263, 277, 282, 283, 284, 285, 286, 294, 296, 300, 301, 302, 304, 305, 306, 307, 320, 323, 326, 327, 329, 331, 334, 341, 366, 370, 381, 410, 416

423

correlation coefficient, 14, 22, 277, 286, 296, 300, 301, 302, 304, 305, 306, 307, 381 correlation function, 320, 326, 327, 331 correlations, 50, 150, 157, 162, 284, 285, 305, 311, 321, 331 costs, 11, 100, 101, 103, 109, 110, 137, 382 coupling, 97, 317, 321, 324, 325, 326 covalent, 17, 195, 196, 197, 198, 352 covalent bond, 195, 196, 197, 198 covering, 55, 184, 254 CpG islands, 368, 370 CPU, 119, 135, 178 CRC, 87, 211 creativity, 354 CREB, 365 critical points, 166 critical value, 30, 31, 73, 77 criticism, 138 cross-linking, 151 cross-validation, 264, 320, 335, 336 crystal structure, 13, 14, 16, 151, 398 crystal structures, 16 crystalline, 161 crystallization, 188 crystals, 157 CSF, 376 C-terminal, 148, 283, 291, 296, 298, 299, 301, 302, 410 C-terminus, 284, 302, 303 Cuba, 44 culture, 92, 355 cumulative distribution function, 74 cyanobacteria, 93, 96 Cybernetics, 359 cycles, 135, 177 cycling, 363 cystectomy, 216, 220 cysteine, 376, 418 cystourethroscopy, 216 cytology, 216 cytomegalovirus, 92 cytosine, 3

D data analysis, 13, 15, 82, 216, 217, 218, 220, 263, 375, 377, 403 data collection, 40 data distribution, 225 data generation, 70 data mining, 12, 40, 218, 332, 409, 414 data processing, 391, 403

424

Index

data set, 70, 71, 72, 79, 145, 175, 218, 225, 252, 260, 262, 263, 266, 267, 268, 275, 328, 329, 331, 381, 382, 383 data structure, 396, 404 database, 10, 11, 15, 16, 1, 6, 7, 8, 12, 21, 39, 40, 41, 42, 43, 44, 49, 89, 91, 93, 94, 132, 140, 143, 144, 146, 175, 180, 181, 190, 191, 217, 225, 305, 312, 326, 327, 330, 331, 335, 337, 372, 373, 375, 377, 383, 384, 385, 389, 398, 399, 400, 408, 409, 411, 412, 414, 416 de novo, 166, 202 death, 94, 96 decay, 175, 178, 342, 345 decision making, 272 decision trees, 335 decisions, 94, 106, 116 decomposition, 101, 126, 128, 321, 338 deconvolution, 15, 375 defects, 23, 154, 159, 408 deficiency, 413, 414, 415, 418 definition, 9, 19, 20, 22, 31, 40, 69, 95, 139, 171, 177, 178, 181, 205, 252, 289, 316, 318, 321, 342, 347, 348, 349, 350, 354, 369, 399, 403 deformation, 12, 149, 157, 193, 194, 198, 213 degenerate, 377, 386, 388 degradation, 17 degrees of freedom, 68, 73, 74, 168, 194, 199, 200, 202, 204, 205, 208, 285 denaturation, 195, 308 dendrites, 23 density, 26, 59, 83, 151, 177, 179, 180, 182, 398 deoxyribonucleic acid, 358 deoxyribose, 162 deregulation, 265 dermatologist, 48 detection, 11, 6, 7, 15, 42, 79, 80, 81, 129, 188, 218, 345, 363, 364, 365, 366, 369, 371 determinism, 346, 354, 355 developing countries, 416 developmental process, 364 deviation, 71, 175 diabetic nephropathy, 85 diagnostic markers, 227 diet, 61, 66 dietary, 60, 61 dietary fat, 61 diets, 68 differential diagnosis, 216, 225 differentiation, 21, 90, 93, 95, 96, 216, 224, 225, 226, 227, 370, 386 diffusion, 62, 80, 85 diffusion tensor imaging (DTI), 62 digestion, 148, 150, 151, 159

dimensionality, 10, 59, 60, 263, 333, 382 dimer, 160 dimeric, 342, 345 dimerization, 16 dipeptides, 327, 328 diploid, 7, 232, 234, 235, 240, 241, 248 direct measure, 366 directionality, 198 discipline, 60 Discovery, 8, 11, 16, 81, 373 discrete variable, 246 discretization, 268 discriminant analysis, 323, 327 discrimination, 15, 21, 328, 417 discriminatory, 82 disease gene, 9, 19, 22, 407 diseases, 12 disequilibrium, 9, 19, 20, 22, 27, 371 disorder, 408, 409, 410, 411, 412, 413, 414, 416, 417, 418 displacement, 194, 195, 196, 204, 209 dissociation, 15 distribution, 15, 22, 23, 61, 64, 65, 70, 71, 73, 74, 75, 76, 77, 78, 82, 116, 121, 134, 138, 142, 172, 181, 185, 186, 209, 218, 220, 225, 233, 249, 255, 279, 280, 284, 289, 307, 325, 328, 361, 365, 366, 367, 368, 370, 371, 373, 384, 385, 394, 400, 404 distribution function, 73 divergence, 142 diversification, 24 diversity, 14, 15, 17, 18, 25, 60, 141, 146, 239, 284, 312 division, 331 DNA, 12, 1, 3, 4, 20, 21, 42, 43, 59, 60, 61, 62, 78, 81, 82, 83, 84, 85, 90, 91, 92, 93, 95, 96, 97, 142, 147, 148, 149, 150, 151, 152, 155, 156, 157, 158, 159, 160, 161, 162, 163, 216, 226, 232, 233, 252, 254, 255, 256, 257, 267, 273, 284, 352, 362, 364, 371, 372, 373, 407, 409 DNA damage, 93, 97 DNA polymerase, 256 DNA repair, 93, 97 DNA sequencing, 407 doctors, 41, 44 donor, 386 dosage, 15 down-regulation, 90, 229 draft, 173 dropouts, 240 Drosophila, 15, 158, 361 drug action, 15 drug design, 17, 45 drug discovery, 9, 11, 12, 13, 14, 16, 39

Index drug targets, 13, 14, 15 drugs, 9, 10, 11, 12, 15, 16, 43, 47, 55, 265, 417 dualism, 346 duplication, 139, 140, 145, 146 duration, 360 duties, 346

E E-cadherin, 228 ecological, 55 ecologists, 253 ecology, 232 Eden, 273 elasticity, 157, 210, 211, 212 electric charge, 155 electron, 148, 150, 151, 155, 398, 405 electron density, 398 electron microscopy, 148, 150, 155, 405 electrophoresis, 62, 81, 85, 86, 410, 414 elongation, 410 email, 231, 385 emission, 216 encoding, 20, 93, 169, 173, 179, 180, 350, 356, 382 endocrine, 272 endothelium, 2 energy, 149, 166, 167, 168, 169, 172, 173, 175, 194, 195, 196, 197, 204, 206, 211, 278, 282, 284, 286, 288, 290, 291, 292, 293, 294, 303, 304, 308, 311, 314, 327, 334, 339, 345, 348, 350, 351, 352 energy consumption, 351 energy parameters, 175 England, 272, 273, 274, 275 entropy, 177, 212, 282, 285, 287, 288, 291, 366, 367 environment, 17, 20, 278, 279, 293, 345, 349, 354, 355, 389, 391, 394, 395 environmental factors, 20 enzymatic, 13, 256 enzymatic activity, 13 enzymes, 15, 13, 14, 17, 341, 352 Epi, 93 epidemiology, 273 epigenetic, 148, 226 epigenetic alterations, 226 epigenetic code, 148 epithelial cell, 265, 275 epithelial cells, 265, 275 epithelial ovarian cancer, 95 epithelium, 226 epitopes, 15, 375, 376, 381, 383, 384, 386, 387, 388 equality, 30, 32 equilibrium, 22, 194, 196, 197, 198, 200, 205, 206, 212, 279, 280, 282, 288, 347, 350, 351, 352, 357, 359

425

equilibrium state, 196, 197, 198, 205, 206, 347, 350, 352, 357 erythrocyte, 409, 416 erythroid, 408, 415 Escherichia coli, 17, 209, 312 ESI, 409 ESN, 158 esophageal cancer, 51, 54 esophagus, 51 EST, 9, 1, 2, 3, 4, 5, 6, 7 estimating, 67, 77, 78, 83, 195, 205, 234 estimator, 68, 69, 74 estimators, 237, 238, 239, 340 estradiol, 265 estrogen, 265, 269, 273, 276 ethanol, 218 ethics, 415 ethnic groups, 25 etiology, 48, 376 eukaryote, 93, 146, 160, 371 eukaryotes, 96, 141, 371, 373 eukaryotic cell, 12, 147 evolution, 11, 15, 20, 22, 25, 129, 130, 132, 134, 138, 140, 142, 143, 144, 145, 146, 158, 256, 313, 341, 347, 353, 355, 357 evolutionary process, 354, 356 examinations, 54 exclusion, 219, 238, 369 execution, 392, 403 exercise, 357 experimental condition, 60, 66, 284 experimental design, 60, 62, 80, 83, 84 expertise, 403, 409 exposure, 20, 52, 54, 325, 386 expressed sequence tag, 5, 6, 7 external influences, 345 extracellular matrix, 210, 223 extraction, 15, 42, 361, 362, 364, 365, 368, 369, 370, 416 extrapolation, 279

F factorial, 60, 62 failure, 173, 348, 350, 351, 357, 359 false negative, 78, 187, 268 false positive, 67, 75, 174, 364 familial, 54, 252 family, 10, 14, 55, 59, 63, 67, 70, 71, 75, 76, 77, 97, 137, 166, 176, 190, 237, 239, 248, 249, 250, 252, 255, 267, 283, 311, 315, 317, 318, 364 family members, 55, 97 family structure, 252 fats, 61

426

Index

feature selection, 268, 333, 334, 340 feature subset selection, 340 feces, 232 feedback, 93, 96, 97 feeding, 352 females, 13, 51, 52, 53, 215, 216, 248, 249, 250 fertility, 233 fetal, 226, 266 FGF-2, 228 fiber, 12, 62, 147, 148, 149, 150, 151, 152, 155, 156, 157, 158, 159, 160, 161, 162 fibers, 151, 152, 155, 156, 157, 158, 160, 162, 163 fibroblast, 266, 273 fibroblasts, 93, 265, 266 fibronectin, 285, 301, 302, 305, 309 field theory, 62 filament, 159 filters, 42 fingerprinting, 232 fish, 60 fish oil, 60 fixation, 25, 355 flank, 233 flexibility, 16, 42, 155, 160, 161, 162, 205, 211, 268, 343, 386, 389, 404 floating, 394, 396, 397 flow, 12, 254, 354, 392 fluctuations, 157, 198, 199, 201, 212, 313, 366 fluid, 376 fluorescence, 62, 85, 410 focusing, 133, 267, 390 folding, 14, 148, 151, 153, 155, 156, 157, 158, 161, 162, 163, 175, 188, 189, 194, 211, 212, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 293, 294, 296, 298, 299, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 325, 326, 331, 337, 338, 353 folding intermediates, 280, 312 forests, 266, 274, 340 Fort Worth, 99 fossil, 142 France, 1, 127, 147 free energy, 149, 278, 279, 288, 289, 290, 291, 292, 293, 294, 303, 304, 306, 307, 351 freedom, 68, 73, 74, 155, 159, 168, 173, 194, 199, 200, 202, 204, 205, 208, 285 frequency distribution, 249, 353, 371 Freud, 354, 358 function values, 116, 117, 119 functional analysis, 8, 410, 414, 415 functional changes, 415 fungal, 141, 142, 146 fungi, 141, 233

Fur, 167 fusion, 3, 361

G garbage, 392, 393, 394, 395, 396 gas, 288 gastric, 2, 7, 265, 272 gastroenterologist, 48 gauge, 116 Gaussian, 198, 211, 322, 367 gel, 62, 81, 85, 86, 92, 410, 414 gels, 21 GenBank, 385 gender, 55 gene expression, 9, 10, 11, 12, 13, 1, 2, 6, 7, 59, 60, 64, 65, 66, 70, 71, 72, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 89, 90, 95, 97, 147, 215, 216, 217, 218, 220, 224, 225, 226, 227, 228, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 371, 373, 407, 408 gene promoter, 11, 89, 92, 229 gene silencing, 90 gene therapy, 411 gene transfer, 138 genealogy, 247 generalization, 40, 320, 335, 380, 381, 382 generation, 11, 6, 12, 15, 40, 70, 99, 100, 101, 106, 107, 108, 109, 119, 123, 126, 127, 128, 144, 175, 181, 191, 234, 248, 249, 252, 254, 345, 349, 355, 357, 398, 403 genetic alteration, 216, 226 genetic code, 148, 350, 352, 356, 358, 360 genetic defect, 409, 414 genetic disorders, 56 genetic factors, 48, 409 genetic information, 51, 147, 252, 352 genetic marker, 13, 231, 232, 233, 255, 256, 275 genetics, 25, 26, 143, 227, 233, 235, 247, 249, 252, 257, 267, 270, 274, 416, 417 Geneva, 410, 417 genitourinary tract, 216 genome, 9, 11, 15, 16, 1, 3, 4, 5, 6, 7, 19, 20, 22, 27, 42, 60, 89, 90, 93, 94, 97, 129, 130, 131, 137, 138, 140, 141, 142, 143, 145, 146, 148, 150, 162, 166, 224, 227, 228, 232, 234, 252, 260, 267, 274, 317, 341, 346, 361, 362, 364, 365, 366, 369, 371, 372, 373, 407, 408, 409, 415 genome sequences, 9, 1, 130, 141, 364 genome sequencing, 10, 89 genomes, 11, 44, 69, 129, 130, 138, 140, 141, 142, 143, 150, 233, 364, 366, 369, 371

Index genomic, 9, 12, 19, 20, 21, 42, 92, 130, 145, 147, 157, 162, 167, 178, 220, 225, 227, 233, 262, 268, 275, 351, 368, 373, 377, 409, 416 genomic regions, 416 genomics, 10, 14, 16, 4, 59, 62, 63, 65, 67, 70, 74, 79, 144, 146, 165, 188, 259, 264, 265, 266, 268, 269, 274, 352, 364, 365, 372, 373, 407, 408, 409, 416, 417 genotype, 21, 24, 252, 346 genotypes, 22, 232, 233, 241, 248 GGT, 411, 417 Gibbs, 363, 369, 372 GlaxoSmithKline, 43 gliomas, 260, 270 glycine, 295, 401 glycoproteins, 62 glycosylation, 414 God, 357 grading, 226, 228 grants, 5, 79, 253 graph, 171, 177, 188, 208, 220, 239, 242, 254 gravity, 401 grid environment, 403 GroEL, 194, 199, 210, 212 group variance, 65 grouping, 141 groups, 13, 1, 2, 4, 13, 17, 22, 23, 24, 25, 40, 41, 52, 60, 61, 66, 70, 71, 72, 73, 78, 130, 134, 137, 138, 139, 140, 141, 150, 155, 216, 217, 218, 220, 225, 233, 234, 235, 237, 238, 239, 240, 241, 242, 243, 244, 245, 247, 248, 252, 254, 256, 264, 265, 268, 294, 321, 325, 329, 334, 368, 370, 372 growth, 10, 2, 3, 47, 48, 95, 96, 97, 223, 229, 255, 261, 269, 342, 360 growth inhibition, 95 guanine, 3 guidance, 332 guidelines, 272

H H1, 15, 148, 153, 155, 159, 161, 162, 163 H2, 15 haemoglobin, 416, 417 Haifa, 29 Hamiltonian, 199 handling, 218, 390, 400 haplotype, 9, 19, 22, 23, 24, 26, 27 haplotypes, 9, 19, 23, 24 HapMap, 25 health, 20, 41, 409, 416 healthcare, 48 heart, 43, 151 heat, 204

427

height, 366, 367 Heisenberg, 348 Helicobacter pylori, 13 helix, 151, 161, 173, 283, 284, 296, 298, 299, 302, 309, 311, 318, 319, 320, 321, 325, 336, 338, 410, 412, 413 hemagglutinin, 386, 387 hematological, 408, 409 hematology, 16, 407, 408, 411, 415, 416 hematopoietic, 227 heme, 320, 408 hemoglobin, 16, 203, 204, 206, 207, 320, 407, 408, 409, 410, 411, 412, 413, 415, 416, 417 Hemoglobin, vii, 213, 407, 408, 416, 417 hemoglobin (Hb), 413 hemoglobinopathies, 409, 410, 411, 417 hemoglobinopathy, 408, 410, 411, 412, 413, 417 hemostasis, 417 hepatitis, 82 hepatitis C, 82 hepatocellular, 265, 266, 271, 274 hepatocellular carcinoma, 265, 266, 271, 274 hepatocytes, 266 Hessian matrix, 194, 197, 204, 205 heterochromatin, 152, 157, 159 heterogeneity, 216, 268, 269, 282 heterogeneous, 268, 269, 333, 368 heteropolymers, 310 heterozygosity, 233, 234 heuristic, 101, 128, 133, 240 high resolution, 282 high-frequency, 205 high-level, 390 high-risk, 260 Hilbert, 37 Hilbert space, 37 histogram, 61 histological, 216, 270 histology, 48, 49 histone, 92, 93, 95, 96, 148, 149, 150, 153, 155, 157, 158, 159, 160, 161, 162, 163 HIV, 211, 386, 388 HIV-1, 211, 386, 388 HLA, 387, 388 Holland, 127, 358 homogenized, 217 homolog, 96 homologous chromosomes, 3 homologous proteins, 331 homology, 12, 13, 15, 17, 132, 133, 150, 165, 167, 170, 175, 176, 185, 186, 188, 337, 339, 354, 363 homozygote, 236 Hong Kong, 310

428

Index

horizontal gene transfer, 138 hormonal therapy, 260, 269 hormone, 52, 273, 351, 364 hormones, 364 host, 146, 386, 390, 394 hot spots, 198 House, 39, 407 HPC, 145 human, 9, 10, 15, 1, 2, 3, 4, 5, 6, 7, 8, 14, 17, 19, 20, 21, 24, 25, 26, 27, 42, 45, 49, 54, 55, 56, 59, 85, 89, 90, 91, 92, 93, 94, 95, 96, 97, 140, 146, 217, 225, 227, 228, 229, 257, 265, 266, 272, 273, 274, 275, 283, 314, 355, 361, 362, 365, 370, 371, 372, 373, 385, 386, 388, 408, 409, 410, 411, 412, 414, 416 human ES, 2, 3, 5 human genome, 9, 10, 3, 4, 5, 7, 19, 27, 89, 90, 93, 94, 140, 146, 371 humans, 20, 25, 27, 62, 79, 233, 355 Hungarian, 417 hybrid, 240, 268, 335, 339 hybridization, 66, 91, 92, 96, 218, 228, 364 hydrodynamics, 200 hydrogen, 14, 146, 191, 280, 295, 310 hydrogen atoms, 295 hydrogen bonds, 14, 280, 310 hydrolysis, 17, 210 hydrophobic, 280, 308, 311, 312, 321, 325, 330, 333, 334 hydrophobic interactions, 280 hydrophobicity, 320, 321, 325, 331, 333, 334, 337 hypercycle, 347 hypermethylation, 229 hypothesis, 13, 54, 55, 59, 61, 63, 64, 65, 66, 67, 70, 73, 74, 77, 78, 81, 135, 143, 146, 152, 285, 326, 384 hypothesis test, 59, 61, 63, 65, 66, 70, 73, 81 hypoxia, 95

I id, 179, 290, 357 identification, 14, 6, 7, 17, 25, 26, 43, 48, 61, 91, 132, 216, 226, 228, 265, 267, 278, 315, 362, 372, 373, 386, 387, 388, 407, 413, 415 identity, 14, 175, 176, 179, 180, 181, 183, 184, 185, 186, 187, 188, 200, 248, 281, 283, 299, 315, 333, 335, 336 IL-1, 291, 292 IL-2, 291, 292 IL-8, 226 Illinois, 231 image analysis, 405 images, 80, 150, 155

imaging, 62, 67, 216 imaging modalities, 62 immune response, 269, 276, 386 immunoglobulin, 17, 210, 284, 309 immunological, 387 immunology, 408 immunoprecipitation, 91, 92, 96, 97, 364 immunoreactivity, 228 implementation, 40, 101, 108, 130, 133, 134, 136, 137, 139, 140, 253, 357, 390, 391, 392, 394, 400, 403 in situ, 162, 171, 218 in transition, 226 in vitro, 10, 15, 39, 40, 62, 96, 151, 156, 380, 381, 387 in vivo, 10, 15, 17, 39, 40, 148, 149, 151, 155, 156, 157, 159, 228, 265, 266, 364, 371, 409 inactive, 161 inbreeding, 252 incidence, 50, 52, 53, 55, 216 inclusion, 136 independence, 22, 66, 70, 72, 73, 74, 79, 381 independent variable, 351 indexing, 35, 378, 392 India, 339, 410, 411, 417 Indian, 417 indication, 151 indices, 333, 382, 396 indigenous, 20 individual development, 355 Indochina, 410 induction, 340 industrial, 12, 16, 390 industrial application, 390 inefficiency, 16, 195, 251 inequality, 30, 32, 67, 116, 117 infancy, 418 infection, 91 infectious, 376, 386 infinite, 346, 348, 353, 356 influenza, 386, 387 information exchange, 94 information technology, 12, 39, 41, 44 informed consent, 20 infrastructure, 12, 346, 390, 394 inheritance, 232, 233, 235, 237, 238 inherited, 143, 408, 410, 413, 414 inherited disorder, 408 inhibition, 95, 381 inhibitor, 311 inhibitors, 17, 387 initiation, 66, 80, 361, 362, 365, 370, 371, 387 injection, 60, 61

Index innovation, 11 inorganic, 346 insertion, 3, 287, 289 insight, 12, 17, 48, 108, 142, 193, 195, 198, 208, 209, 267, 326, 411, 415 inspection, 319, 372 Inspection, 137 instability, 260, 261, 262, 263, 279, 291, 410, 413 insulin, 320 integration, 4, 16, 42, 74, 264, 268, 273 integrity, 395, 414 Intel, 248 interaction, 9, 12, 19, 20, 26, 42, 43, 93, 148, 149, 160, 195, 196, 197, 223, 249, 266, 284, 326, 342, 343, 362, 376, 387, 398 interactions, 9, 14, 19, 20, 22, 43, 96, 150, 155, 157, 159, 160, 161, 168, 169, 171, 172, 196, 197, 199, 200, 213, 266, 278, 280, 282, 284, 287, 288, 291, 309, 310, 313, 371, 377, 381, 388 interdisciplinary, 41 interface, 16, 12, 139, 141, 162, 253, 384, 385, 389, 390, 394, 395, 398, 402, 404 interferon, 62, 82 interleukin, 226, 229, 309 interleukin-1, 309 internal clock, 351 internal constraints, 346 internal time, 348, 350, 351, 357 internalization, 354, 355 internalizing, 355 International Classification of Diseases, 49, 55 internet, 404 Internet, 26, 253 interphase, 150, 151 interrelations, 354 interval, 50, 67, 68, 171, 181, 185, 293, 294, 295, 299 intervention, 166 intracranial, 418 intravenous, 216 intravesical chemotherapy, 216 intrinsic, 156, 212, 285, 311, 352 intron, 362, 370 intuition, 199 invasive, 216, 226, 227, 228 invertebrates, 233 ionization, 414 ionizing radiation, 86, 151 Ireland, 165, 188 irradiation, 20, 77 ISCO, 163 island, 97, 370 isolation, 6, 217, 413

429

isomerization, 278 isomers, 148 isotropic, 198 Israel, 29 Italy, 315, 340, 416 iteration, 15, 102, 108, 239, 240, 341, 345

J jackknife, 319, 320, 321, 326, 327, 328, 329, 330, 331, 332, 333, 335, 336 Japan, 11, 20, 89, 94, 361, 389, 405 Japanese, 26 Java, 15, 239, 375, 383, 385, 404 JAVA, 42, 116 jobs, 11, 99, 100, 101, 102, 103, 104, 109, 116, 119, 126, 403 joining, 134, 144, 155 Jun, 227, 228 Jung, 73, 74, 75, 77, 82 justification, 25, 246

K kelp, 257 kernel, 322, 382 kidney, 10, 47, 54, 86, 260, 271 kin selection, 13, 231, 233 kinase, 213, 281, 284, 311, 352, 359 kinetics, 213, 278, 282, 284, 285, 286, 289, 290, 308, 310, 311, 312, 313, 314 kinship analysis, 13, 231 knockout, 265, 266 Korea, 193, 209, 259, 269

L L1, 15, 285, 354 L2, 15, 285, 354 labeling, 218 labor, 355 lactoglobulin, 150, 159 Lagrangian, 100, 101 landscapes, 211, 286, 308, 314 language, 16, 343, 354, 355, 357, 389, 390, 391, 392, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405 large-scale, 11, 129, 130, 131, 133, 135, 136, 140, 142, 144, 260, 365, 391 larvae, 255 laser, 218 lattice, 308, 325 law, 151, 345, 353, 354, 355, 358, 360 laws, 13, 166, 232, 285, 354

430

Index

learning, 41, 169, 172, 177, 178, 188, 218, 219, 225, 321, 330, 331, 335, 340, 382 left-handed, 148, 151, 162 leukemia, 62, 84, 260, 271 licenses, 250 life cycle, 223 life sciences, 39 life span, 351 ligand, 15, 17, 43, 213 ligands, 14, 15, 43, 45, 377, 387, 388 likelihood, 134, 137, 140, 144, 145, 232, 237, 238, 239, 240, 252, 253, 255, 256, 332 limitation, 138, 156, 345, 346, 362, 364, 369 limitations, 4, 135, 140, 159, 171, 210, 234 linear, 11, 61, 70, 74, 99, 101, 116, 126, 127, 128, 151, 155, 158, 159, 166, 175, 195, 199, 200, 206, 218, 220, 244, 308, 320, 321, 331, 335, 336, 340, 344, 352, 380, 382 linear function, 166 linear model, 74 linear programming, 11, 99, 101, 127 linear regression, 70, 320, 321 linkage, 20, 26, 27, 267 links, 95, 148, 287, 288, 291, 295, 306 Linux, 116, 136, 249, 383 liquid crystals, 157 liquid nitrogen, 217 liver, 48, 260 liver cancer, 48 loading, 194, 209 localization, 366, 367 location, 92, 94, 97, 100, 102, 110, 127, 150, 162, 281 locus, 21, 22, 25, 26, 232, 233, 234, 235, 236, 239, 241, 243, 244, 245, 246, 247, 248, 249, 250, 409, 416 lognormal, 71, 82 London, 146, 310, 359 long distance, 369 long period, 296 losses, 139 luciferase, 92 Luciferase, 92 lumen, 223 luminescence, 92 lung, 10, 47, 54, 91, 260, 265, 266, 271, 275 lung cancer, 54, 266 lymph, 220, 270, 382 lymph node, 220, 382 lymphocyte, 388 lymphocytes, 387 lymphoma, 55, 80, 260, 268, 271, 275 lymphomas, 50

lysine, 93, 96, 150, 413

M M.O., 211 machine learning, 15, 166, 176, 186, 188, 189, 331, 332, 340, 375, 377 machinery, 356, 362 machines, 167, 176, 178, 189, 211, 213, 335, 339, 388, 391 macromolecules, 158, 213, 391, 399 Madison, 92 magnetic, 62, 152, 160, 200, 216 magnetic resonance, 62, 216 magnetic resonance imaging (MRI), 62, 79, 80, 216 Maine, 44 maintenance, 12, 346, 351 major histocompatibility complex, 376, 387 males, 13, 51, 52, 53, 215, 216, 248, 249, 250 malignant, 5, 226, 270 mammal, 370 Mammalian, 14 mammalian cell, 90, 93, 371 mammalian cells, 90, 93, 371 mammals, 370, 371, 373 management, 177, 227, 385, 390, 391, 393, 394, 402, 403 manipulation, 10, 39, 43, 152, 155, 157, 344 mapping, 9, 19, 22, 26, 43, 96, 143, 176, 218, 220, 242, 243, 347, 349, 353, 355, 368, 373, 416 Markov, 134, 145, 195, 196, 212, 237, 238 Markov chain, 145 masking, 42 mass spectrometry, 62, 87, 409, 414, 416 Massachusetts, 225 Massachusetts Institute of Technology, 225 maternal, 226, 236 mathematical logic, 343 mathematics, 200, 342, 357 matrix, 15, 22, 62, 73, 74, 78, 91, 134, 138, 144, 169, 170, 171, 172, 174, 176, 177, 179, 194, 196, 197, 198, 199, 200, 204, 205, 207, 210, 223, 229, 244, 294, 295, 299, 324, 363, 375, 380, 383, 384, 394, 398, 401 matrix library, 91 matrix protein, 210 maturation, 15, 17, 86, 212 Maximum Likelihood, 133, 134, 136 Maya, 231 Mb, 369 measurement, 10, 59, 62, 71, 72, 85, 152, 267, 329, 341, 344, 345, 346, 366, 367, 381 measures, 22, 25, 26, 48, 74, 149, 173, 175, 225, 267, 268, 323, 382

Index mechanical energy, 194 mechanical properties, 155, 161, 212, 358 media, 213 medicine, 10, 14, 16, 39, 40, 41, 42, 43, 259, 270, 271, 273, 274, 275, 407, 408, 409, 413, 417, 418 melanin, 20, 23 melanoma, 10, 47, 48, 52, 54, 382 melanosomes, 23 melting, 285 melting temperature, 285 membership, 178, 242, 330 membranes, 351 memory, 116, 130, 135, 136, 240, 248, 249, 352, 391, 393, 394, 395, 396, 404 men, 54, 216 Mendeleev, 352 mental development, 355 mercury, 21 mesoscopic, 157 messenger RNA, 6 meta-analysis, 267, 268, 274, 275 metabolic, 13, 143, 346, 349, 350, 352, 354, 358, 359 metabolism, 12, 93, 95, 143, 145, 146, 147, 156, 157, 346, 347, 348, 349, 351, 352, 359 metabolomics, 62, 260 metaphase, 155 metastases, 260 metastasis, 217, 220, 226, 265, 266, 270, 274 metastatic, 6, 216, 267, 270, 271, 272 metastatic disease, 272 methylation, 93, 96, 216, 226 metric, 9, 29, 30, 32, 37 metric spaces, 9, 29, 37 Mg2+, 359 MHC, 376, 377, 380, 387, 388 mice, 266, 409, 416 microarray, 13, 15, 12, 59, 60, 66, 67, 69, 70, 73, 77, 78, 80, 81, 82, 83, 84, 85, 86, 87, 90, 91, 92, 94, 95, 96, 97, 216, 217, 218, 220, 225, 227, 228, 229, 259, 260, 261, 262, 263, 266, 267, 268, 270, 272, 273, 274, 275, 361, 362, 364, 365, 368, 408 microarray technology, 92, 216, 225, 267, 362 Microarrays, 81, 83, 87, 272, 416 microorganisms, 13 microRNAs, 90, 96, 98 microRNAs (miRNAs), 90 microsatellites, 232, 233, 234, 252, 254 microscope, 150 microscopy, 148, 150, 155, 195, 210, 405 migration, 224, 226 mimicry, 387

431

mining, 9, 10, 1, 3, 7, 12, 40, 47, 48, 49, 51, 54, 55, 94, 218, 332, 409, 414 Ministry of Education, 94 MIP, 250 miRNAs, 91 mirror, 406 misfolded, 282 misfolding, 287, 289 MIT, 189, 358, 360, 372 mitochondria, 143, 145 mitochondrial, 93, 95, 143, 145, 146 mitosis, 226 mitotic, 93, 96, 151 modalities, 62, 216 model reduction, 195, 200 modeling, 12, 17, 18, 43, 96, 148, 150, 158, 160, 162, 163, 193, 195, 210, 213, 267, 268, 274, 275, 307, 355, 377, 381, 404, 409, 410, 411, 413 modulation, 97 modules, 4, 210, 344, 358, 371, 394, 400 MOE, 403 molecular biology, 232, 282, 352, 405, 409 molecular dynamics, 12, 16, 18, 149, 150, 160, 168, 193, 194, 196, 210, 308, 311, 403 molecular markers, 216, 233, 256 molecular mechanisms, 4, 25 molecular structure, 16, 43, 196, 199, 200, 389, 390, 391, 399, 401, 404, 405, 411, 413 molecular weight, 21, 320, 334 molecules, 15, 43, 62, 130, 150, 208, 213, 305, 306, 307, 352, 398 Monte Carlo, 14, 65, 134, 145, 150, 166, 237, 238, 277, 287, 296, 305, 307, 313 Monte Carlo method, 134, 296 Monte-Carlo, vi, 153, 154, 155, 277, 286, 287, 289, 290, 294, 295, 296, 298, 299, 301, 302, 303, 306, 307 Monte-Carlo simulation, 153, 155, 286, 287, 296, 299, 307 morphological, 130, 133, 353 mortality, 216 Moscow, 277, 310, 357 moths, 130 motion, 194, 197, 198, 199, 200, 203, 204, 205, 210, 351 motivation, 103 motor activity, 223 mouse, 15, 265, 266, 274, 361, 365, 370 mouse model, 265, 266, 274 movement, 149 MPI, 403 mRNA, 2, 5, 229, 373 mucosa, 226

432

Index

multidimensional, 244, 282 multiple sclerosis (MS), 382, 386, 387 multiplicity, 62, 91 multiplier, 37, 127 muscle, 194, 216, 227, 314 mutagenesis, 4, 313 mutant, 43, 132, 278, 294 mutation, 62, 233, 238, 246, 266, 279, 285, 286, 294, 295, 409, 410, 411, 413, 414, 415, 417, 418 mutations, 1, 5, 23, 64, 179, 189, 232, 235, 237, 246, 252, 278, 279, 280, 283, 286, 294, 295, 312, 347, 409, 410, 414, 416 mutuality, 231 myelin, 386 myeloid, 62, 271 myosin, 21

N NA, 12, 20, 93, 147, 150, 157, 417 nanomachines, 212 nanometer, 162 nanostructures, 417 National Academy of Sciences, 83, 86, 271, 272, 273, 274, 275 National Institute of Neurological Disorders and Stroke, 386 National Institutes of Health, 40, 375 National Science Foundation, 126 natural, 13, 34, 35, 36, 68, 108, 157, 197, 205, 231, 242, 256, 265, 280, 347, 353, 376, 377, 378, 380 natural selection, 280, 347 neck, 260, 270 neglect, 282, 287, 291 nematodes, 141 neoplastic, 56, 376 neoplastic diseases, 376 nervous system, 386 network, 12, 14, 15, 12, 90, 177, 178, 189, 191, 193, 195, 198, 199, 201, 203, 204, 207, 208, 209, 211, 212, 213, 218, 225, 266, 278, 287, 291, 306, 314, 315, 317, 321, 325, 330, 332, 333, 335, 337, 338, 341, 349, 352, 356, 359, 360 networking, 100 neural network, 12, 14, 165, 167, 176, 177, 178, 184, 189, 190, 191, 218, 225, 315, 317, 320, 321, 325, 330, 331, 332, 333, 335, 337, 338 neural networks, 12, 165, 167, 176, 177, 178, 184, 189, 190, 191, 320, 331, 333, 335, 338 neuroblastoma, 275 neuroimaging, 81 New England, 81, 270, 271 New Jersey, 21, 310, 371

New York, 26, 56, 80, 84, 85, 86, 212, 309, 311, 312, 337, 339, 359, 360, 371 New Zealand, 340 next generation, 13 Nielsen, 27 NIH, 40, 79, 386, 390 nitric oxide, 17 nitric oxide synthase, 17 nitrogen, 217 NMR, 14, 62, 80, 84, 170, 179, 181, 189, 200, 277, 289, 305, 310, 311, 314, 403 nodes, 101, 106, 109, 110, 112, 117, 119, 120, 121, 122, 124, 125, 139, 141, 142, 171, 248 noise, 70, 71, 136, 170, 171, 177, 272, 332 non-Hodgkin lymphoma, 55 non-human, 55 non-native, 282, 287, 309 non-native interactions, 282, 287 normal, 9, 12, 3, 5, 6, 7, 8, 19, 20, 25, 27, 62, 64, 65, 70, 75, 76, 77, 79, 80, 83, 93, 97, 193, 194, 195, 197, 198, 202, 203, 204, 205, 206, 207, 208, 210, 211, 212, 213, 217, 331, 385, 408, 410, 411, 412, 413 normal distribution, 64, 65, 70, 76, 83, 331, 385 normalization, 218 norms, 325 North America, 338 NOS3, 224, 225 N-terminal, 162, 283, 296, 298, 299, 301, 302, 303 nuclear, 93, 145, 159, 200, 229, 255 nuclear magnetic resonance, 200 nuclease, 148, 150, 310 nucleation, 279, 311 nuclei, 14, 155, 277, 279, 280, 281, 282, 286, 300, 305, 307, 310 nucleic acid, 43, 144, 162, 210, 352 nucleoprotein, 12, 147, 148 nucleosome, 148, 149, 150, 151, 153, 155, 156, 157, 158, 159, 160, 161, 162, 163, 372, 373 nucleosomes, 148, 149, 150, 151, 155, 156, 157, 158, 159, 161, 163, 362 nucleotides, 133, 352 nucleus, 148, 156, 157, 278, 279, 280, 282, 283, 284, 286, 291, 293, 294, 302, 308, 309, 312, 313 null hypothesis, 63, 64, 65, 73, 74, 77, 78, 384 nutrition, 342

O observations, 48, 55, 63, 151, 279, 382 odds ratio, 264, 273 Oedipus, 354, 355 Oedipus complex, 354, 355 Ohio, 20

Index oil, 60 oligomers, 160 oligonucleotide arrays, 59, 80, 83 olive, 60 olive oil, 60 oncogene, 5, 90, 96, 97, 98, 216, 228, 270 Oncogenesis, 1 oncology, 273 Oncology, 48, 49, 55 online, 10, 41, 47, 146, 239 on-line, 132, 143 online information, 10, 47 operating system, 250, 383, 391, 394 operator, 200, 324, 328, 344, 348, 349 opioid, 387 optical, 152, 158, 311 optical tweezers, 152, 158 optimization, 12, 14, 29, 37, 100, 127, 158, 163, 237, 238, 325, 338, 340 organ, 217, 225 organelle, 146, 223 organelles, 143 organic, 311 organism, 14, 43, 60, 94, 142, 232, 252, 253, 351, 355, 359, 386, 407 orientation, 14, 103, 168, 171, 173 oscillation, 106 oscillator, 201, 204, 205, 206 osteomalacia, 20 outliers, 332 ovarian cancer, 6, 275 ovary, 2, 52, 151 oxidative, 410 oxidative stress, 410 oxide, 17, 228 oxygen, 410 oyster, 155 ozone, 20

P p53, 94 Pacific, 310, 405 packaging, 150 pairing, 359 pancreas, 6 paradox, 159, 309, 357 parallel implementation, 137 parallelism, 140 parallelization, 135 paramagnetic, 310 parameter, 15, 64, 68, 71, 73, 82, 134, 181, 203, 205, 207, 211, 242, 248, 249, 267, 285, 341, 345, 348, 349, 350, 351, 356, 382, 391, 396

433

parameter estimation, 267 parasite, 17 parentage, 13, 231, 232, 233, 252, 254, 255 parents, 232, 234, 235, 236, 240, 243, 248, 249 Paris, 147, 358, 359 Parkinson, 44 particles, 150, 153, 344 partition, 109, 235, 237, 238, 247, 254, 256, 293 paternal, 235 paternity, 232, 233, 252, 254, 256, 257 pathogenesis, 225, 226, 376, 409, 410, 411, 413, 414, 415 pathogenic, 13, 14 pathogens, 386, 388 pathology, 408, 409 pathways, 4, 95, 97, 211, 224, 226, 264, 265, 266, 267, 269, 273, 278, 279, 280, 282, 284, 287, 291, 292, 296, 298, 299, 306, 307, 308, 309, 311, 364 patient care, 41 patients, 10, 12, 13, 47, 48, 50, 52, 55, 82, 215, 216, 217, 220, 224, 226, 227, 228, 259, 260, 261, 262, 263, 264, 266, 267, 268, 269, 270, 271, 272, 273, 275, 386, 410, 411 PCA, 205 PCR, 4, 21, 60, 92, 226, 232, 233 pediatric, 271 pedigree, 13, 231, 237, 238, 252, 253, 255, 256 pelvis, 226 penalties, 17, 102, 103, 133, 144 penalty, 9, 11, 29, 30, 37, 99, 100, 101, 103 Pennsylvania, 59 peptide, 15, 42, 308, 327, 375, 376, 377, 378, 380, 381, 382, 383, 384, 385, 386, 387, 388 peptides, 15, 13, 18, 43, 319, 321, 323, 327, 372, 375, 376, 377, 378, 380, 381, 382, 383, 385, 386, 387, 388 periodic, 352, 400 periodic table, 400 periodicity, 93 peripheral blood, 376, 382 permit, 388 personal communication, 254 perturbation, 106, 206 pH, 283, 314 pharmaceutical, 16 pharmaceutical companies, 16 pharmacogenomics, 16 pharmacology, 408 phenotype, 3, 15, 266, 346 phenotypes, 9, 19, 20, 224, 226, 229, 268, 411 phenotypic, 25, 95, 354, 415 Philadelphia, 254 philosophy, 342

434

Index

phosphorylation, 93, 96 photoreceptor, 27 phylogenetic, 11, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 142, 143, 144, 145, 146, 414 phylogenetic tree, 11, 129, 130, 133, 134, 136, 138, 139, 143, 144, 145, 146 phylogeny, 130, 133, 134, 137, 139, 140, 143, 144, 145 physical properties, 161, 382, 387 physical world, 14, 341, 342, 345, 357 physicists, 155 physicochemical, 333, 335, 336 physico-chemical properties, 14, 315, 321, 333, 335, 336 physics, 160, 308, 313, 342, 344 physiological, 12, 25, 80, 133, 150, 151, 152, 222, 223, 226, 351, 376 physiological factors, 376 pilot study, 41 pipelines, 11, 129, 136, 165 planning, 10, 59, 60, 61, 65, 69, 70, 79, 81, 84, 135 plants, 13, 231, 232, 233, 370, 371, 373 plaque, 91 plasma, 414, 418 plasma proteins, 414, 418 plasmid, 92 plastic, 152, 213 plasticity, 157, 210 platforms, 4, 225, 250, 262, 263, 267 Plato, 342, 350, 353, 357 play, 9, 19, 20, 29, 226, 242, 282, 283, 326, 332, 333, 351, 377 Pleistocene, 146 PLO, 406 plug-in, 399, 405 PLUS, 84, 383, 384 point mutation, 279, 312 polarity, 401 pollination, 232 polydactyly, 372 polygamy, 250 polygenic, 25 polymer, 147, 212, 213 polymer chains, 212, 213 polymerase, 159, 160, 232, 257, 373 polymerase chain reaction, 232, 257 polymerization, 352 polymorphism, 9, 6, 19, 25, 149, 156, 160, 162, 226, 353 polymorphisms, 9, 1, 4, 5, 6, 7, 20, 234, 257, 411 polynomial, 178, 241, 242, 244, 382 polypeptide, 14, 288, 315, 317, 321, 337, 398

polypeptides, 328 polyploid, 7 polyunsaturated fat, 80 polyunsaturated fatty acid, 80 polyunsaturated fatty acids, 80 pools, 352 poor, 4, 78, 134, 188, 216, 228, 260, 261, 263, 266, 268 poor performance, 260 population, 24, 25, 55, 232, 233, 234, 237, 238, 239, 241, 249, 250, 252, 254, 257, 267, 273, 363, 409, 416, 417 population size, 239 portability, 392 positive correlation, 53 positron, 216 positron emission tomography, 216 post-translational, 93, 96, 148, 157 post-translational modifications, 148, 157 power, 10, 13, 59, 60, 61, 62, 63, 64, 65, 67, 69, 70, 71, 72, 75, 76, 77, 78, 79, 81, 84, 86, 137, 177, 178, 195, 231, 234, 267, 268, 273, 332, 382 pragmatic, 355 pRb, 90 pRB, 97 predictability, 274 predictive accuracy, 273, 327, 328 predictive model, 264, 346, 382 predictive models, 346 predictors, 12, 165, 167, 175, 176, 182, 186, 187, 217, 266, 332, 382 pre-existing, 4 preference, 370 preprocessing, 217, 218 press, 87 prevention, 48, 409 preventive, 80 prices, 105 primary data, 40 primary tumor, 265, 271 primates, 141 primitives, 398 principal component analysis, 205 prior knowledge, 371 private, 41, 44 probability, 50, 64, 65, 67, 73, 74, 75, 76, 101, 103, 134, 173, 176, 178, 190, 209, 240, 268, 283, 289, 293, 294, 295, 321, 324, 382 probability density function, 190 probability distribution, 134, 209 probe, 218 production, 20, 140, 188, 408 progenitor cells, 274

Index prognosis, 216, 224, 226, 260, 261, 267, 268, 270, 271, 272, 273, 275, 276 prognostic marker, 13, 229, 259, 260, 264, 269 prognostic value, 262, 264 program, 11, 16, 6, 22, 41, 50, 99, 101, 106, 110, 117, 131, 134, 144, 160, 180, 217, 220, 225, 228, 232, 244, 246, 248, 255, 295, 335, 389, 390, 391, 392, 393, 394, 395, 398, 399, 400, 402, 403, 404 programming, 14, 16, 29, 37, 100, 128, 292, 306, 383, 389, 390, 391, 392, 394, 395, 398, 402, 403, 404, 405 proliferation, 62, 96, 224, 226, 265, 267, 269 promoter, 11, 15, 89, 90, 91, 92, 93, 95, 96, 149, 226, 229, 361, 362, 363, 364, 365, 366, 368, 369, 370, 371, 372, 373 promoter region, 11, 15, 89, 90, 361, 362, 364, 365, 368, 370 propagation, 177, 178, 183, 189, 191, 212, 330 property, iv, 9, 29, 30, 34, 43, 101, 104, 148, 170, 235, 236, 241, 242, 243, 247, 252, 321, 334, 344, 347, 348, 349, 351, 353, 399, 400, 414 prophase, 152 proposition, 30, 31, 104, 105 prostate, 2, 5, 6, 51, 52, 54, 260, 265, 272 prostate cancer, 51, 54, 272 protection, 20 protein arrays, 62 protein binding, 212 protein conformations, 167 protein engineering, 210, 309, 311, 312, 313 protein family, 143, 317 protein folding, 188, 194, 278, 279, 282, 284, 285, 286, 287, 288, 306, 307, 308, 309, 310, 311, 312, 313, 314, 325, 331, 338, 353 protein function, 316, 336, 415 protein secondary structure, 190, 191, 211, 321, 337, 338, 411 protein sequence, 12, 41, 42, 136, 145, 165, 185, 282, 316, 329, 334, 335, 337, 352, 371, 385 protein structure, 12, 166, 167, 168, 169, 170, 171, 175, 186, 187, 188, 189, 190, 191, 193, 194, 195, 196, 197, 198, 199, 200, 202, 203, 206, 208, 209, 212, 213, 280, 282, 283, 286, 287, 294, 307, 310, 317, 318, 330, 334, 337, 339, 358, 410, 411, 413 protein structure analysis, 191 protein synthesis, 146, 352 protein-coding sequences, 385 protein-protein interactions, 266 proteobacteria, 145 proteolysis, 97 proteome, 16, 85, 142, 143, 146, 407, 408, 410, 416 proteomes, 130, 144, 146

435

proteomics, 10, 59, 62, 260, 273, 408, 409, 413, 416, 417 prothrombin, 16, 407, 408, 413, 414, 415, 418 prothrombin complex concentrates, 413, 418 prothrombin deficiency, 413, 414, 415, 418 protocol, 20, 91, 166, 167, 173, 175, 186, 218 prototyping, 403, 404 proxy, 397 pruning, 139 PSD, 42 pseudo, 166, 172, 173, 317, 321, 329, 331, 338, 339 PSI, 144, 167, 181, 182, 188 psychology, 354 public, 11, 6, 7, 41, 42, 69, 89, 91, 132 Puerto Rican, 26 PUFA, 60, 61, 66, 68 PUFAs, 60, 66 P-value, 268 PVC, 358

Q QSAR, 17 quadrupole, 21 quality control, 413 quality of life, 260 quantization, 330 quantum, 15, 341, 344, 345, 346, 348, 357, 359, 360 quantum computers, 359 quantum mechanics, 15, 341, 345 quantum state, 345, 346 quantum theory, 344, 357 Quebec, 51 Quercus, 257 query, 10, 7, 39, 41, 132, 179, 180, 181, 182, 183, 184, 185, 186, 187, 225, 326, 327, 329

R R&D, 43 race, 22 racial differences, 21, 22, 23, 24, 25 racial groups, 22, 23 radiation, 20, 25, 26, 54, 86, 140, 151 radical cystectomy, 220 radiotherapy, 220 radius, 400 radius of gyration, 400 random, 11, 22, 50, 62, 64, 70, 74, 78, 99, 100, 101, 102, 103, 116, 120, 123, 169, 173, 181, 237, 238, 240, 248, 249, 262, 263, 266, 272, 274, 289, 291, 310, 321, 322, 333, 335, 347, 363, 384 random numbers, 120, 240 random walk, 173

436

Index

randomness, 122 range, 11, 15, 70, 79, 129, 138, 150, 157, 162, 167, 176, 177, 183, 184, 190, 197, 204, 233, 247, 251, 253, 278, 286, 310, 361, 369, 371, 372, 408 rat, 60, 66, 80, 86, 227, 228, 266 RBF, 322, 335 reactant, 351 reaction rate, 293 reactivity, 386, 387 reading, 52, 312, 392, 397, 400 reagent, 217, 409 reality, 51, 247, 342, 344, 345, 353, 354, 355 reasoning, 332, 392 receptor sites, 43 receptor-positive, 273 receptors, 27, 377 recognition, 13, 14, 16, 17, 181, 188, 326, 338, 357, 372, 376, 377, 381, 386, 388 recombination, 43, 93, 356 reconciliation, 139 reconstruction, 13, 130, 132, 136, 137, 138, 142, 143, 144, 145, 146, 167, 170, 172, 173, 175, 178, 185, 189, 231, 232, 233, 234, 236, 237, 238, 239, 240, 241, 242, 244, 246, 247, 248, 250, 251, 252, 253, 254, 255, 256, 257, 363 recurrence, 216, 267, 268 recursion, 292 red blood cell, 409, 416 reductionism, 353, 358 redundancy, 132, 180, 181, 329 refining, 168 reflection, 342, 345, 355 reflectivity, 358 regeneration, 350 regional, 50, 52 Registry, 50 regression, 14, 69, 70, 266, 274, 315, 317, 320, 321, 322, 332, 334, 335, 336, 337, 340, 346, 382, 387 regression analysis, 387 regression method, 266, 320 regular, 41, 54, 151, 178, 218 regulation, 12, 15, 65, 74, 90, 92, 93, 95, 96, 97, 98, 147, 149, 150, 157, 222, 223, 226, 229, 267, 361, 362, 371, 373 regulators, 90, 92, 93, 226, 229, 372 rejection, 63, 69, 73, 74 relational database, 15, 375, 409 relationship, 13, 12, 54, 55, 91, 104, 132, 169, 218, 231, 235, 236, 237, 239, 246, 247, 255, 256, 262, 286, 330, 365, 380 relationships, 11, 13, 129, 130, 132, 137, 138, 139, 140, 141, 142, 169, 218, 220, 231, 232, 234, 237, 238, 246, 252, 253, 254, 256, 320, 331, 339, 353

relaxation, 11, 99, 100, 101, 103, 104, 105, 106, 107, 108, 109, 110, 112, 113, 114, 115, 119, 126, 127, 128, 159, 310 reliability, 14, 40, 185, 259, 261, 366, 369 remodeling, 149, 159, 160, 362 renal, 226, 271 renal cell carcinoma, 271 repair, 93, 97, 150, 346, 349, 353, 359, 410 repair system, 359 replacement rate, 146 replication, 83, 92, 93, 95, 96, 97, 150, 226, 353 repression, 94 reproduction, 10, 47, 48, 223, 255, 352 reputation, 102 resection, 13, 215, 216 residues, 12, 13, 14, 132, 150, 165, 167, 170, 171, 173, 176, 181, 183, 184, 198, 200, 202, 203, 204, 205, 279, 281, 285, 287, 288, 289, 290, 291, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 309, 311, 313, 314, 326, 327, 328, 334, 410 resolution, 62, 135, 146, 148, 159, 167, 170, 172, 179, 181, 186, 200, 282, 313, 319, 353 resources, 3, 4, 97, 101, 130, 132, 135, 140, 143, 402 respiratory, 93 retinoblastoma, 90 returns, 110, 111, 344 reverse transcriptase, 211 Reynolds, 372 Rho, 7 rhythms, 93, 96 ribosomal, 211, 218 ribosomal RNA, 211, 218 ribosome, 280 rice, 15, 361, 365, 369, 370 rickets, 20 rigidity, 155 risk, 12, 50, 54, 226, 229, 255, 260, 268 risk factors, 268 risks, 12, 54 RNA, 3, 42, 90, 97, 159, 160, 212, 217, 218, 310, 352, 361, 373 robustness, 94, 211, 217, 238 rodents, 141 room temperature, 217 Root Mean Square Deviation, 167 routines, 392 routing, 100, 102, 127 Russia, 309 Russian, 277, 307, 359, 360 Russian Academy of Sciences, 277

S S phase, 93

Index Saccharomyces cerevisiae, 142, 228, 372 saline, 60 salmon, 239, 248, 249, 255 salt, 150, 151, 153, 155, 158, 162, 280 sample, 10, 13, 21, 59, 60, 61, 62, 63, 64, 65, 66, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 82, 83, 84, 86, 102, 185, 186, 216, 218, 219, 220, 232, 237, 239, 248, 252, 262, 263, 264, 267, 320, 332, 382, 392, 398, 400 sample mean, 64 sampling, 16, 18, 64, 73, 137, 138, 142, 194, 219, 232, 233, 240, 242, 253, 262, 372 sampling distribution, 64, 73 saturated fat, 60 saturated fatty acids, 60 savings, 4, 119 scalability, 273, 403 scalable, 146, 225 scaling, 130, 219, 285, 311 scaling law, 285 scatter, 381 scheduling, 102, 126, 127 Schmid, 44, 227, 274, 313 scientific computing, 389, 390, 391, 395, 402, 403, 404 scores, 78, 133, 268, 318, 319, 380, 383, 384, 385, 415 scripts, 389, 392, 401, 404 sea urchin, 163 seals, 148 search, 11, 15, 18, 25, 40, 42, 50, 77, 91, 92, 99, 101, 103, 106, 109, 110, 111, 112, 119, 120, 121, 130, 132, 134, 135, 139, 141, 143, 144, 145, 158, 166, 167, 168, 172, 173, 175, 181, 191, 237, 240, 260, 280, 285, 291, 310, 334, 357, 361, 363, 369, 375, 383, 384, 385 search engine, 42 searches, 41, 132, 134, 135, 155, 166, 167, 377, 385 searching, 11, 15, 40, 41, 42, 60, 89, 91, 166, 172, 267, 291, 306, 375, 411 seed, 131, 254, 363 segregation, 93 selecting, 70, 110, 136, 137, 201, 249, 264, 265 selectivity, 14 self, 213, 215, 219, 227, 333, 358 self-consistency, 319, 320, 321, 326, 330, 331, 332 self-organization, 330, 360 self-organizing, 12, 176, 215, 217, 227, 228 Self-Organizing Maps, vi, 215, 219, 227 semantic, 15, 341, 348 semantics, 130, 390, 392, 404 semiotics, 15, 341, 343, 347, 355 senescence, 97, 359

437

sensitivity, 14, 76, 78, 79, 84, 132, 144, 264, 265, 268, 328, 332, 369, 370, 382 separation, 218, 382 sequencing, 10, 3, 4, 89, 130, 142, 146, 233, 317, 364 series, 3, 42, 60, 236, 321, 354, 360 serum, 225, 265, 273 services, 95, 416 set theory, 332 severity, 409, 411, 413 sex, 241, 369 Shanghai, 414, 415, 418 shape, 43, 156, 159, 211, 213, 282, 357 shares, 16, 166, 391, 407, 408 sharing, 12, 16, 44, 145, 253, 285, 403, 416 sheep, 232, 255 shock, 285, 313 short period, 10, 39 short-range, 197 shrimp, 248, 255 SIB, 238 sibling, 13, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 250, 251, 252, 253, 254, 256, 257 siblings, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 247, 257 sickle cell, 409, 411, 413, 416, 417 sickle cell anemia, 409, 413 side effects, 13, 14 sign, 342, 343, 344, 345, 348 signal transduction, 195, 226 signaling, 10, 20, 25, 26, 47, 48, 210 signals, 10, 47, 48, 97, 177, 312, 371, 372, 373 signal-to-noise ratio, 136 significance level, 22, 63, 75, 385 signs, 15, 341, 343, 345, 355 silica, 21 silico methods, 150, 336 similarity, 12, 14, 13, 14, 132, 133, 135, 165, 166, 167, 176, 179, 180, 181, 182, 183, 184, 185, 218, 237, 283, 315, 317, 318, 319, 321, 328, 331, 333, 334, 335, 336, 382 simulation, 12, 16, 18, 43, 59, 63, 64, 65, 69, 70, 71, 72, 78, 84, 94, 144, 149, 150, 153, 160, 162, 193, 194, 196, 205, 209, 210, 211, 249, 250, 289, 299, 306, 312, 313, 414 simulations, 12, 14, 64, 72, 150, 155, 158, 160, 167, 168, 175, 180, 193, 194, 210, 248, 277, 280, 282, 286, 287, 293, 296, 299, 302, 305, 307, 308, 311, 314 Singapore, 310, 360 single nucleotide polymorphism, 6, 7, 20, 226, 234 single test, 75

438

Index

single-nucleotide polymorphism, 5, 6 siRNA, 90 sites, 10, 8, 15, 17, 22, 43, 45, 47, 48, 51, 52, 91, 93, 94, 95, 97, 134, 135, 137, 146, 157, 159, 362, 365, 369, 370, 371, 373 skeleton, 150, 410 skills, 39, 163, 409 skin, 9, 19, 20, 21, 23, 24, 25, 26, 48 skin cancer, 48 sleep, 96 Sm, 241 smiles, 392 SNP, v, 9, 3, 6, 7, 19, 20, 21, 22, 24, 25, 26, 27, 226 SNPs, 3, 4, 5, 6, 7, 8, 20, 21, 22, 23, 24, 27, 234, 418 social group, 256 sodium, 284 software, 12, 13, 12, 13, 22, 42, 44, 50, 54, 143, 145, 215, 216, 217, 218, 219, 220, 222, 225, 232, 233, 237, 238, 239, 249, 250, 253, 256, 346, 367, 370, 371, 389, 390, 391, 394, 400, 402, 403, 404, 405 solid tumors, 6 solvation, 43 solvent, 150, 166, 179, 180, 182, 186, 190, 191, 288, 325 somatic mutations, 5 soot, 54 sorting, 368, 403 South Africa, 83 Southeast Asia, 416 spacers, 362 Spain, 129 spatial, 93, 194, 233, 309, 354 spatiotemporal, 357 speciation, 139, 140, 145 species, 11, 15, 7, 41, 42, 129, 130, 131, 137, 138, 139, 140, 141, 142, 143, 146, 232, 233, 237, 240, 252, 274, 351, 354, 360, 361, 369, 370, 372 specificity, 15, 66, 91, 95, 264, 268, 328, 366, 369, 375, 386, 387 spectroscopy, 62, 210, 273, 311 spectrum, 10, 15, 47, 94, 268, 333, 375, 377, 409 speed, 11, 99, 100, 135, 136, 404 sperm, 163 spin, 344 spindle, 93 springs, 195, 202, 205 squamous cell, 270 squamous cell carcinoma, 270 stability, 108, 149, 157, 211, 262, 279, 284, 291, 308, 309, 312, 346, 351, 358, 403, 414 stabilization, 11, 82, 99, 100, 102, 108, 112, 119, 120, 121, 122, 126, 127, 283, 308 stabilize, 103, 284, 309

stages, 13, 2, 12, 61, 140, 166, 178, 216, 226, 295, 355 standard deviation, 70, 71, 320, 378 standard model, 102 standards, 95 staphylococcal, 310 stars, 185, 186 statistical analysis, 15, 272, 285, 375 statistical mechanics, 197, 212 statistics, 9, 19, 20, 64, 67, 69, 72, 73, 76, 78, 80, 227, 239, 339 steady state, 346 stiffness, 155, 194, 197, 198, 199, 200, 202, 204, 205 stimuli, 94 stimulus, 223, 285 stochastic, vi, 11, 99, 100, 101, 102, 103, 107, 108, 109, 111, 113, 115, 116, 117, 123, 124, 125, 126, 127, 128, 170 stochastic model, 116 stock, 102, 127 stoichiometry, 163 stomach, 13, 48, 51, 52, 260 strategies, 9, 10, 1, 4, 47, 48, 54, 90, 109, 119, 120, 121, 135, 137, 181, 216, 232, 257, 386 strategy use, 54 strength, 149, 366, 367 stress, 152, 158, 410 stretching, 155, 158, 160, 161, 196, 211 stroke, 195 STRs, 232 structuring, 296, 301 students, 39 subgroups, 370 substitution, 17, 134, 332, 364, 414, 418 substrates, 14 subtraction, 2, 219 success rate, 329, 331, 332 suffering, 260, 386 sulfate, 284 Sun, 155, 162, 268, 271, 274, 275, 338, 417, 418 supercomputers, 140, 280 superficial bladder cancer, 13, 216, 220, 226 superiority, 334 superposition, 14, 281, 345 suppression, 408 suppressor, 216, 226, 228 suppressors, 90, 96, 98 supramolecular, 12, 147 surface area, 280, 286 surgery, 215, 260 surveillance, 10, 47, 48, 54, 55 survivability, 260 survival, 265, 268, 270, 271, 273

Index Survivin, 228 Sweden, 261, 262 switching, 236, 352 symbols, 177 symmetry, 78, 283, 311 symptoms, 413 synapse, 223 syndrome, 54, 413, 417 synergistic, 12 synergistic effect, 12 syntax, 390, 391, 392, 404 synthesis, 10, 20, 23, 59, 60, 146, 196, 212, 267, 274, 352, 387, 410 systemic biology, 43 systems, 13, 15, 4, 12, 16, 47, 94, 95, 135, 166, 167, 169, 178, 180, 181, 182, 186, 188, 212, 231, 232, 252, 255, 341, 345, 347, 349, 350, 351, 352, 353, 355, 356, 357, 358, 359, 360, 386, 388, 391, 405, 408 systomics, 43

T T cell, 376, 377, 381, 382, 383, 386, 387, 388 T lymphocytes, 387 Taiwan, 215 tandem mass spectrometry, 409, 414, 416 tanks, 255 Tanning, 26 targets, 13, 14, 15, 16, 81, 90, 92, 93, 95, 97, 166, 218, 227, 386 taxa, 134, 139, 145, 353, 354 taxonomy, 144, 353 TCC, 226, 376, 382, 386 T-cell, 15, 375, 376, 377, 378, 381, 386, 387, 388 T-cell receptor, 377 TCR, 15, 375, 376, 377, 380, 386, 388 technology, 15, 26, 41, 80, 81, 224, 375, 377, 390, 402, 408, 410, 416 telomerase, 216 telomeres, 157 TEM, 151 temperature, 173, 175, 197, 217, 282, 288 temporal, 194, 256, 342, 354, 357 tenascin, 210 tension, 152, 160 termination codon, 410 test procedure, 334, 336, 337 test statistic, 64, 65, 67, 69, 72, 73, 74, 76 Texas, 99 Thai, 417 Thailand, 39, 407, 410, 413 thalassemia, 408, 409, 410, 416 Thalassemia, 416

439

theoretical assumptions, 151 theoretical biology, 12, 350, 351 therapeutic agents, 9, 11, 12 therapeutic approaches, 4, 376 therapeutic interventions, 224 therapeutics, 97, 225, 414, 418 therapy, 6, 48, 227, 260, 269, 272, 273, 411, 416 thermal denaturation, 195, 308 thermal energy, 200, 204, 205 thermodynamic, 14, 149, 157, 277, 278, 284, 309, 349, 352 thermodynamic equilibrium, 14, 277, 309 thermodynamic stability, 157 thermodynamics, 282, 313 threat, 273 three-dimensional, 280, 281, 282, 310, 312, 331, 345, 350, 357 three-dimensional space, 345 threshold, 69, 73, 170, 171, 176, 180, 239, 240, 255, 318, 363, 384, 385 threshold level, 384 thresholds, 132, 171, 172, 175, 238 thrombin, 413 thymine, 3 thymus, 358 tiger, 248, 255 time constraints, 251 time periods, 294 timing, 96, 361, 362 tissue, 10, 13, 1, 2, 3, 4, 6, 47, 48, 94, 216, 217, 220, 221, 232 title, 250, 392, 394 tobacco, 54 Tokyo, 20 tolerance, 106, 116 top-down, 264 topological, 12, 138, 139, 140, 141, 146, 147, 149, 152, 157, 158 topology, 48, 134, 135, 138, 141, 142, 143, 145, 157, 163, 175, 201, 203, 206, 208, 283, 284, 285, 288, 307, 308, 309, 313, 318 toxicity, 12, 15, 16, 411 toxicological, 80 toxicology, 408 TP53, 266 trade, 170, 282, 288, 382 trade-off, 282, 288, 382 training, 39, 172, 176, 177, 178, 182, 185, 263, 264, 320, 327, 330, 331, 332, 382, 391, 402 traits, 9, 13, 19, 20, 25, 49, 54, 55, 231, 353, 354 trajectory, 290, 294, 303, 304 trans, 278, 362 transcriptase, 211

440

Index

transcription, 11, 21, 89, 90, 91, 92, 93, 95, 96, 97, 149, 150, 160, 223, 224, 226, 356, 361, 362, 369, 371, 372 transcription factor, 11, 21, 89, 90, 91, 92, 93, 95, 96, 97, 370, 72 transcription factors, 91, 92, 93, 97 transcriptional, 11, 15, 70, 89, 90, 91, 93, 94, 95, 96, 98, 228, 272, 361, 362, 370, 371, 372, 373 transcriptomics, 3 transcripts, 2, 3, 4 transducer, 223 transduction, 194 transfection, 92 transfer, 13, 138, 166, 178, 345 transformation, 65, 84, 97, 157, 196, 267, 344, 348, 349, 353, 354, 398 transformations, 354, 358 transfusion, 414, 418 transgene, 409 transgenic, 264, 266, 409, 416 transgenic mice, 266, 416 transition, 14, 95, 130, 157, 158, 195, 213, 277, 278, 279, 280, 282, 283, 284, 286, 288, 289, 291, 293, 294, 296, 298, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 325, 355 transitional cell carcinoma, 227, 229 transitions, 208, 211, 212, 213, 282, 292 translation, 90, 223, 226, 317, 343, 356, 371, 372 translational, 93, 96, 148, 157, 365 transmission, 282 transparency, 272, 335 transparent, 336 transport, 23 transportation, 126 transposon, 369 transposons, 369 transurethral resection, 13, 215, 216 traps, 308 trees, 130, 133, 134, 135, 137, 138, 139, 140, 141, 142, 143, 145, 146, 232, 335 trial, 272 trust, 108 tumor, 9, 13, 1, 2, 5, 7, 61, 79, 80, 85, 90, 96, 97, 98, 215, 216, 217, 220, 226, 227, 228, 229, 266, 268, 270, 271, 275, 382, 386, 387, 388 tumor cells, 3, 97 tumor growth, 229 tumor progression, 226, 228, 266 tumorigenesis, 270 tumorigenic, 225, 228 tumors, 12, 13, 6, 215, 216, 217, 220, 265, 273 tumour, 2, 5, 6, 7, 270 tumours, 3, 4, 275

turnover, 352 two-dimensional, 220, 410, 414 Type I error, 63, 64, 75 typology, 359 tyrosine, 23

U ultrasonography, 216, 227 ultraviolet (UV), 20, 24, 25, 26 uncertainty, 4, 100, 102, 128, 170, 348 unfolded, 14, 167, 169, 194, 277, 279, 282, 283, 286, 287, 288, 289, 291, 293, 294, 295, 296, 301, 302, 304, 307 unification, 344 uniform, 116, 121, 181, 249, 250, 289 United States, 216, 271, 272, 273, 274, 275 universal genetic code, 360 universal law, 355 universe, 241, 356, 360 untranslated regions, 3, 5 ureter, 226 urinary, 216, 226, 229 urinary bladder, 216, 229 urinary bladder cancer, 216 urine, 85 user-defined, 42, 91 Utah, 254 UV exposure, 20 UV irradiation, 20 UV radiation, 20

V vaccine, 386 vacuum, 218 Valencia, 91, 129, 188, 189, 190 validation, 1, 12, 60, 86, 178, 180, 182, 228, 233, 248, 264, 272, 319, 320, 325, 328, 333, 337, 382 validity, 34, 273, 345, 367 variability, 24, 25, 61, 70, 71, 73, 76, 122, 163, 267 variables, 22, 60, 101, 102, 103, 105, 106, 108, 109, 110, 111, 112, 113, 114, 115, 123, 126, 245, 246, 262, 268, 328, 335, 351, 382, 392, 394 variance, 61, 64, 68, 70, 74, 77, 82, 84, 101, 219, 384, 385 variation, 9, 3, 4, 7, 19, 20, 23, 25, 26, 27, 233, 250, 253, 255, 256, 267, 274, 280, 284, 354, 360, 408, 416 vector, 70, 74, 91, 92, 102, 103, 104, 173, 176, 189, 190, 195, 196, 198, 206, 317, 321, 322, 323, 325, 326, 330, 331, 333, 334, 335, 337, 338, 339, 382, 388, 401, 403 vertebrates, 140, 233, 373

Index vesicle, 23 vibration, 194 virulence, 13 virus, 91, 386, 387 virus infection, 91 visible, 350, 353 vision, 358 visualization, 26, 217, 225, 401, 403 vitamin D, 20 vocabulary, 42 voting, 325, 333 Vygotsky, 355, 360

W water, 62, 218, 278, 283, 320 water-soluble, 320 wavelet, 150 weakness, 266 wealth, 156 web, 15, 3, 7, 42, 44, 45, 141, 143, 162, 253, 373, 375, 377, 383, 388, 404, 408 web browser, 404 web-based, 7, 44, 377, 383, 388, 408 websites, 3 Weinberg, 10, 22, 47, 48, 55, 56 wells, 378 western blot, 4

441

wild type, 21, 43 wild-type allele, 21 windows, 126, 127 wisdom, 357 women, 26, 216 wood, 233 workflow, 17 World Health Organization, 55 writing, 397

X xenografts, 229 X-ray crystallography, 13, 14, 200, 403 X-ray diffraction, 150

Y YAC, 409, 416 yang, 96 yeast, 15, 90, 95, 145, 146, 150, 161, 228, 361, 362, 372, 373 yield, 12, 113, 114, 115, 167, 186, 326, 331 yin, 96

Z zebrafish, 27 zinc, 369, 372