Evolution after Gene Duplication

EVOLUTION AFTER GENE DUPLICATION EVOLUTION AFTER GENE DUPLICATION Edited by Katharina Dittmar SUNY at Buffalo Buffa...

Author: Katharina Dittmar | David Liberles

125 downloads 984 Views 46MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

EVOLUTION AFTER GENE DUPLICATION

EVOLUTION AFTER GENE DUPLICATION

Edited by

Katharina Dittmar SUNY at Buffalo Buffalo, New York

David Liberles University of Wyoming Laramie, Wyoming

A JOHN WILEY & SONS, INC., PUBLICATION

Copyright © 2010 by Wiley-Blackwell. All rights reserved. Wiley-Blackwell is an imprint of John Wiley & Sons, formed by the merger of Wiley’s global Scientific, Technical and Medical business with Blackwell Publishing. Published by John Wiley & Sons, Inc., Hoboken, New Jersey. Published simultaneously in Canada. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission. Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002. Wiley also publishes its books in a variety of electronic formats. Some Content that appears in print may not be available in electronic formats. For information about Wiley products, visit our web site at www.wiley.com. Library of Congress Cataloging-in-Publication Data: Dittmar, Katharina. Evolution after gene duplication / Katharina Dittmar and David Liberles. p. cm. Includes bibliographical references and index. Summary: “Gene duplication has long been believed to have played a major role in the rise of biological novelty through evolution of new function and gene expression patterns. The first book to examine gene duplication across all levels of biological organization, Evolution after Gene Duplication presents a comprehensive picture of the mechanistic process by which gene duplication may have played a role in generating biodiversity. Key Features: Explores comparative genomics, genome evolution studies and analysis of multi-gene families such as Hox, globins, olfactory receptors and MHC (immune system). A complete post-genome treatment of the topic originally covered by Ohno’s 1970 classic, this volume extends coverage to include the fate of associated regulatory pathways. Taps the significant increase in multi-gene family data that has resulted from comparative genomics. Comprehensive coverage that includes opposing theoretical viewpoints, comparative genomics data, theoretical and empirical evidence and the role of bioinformatics in the study of gene duplication. This up-to-date overview of theory and mathematical models along with practical examples is suitable for scientists across various levels of biology as well as instructors and graduate students”— Provided by publisher. ISBN 978-0-470-59382-0 (hardback) 1. Evolutionary genetics. 2. Mutation (Biology) 3. Variation (Biology) I. Liberles, David II. Title. QH390.D58 2010 572.8 38–dc22 2010031097 Printed in the United States of America 10 9 8 7 6 5 4 3 2 1

wwwwwww

CONTENTS Contributors

vii

Preface

xi

1 Understanding Gene Duplication Through Biochemistry and Population Genetics

1

David A. Liberles, Grigory Kolesov, and Katharina Dittmar

2 Functional Divergence of Duplicated Genes

23

Takashi Makino, David G. Knowles, and Aoife McLysaght

3 Duplicate Retention After Small- and Large-Scale Duplications

31

Steven Maere and Yves Van de Peer

4 Gene Dosage and Duplication

57

Fyodor A. Kondrashov

5 Myths and Realities of Gene Duplication

77

Austin L. Hughes and Robert Friedman

6 Evolution After and Before Gene Duplication?

105

Tobias Sikosek and Erich Bornberg-Bauer

7 Protein Products of Tandem Gene Duplication: A Structural View

133

William R. Taylor and Michael I. Sadowski

8 Statistical Methods for Detecting Functional Divergence of Gene Families

163

Xun Gu

9 Mapping Gene Gains and Losses Among Metazoan Full Genomes Using an Integrated Phylogenetic Framework

173

Athanasia C. Tzika, Rapha¨el Helaers, and Michel C. Milinkovitch

10 Reconciling Phylogenetic Trees

185

Oliver Eulenstein, Snehalata Huzurbazar, and David A. Liberles

v

vi

CONTENTS

11 On the Energy and Material Cost of Gene Duplication

207

Andreas Wagner

12 Fate of a Duplicate in a Network Context

215

Orkun S. Soyer

13 Evolutionary and Functional Aspects of Genetic Redundancy

229

Ran Kafri and Tzachi Pilpel

14 Phylogenomic Approach to the Evolutionary Dynamics of Gene Duplication in Birds

253

Chris L. Organ, Matthew D. Rasmussen, Maude W. Baldwin, Manolis Kellis, and Scott V. Edwards

15 Gene and Genome Duplications in Plants

269

Pamela S. Soltis, J. Gordon Burleigh, Andre S. Chanderbali, Mi-Jeong Yoo, and Douglas E. Soltis

16 Whole Genome Duplications and the Radiation of Vertebrates

299

Shigehiro Kuraku and Axel Meyer

Index

313

CONTRIBUTORS

Maude W. Baldwin, Museum of Comparative Zoology, Harvard University, Cambridge, Massachusetts Erich Bornberg-Bauer, Evolutionary Bioinformatics Group, Institute for Evolution and Biodiversity, University of Muenster, Muenster, Germany J. Gordon Burleigh, Department of Biology, University of Florida, Gainesville, Florida Andre Chanderbali, Florida Museum of Natural History, University of Florida, Gainesville, Florida; Department of Biology, University of Florida, Gainesville, Florida Katharina Dittmar, Department of Biological Sciences, SUNY at Buffalo, Buffalo, New York Scott V. Edwards, Museum of Comparative Zoology, Harvard University, Cambridge, Massachusetts Oliver Eulenstein, Department of Computer Science, Iowa State University, Ames, Iowa Robert Friedman, Department of Biological Sciences, University of South Carolina, Columbia, South Carolina Xun Gu, Department of Genetics, Development and Cell Biology, Center for Bioinformatics and Biological Studies, Iowa State University, Ames, Iowa Rapha¨el Helaers, Department of Biology, Facult´es Universitaires Notre-Dame de la Paix, Namur, Belgium Austin L. Hughes, Department of Biological Sciences, University of South Carolina, Columbia, South Carolina Snehalata Huzurbazar, Department of Statistics, University of Wyoming, Laramie, Wyoming Ran Kafri, Department of Systems Biology, Harvard Medical School, Boston, Massachusetts Manolis Kellis, Computer Science and Artificial Intelligence, Massachusetts Institute of Technology, Cambridge, Massachusetts David G. Knowles, Smurfit Institute of Genetics, University of Dublin, Trinity College, Dublin, Ireland vii

viii

CONTRIBUTORS

Grigory Kolesov, Department of Molecular Biology, University of Wyoming, Laramie, Wyoming Fyodor A. Kondrashov, Center for Genomic Regulation, Barcelona, Spain Shigehiro Kuraku, Evolutionary Biology and Zoology, Department of Biology, University of Konstanz, Konstanz, Germany David A. Liberles, Department of Molecular Biology, University of Wyoming, Laramie, Wyoming Steven Maere, Department of Plant Systems Biology, VIB, Ghent, Belgium; Department of Molecular Genetics, Ghent University, Ghent, Belgium Takashi Makino, Smurfit Institute of Genetics, University of Dublin, Trinity College, Dublin, Ireland Aoife McLysaght, Smurfit Institute of Genetics, University of Dublin, Trinity College, Dublin, Ireland Axel Meyer, Evolutionary Biology and Zoology, Department of Biology, University of Konstanz, Konstanz, Germany Michel C. Milinkovitch, Department of Genetics and Evolution, Laboratory of Natural and Artificial Evolution, Sciences III, Geneva, Switzerland Chris L. Organ, Museum of Comparative Zoology, Harvard University, Cambridge, Massachusetts Tzachi Pilpel, Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, Israel Matthew D. Rasmussen, Computer Science and Artificial Intelligence, Massachusetts Institute of Technology, Cambridge, Massachusetts Michael A. Sadowski, Division of Mathematical Biology, MRC National Institute for Medical Research, London, UK Tobias Sikosek, Evolutionary Bioinformatics Group, Institute for Evolution and Biodiversity, University of Muenster, Muenster, Germany Douglas E. Soltis, Department of Biology, University of Florida, Gainesville, Florida Pamela S. Soltis, Florida Museum of Natural History, University of Florida, Gainesville, Florida Orkun S. Soyer, Engineering, Mathematics and Physical Sciences, University of Exeter, Exeter, UK William R. Taylor, Division of Mathematical Biology, MRC National Institute for Medical Research, London, UK Athanasia C. Tzika, Laboratory of Natural and Artificial Evolution, Department of Genetics and Evolution, Sciences III, Geneva, Switzerland; Evolutionary Biology and Ecology, Universit´e de Bruxelles, Brussels, Belgium

CONTRIBUTORS

ix

Yves Van de Peer, Department of Plant Systems Biology, VIB, Ghent, Belgium; Department of Molecular Genetics, Ghent University, Ghent, Belgium Andreas Wagner, Department of Biochemistry, University of Zurich, Zurich, Switzerland Mi-Jeong Yoo, Florida Museum of Natural History, University of Florida, Gainesville, Florida; Department of Biology, University of Florida, Gainesville, Florida

wwwwwww

PREFACE The duplication of genes and genomes was postulated to be an important process for the evolution of functional and organismal diversity long before we entered the genome sequencing era. Many of the groundbreaking intellectual concepts for this hypothesis come from the work of Susumu Ohno. However, only with the recent availability of many genome sequences are we able to gather supporting data to develop hypotheses and test them on a large scale. Current research on the topic spans scientific disciplines from bioinformatics to organismal biology and touches on different aspects of gene and genome duplication, ranging from the molecular mechanics of the duplication process to the fate of duplicated genomes. Naturally, a variety of approaches goes hand in hand with differences in opinion, and presently, a prolific and at times overwhelming body of research has accumulated. Thus, it was our idea to provide a systematic examination of current thought on gene duplication and its importance to biological diversification across multiple levels. This edited volume began with a review article published in the Journal of Experimental Zoology. With the expansion of concepts into full book chapters, we hoped to cover approaches from a range of fields, starting with molecular and structural biology, leading to computer science and statistics to cellular and, ultimately, organismal-level biology. It is our intention to lay out a hierarchy of chapters extending from evolutionary principles and molecular details out to increasingly higher levels of biological organization. This setup is designed to make the reader appreciate the interconnectedness of these levels. It is important to us to present this work as a platform for diverse scientific approaches and reasoning. One clear point that will emerge in reading the book is that chapters come from authors in diverse disciplines, some of whom disagree with each other on underlying evolutionary forces, on experimental procedures, and on data interpretation. Occasionally, authors use terms in different ways. In particular, concepts of redundancy (and its maintenance) differ: Genes that one set of authors consider to have been selected for retention due to redundant function others consider to have diverged and to no longer be redundant. The first section of the book deals with models of gene duplication and retention. Five chapters provide an overview (with some overlap in description, but involving different interpretations) of the mechanisms for the retention of duplicate genes. This includes diverse perspectives on mutational opportunities, dosage compensation, evolutionary processes, and their interrelationship with molecular processes and resulting selection. These fine-scale aspects of interpreting genomic data lay the foundation for the cell/systems level of biology of duplication as well as effects on speciation and biodiversity. The second section, on gene/protein structure and duplication, includes two chapters. Chapter 6 links the evolutionary process associated with gene duplication through xi

xii

PREFACE

structure to function, paying particular attention to adaptive processes in proteins that have not yet undergone duplication. Chapter 7 describes the link between the process of gene duplication and protein structure, concentrating on a review of the genetic mechanisms creating fused tandem duplicates. Comparative genomic methodologies for characterizing gene duplicates are treated in the third section. Chapter 8 overviews methodology for characterizing the functional divergence of duplicate genes, Chapter 9 presents a procedure for linking changes in gene copy number through evolution to functional and gene expression evolution, and Chapter 10 presents an overview of model and parsimony-based approaches for gene tree/species tree reconciliation coupled to a detailed presentation of most parsimonious reconciliation. The fourth section, involving systems biology considerations of gene duplication, includes three chapters. Chapter 11 describes the energetic costs of gene duplication, arguing for a nonneutral process of fixation. The next two chapters describe the interplay between systems-level constraints and duplicate gene retention as well as the complementary interplay between duplication and the structure of biological networks. The last section builds up to species-level characterizations of gene duplication. The first two chapters in this section treat research on duplication in birds and plants, addressing their roles in the speciation process, as well as developmental and morphological novelty. Chapter 16 concludes this work with a portrayal of the role of duplication in vertebrate speciation. As editors, we would like to extend our thanks to all the researchers who contributed to this volume. First and foremost, we thank them for their timely delivered scientific insights, but also for their good humor and patience as we worked through the publishing process. We also thank all the colleagues, students, and friends who joined our discussions on this topic. Finally, we would like to thank Karen Chambers, at John Wiley, who was enthusiastic and supportive of our idea, and thankfully, a very patient editor in the long and at times exhausting process of assembling this volume. We hope the book will be a useful tool for researchers and students alike to learn about current research on duplication and inspire continued discussion about this important topic. David Liberles Katharina Dittmar

Fitness (phenotype)

Chapter 1, Figure 4 Change of binding interaction in the two-component system. Shown in red blobs are three amino acid substitutions described by Skerker et al. (2008) that completely switch from EnvZ histidine-kinase to OmpR HK signal transduction. The HK homodimer is on the bottom (green and blue domains), and the response regulator domain (brown, top) was computationally docked to the 2C2A HK structure by Marina et al. (2005). The phosphotransfer histidine is shown in magenta.

0 (aa)

1/2 (Aa)

1 (AA)

2 (AA,AA)

Protein concentration (genotype)

Chapter 4, Figure 2 Dependence of fitness on gene dosage (the number of gene copies or genotype). Four types of functions are displayed here: in red, a linear fitness function; in blue, a diminishing returns function; in green, a function with a well-defined optimum; and a concave function is shown in magenta. The area between the two dashed lines defines an area where fitness is similar to the fitness of an individual with one gene copy.

0

10

20

30

40

0

200

400

600

800

0

0

2

3

4

1

2

SSD

3

4

α0 = 1.55 ± 0.28 α1 = 0.60 ± 0.04 α2 = 0.45 ± 0.03 α3 = 0.75 ± 0.05

development

1

α0 = 1.25 ± 0.02 α1 = 0.90 ± 0.03 α2 = 0.65 ± 0.01 α3 = 0.85 ± 0.01

whole paranome

5

5

α

0

1

2

3

4

α0 = 1.25 ± 0.16 α1 = 0.60 ± 0.04 α2 = 0.30 ± 0.02 α3 = 0.55 ± 0.04

TF activity

5

0

5

10

15

20

25

0

1

β

2

(A)

KS

3

γ

4

α0 = 1.40 ± 0.12 α1 = 0.45 ± 0.15 α2 = 0.15 ± 0.07 α3 = 0.55 ± 0.11

5

secondary metabolism

0

20

40

60

80

0

0

2

3

4

1

2

3

4

α0 = 0.85 ± 0.15 α1 = 0.80 ± 0.11 α2 = 0.85 ± 0.11 α3 = 0.75 ± 0.08

DNA metabolism

1

α0 = 3.05 ± 1.07 α1 = 0.35 ± 0.03 α2 = 0.10 ± 0.02 α3 = 0.35 ± 0.03

signal transduction

observed Ks distr.

0

10

20

30

40

0

20

40

60

80

100

5

5

>1.0

γ

0.7 (B)

β

α

<0.4

(P)metabolism (P) response to external stimulus

(P) response to biotic stimulus (P) secondary metabolism

(P) cell cycle (F) nuclease activity (P) DNA metabolism (F) RNA binding

(F) structural constituent of ribosome

(P) flower development (P) post-embryonic development (P) development

(F) kinase activity (P) signal transduction (F) transferase activity (F) carbohydrate binding (P) cell communication (F) transporter activity (F) enzyme regulator activity (P) protein modification (F) transcription factor activity (F) protein binding (F) carrier activity

Chapter 3, Figure 1 (A) Duplicate retention dynamics for several GO categories in Arabidopsis. The colored areas show the simulated fraction of retained duplicates created by each duplication mode as a function of KS , the blue curve is the observed KS distribution. The best-fit decay constants for each duplication mode and their 68% confidence intervals are indicated. α0 , α1 , α2 , and α4 correspond to the SSD, γ, β, and α modes, respectively. (B) Functional bias in duplicate retention is different after WGD and SSD. Blue, high loss; yellow, high retention; black corresponds to α = 0.7, the average decay constant of SSD duplicates across the entire Arabidopsis paranome.

number of retained duplicates

1000

SSD

(A)

(B)

Chapter 6, Figure 4 Energy landscape of two adjacent neutral networks. (A) The large plane is a two-dimensional representation of sequence space. The vertical axis represents the free energy (G) of the protein structures x1 and x2 associated with the neutral networks [here, molecular structures of the Arc-repressor are taken as an example (Cordes et al., 2000)]. The lower the energy, the higher the thermodynamic stability of the structure. Each symbol represents a protein sequence; the lines between them represent amino acid substitutions. Neutral networks are distinguished by symbol shape. Sequences that fold uniquely into one conformation are shown in black, those that are equally stable in more than one conformation are shown in white. The sequence in the middle of the two nets folds equally well into both structures, x1 and x2 . A path connecting the prototype sequences (framed symbols) of the two networks is drawn. (B) Frontal view of the energy landscape, showing that the two structures x1 and x2 coexist in an equilibrium for the protein sequences lying in between the two neutral networks. The locations of the prototype sequences are indicated in parts A and B by dashed lines.

(A)

(B)

(C)

Chapter 6, Figure 5 Hypothetical sequence neighborhood of the enzyme PON1. (A) Putative neutral network of PON1 with adjacent networks associated with promiscuous functions. The symbols represent different protein sequences connected by amino acid substitutions (lines). Symbols of different shapes belong to different neutral networks. Symbols with circles around them represent sequences on maxima in the fitness landscape. Large dashed circles delimit the neutral networks of the promiscuous functions. The native function of PON1 is that of a lipolactonase (circles). Promiscuous functions are thiolactonase (hexagons), aryl-esterase (squares), phosphotriesterase (triangles), and drug resistance (stars). (B) Fitness landscape of the same neutral network. The plane represents protein sequence space and the vertical axis is the fitness of an individual expressing the corresponding protein. Over the centre of each neutral network lies a fitness maximum. Paths between the neutral networks correspond to ridges in the fitness landscape, which connect the maxima of neighboring networks. (C) Frontal view of the fitness landscape, showing the maxima in profile. A hypothetical diagram under a highlighted part of the fitness landscape shows how catalytic activities of a promiscuous function (thio-lactonase) and the native function decrease and increase along the connecting path in sequence space [see (Aharoni et al., 2005) for experimental data]. Along this path, overall fitness does not decrease much if both native and promiscuous functions can be maintained by the same enzyme. For a complete transition from one network to another, however, a gene duplication event might be necessary (see Figure 6).

(A)

(B)

(C)

Chapter 7, Figure 4 Structural options for tandem duplicates: (A) beads on a string, illustrated with the structure of titin, (2r15); (B) pseudodimer, illustrated with the archaeal histone structure, (1f 1e); (C) domain swap, illustrated with the structure of cyanovirin, (1l5b); (D) inseparable domains, illustrated with the structure of aspartate protease (1e81); (E) entangled domain, illustrated with the structure of myoglobin (101m). In each case the first domain is shown in blue, the second in red.

(D)

(E)

Chapter 7, Figure 4 (Continued )

Chapter 7, Figure 5 Aspartyl protease 1e81. The two halves are colored red and blue to distinguish them, the two active-site aspartic acids are shown in a lighter color.

(A)

(B)

Chapter 7, Figure 6 Beta propellor structures with (A) four, (B) five, (C) six, and (D) seven blades: structures 1hxn, 1tl2, 1f8d, and 2bbk, respectively.

(C)

(D)

Chapter 7, Figure 6 (Continued )

Fitness effect

–0.1

–0.2

–0.6 0

200

400

600

800

1000

800

1000

Generation

Pathway size

12 8 4 0 0

200

400 600 Generation

Chapter 12, Figure 4 Fitness effects of different mutation types and pathway size over generations. Fitness in this case is defined as the ability of the signaling network to produce independent outputs to two incoming signals (Soyer, 2007). The fitness effects of each mutation type are averaged over the entire population. Data are collected and averaged over seven independent evolutionary simulations. Different colors indicate different mutation types. Using the notation of Figure 3, we have black for “duplicate protein,” red for “change coefficient,” blue for “protein loss,” green for “create interaction,” cyan for “protein recruitment,” and yellow for “interaction loss.”

1

Understanding Gene Duplication Through Biochemistry and Population Genetics DAVID A. LIBERLES and GRIGORY KOLESOV Department of Molecular Biology, University of Wyoming, Laramie, Wyoming

KATHARINA DITTMAR Department of Biological Sciences, SUNY at Buffalo, Buffalo, New York

1 INTRODUCTION Gene duplication has emerged as an important process supporting the functional diversification of genes. Since publication of the seminal book Evolution by Gene Duplication by Ohno (1970), the hypothesis regarding the importance of gene duplication in the generation of evolutionary novelty has steadily gained support as we have entered the genome-sequencing era. It is through the link to functional biology that an ultimate understanding of the preservation and diversification of duplicate genes will be accomplished. Genes can diverge in function through accumulation (fixation) of coding sequence changes, which may influence binding interactions and/or catalysis, through the evolution of splice variants, and through spatial, temporal, and concentration-level changes in the expression of the protein product. Governing these processes is an interplay among mutational opportunity, population dynamics, protein biochemistry, and systems and organismal biology. This interplay is described systematically in this chapter.

2 SYSTEMS BIOLOGY AND HIGHER-LEVEL ORGANIZATION At the level of biological systems, two early but still relevant views suggested a role for gene duplication in constructing pathways. These views are both dependent on a new function emerging in one of the duplicates, but differ in the manner in which it occurs. One view, patchwork evolution, involved a conservation of catalytic activity coupled with the evolution of a new substrate after duplication (Jensen, 1976). An alternative view, retrograde evolution, suggested that pathways are built up backward, with product becoming substrate based on recognition of the transition state in the Evolution After Gene Duplication, Edited by Katharina Dittmar and David Liberles Copyright © 2010 Wiley-Blackwell

1

2

UNDERSTANDING GENE DUPLICATION

active site, with the evolution of a new catalytic activity to generate the substrate for the downstream reaction after duplication (Horowitz, 1945). In a systematic analysis in Escherichia coli , Light and Kraulis found some evidence for the retrograde evolution model, but found the patchwork model to be much more common, possibly because it is easier to gain new binding specificity than to evolve a new catalytic activity (Light and Kraulis, 2004). Relatedly, it has been suggested that (also in bacteria) there are secondary (moonlighting) functions where enzymes with a given catalytic activity carry it out on multiple substrates with different specificities (Copley, 2003). This nature of enzymatic activities might generally lead to quick differential optimization after duplication, especially easily if maintained with different specificities in different alleles by balancing selection before duplication. Further (as discussed in detail below), specificity is chemically and evolutionarily difficult to attain, and nonspecific binding activities may arise easily when there is no selective pressure against them. Whereas selective pressures are ultimately at the systems level, divergence occurs gene by gene and mutation by mutation. This process will be dissected.

3

MUTATIONAL DYNAMICS AND SUBSTITUTIONS

Both intramolecular and intermolecular coevolution of sites affects the probability of fixation of any individual mutation, where genetic background (the sequence at genetically interacting positions) determines the phenotype of any given mutation. The evolutionary accessibility of different mutations from a given genetic background is therefore dictated partly by the mutation rate and the frequency of multiple segregating mutations as well as the population size as a dictator of strength of selection. The same evolutionary properties affect both intramolecular and intermolecular interaction, only with differing degrees of sensitivity to mutation, due to the entropic differences between the two types of interresidue interaction. For these entropic reasons, it is easier to knock out a binding interaction than to knock out proper protein folding (although this happens, too) with a single mutation. This is because although there are a greater number of sites that influence proper folding, covalent attachment means that there will also be a greater local effective concentration of intramolecularly interacting residues requiring a lower affinity interaction to generate the same levels of bound state. If one views two residues as interacting or not interacting, the probability of interaction at any given time is dependent on their affinity for each other and how many opportunities they have to interact (their concentration about each other). So far, we have focused on the coding properties of a gene. Gene expression is another important process that is subject to phenotypic divergence through mutation. The typical gene has approximately 12 transcription factor binding sites [the distribution of this across genomes is not well characterized, and this number is given with an approximation of six to eight base pairs (Harbison et al., 2004; Hughes and Liberles, 2007)]. The specificity of binding typically enables transcription factors to discriminate among many sites with single-base-pair mutations (Lusk and Eisen, 2008). Because of the small size of transcription factor–binding sites, site loss and de novo site evolution are reasonably common, and this is explored further below. Due to the periodicity of standard B-form DNA of about 10 bp, as well as changes in effective local concentration of transcription factors about each other and about the initiation site, it might be expected that spacing between sites is important in gene regulation, but evidence

EVOLUTION OF ENZYME ACTIVE CENTERS AFTER DUPLICATION

3

generated so far seems to downplay the role of these effects (Shultzaberger et al., 2007), leading to a focus on the evolution of the sites themselves. Splicing is another mechanism by which genes can diverge through mutation. There are two types of splicing, constitutive and alternative, with alternative splicing simply showing a weaker consensus to splicing regulatory sites (Churbanov et al., 2008). Like transcription factor–binding sites, splicing regulatory sequences are short and potentially subject to turnover. However, because of the lack of redundancy (unlike transcription factor–binding sites), loss in the absence of duplication may frequently be highly deleterious. It has been shown that alternative splicing itself enables a substitution burst mediated by relaxed selection on and around these regulatory sites (Xing and Lee, 2005). That gene duplication can also enable such a burst of substitution under relaxed selection suggests that gene duplication should enable enhanced rates of alternative transcript generation, and this has indeed now been demonstrated (Jin et al., 2008). Many other molecular mechanisms can contribute to mutation-driven diversification. A far from exhaustive list would include glycosylation sites, protein splicing, and RNA editing—one only needs to think of the effects of duplication and relaxed selection on any processes generating constraint described in a molecular biology textbook. Starting with a few examples of several of these molecular processes, we will then link mutational opportunity to evolutionary mechanism and process. The following section includes a series of examples of the fates of duplicate genes. These examples are meant to be illustrative, and we will ultimately address how general the various processes that underlie the examples actually are. 4 EVOLUTION OF ENZYME ACTIVE CENTERS AFTER DUPLICATION Mutations in the active center(s) of an enzyme can lead to a change of its substrate or a change in its kinetics. For example, Vick and Gerlt (2007) demonstrate that a single-base-pair change leading to D-to-G substitution in the active center of the monofunctional l-Ala-d/l-Glu epimerase from E. coli introduced the ability to catalyze the o-succinylbenzoate synthase reaction while reducing the level of the original reaction (Figure 1). Four additional nucleotide substitutions led to a complete switch of specificity and kinetics to the new reaction. Consistent with the patchwork model discussed earlier, a large number of enzymes in the arginine and lysine synthetic pathways are homologous to each other (Miyazaki et al., 2001). Mutations in the structure surrounding the active center can lead to fine-tuning the active center to different but fundamentally similar substrates. For example, residues in the active centers of Leu-tRNA synthase and Ile-tRNA synthase are mostly conserved; Leu and Ile are very close chemically. There are a number of variable residues that do not directly contact the substrate residue in the active center but, rather, shape the active center, allowing for recognition of the cognate substrate residue. Both tRNA synthases are highly similar on both the sequence and structural levels. Leu- and IletRNA synthases probably arose via gene duplication (Brown and Doolittle, 1995). This demonstrates a shift in substrate specificity following gene duplication. 4.1 Change of DNA-Binding Specificity Homeobox genes are homeodomain-containing transcription factors that are known as principal regulators in the formation of the animal body plan during embryo

4

UNDERSTANDING GENE DUPLICATION

ARG24

Ile19 ASP297

(A)

TRP24

9

Phe1 GLY297

(B)

Figure 1 (A) Substitutions in the active center of the L-Ala-D/L-Glu (SP2Q filling residues). (B) Substitution of one nucleotide in codon coding for Asp297 leads to acquiring of o-succinylbenzoate activity. Further substitutions of Arg24 and Ile19 (located on an unresolved loop) lead to a complete switch of specificity of the reaction (PDB ID 1JPD).

development. They are often organized in homologous gene clusters such as Hox , ParaHox , and NK (Garcia-Fern´andez, 2005). Hox clusters can contain different numbers of genes in different species, where new genes in the clusters arise via duplication and loss in the course of evolution. It has been shown that the DNA-binding specificity of Hox genes is controlled by a few key positions in the homeodomain. For example, substitution of Gln to Lys in position 50 of the homeodomain alters recognition from TAATCC (recognized by bicoid class hox proteins) to the TAAT(T/G)(A/G) motif recognized by the Antennapedia and Engrailed classes (Hanes and Brent, 1989; Treisman et al., 1989; Percival-Smith et al., 1990). This is shown in Figure 2. Similarly, substitutions in positions 3, 6, and 7 of the

EVOLUTION OF ENZYME ACTIVE CENTERS AFTER DUPLICATION

5

(A)

(B)

Figure 2 X-ray and NMR structures of three Hox proteins from Drosophila melanogaster in complex with its DNA recognition site. Mutation in crucial position 50 of the homeodomain switches the specificity to a different DNA binding site. Gln50 in Antennapedia (A) contacts GG on the antisense strand, leading to recognition of the TAATGG core motif, while Lys50 in the bicoid homolog of Antennapedia (B) recognizes TAATCC. The recognition site can be changed mutagenically (Tucker-Kellogg et al., 1997). (C) Ultrabithorax (a tandem duplicate of Antennapedia) has the same DNA-binding motif, but it developed an interaction with the Ebx homeodomain protein mediated primarily by the terminal YPWM motif of Ubx that binds to the hydrophobic pocket in Ebx (Passner et al., 1999). As a result of the interaction, the complex cooperatively recognizes the TGATTTATGG/ATAAATCA motif. The disordered flexible linker (unresolved on the structure) is shown in an extended conformation. (From Baird-Titus et al., 2006.)

6

UNDERSTANDING GENE DUPLICATION

(C)

Figure 2 (Continued )

N-terminus of the homeodomain alter the specificity toward the nucleotide in position 2 of the motif TTATGG → TAATGG (Ekker et al., 1994; Noyes et al., 2008). The evolution of homeotic genes in Hox -like clusters demonstrates how gene duplication followed by a single or a few mutations can create new functions that have dramatic effects on the phenotype (in the case of Hox genes, the number of body segments, limbs, etc.). An example of the rearrangement of Hox genes and their regulatory elements is shown in Figure 3. 4.2 Change of Binding Interface and Interaction Partners Most proteins do not act alone but, instead, interact with other proteins. This is another mode of potential divergence for duplicated genes. Protein–protein interactions are in most cases highly specific and form complex protein interaction networks that execute metabolic functions and make up regulatory, signal transduction, and intercellular circuits. Mutations in the protein–protein binding interfaces have a significant impact on the function of the protein and the network in which it participates. A bacterial example will be used to illustrate this ubiquitous nature of biological systems. Two-component systems are the most common signal transduction systems in bacteria and are responsible for the sensing and adaptation of the bacteria to a variety of environments, nutrients, and stresses. The family of two-component systems is comprised of homologous proteins. The high degree of binding specificity in proteins making up a two-component system allows for virtually no detrimental crosstalk in a bacterial cell, which can contain up to 200 different two-component systems. The

EVOLUTION OF ENZYME ACTIVE CENTERS AFTER DUPLICATION

7

Figure 3 HoxA1 and HoxB1 genes in mammals. (a) Wild type. HoxA1 has a fully functioning 3 retinoic acid response element (RARE, circle). HoxB1 has both a Hox1 autoregulatory element (ARE, square) and a RARE, with the functionality of the latter severely reduced. (b) Swapping coding regions of the gene in mice produce a normal phenotype. (c) Combing the fully functional ARE and RARE elements around one of the genes and knocking out another also produces a normal phenotype. This is believed to be an ancestral form that preceded WGD and subsequently subfuctionalized so that HoxA1 lost ARE and HoxB1 retained it but deteriorated its RARE. (From Trdvik and Capecchi, 2006.)

two components are typically a membrane-localized sensor with kinase activity and a transcription factor that is phosphorylated. It has been shown that by substituting only three residues in the kinase, the specificity of the kinase can be switched completely to another two-component signal transduction pathway, thus drastically changing the signal transduction logic (Skerker et al., 2008) (Figure 4). Another interesting feature of the archetypical two-component system is that similar to Hox clusters, genes coding for its components are clustered on the bacterial chromosome, comprising an operon. That allows for duplication and subsequent divergence of the entire system cooperatively. 4.3 Change of Regulatory Elements That Control Gene Expression Changes in the way that genes are regulated affect the timing, level, and tissue specificity of gene expression. It has been shown in the case of the HoxA1 and HoxB1 genes that swapping their coding regions has no detrimental effect in mouse development (Tvrdik and Capecchi, 2006) (Figure 3). Moreover, combining elements of the regulatory region of both duplicates into one delivers a fully functional gene that carries out the function of both genes, resulting in normal mice. This work demonstrates an apparent case of historical subfunctionalization of regulatory regions. As discussed later, this appears to be common in the evolutionary trajectories of duplicated genes. 4.4 Instantaneous Change of Regulation of Duplicate Copy The duplication event itself can radically alter the way in which gene expression is regulated. For example, in considering the fate of X-linked gene Utp14 , Bradley et al. (2004) found a retrocopy (Utp14b) integrated in the intron of autosomal gene Acsl3 on mouse chromosome 1 (Figure 5). The presence of the retrocopy is essential for proper spermatogenesis in mice. The retrocopy is regulated by the promoter of the host gene, and unlike Utp14 , is not affected by X inactivation during spermatogenesis. Thus, as a result of the retrotransposition event, the second copy switched its promoter, moved

8

UNDERSTANDING GENE DUPLICATION

Figure 4 Change of binding interaction in the two-component system. Shown in red blobs are three amino acid substitutions described by Skerker et al., (2008) that completely switch from EnvZ histidine-kinase to OmpR HK signal transduction. The HK homodimer is on the bottom (green and blue domains), and the response regulator domain (brown, top) was computationally docked to the 2C2A HK structure by Marina et al. (2005). The phosphotransfer histidine is shown in magenta. (See insert for color representation of the figure.)

Figure 5 Retrotransposition of Utp14 from the mouse X-chromosome onto the untranslated exon of gene Acsl3 located on chromosome 1 is depicted.

to a different chromosome with different regulation of chromatin packing, and lost all introns. This can be considered a case of regulatory neofunctionalization. 4.5 How General Do We Expect the Examples Above To Be? We have seen a collection of examples where different molecular mechanisms interplay with different evolutionary processes. Starting from the initial duplication event,

EVOLUTION OF ENZYME ACTIVE CENTERS AFTER DUPLICATION

9

through mutational opportunities to affect different molecular processes to different mechanisms on a population and evolutionary scale, we systematically evaluate the potential for duplicate gene retention. Initial gene duplication events occur in a single individual. The rate of fixation of duplicates within a population depends on the effective population size and the degree of selection. Some treat the initial events as neutral (Force et al., 1999; Lynch and Force, 2000) with some case-specific positive selection (see below; Perry et al., 2007), whereas others view duplication events as deleterious (Wagner, 2005). Given that trisomy in humans is lethal except for chromosome 21 and the sex chromosomes (and these cases are associated with reduced fitness), duplication of a subset of genes is clearly deleterious in humans. Different lineages in the tree of life show different propensities to tolerate gene duplication, and mammals of small effective population size differ from plants of small effective population size in this regard. Even in plants, different genes and gene functions are retained differently after duplication events (Hanada et al., 2008), although this analysis does not yet sort out the role for selection in the initial duplication event, and further work is needed. All of the processes above are described as single events in a species. In actuality, these events occur in a single individual and are then subjected to population-level processes simultaneous to the process of divergence. Genes that are born identical or that have not diverged in a mechanism that affects fitness will be born in proportion to effective population size and will be fixed in inverse proportion to effective population size. Once a fitness advantage is gained (where the probability of advantageous mutation is proportional to effective population size), the probability of fixation is inversely proportional to less than the effective population size. The degree by which the probability of fixation based on effective population size is modulated is dependent on the strength of selection. It is therefore expected that selective processes are more common in organisms of large effective population size. Selection can also be driven by mutation rate, with higher mutation rates providing a greater sampling of changes to access those of adaptive effect. Whole-genome duplications (WGDs) have an added complexity in sexual organisms. Perhaps the rarity of whole-genome duplication events is that successful reproduction is dependent on two individuals with whole-genome duplications finding each other and mating, coupled to the interplay of population genetics involving the relative fitness of offspring with a whole-genome duplication compared to individuals without a wholegenome duplication. This scenario is dependent on the cessation of gene flow between the two subpopulations. Moving on to the initial duplication event at the molecular level, a wide variety of processes can lead to gene duplication. At the grossest level, whole-genome duplication results in duplication of every gene in the genome. Under this process, each gene is identical upon arrival, in terms of both coding sequence and regulation. Further, every interacting partner is duplicated together with the gene itself, resulting in a doubling of the interactome. The next level down involves other large-scale (e.g., whole-chromosome) duplication events, where the gene is identical in coding sequence and regulation but does not necessarily have any or all of its interacting partners duplicated. This distinction is important for some of the underlying mechanisms for duplicate gene retention, as we will see. Other mechanisms involve duplication of a single gene at a time without interacting partners, but otherwise also involve differences. Tandem duplication is mediated

10

UNDERSTANDING GENE DUPLICATION

by recombination, break-and-repair processes, or polymerase error. Tandemly duplicated genes are probably identical in coding sequence and regulatory elements, but have a chance of missing a terminal domain and distal regulatory elements. Genes duplicated by DNA-level transposition will probably be identical in coding sequence, again with the chance of missing a terminal domain, but will probably be born in a new gene expression environment. There is a chance of retention of proximal expression elements. Genes duplicated by retrotransposition will be born identical in coding sequence except for the lack of introns (eliminating the possibility of splicing-level divergence). These genes will be born in a new gene expression environment. If the new environment does not result in expression of the gene, the duplicate that was created is dead upon arrival.

5

MUTATIONAL OPPORTUNITIES AFTER DUPLICATE GENE BIRTH

Whereas the birth process itself may introduce changes to the gene that result in functional modification, subsequent to birth, random changes occur independently in each copy that lead to divergence. The opportunity to effect functional change through either gain or loss of function without creating a nonfunctional gene in either duplicate is expected to be proportional to the number of sites where such changes can possibly happen. The easiest events to envision are loss of a transcription factor binding site and loss of a binding site from the protein. The average gene has 12 transcription factor–binding sites of typical length six to eight base pairs, where one or two mutations in a site will alter or knock out transcription factor binding. The average protein interacts with one to three other proteins under a power-law distribution (Luscombe et al., 2002). The size and nature of an interaction interface ranges from two to five amino acids for modifying enzymes (Puntervoll et al., 2003), with larger sites for transient and obligatory interactions. For transient interactions, the average recognition site is widely variable in size and typically has shown a significant energetic contribution, from 12 to 15 amino acids (Chakrabarti and Janin, 2002), but fewer in other studies (Bogan and Thorn, 1998). Each amino acid site corresponds to potential changes in roughly 2.5 nucleotide positions (from the genetic code). Larger interaction interfaces, including among obligate interactions, tend to be driven by hydrophobic interactions, while smaller interfaces, including transient interaction interfaces, are more driven by electrostatic interactions (Bradford et al., 2006). The role of the remaining residues not contributing to the binding affinity is thought largely to be to exclude solvent (Bogan and Thorn, 1998). These residues are less constrained evolutionarily and do not affect specificity (Caffrey et al., 2004; Guharoy and Chakrabarti, 2005). Even among the binding interface residues that contribute to the binding affinity, the degree of amino acid sensitivity between similar amino acids is unclear. It has been suggested that a small subset of electrostatic residues may drive specificity in a sea of hydrophobic interactions driving affinity (Pechmann et al., 2009). Changes in untranslated regions can also affect mRNA stability, but have not been factored into the view described above. This quick back-of-the-envelope calculation with a few unknowns shows that it should be easier to change a gene expression profile than a binding profile, but not overwhelmingly so (roughly, one- to tenfold more likely). In fact, evidence suggests that subfunctionalization of gene expression is typically the first thing to happen, but followed subsequently by change (potentially neofunctionalization as well as subfunctionalization) in protein

EVOLUTIONARY MECHANISMS

11

function (He and Zhang, 2005). However, the back-of-the-envelope calculation above, based on the mean, will be sensitive to the underlying distributions, with many foldand gene-specific effects (see below). Additionally, the affinity of the transcription factor to a regulatory region can be determined by the enrichment of different motives rather than by singular sites (Badis et al., 2009). While the foregoing estimations deal with loss or modification of existing binding sites, surface regions can evolve new binding interactions that were not present in the ancestor. This has been suggested for leptin in primates in the absence of duplication (Gaucher et al., 2003), but represents a mechanism that should be even more accessible to duplicate genes. Although it is generally thought that binding interactions will evolve more easily than catalytic activities, many binding interfaces include residues that are buried in pockets or exposed only upon conformational shifts in binding.

6 EVOLUTIONARY MECHANISMS We have mentioned several possible evolutionary mechanisms acting upon available mutations at different levels. Next, we examine these mechanisms systematically; they are summarized in Figure 6. 6.1 Pseudogenization Pseudogenization, the most common fate for duplicate genes, arises from the random neutral accumulation of mutations, most of which are deleterious. Eventually, the gene no longer functions. For the products of small-scale duplication (SSD) events, a fraction of genes will be born without the expression elements necessary to have a function that confers fitness. The same applies to genes born missing terminal domains. 6.2 Subfunctionalization Subfunctionalization is a mechanism that involves a combination of neutral mechanisms and negative selection to relax the redundancy of duplicate copies via complementary loss of functional attributes between the duplicates. The functions of a protein, whether expression domains, binding interactions, alternative splice forms, or other features, are viewed modularly, with evolutionary dynamics characterized by mutational opportunities for loss of different modules. Genes that have more regulatory regions, including those that regulate development, will be more prone to subfunctionalization. Because this mechanism does not involve positive selection, it has been viewed as more important in smaller effective population-size lineages. Some products of tandem duplication or DNA-level transposition will be born subfunctionalized. 6.3 Neofunctionalization Neofunctionalization involves the development of new functions. This can include the development of de novo transcription factor–binding sites, the modification of existing sites to change the specificity, affinity, or kinetics, the modification or gain of binding interactions, the modification or gain of splice regulatory elements, and a number of other events. The frequency of neofunctionalization depends on the frequency of

12

UNDERSTANDING GENE DUPLICATION

N

N

N

Figure 6 Schematic depicting the processes of neofunctionalization, subfunctionalization, and dosage compensation. Neofunctionalization can occur either pleiotropically or nonpleiotropically, depending on whether the new function occurs in a region that also carries out the original function. Subfunctionalization can occur alone or together with neofunctionalization. In the bottom panel, the decay of binding interactions driven by changes in stoichiometry is shown.

neofunctionalizing events. Because of the complexity of interacting mutations not only within but between genes, neofunctionalization rates may show a time lag and are more complex than the simple rate of instantaneously beneficial mutations within a population. Further, a new function at the molecular level does not necessarily implicate a selectable advantage and positive selection. Some new molecular functions will be evolutionarily neutral. Timing of Neofunctionalization The classical model (which still may be the most common) suggests that when neofunctionalization occurs, the relaxation of selective constraints associated with gene duplication paves the way. However, there is some evidence for increased substitution preceding duplication events leading to duplicate gene retention (Johnston et al., 2007). There are several mechanisms that have been characterized associated with this. One mechanism is fixation of selectively balanced alleles (Sato et al., 2001), where alleles that benefit the heterozygous individual individually are fixed at different loci. Another mechanism involves enzymes that catalyze side reactions, where duplication allows subfunctionalization of the main reaction and side reaction and optimization of the side reaction without pleiotropic constraint. Selection for Increased Dosage as a Form of Neofunctionalization In addition to changes to transcriptional (and translational) regulatory regions, gene duplication can be

EXPECTATIONS FOR RETENTION PROFILE AND FOR SUBSTITUTION PROFILE

13

a mechanism to increase the dosage of a gene, where increased dosage is beneficial. An example of this that has been suggested in the human population is salivary amylase I, which apparently varies in copy number in correlation with starch consumption (Perry et al., 2007). Dosage Compensation Duplicating a gene that instantaneously leads to a doubling of expression is potentially deleterious for several reasons (Wagner, 2005; Drummond and Wilke, 2008). Beyond any deleterious effects due to the cost of expression or mistranslation (or gain of low-affinity interactions at higher protein concentrations), it is thought that stoichiometric imbalance is deleterious (Aury et al., 2006). Thus, when two or more interacting partners are duplicated, there is expected to be a selective pressure to retain such duplicates together in a genome for long evolutionary periods. Loss of one of the copies or down-regulated expression of one copy is then expected to lead to positive selection for the loss of interacting partner duplicates (or down-regulated expression) [see Hughes et al. 2007 for a discussion]. Subfunctionalizing mutations are expected to be deleterious and also lead to loss of interacting partners to restore stoichiometric balance in interactions. Additionally, subfunctionalizing interactions have the potential to cause dominant negative effects in genes retained through dosage compensation. Thus, as we discuss subsequently, dosage compensation is expected to yield very different evolutionary signals from those generated by neofunctionalization and subfunctionalization. Selection for Genetic Redundancy Another mechanism that has been proposed for the retention of duplicate genes is that of serving as a backup copy and, as interactions diverge, playing a role in providing genetic redundancy to generate a more robust system. Under this mechanism, duplicated genes play a buffering role as backup copies for future mutation. The expectation of this mechanism is strong negative selection on coding sequence and function, and it does not explain the burst of substitutions that are typically observed after a duplication event. Further, it has been argued that although the most robust systems are those in chordates, the small effective population sizes and low mutation rates in chordates would not provide strong enough selection for such a weak secondary selection type of mechanism (Forster et al., 2006; Elena et al., 2007). 6.4 Interplay Between Mechanisms Clearly, these mechanisms are not all mutually exclusive, although some clearly are. For example, subfunctionalization of binding interactions or transcriptional domains would not be compatible with dosage compensation as a mechanism. However, if one views neofunctionalizing changes as rare, any mechanism that increases, even temporarily, the retention time of a duplicate gene has the potential to serve as a transition state for neofunctionalization. This has been established most clearly for the interplay between subfunctionalization and neofunctionalization. 7 EXPECTATIONS FOR RETENTION PROFILE AND FOR SUBSTITUTION PROFILE The different mechanisms present different profiles expected for time-dependent retention probabilities and time-dependent substitution (dN ) and selective pressure (dN /dS )

14

UNDERSTANDING GENE DUPLICATION

probabilities. Although relaxation of selective constraint and positive selection can be difficult to differentiate, the expectation from both the neofunctionalization and subfunctionalization models is a burst of substitution after duplication and a declining death rate with time (Figure 7). The substitution process will probably include greater levels of substitution when the events occur in the coding sequence than when they occur transcriptionally. The retention process is typically characterized by a Weibull distribution for neofunctionalization and an exponential distribution for periods between 0.02 and 0.15 dS units, followed by a concavely declining hazard function after this point, with an initial waiting time for complementary loss events that appears like neutral loss (Hughes and Liberles, 2007). In contrast, mechanisms that involve retention of the coding sequences will impose immediate negative selection and will not show a burst of substitution. The dosage compensation mechanism will show immediately high retention rates followed by cooperative loss driven by positive selection once one interacting partner is lost. The loss dynamics of the genetic robustness model are less clear but will probably show retention of duplicated genes where loss is more deleterious at higher rates. It is clear that the dynamics are slightly different between WGD events and SSD events (Maere et al., 2005; Blomme et al., 2006; Hughes and Liberles, 2008). It is unclear at this stage if this is due to the underlying mutational process or to other features of WGDs. In both cases, there does appear to have been a burst of substitution immediately following duplication, consistent with the subfunctionalization and neofunctionalization mechanisms. Following SSD, the retention pattern is clearly Weibull-like in mammalian genomes. Model-based gene family analysis will enable a more detailed description of underlying processes in different families (see Chapter 10). One pattern that has emerged is that subfunctionalization and/or dosage compensation might be relatively more important in chordates for whole-genome duplication events (Blomme et al., 2006; Hughes et al., 2007), whereas neofunctionalization is relatively more important in chordates for smaller-scale stochastic events (Hughes and Liberles, 2008). Further initial subfunctionalization events result in genes that eventually neofunctionalize (He and Zhang, 2005). To complement comparative genomic data analysis, lattice and framework modeling systems have been developed to understand both the time-dependent retention profiles observed under different evolutionary mechanisms and the time-dependent dN /dS ratios observed in different protein regions during different evolutionary mechanisms. These frameworks will enable creation of better models, consistent with different evolutionary scenarios, which can then be tested in gene families and genome-specific data.

8

ROLE OF PROTEIN FUNCTION AND PROTEIN FOLD

It has previously been reported that some protein functions, especially those that function extracellularly, evolve particularly rapidly after a gene duplication event, whereas other functions, such as those with various immune functions, evolve particularly rapidly after a speciation event (Seoighe et al., 2003). In addition to these protein function-specific differences, it has been observed that different protein folds present different dynamics and relative propensities to subfunctionalize and neofunctionalize. It is expected that a protein with a larger surface area (and necessarily a smaller surface

15

ROLE OF PROTEIN FUNCTION AND PROTEIN FOLD

Canis familiaris

140

Number of duplicate pairs

Number of duplicate pairs

160 120 100 80 60 40 20 0 0

0.05

0.1

0.15

0.2

0.25

500 450 400 350 300 250 200 150 100 50 0

0.3

Homo sapiens

0

0.05

900 800 700 600 500 400 300 200 100 0

0.1

0.15

0.2

0.25

0.3

S (substitutions per silent site) 250

Mus musculus

Number of duplicate pairs

Number of duplicate pairs

S (substitutions per silent site)

Rattus norvegicus

200 150 100 50 0

0

0.05

0.1

0.15

0.2

0.25

0

0.3

0.05

S (substitutions per silent site)

0.1

0.15

0.2

0.25

0.3

S (substitutions per silent site)

(A) 1 Homo sapiens

dR/dS

0.75

0.5

0.25

0 0

0.5

1

1.5

2

2.5

3

S (substitutions per silent site)

(B)

Figure 7 (A) Time (dS )-dependent decay of retained duplicate genes as fit by an exponential and a Weibull distribution in four mammalian genomes. The gray bars show the growth of gene pairs under negative selection with increasing time. (B) In the human genome, the time (dS )-dependent decay of dN /dS for duplicate genes from a relaxed level of selection to an orthologous substitution rate is modeled. (From Hughes and Liberles, 2007.)

16

UNDERSTANDING GENE DUPLICATION

area/volume ratio) will have a greater opportunity to evolve new binding functions on its surface. Because chordate proteins tend to be larger than bacterial proteins, this is one possible explanation for the unexpectedly high rates of neofunctionalization in chordates of small effective population size. A possible explanation from the other perspective is that if neofunctionalization is an important process for duplicate gene retention, the folds that are more likely to neofunctionalize rather than nonfunctionalize after gene duplication will be enriched more in species of small effective population size than in species of large effective population size, where a less evolvable fold will also readily neofunctionalize. A prediction of this hypothesis is that chordate genomes will be enriched with the most evolvable folds from a natural distribution, even after correcting for differences in protein surface area. Ultimately, protein function and protein fold are intertwined, as the protein fold delimits or determines the accessible functions for a protein. From this, disentangling what selection is acting on becomes difficult, and both are clearly important.

9

SPECIES-SPECIFIC DETAILS

Evolution also shows lineage-specific characteristics. Some of this will be due to changes in the underlying process. For example, a lineage-specific change in effective population size or a loss of a DNA repair enzyme will, respectively, alter the relative importance of selection vs. drift on a lineage and will increase the mutation rate (Ota and Penny, 2003). These factors will affect the relative likelihood of different fates for genes duplicated on that lineage. Additionally, lineage-specific selection driven by changes at other loci as well as differences in the environment leading to differential selective pressures on the organism and resulting adaptation will create gene family-specific effects on specific lineages. This is seen in massive lineage-specific expansions or contractions of particularly gene families. The olfactory receptors are now a classic example of this (Glusman et al., 2001).

10

CONCLUSIONS AND LARGER-SCALE EFFECTS

Ultimately, the interplay between population genetic dynamics and biochemistry dictates the fate of duplicate genes. This interplay occurs over many levels of biological organization, and we have tried to integrate this into a larger understanding of the patterns of duplication that we observe in genomes today as well as in their reconstructed history. Beyond genomes, duplication appears to affect speciation rates and the derivative clade-specific biodiversity. Based on fossil records showing the often rather sudden appearance of morphological variation, Mayr (1963) proposed the founder effect model of speciation, which later contributed to Eldredge and Gould’s (1972) development of the punctuated equilibrium theory. Again, these ideas were postulated before the availability of sequence, or whole-genome data, and therefore relied largely on two concepts: (1) the idea that phenotypic variation is essentially an expression of underlying reproductive isolation (biological species concept) or of an independent evolutionary trajectory of a unit of organisms (other species concepts), and (2) the observation that in an evolutionary trajectory, many novel phenotypes seem to occur relatively fast and

CONCLUSIONS AND LARGER-SCALE EFFECTS

17

without intermediate forms (e.g., the Cambrian explosion). This could not be readily explained by the gradual evolutionary processes proposed by Darwin and others, for this would mean a slow shifting of populations from one equilibrium point to another, eventually launching subpopulations onto their own evolutionary path through a series of “intermediate” stages. Discovery of the imprint of whole-genome duplications in the sequences of a variety of organisms seemed to provide plausible molecular mechanisms for a burst of innovation (i.e., new phenotypes), and a number of authors have speculated about a link between whole-gen(om)e duplication and radiation/speciation [see Roth et al. (2007) for a review]. The reasons for such speculation evolve primarily around the ideas of increased cladogenesis and gen(om)e diversification rates after duplication events. Yet only a few studies have actually correlated rates of speciation with rates of gen(om)e evolution after duplication. The crux of calculations that include speciation events lies in the very complicated nature of defining a species. Presently, the majority of recognized species is still based on a very narrow definition of “phenotype,” which is clearly influenced by our own perceptional biases. While the addition of subsets of molecular sequence data may increase our resolution to distinguish similarities and differences, it may also obscure relationships, due to the opportunistic and disjunct nature of our sampling (on organismal and molecular levels). Thus, it is very likely that we underestimate the number and evolutionary time frames of organismal units on their own evolutionary trajectory. By extension this will obscure the role of ge(nom)e duplication as a process of speciation. For example, recent studies on polyploid plants found that there is definitively a substantial contribution of polyploidy to cladogenesis, yet the authors acknowledge that phylogenetic uncertainties may render this result too conservative (Wood et al., 2009). Therefore, we may better be served to evaluate the role of gen(om)e duplication on the “first responder” level of populations, taking the dynamics of genealogical history into account. These dynamics are naturally influenced by repetitive population size variation and fragmentation at different spatiotemporal scales, and although these parameters cannot be observed directly on an evolutionary scale, they can be included in models of evolution. From the standpoint of population dynamics, and depending on reproductive strategies, each duplication event may affect one individual or a local set of siblings of a population. Assuming that the bearer(s) of duplication are reproductively fit (i.e. the duplication event is selectively neutral, advantageous, or only slightly deleterious), either one or both potential parents from that subpopulation may carry the duplicated set of genes. However, duplication may introduce immediate reproductive barriers in a subpopulation, thus decreasing the likelihood of an F1 from mixed parental ploidy or gene number (Kelleher et al., 2007). Furthermore, it would mean that inbreeding of siblings with duplicated genes is likely when population connectivity is high (low dispersal). In this particular case, each subsequent perpetuation of the duplicated genetic line may then best be understood and modeled as a founder effect scenario. However, far too few studies exist regarding the actual potential of reproductive isolation of individuals after duplications. It stands to reason that there are varying degrees of “severity” in duplications of genetic material, ranging from internal gene duplications to whole-genome duplications. The cessation of gene flow between segments of a population does not eliminate their competition in a population-like paradigm. If the initial duplicates are identical (as

18

UNDERSTANDING GENE DUPLICATION

in the extreme case of whole-genome duplication), the phenotype of the organism may initially be similar. In this scenario, even if the individual(s) with duplication events are reproductively isolated, they run the risk of elimination from the “population.” Organismal groups seem to have a vastly different tolerance to polypoloidization and duplication events. For instance, while many vertebrate groups are certainly paleopolyploid, they have largely returned to diploidy, and higher vertebrates (e.g., humans) seem to be particularly negatively affected by duplications. Plants, on the other hand, appear to perpetuate polyploidy on a much more frequent basis. It is currently not well understood if the process of diploidization is due to changes in master chromosomepairing genes or through a more general loss of pairing ability between homeologs due to loss of genes (Semon and Wolfe, 2007). Although these are certainly molecular mechanisms to lose polyploidy, the time frame of the loss should again be a function of population size, and cannot be divorced from the organism’s life history parameters. For example, in slowly reproducing organisms with few offspring, the likelihood of losing selectively neutral duplicates is higher than in an organism with many offspring (which may all carry the duplication). Although still in its infancy (especially for nonmodel organisms), the simultaneous study of genomewide variation within and between species (population genomics) may reveal new mechanisms influencing the faith of genomic duplications. Particularly interesting is the additional insight into the variability of noncoding sequence across individuals, reducing the bias associated with population genetic analyses based on targeted protein-coding genes (Begun et al., 2007). Additionally, and in line with the issues mentioned previously, individuals of populations are the first responders to duplications, and such comprehensive data may allow for a better understanding of the fitness effects associated with different levels of duplications. Acknowledgments We wish to thank Johan Grahnen for providing Figure 6. This work was supported by an NIH-INBRE award to University of Wyoming.

REFERENCES Aury J, Jaillon O, Duret L, Noel B, Jubin C, Porcel BM, S´egurens B, Daubin V, Anthouard V, Aiach N, et al. 2006. Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraurelia. Nature 444:171–178. Badis G, Berger MF, Philippakis AA, Talukder S, Gehrke AR, Jaeger SA, Chan ET, Metzler G, Vedenko A, Chen X, et al. 2009. Diversity and complexity in DNA recognition by transcription factors. Science 324:1720–1723. Baird-Titus JM, Clark-Baldwin K, Dave V, Caperelli CA, Ma J, Rance M. 2006. The solution structure of the native K50 bicoid homeodomain bound to the consensus TAATCC DNAbinding site. J Mol Biol 356:1137–1151. Begun DJ, Holloway AK, Stevens K, Hillier LW, Poh YP, Hahn MW, Nista PM, Jones CD, Kern AD, Dewey CN, et al. 2007. Population genomics: whole genome analysis of polymorphism and divergence in Drosophila simulans. PLoS Biol 5:e310. Blomme T, Vandepoele K, De Bodt S, Simillion C, Maere S, Van de Peer Y. 2006. The gain and loss of genes during 600 millionyears of vertebrate evolution. Genome Biol 7:R43.

REFERENCES

19

Bogan AA, Thorn KS. 1998. Anatomy of hot spots in protein interfaces. J Mol Biol 280:1–9. Bradford JR, Needham CJ, Bulpitt AJ, Westhead DR. 2006. Insights into protein-protein interfaces using a Bayesian network prediction method. J Mol Biol 362:365–386. Bradley J, Baltus A, Skaletsky H, Royce-Tolland M, Dewar K, Page DC. 2004. An X-toautosome retrogene is required for spermatogenesis in mice. Nat Genet 36:872–876. Brown JR, Doolittle WF. 1995. Root of the universal tree of life based on ancient aminoacyltRNA synthetase gene duplications. Proc Nat Acad Sci USA 92:2441–2445. Caffrey DR, Somaroo S, Hughes JD, Mintseris J, Huang ES. 2004. Are protein–protein interfaces more conserved in sequence than the rest of the protein surface? Protein Sci 13:190–202. Chakrabarti P, Janin J. 2002. Dissecting protein–protein recognition sites. Proteins 47:334–343. Churbanov A, Winters-Hilt S, Koonin EV, and Rogozin IB. 2008. Accumulation of GC donor splice signals in mammals. Biol Direct 3:30. Copley SD. 2003. Enzymes with extra talents: moonlighting functions and catalytic promiscuity. Curr Opin Chem Biol 7:265–272. Drummond DA, Wilke CO. 2008. Mistranslation-induced protein misfolding as a dominant constraint on coding-sequence evolution. Cell 134:341–352. Ekker SC, Jackson DG, von Kessler DP, Sun BI, Young KE, Beachy PA. 1994. The degree of variation in DNA sequence recognition among four Drosophila homeotic proteins. EMBO J 13:3551–3560. Eldredge N, Gould SJ. 1972. Punctuated equilibria: an alternative to phyletic gradualism In Schopf TJM (ed.), Models in Paleobiology. San Francisco: Freeman Cooper, pp. 82–115. Elena SF, Wilke CO, Ofria C, Lenski RE. 2007. Effects of population size and mutation rate on the evolution of mutational robustness. Evolution 61:666–674. Force A, Lynch M, Pickett FB, Amores A, Yan YL, Postlethwait J. 1999. Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151:1531–1545. Forster R, Adami C, Wilke CO. 2006. Selection for mutational robustness in finite populations. J Theor Biol 243:181–190. Garcia-Fern´andez J 2005. The genesis and evolution of homeobox gene clusters. Nat Rev Genet 6:881–892. Gaucher EA, Miyamoto MM, Benner SA. 2003. Evolutionary, structural and biochemical evidence for a new interaction site of the leptin obesity protein. Genetics 163:1549–1553. Glusman G, Yanai I, Rubin I, Lancet D. 2001. The complete human olfactory subgenome. Genome Res 11:685–702. Guharoy M, Chakrabarti P. 2005. Conservation and relative importance of residues across protein–protein interfaces. Proc Natl Acad Sci USA 102:15447–15452. Hanada K, Zou C, Lehti-Shiu MD, Shinozaki K, Shiu SH. 2008. Importance of lineage-specific expansion of plant tandem duplicates in the adaptive response to environmental stimuli. Plant Physiol 148:993–1003. Hanes SD, Brent R. 1989. DNA specificity of the bicoid activator protein is determined by homeodomain recognition helix residue 9. Cell 57:1275–1283. Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, MacIsaac KD, Danford TW, Hannett NM, Tagne J, Reynolds DB, Yoo J, et al. 2004. Transcriptional regulatory code of a eukaryotic genome. Nature 431:99–104. He X, Zhang J. 2005. Gene complexity and gene duplicability. Curr Biol 15:1016–1021. Horowitz NH. 1945. On the evolution of biochemical syntheses. Proc Natl Acad Sci USA 31:153–157. Hughes T, Liberles D. 2007. The pattern of evolution of smaller-scale gene duplicates in mammalian genomes is more consistent with neo- than subfunctionalisation. J Mol Evol 65:574–588.

20

UNDERSTANDING GENE DUPLICATION

Hughes T, Liberles DA. 2008. Whole-genome duplications in the ancestral vertebrate are detectable in the distribution of gene family sizes of tetrapod species. J Mol Evol 67:343–357. Hughes T, Ekman D, Ardawatia H, Elofsson A, Liberles DA. 2007. Evaluating dosage compensation as a cause of duplicate gene retention in Paramecium tetraurelia. Genome Biol 8:213. Jensen RA. 1976. Enzyme recruitment in evolution of new function. Annu Rev Microbiol 30:409–425. Jin L, Kryukov K, Clemente JC, Komiyama T, Suzuki Y, Imanishi T, Ikeo K, Gojobori T. 2008. The evolutionary relationship between gene duplication and alternative splicing. Gene 427:19–31. Johnston CR, O Dushlaine C, Fitzpatrick DA, Edwards RJ, Shields DC. 2007. Evaluation of whether accelerated protein evolution in chordates has occurred before, after, or simultaneoulsy with gene duplication. Mol Biol Evol 24:315–323. Kelleher ES, Swanson WJ, Markow TA. 2007. Gene duplication and adaptive evolution of digestive proteases in Drosophila arizonae female reproductive tracts. PLoS Genet 3:e148. Light S, Kraulis P. 2004. Network analysis of metabolic enzyme evolution in Escherichia coli . BMC Bioinf 5:15. Luscombe NM, Qian J, Zhang Z, Johnson T, Gerstein M. 2002. The dominance of the population by a selected few: power-law behaviour applies to a wide variety of genomic properties. Genome Biol 3:40. Lusk RW, Eisen MB. 2008. Use of an evolutionary model to provide evidence for a wide heterogeneity of required affinities between transcription factors and their binding sites in yeast. Pacific Symposium on Biocomputing, pp. 489–500. Lynch M, Force A. 2000. The probability of duplicate gene preservation by subfunctionalization. Genetics 154:459–473. Maere S, De Bodt S, Raes J, Casneuf T, Van Montagu M, Kuiper M, Van de Peer Y. 2005. Modeling gene and genome duplications in eukaryotes. Proc Natl Acad Sci USA 102:5454–5459. Marina A, Waldburger CD, Hendrickson WA. 2005. Structure of the entire cytoplasmic portion of a sensor histidine-kinase protein. EMBO J 24:4247–4259. Mayr E. 1963. Animal Species and Evolution. Cambridge, MA: Harvard University Press. Miyazaki J, Kobashi N, Nishiyama M, Yamane H. 2001. Functional and evolutionary relationship between arginine biosynthesis and prokaryotic lysine biosynthesis through alphaaminoadipate. J Bacteriol 183:5067–5073. Noyes MB, Christensen RG, Wakabayashi A, Stormo GD, Brodsky MH, Wolfe SA. 2008. Analysis of homeodomain specificities allows the family-wide prediction of preferred recognition sites. Cell 133:1277–1289. Ohno S. 1970. Evolution by Gene Duplication. New York: Springer-Verlag. Ota R, Penny D. 2003. Estimating changes in mutational mechanisms of evolution. J Mol Evol 57(Suppl 1):S233–S240. Passner JM, Ryoo HD, Shen L, Mann RS, Aggarwal AK. 1999. Structure of a DNA-bound Ultrabithorax–Extradenticle homeodomain complex. Nature 397:714–719. Pechmann S, Levy ED, Tartaglia GG, Vendruscolo M. 2009. Physicochemical principles that regulated the competition between functional and dysfunctional association of proteins. Proc Natl Acad Sci U S A 106:10159–10164. Percival-Smith A, M¨uller M, Affolter M, Gehring WJ. 1990. The interaction with DNA of wild-type and mutant fushi tarazu homeodomains. EMBO J 9:3967–3974.

REFERENCES

21

Perry GH, Dominy NJ, Claw KG, Lee AS, Fiegler H, Redon R, Werner J, Villanea FA, Mountain JL, Misra R, et al. 2007. Diet and the evolution of human amylase gene copy number variation. Nat Genet 39:1256–1260. Puntervoll P, Linding R, Gem¨und C, Chabanis-Davidson S, Mattingsdal M, Cameron S, Martin DMA, Ausiello G, Brannetti B, Costantini A, et al. 2003. ELM server: a new resource for investigating short functional sites in modular eukaryotic proteins. Nucleic Acids Res 31:3625–3630. Roth C, Rastogi S, Arvestad L, Dittmar K, Light S, Ekman D, Liberles DA. 2007. Evolution after gene duplication: models, mechanisms, systems, and organisms. J Exp Zool B Mol Dev Evol 308:58–73. Sato A, Mayer MW, Tichy H, Grant PR, Grant BR, Klein J. 2001. Evolution of Mhc class II B genes in Darwin s finches and their closest relatives: birth of a new gene. Immunogenetics 53:792–801. Semon M, Wolfe KH. 2007. Reciprocal gene loss between tetraodon and zebrafish after whole genome duplication in their ancestor. Trends Genet 23:16–20. Seoighe C, Johnston CR, Shields DC.*.baty 2003. Significantly different patterns of amino acid replacement after gene duplication as compared to after speciation. Mol Biol Evol 20:484–490. Shultzaberger RK, Chiang DY, Moses AM, Eisen MB. 2007. Determining physical constraints in transcriptional initiation complexes using DNA sequence analysis. PLoS ONE 2:e1199. Skerker JM, Perchuk BS, Siryaporn A, Lubin EA, Ashenberg O, Goulian M, Laub MT. 2008. Rewiring the specificity of two-component signal transduction systems. Cell 133:1043–1054. Treisman J, G¨onczy P, Vashishtha M, Harris E, Desplan C. 1989. A single amino acid can determine the DNA binding specificity of homeodomain proteins. Cell 59:553–562. Tucker-Kellogg L, Rould MA, Chambers KA, Ades SE, Sauer RT, Pabo CO. 1997. Engrailed ˚ resolution: structural basis for enhanced (Gln50→Lys) homeodomain-DNA complex at 1.9 A affinity and altered specificity. Structure 5:1047–1054. Tvrdik P, Capecchi MR. 2006. Reversal of Hox1 gene subfunctionalization in the mouse. Dev Cell 11:239–250. Vick JE, Gerlt JA. 2007. Evolutionary potential of (beta/alpha)8-barrels: stepwise evolution of a “new” reaction in the enolase superfamily. Biochemistry 46:14589–14597. Wagner A. 2005. Energy constraints on the evolution of gene expression. Mol Biol Evol 22:1365–1374. Wood T, Takebayashi N, Barker MS, Mayrose I, Greenspoon PB, Rieseberg L. 2009. The frequency of polyploidy speciation in vascular plants. Proc Natl Acad Sci USA 106:13875–13879. Xing Y, Lee CJ. 2005. Protein modularity of alternatively spliced exons is associated with tissue-specific regulation of alternative splicing. PLoS Genet 1:e34.

wwwwwww

2

Functional Divergence of Duplicated Genes TAKASHI MAKINO, DAVID G. KNOWLES, and AOIFE McLYSAGHT Smurfit Institute of Genetics, University of Dublin, Trinity College, Dublin, Ireland

1 INTRODUCTION Divergence of duplicated genes is often considered in terms of coding sequence divergence alone. However, most interest lies in uncovering divergences in function, which may be more accurately represented by considering secondary features of the genes, such as interaction partners, expression patterns, and alternative splice variants. The evolution of these characteristics of a gene or its products will not always be reflected in significant divergence of coding sequence and so must be studied using novel approaches. For example, it was recently shown that the daughters of some hominoid-specific gene duplications have diverged functionally through changes in their subcellular location targeting, and in at least one case this was due to a single positively selected amino acid change (Rosso et al., 2008a,b). Functional divergences such as these are not readily understood through sequence comparison alone. Here we review the evidence for the functional divergence of duplicated genes in terms of protein–protein interactions, gene expression, and alternative splicing.

2 PROTEIN–PROTEIN INTERACTION DIVERGENCE Proteins operate through interactions with other biomolecules. In particular, protein–protein interactions (PPIs) are one of the most important components in bimolecular networks. The availability of PPI data has increased rapidly recently, mainly because of high-throughput methods, and the interactomes of several species are just emerging (Ito et al., 2000; Uetz et al., 2000; Rain et al., 2001; Gavin et al., 2002; Ho et al., 2002; Giot et al., 2003; Li et al., 2004; Butland et al., 2005; Formstecher et al., 2005; LaCount et al., 2005; Rual et al., 2005; Stelzl et al., 2005; Gavin et al., 2006; Krogan et al., 2006; Tarassov et al., 2008). Because PPIs are such an important component of biological systems, it makes sense to think of the divergence of gene function in terms of PPI divergence. The most direct approach to understanding PPI divergence is to compare an interactome in a species with other

Evolution After Gene Duplication, Edited by Katharina Dittmar and David Liberles Copyright © 2010 Wiley-Blackwell

23

24

FUNCTIONAL DIVERGENCE OF DUPLICATED GENES

orthologous interactomes (Liang et al., 2006; Dutkowski and Tiuryn, 2007; Huang et al., 2007). However, the high-throughput data on PPIs are known to contain a lot of false-negative and false-positive interactions (von Mering et al., 2002). In addition, it has been reported that overlap among interactomes of different species is low (Gandhi et al., 2006; Mika and Rost, 2006). It is therefore very difficult to find consistent conclusions on PPI divergence by cross-species comparison (Kiemer and Cesareni, 2007). Comparison of PPIs of duplicated gene pairs in single species solves the problem of inconsistency and difficulty making comparisons across species. Proteins encoded by paralogous genes often share common PPI partners because they are derived from a common ancestral gene (Wagner, 2001; Deane et al., 2002) and indeed, immediately upon duplication the PPIs must be identical (except when duplication is due to allopolyploidy or other types of hybridization). Duplicated gene pairs sharing PPI partners may compensate for loss of function by the other (DeLuna et al., 2008; Musso et al., 2008). Furthermore, PPIs of duplicated gene pairs are likely to diverge rapidly (Wagner, 2001; Beltrao and Serrano, 2007). The two observations—that is identical interactions followed by rapid divergence of PPI partners of duplicated gene pairs—allow us to study evolutionary mechanisms of PPIs under the same initial conditions. It has been shown that PPI divergence has occurred asymmetrically after gene duplication (Wagner, 2002; Makino et al., 2006). Psator-Satorras and colleagues (2003) propose an evolutionary model of protein interaction networks through gene duplication. Furthermore, He and Zhang (2005) propose another new model of subneofunctionalization of PPIs after gene duplication. Gene pairs derived from whole-genome duplication (WGD) are often used for analyses of PPI divergence, and it has been shown that there was a WGD event in yeast lineage about 100 million years ago (Mya) (Wolfe and Shields, 1997; Kellis et al., 2004). By analyzing WGD gene pairs and their PPIs in yeast, Makino and colleagues (2006) have shown that gene duplication leads to the functional differentiation of the duplicated gene pairs through the losses and/or gains of PPI partners, resulting in a change in their evolutionary rates (Makino et al., 2006). It is also notable that the PPI divergence mechanism of WGD gene pairs seems to differ from that of other duplicated genes. WGD gene pairs tend to share more PPI partners than do small-scale duplicated genes (Guan et al., 2007) and are often on the same protein complex (Musso et al., 2007). It will be useful to further uncover the different evolutionary patterns of PPIs between WGD and other duplicated gene pairs to better understand the evolution of the interactome. Proteins often interact with themselves: for example, to form a homodimer. A recent study has shown that duplication of genes encoding self-interacting proteins tends to result in protein complexes (heterodimers) formed from the duplicates (Pereira-Leal et al., 2007). In addition, there is evidence in support of preferential retention of WGD paralogs that interact with each other (Presser et al., 2008). The results indicate that genes encoding self-interacting proteins are preserved during evolution despite the extensive gene loss that has occurred after WGD. The duplication of genes encoding self-interacting proteins may present particular evolutionary opportunities, such as the propensity to evolve a new heterodimer, which have been important in divergence after gene duplication. Duplicated genes have unique features to investigate and understand PPI divergence and the evolution of the interactome, and conversely, studying PPI divergence has yielded new insights into evolution after gene duplication.

EXPRESSION DIVERGENCE

25

3 EXPRESSION DIVERGENCE After gene duplication, paralogs may have similar expression patterns if their regulatory regions were also duplicated. Under this assumption, many studies have investigated expression divergence of duplicated genes (Li et al., 2005). Gu and his colleagues (2002) have shown that expression divergence of duplicated genes in yeast correlates with their synonymous substitution rates (Ks ) and thus presumably with time since duplication. The correlation is also found in duplicated genes for humans (Makova and Li, 2003) and Arabidopsis thaliana (Blanc and Wolfe, 2004). Several studies have shown a correlation between expression divergence and nonsynonymous substitution rate (Ka ) for duplicated genes in yeast (Gu et al., 2002; Leach et al., 2007), Caenorhabditis elegans (Conant and Wagner, 2003), humans (Makova and Li, 2003), and A. thaliana (Ganko et al., 2007). On the other hand, it has been often observed that the evolutionary rates of coding sequences for duplicated genes do not correlate with their expression divergence (Castillo-Davis et al., 2004; Haberer et al., 2004; Ganko et al., 2007). In addition, a recent study suggests that both Ks and Ka do not correlate with gene expression divergence of yeast orthologs (not duplicated genes) (Tirosh and Barkai, 2008). The expression of a gene is likely to be influenced by the expression of neighboring genes even if the coexpression is not intended (Liao and Zhang, 2008), and a substantial fraction of duplicated genes are tandemly arrayed on the chromosome, due to the mechanism of duplication (Shoja and Zhang, 2006). Of the studies mentioned above, the analysis of expression and sequence divergence of orthologs in various yeast species is completely unaffected by any leaky expression effect of tandemly arrayed paralogs. Therefore the later conclusion, that there is no correlation between expression divergence and coding sequence divergence, is the more credible. It is thought that gene expression is regulated primarily by transcription factors (TFs) binding to regulatory regions of a target gene. It has been shown that there is a correlation between the number of shared regulatory regions and Ks (Papp et al., 2003; Zhang et al., 2004; Leach et al., 2007). Duplicated genes tend to lose their coexpression partners rapidly in yeast over evolutionary time (Chung et al., 2006). Interestingly, the authors also show that duplicated genes quickly acquire new coexpression partners (Chung et al., 2006). Furthermore, the number of shared regulatory motifs also correlates with expression divergence (Castillo-Davis et al., 2004; Zhang et al., 2004). However, a recent study shows that divergence of TF binding sites does not correlate with that of gene expression among yeast and mammalian orthologs (Tirosh et al., 2008). The authors suggest that some robust mechanisms maintain gene expression patterns even when regulatory regions diverge rapidly (Tirosh et al., 2008). Further investigation is required to resolve this discrepancy. There is evidence that gene duplication itself tends to promote divergence of gene expression, probably because of genetic redundancy. It has been shown that the more times genes have duplicated, the more their expression has diverged in Drosophila melanogaster (Gu et al., 2002) and mammals (Huminiecki and Wolfe, 2004). Gene duplication also contributes to divergence of regulatory networks for gene expression. Teichmann and Babu (2004) have shown that a substantial fraction of TFs and their target genes in yeast regulatory networks consists of duplicated genes. The importance of duplication of genes encoding TFs in the divergence of the yeast regulatory network is also described in a recent study (Ward and Thornton, 2007). In particular,

26

FUNCTIONAL DIVERGENCE OF DUPLICATED GENES

autoregulators, which regulate themselves, are created mainly by gene duplication in Escherichia coli (Cosentino Lagomarsino et al., 2007). As described above, many studies focus on expression divergence of duplicated genes; however, there are not yet any overarching consistent conclusions, and further research involving novel approaches is required to understand expression divergence after gene duplication. For example, Ha and his colleagues have shown that expression divergence of genes involved in response to environmental stress is greater than that of genes involved in developmental processes in A. thaliana (Ha et al., 2007). The degree of expression divergence for duplicated genes is likely to depend on their functional roles. These continued efforts are likely to unveil the details of expression divergence mechanisms for duplicated genes in the near future.

4

DIVERGENCE OF SPLICE VARIANTS OF DUPLICATED GENES

Although the process of duplication and subfunctionalization has been invoked as one of the main explanations for the retention of a large number of duplicates after large genome duplication events, in many cases our knowledge of the functions of the ancestral gene is very limited. In such cases it is not possible to distinguish subfunctionalisation from any other kind of functional divergence of duplicated genes. However, the special case of subfunctionalization of alternative splice variants through alternative loss of splice variants in the duplicated copies is more amenable to analysis, as these will present a clear structural difference between the two resulting genes that reflects the division of functions even if these functions are not completely known. Alternative splicing allows the production of different products from a single gene and plays an important role in the large complexity of eukaryotes. More than 40% of vertebrate genes may undergo alternative splicing, and this has been suggested as one of the reasons for the higher complexity of mammals compared to invertebrate organisms, where gene content is not substantially different (Kim et al., 2007). Alternatively spliced genes may be particularly susceptible to subfunctionalization because a single insertion in an alternative exon can alter the reading frame and render that form completely inactive, and because the area where this mutation can occur is quite large (the entire exon), alternatively spliced genes may undergo subfunctionalization at a faster rate than genes that do not have multiple splice forms, as the number of possible mutations that would inactivate one of the functions without affecting the other is larger. Several examples of this type of subfunctionalization by differential loss of splice variants in duplicated genes have been described in the literature, and we will review two cases in fish and one in plants. The microphtalmia-associated transcription factor (Mitf ) is a gene involved in the differentiation and survival of melanocytes. Within these cells, it influences the production of the pigment melanin, which contributes to hair, eye, and skin color. It also regulates the development of the retinal pigment epithelial cells, which are located in the eye and nourish the retina. This gene has different isoforms in birds and mammals that are generated through the use of alternative 5 exons and promoters from a single gene. In teleost fish species two separate genes (Mitf -m and Mitf -b) exist. Each of these genes encodes a protein that corresponds to one of the bird/mammalian isoforms, and the two of them have different expression profiles. Degeneration of the first exon, which is present in the mammalian MITF-m form, is observed in the Mitf -b gene in fish but not in the fish Mitf -m form. The

CONCLUDING REMARKS

27

presence of these two genes in fish shows an alternative genomic strategy to the use of different splice forms of the same gene that is observed in mammals and birds (Altschmied et al., 2002). Another example of subfunctionalization by alternative loss of splice forms in duplicated genes can be found in the synapsin family. Synapsins are vesicle-associated phosphoproteins involved in the short-term regulation of neurotransmitter release. The human genome encodes three synapsin genes (Syn1–3 ); however, in the pufferfish (Takifugu rubripes) there is a second copy of Syn2 , TrSyn2B . The human copy of Syn2 generates two alternatively spliced variants, whereas in the pufferfish each of these variants is encoded by one of the two different TrSyn2 genes, and both of these duplicated genes have lost the ability to produce the form encoded by the other duplicate through the accumulation of complementary degenerative mutations (Yu et al., 2003). Finally, an interesting example of this process can be found in plants. In this case the chloroplast ribosomal protein L32 gene (Rpl32 ) was relocated into the nuclear genome and fused with a superoxide dismutase gene (Sodcp) sometime before the divergence of the lineages leading to mangrove (Bruguiera gymnorrhyza) and poplar (Populus trichocarpa), forming the Sodcp-Rpl32 fused gene. This newly formed gene was alternatively spliced, giving rise to a transcript identical in structure to the original Sodcp and another form containing a fusion of exons 1 to 7 of Sodcp with a novel exon that corresponds to the coding region of Rpl32 . This ancestral form can be found in the modern mangrove. However, in the poplar this gene became duplicated twice. Each of these resulting duplicates encodes a single protein that corresponds to one of the alternative forms of the mangrove Sodcp-Rpl32 , and due to degenerative mutations has lost the ability to produce the other form, effectively separating this fused gene into the two original genes (Cusack and Wolfe, 2007). In these examples we can see how small changes affecting splice variants can cause rapid changes in gene structure. This allows small changes at the genome level to produce a rapid divergence of the proteomes. Currently, there are not many well-characterized examples of this type of subfunctionalization, in part due to the lack of comprehensive information on alternative splicing in most genomes with the exception of the human genome. However, the fact observed by Su et al. (2006) that duplicated genes and genes belonging to large families have fewer alternative forms than genes that are single-copy in the genome, combined with the recent evidence that a large fraction of mammalian genes possess alternative splice variants, may be an indication that this process of subfunctionalization by differential loss of alternative splice forms may be more frequent than previously believed.

5 CONCLUDING REMARKS The evolution of duplicated genes is interesting and important not only because duplication is the major source of new genes in any genome, but also because duplicate genes give us a unique and powerful possibility of examining the functional divergence of genes under eqiuvalent conditions. Paralogs exist in the same genome and are therefore subjected to the same environmental pressures and mutation biases on the grand scale, and this greatly enhances the interpretation of differences that have

28

FUNCTIONAL DIVERGENCE OF DUPLICATED GENES

accumulated in duplicate genes since the moment of origin as identical copies. The study of duplicated genes will continue to be important in the interpretation of new high-throughput and systems-level data for understanding the evolution and functional diversification of genes.

REFERENCES Altschmied J, Delfgaauw J, et al. 2002. Subfunctionalization of duplicate mitf genes associated with differential degeneration of alternative exons in fish. Genetics 161(1):259–267. Beltrao P, Serrano L. 2007. Specificity and evolvability in eukaryotic protein interaction networks. PLoS Comput Biol 3(2):e25. Blanc G, and Wolfe KH. 2004. Functional divergence of duplicated genes formed by polyploidy during Arabidopsis evolution. Plant Cell 16(7):1679–1691. Butland G, Peregrin-Alvarez JM, et al. 2005. Interaction network containing conserved and essential protein complexes in Escherichia coli . Nature 433(7025):531–537. Castillo-Davis CI, Hartl DL, et al. 2004. cis-Regulatory and protein evolution in orthologous and duplicate genes. Genome Res 14(8):1530–1536. Chung WY, Albert R, et al. 2006. Rapid and asymmetric divergence of duplicate genes in the human gene coexpression network. BMC Bioinf 7:46. Conant GC, and Wagner A. 2003. Asymmetric sequence divergence of duplicate genes. Genome Res 13(9):2052–2058. Cosentino Lagomarsino M, Jona P, et al. 2007. Hierarchy and feedback in the evolution of the Escherichia coli transcription network. Proc Natl Acad Sci USA 104(13):5516–5520. Cusack BP, Wolfe KH. 2007. When gene marriages don’t work out: divorce by subfunctionalization. Trends Genet 23(6):270–272. Deane CM, Salwinski L, et al. 2002. Protein interactions: two methods for assessment of the reliability of high throughput observations. Mol Cell Proteom 1(5):349–356. DeLuna A, Vetsigian K, et al. 2008. Exposing the fitness contribution of duplicated genes. Nat Genet 40(5):676–681. Dutkowski J, and Tiuryn J. 2007. Identification of functional modules from conserved ancestral protein–protein interactions. Bioinformatics 23(13):i149–il58. Formstecher E, Aresta S, et al. 2005. Protein interaction mapping: a Drosophila case study. Genome Res 15(3):376–384. Gandhi TK, Zhong J, et al. 2006. Analysis of the human protein interactome and comparison with yeast, worm and fly interaction datasets. Nat Genet 38(3):285–293. Ganko EW, Meyers BC, et al. 2007. Divergence in expression between duplicated genes in Arabidopsis. Mol Biol Evol 24(10):2298–2309. Gavin AC, Bosche M, et al. 2002. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415(6868):141–147. Gavin AC, Aloy P, et al. 2006. Proteome survey reveals modularity of the yeast cell machinery. Nature 440(7084):631–636. Giot L, Bader JS, et al. 2003. A protein interaction map of Drosophila melanogaster . Science 302(5651):1727–1736. Gu Z, Nicolae D, et al. 2002. Rapid divergence in expression between duplicate genes inferred from microarray data. Trends Genet 18(12):609–613. Gu Z, Rifkin SA, et al. 2004. Duplicate genes increase gene expression diversity within and between species. Nat Genet 36(6):577–579.

REFERENCES

29

Guan Y, Dunham MJ, et al. 2007. Functional analysis of gene duplications in Saccharomyces cerevisiae. Genetics 175(2):933–943. Ha M, Li WH, et al. 2007. External factors accelerate expression divergence between duplicate genes. Trends Genet 23(4):162–166. Haberer G, Hindemitt T, et al. 2004. Transcriptional similarities, dissimilarities, and conservation of cis-elements in duplicated genes of Arabidopsis. Plant Physiol 136(2):3009–3022. He X, Zhang J. 2005. Rapid subfunctionalization accompanied by prolonged and substantial neofunctionalization in duplicate gene evolution. Genetics 169(2):1157–1164. Ho Y, Gruhler A, et al. 2002. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415(6868):180–183. Huang TW, Lin CY, et al. 2007. Reconstruction of human protein interolog network using evolutionary conserved network. BMC Bioinf 8:152. Huminiecki L, Wolfe KH. 2004. Divergence of spatial gene expression profiles following species-specific gene duplications in human and mouse. Genome Res 14(10A):1870–1879. Ito T, Tashiro K, et al. 2000. Toward a protein–protein interaction map of the budding yeast: a comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins. Proc Natl Acad Sci USA 97(3):1143–1147. Kellis M, Birren BW, et al. 2004. Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature 428(6983):617–624. Kiemer L, Cesareni G. 2007. Comparative interactomics: Comparing apples and pears ? Trends Biotechnol 25(10):448–454. Kim E, Magen, A, et al. 2007. Different levels of alternative splicing among eukaryotes. Nucleic Acids Res 35(1):125–131. Krogan NJ, Cagney G, et al. 2006. Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 440(7084):637–643. LaCount DJ, Vignali M, et al. 2005. A protein interaction network of the malaria parasite Plasmodium falciparum. Nature 438(7064):103–107. Leach LJ, Zhang Z, et al. 2007. The role of cis-regulatory motifs and genetical control of expression in the divergence of yeast duplicate genes. Mol Biol Evol 24(11):2556–2565. Li S, Armstrong CM, et al. 2004. A map of the interactome network of the metazoan C. elegans. Science 303(5657):540–543. Li WH, Yang J, et al. 2005. Expression divergence between duplicate genes. Trends Genet 21(11):602–607. Liang Z, Xu M, et al. 2006. Comparison of protein interaction networks reveals species conservation and divergence. BMC Bioinf 7:457. Liao BY, Zhang J. 2008. Coexpression of linked genes in mammalian genomes is generally disadvantageous. Mol Biol Evol 25(8):1555–1565. Makino T, Suzuki Y, et al. 2006. Differential evolutionary rates of duplicated genes in protein interaction network. Gene 385:57–63. Makova KD, Li WH. 2003. Divergence in the spatial pattern of gene expression between human duplicate genes. Genome Res 13(7):1638–1645. Mika S, Rost B. 2006. Protein–protein interactions more conserved within species than across species. PLoS Comput Biol 2(7):e79. Musso G, Zhang Z, et al. 2007. Retention of protein complex membership by ancient duplicated gene products in budding yeast. Trends Genet 23(6):266–269. Musso G, Costanzo M, et al. 2008. The extensive and condition-dependent nature of epistasis among whole-genome duplicates in yeast. Genome Res 18(7):1092–1099.

30

FUNCTIONAL DIVERGENCE OF DUPLICATED GENES

Papp B, Pal C, et al. 2003. Evolution of cis-regulatory elements in duplicated genes of yeast. Trends Genet 19(8):417–422. Pastor-Satorras R, Smith E, et al. 2003. Evolving protein interaction networks through gene duplication. J Theor Biol 222(2):199–210. Pereira-Leal JB, Levy ED, et al. 2007. Evolution of protein complexes by duplication of homomeric interactions. Genome Biol 8(4):R51. Presser A, Elowitz MB, et al. 2008. The evolutionary dynamics of the Saccharomyces cerevisiae protein interaction network after duplication. Proc Natl Acad Sci USA 105(3):950–954. Rain JC, Selig L, et al. 2001. The protein–protein interaction map of Helicobacter pylori . Nature 409(6817):211–215. Rosso L, Marques AC, et al. 2008. Mitochondrial targeting adaptation of the hominoidspecific glutamate dehydrogenase driven by positive Darwinian selection. PLoS Genet 4(8):e1000150. Rosso L, Marques AC, et al. 2008. Birth and rapid subcellular adaptation of a hominoid-specific CDC14 protein. PLoS Biol 6(6):e140. Rual JF, Venkatesan K, et al. 2005. Towards a proteome-scale map of the human protein-protein interaction network. Nature 437(7062):1173–1178. Shoja V, Zhang L. 2006. A roadmap of tandemly arrayed genes in the genomes of human, mouse, and rat. Mol Biol Evol 23(11):2134–2141. Stelzl U, Worm U, et al. 2005. A human protein–protein interaction network: a resource for annotating the proteome. Cell 122(6):957–968. Su Z, Wang J, et al. 2006. Evolution of alternative splicing after gene duplication. Genome Res 16(2):182–189. Tarassov K, Messier V, et al. 2008. An in vivo map of the yeast protein interactome. Science 320(5882):1465–1470. Teichmann SA, Babu MM. 2004. Gene regulatory network growth by duplication. Nat Genet 36(5):492–496. Tirosh I, Barkai N. 2008. Evolution of gene sequence and gene expression are not correlated in yeast. Trends Genet 24(3):109–113. Tirosh I, Weinberger A, et al. 2008. On the relation between promoter divergence and gene expression evolution. Mol Syst Biol 4:159. Uetz P, Giot L, et al. 2000. A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae. Nature 403(6770):623–627. von Mering C, Krause R, et al. 2002. Comparative assessment of large-scale data sets of protein–protein interactions. Nature 417(6887):399–403. Wagner A. 2001. The yeast protein interaction network evolves rapidly and contains few redundant duplicate genes. Mol Biol Evol 18(7):1283–1292. Wagner A. 2002. Asymmetric functional divergence of duplicate genes in yeast. Mol Biol Evol 19(10):1760–1768. Ward JJ, Thornton JM. 2007. Evolutionary models for formation of network motifs and modularity in the Saccharomyces transcription factor network. PLoS Comput Biol 3(10):1993–2002. Wolfe KH, Shields DC. 1997. Molecular evidence for an ancient duplication of the entire yeast genome. Nature 387(6634):708–713. Yu WP, Brenner S, et al. 2003. Duplication, degeneration and subfunctionalization of the nested synapsin-Timp genes in Fugu. Trends Genet 19(4):180–183. Zhang Z, Gu J, et al. 2004. How much expression divergence after yeast gene duplication could be explained by regulatory motif evolution? Trends Genet 20(9):403–407.

3

Duplicate Retention After Smalland Large-Scale Duplications STEVEN MAERE and YVES VAN DE PEER Department of Plant Systems Biology, VIB, Ghent, Belgium; and Department of Molecular Genetics, Ghent University, Ghent, Belgium

1 INTRODUCTION The hypothesis that gene duplication is a driving force in the evolution of increasingly complex organisms has a long history, dating back to the early twentieth century [reviewed by Taylor and Raes (2004)]. In the past decade, the analysis of complete genome sequences has revealed an unexpectedly large prevalence of gene duplications, particularly in eukaryotes. Moreover, a surprisingly high number of eukaryotic organisms have experienced one or several duplications of their entire genome at some point in their evolutionary history (Wolfe, 2001; Van de Peer, 2004). Remnants of ancient genome duplications have been uncovered in yeasts (Wolfe and Shields, 1997), ciliates (Aury et al., 2006), fish (Van de Peer et al., 2003; Jaillon et al., 2004; Vandepoele et al., 2004; Meyer and Van de Peer, 2005), vertebrates (McLysaght et al., 2002; Dehal and Boore, 2005; Putnam et al., 2008), and especially in plants (Otto and Whitton, 2000; Vision et al., 2000; Simillion et al., 2002; Blanc et al., 2003; Bowers et al., 2003; Blanc and Wolfe, 2004b; Paterson et al., 2004, 2005; Yu et al., 2005; Cui et al., 2006; Tuskan et al., 2006; Jaillon et al., 2007; Rensing et al., 2007; Velasco et al., 2007; Ming et al., 2008; Tang et al., 2008). Such large-scale duplication events are considered to be of major importance for the evolution of and increase in biological complexity (Ohno, 1970; Otto and Whitton, 2000; Aburomia et al., 2003; Holland, 2003; De Bodt et al., 2005; Freeling and Thomas, 2006; Fawcett et al., 2009; Freeling, 2009; Van de Peer et al., 2009). In his classic book Evolution by Gene Duplication, Susumu Ohno suggested that the evolution of the vertebrates was facilitated by one or more genome duplications (Ohno, 1970). Recently, De Bodt et al. (2005) suggested that ancient genome duplications in the angiosperm lineage may help explain Darwin’s “abominable mystery,” the sudden rise and rapid diversification of angiosperm plants during the Cretaceous [145–65 million years ago (Mya)]. Intuitively, the potential of gene and genome duplications to create raw material for evolutionary innovation is easy to grasp. However, which evolutionary mechanisms play a center-stage role in the

Evolution After Gene Duplication, Edited by Katharina Dittmar and David Liberles Copyright © 2010 Wiley-Blackwell

31

32

DUPLICATE RETENTION AFTER SMALL- AND LARGE-SCALE DUPLICATIONS

retention and functional diversification of duplicate genes, and how they affect smalland large-scale duplications, remain a source of much controversy. In this chapter we review the current knowledge on duplicate retention and diversification, focusing on the differences between small-scale duplications and genome duplications.

2

MODES OF DUPLICATION

There are several molecular mechanisms by which stretches of genomic DNA are duplicated. The most common mechanisms are believed to be unequal crossing over (leading to tandemly arranged duplicates), slippage of the DNA polymerase during replication of, and duplicative retrotransposition (Bailey et al., 2003; Kondrashov and Kondrashov, 2006), but many other mechanisms are known (Koszul et al., 2006), and some organisms (e.g., Caenorhabditis elegans) seem to harbor as yet uncharacterized duplicative processes (Katju and Lynch, 2003; Koszul et al., 2006). Most of the time, these processes result in duplication of relatively short stretches of DNA, spanning one or a few genes or only part of a gene. For instance, 70% of the most recent duplications in C. elegans, with less than 0.10 synonymous substitution per synonymous site (Ks ), span less than 2.5 kb (Katju and Lynch, 2003). We refer to such duplications as small-scale duplications (SSDs). Occasionally, larger stretches are duplicated. For example, the rice genome bears evidence of a 3-Mb segmental duplication between chromosomes 11 and 12 (TRCaS Consortia, 2005). Aneuploidy and polyploidy events are a special class of large-scale duplications that give rise to the duplication of entire chromosomes or entire genomes, respectively. Aneuploids and autopolyploids result primarily from nondisjunction of chromosome pairs during mitosis or meiosis in the germ line, leading to the formation of (partially) unreduced gametes that fuse with other gametes to produce aneuploid or polyploid zygotes. Allopolyploids are typically formed through interspecific hybridization followed by chromosome doubling, or vice versa (Comai, 2005). In the remainder of this chapter we consider only one type of large-scale duplication: polyploidy or whole-genome duplication (WGD). The other types of large-scale duplication will be ignored because their effects have been less well studied.

3

DUPLICATE RETENTION MECHANISMS

The boundary conditions of duplicate gene retention after SSD and WGD are very different (Hurles, 2004; Davis and Petrov, 2005). An obvious difference is that in the SSD mode, only one or a few genes are duplicated, whereas a WGD duplicates all genes at the same time. In addition, SSD duplicates need to be fixed (i.e., they have to rise to an appreciable frequency in the population to become permanently part of the gene pool). Some theories on duplicate gene evolution assume that immediately after gene duplication, the duplicate copy is selectively neutral. Population genetic theory predicts that such duplicates most often do not get fixed in the population, but are instead, rapidly lost by random genetic drift. Even if a neutral duplicate arises to a high enough frequency in the population, either the duplicated or the ancestral locus will ultimately get silenced by degenerative mutations, although this may take up to 10Ne generations on average in large populations, Ne being the effective population size

DUPLICATE RETENTION MECHANISMS

33

(Lynch and Force, 2000). However, massive duplicate gene preservation is apparent from sequenced genomes, indicating that a substantial fraction of the newborn duplicates are either not selectively neutral from the start or somehow manage to become visible to selection. In contrast to SSD duplicates, WGD duplicates do not need to fix in the population. Instead, they have to survive the turbulent period of genome rearrangements and massive gene loss that typically follows genome duplication (Wendel, 2000; Scannell et al., 2006; Kasahara et al., 2007; S´emon and Wolfe, 2007b). This period of genomic instability has several side effects. The first is cytological diploidization of the polyploid (i.e., reestablishment of bivalent chromosome pairing), reducing the chance of detrimental meiotic catastrophes. Cytological diploidization may not be needed for allopolyploids, whose subgenomes are often divergent enough to avoid multivalent formation (Ma and Gustafson, 2005). Another side effect is the emergence of a variety of new polyploid species through generation of reproductive incompatibilities (Adams and Wendel, 2005b). Ultimately, the successful establishment of polyploid lineages depends largely on which duplicate genes are kept or lost during the turmoil immediately after WGD, which implies that at least some of the surviving duplicates need to confer a near-immediate selective advantage (De Bodt et al., 2005). However, there is a high probability that a newly established polyploid lineage also harbors a number of neutral or nearly neutral duplicates that hitch a ride with their advantageous siblings. Such neutral duplicates may have survived the initial duplicate massacre by chance, but they still need to acquire a selectable advantage in order to be preserved in the longer term. Several mechanisms have been put forward by which duplicate genes can avoid getting lost. These mechanisms are discussed below, as well as their relative importance after SSD and WGD. 3.1 Neofunctionalization Fisher, Haldane, and Ohno put forth the idea that a duplicate gene can be preserved through positive selection if it acquires a new beneficial function (neofunctionalization) (Haldane, 1933; Fisher, 1935; Ohno, 1970). Ohno considered neofunctionalization to be the primary mechanism for preserving duplicate genes. Neofunctionalization may take on different forms. First, a duplicate gene may acquire a novel function through mutation of its coding sequence. One of the hallmarks of this type of neofunctionalization is conjectured to be an asymmetric rate of nonsynonymous substitutions (Ka /Ks ) in both duplicates, with the neofunctionalizing duplicate undergoing amino acid substitutions at a higher rate than its copy, which is supposed to retain the ancestral function. An acknowledged caveat of the Ka /Ks asymmetry hypothesis is that neofunctionalization does not necessarily require widespread changes of the protein sequence but, instead, may involve only part of the sequence, or only a few amino acid changes at specific positions (Endo et al., 1996; Van de Peer et al., 2001). Moreover, an asymmetric Ka /Ks may also be caused by asymmetric subfunctionalization (see below) (He and Zhang, 2005b). Nevertheless, Ka /Ks analysis has been widely used to assess the neofunctionalization potential of duplicate genes. According to Ohno’s original model, the neofunctionalizing copy should have an amino acid substitution rate that is higher than the neutral expectation (Ka /Ks > 1), indicating directional selection, whereas the other copy should evolve under purifying selection (Ka /Ks < 1). This is, however, rarely observed (Van de Peer et al., 2001; Conant and Wagner, 2003; Zhang et al.,

34

DUPLICATE RETENTION AFTER SMALL- AND LARGE-SCALE DUPLICATIONS

2003). A much more prevalent scenario is that both copies still evolve under purifying selection, but the selection on one of the copies is relaxed (He and Zhang, 2005b; Byrne and Wolfe, 2007). One study of recent SSD duplicates in several prokaryotic and eukaryotic species (Ks < 0.5) found little evidence for asymmetric protein evolution (Kondrashov et al., 2002). In contrast, another study on recent SSD pairs in human (Ks < 0.3) found that about 60% of the pairs evolve asymmetrically (Zhang et al., 2003). However, a substantial portion of the fast-evolving genes identified in the latter study had accumulated amino acid substitutions evenly across their sequence and at nearly neutral rates, indicating that they may be on their way to pseudogenization rather than neofunctionalization. Another study on recent SSD duplicates in Saccharomyces cerevisiae, Schizosaccaromyces pombe, Drosophila melanogaster, and C. elegans found that 20 to 30% of the duplicate pairs exhibit asymmetric sequence divergence (Conant and Wagner, 2003). The lack of consistency between these results may be explained in part by the fact that all of these studies investigate relatively small samples, on the order of tens of genes. It is doubtful whether such small-scale studies can generate enough statistical power to detect rate asymmetry trends reliably, especially for recent duplicates (Lynch and Katju, 2004). Moreover, assessment of rate asymmetry in recent duplicates is probably not the best way to look for neofunctionalization, since the majority of recent fast-evolving duplicates is bound to be on its way to nonfunctionalization rather than neofunctionalization (Lynch and Conery, 2000). It probably makes more sense to look for traces of asymmetric protein evolution in older duplicates. One class of duplicates that have been studied extensively in this respect are duplicates retained from ancient WGDs. Byrne and Wolfe (2007) found evidence of asymmetric protein evolution rates in 56% of the duplicate pairs remaining from the WGD in yeast. The faster-evolving copy tends to be the same across the post-WGD species, indicating that the asymmetry was established soon after the WGD and before speciation occurred. A study of 26 zebrafish duplicate pairs remaining from the teleost-specific WGD uncovered 13 cases in which the evolutionary rate of one of the paralogs was increased (Van de Peer et al., 2001). In a larger-scale study on Tetraodon, another teleost fish, Brunet et al. (2006) estimated that at least 36% of the WGD pairs have diverged asymmetrically. Also, Blanc and Wolfe (2004a) found evidence for asymmetric protein evolution in more than 20% of the duplicate pairs retained from the most recent WGD in the Arabidopsis ancestor. In contrast to yeast, fish, and Arabidopsis, a recent study in the allotetraploid frog Xenopus laevis found evidence for asymmetric protein sequence divergence in a mere 6% of duplicate pairs that were retained from the WGD (Chain and Evans, 2006). The differences in the extent of asymmetric divergence between species may be correlated with differences in their effective population size or that of their ancestors. Organisms with a large effective population size, such as yeasts, are predicted to be more permissive to duplicate neofunctionalization than organisms with a small effective population size (Lynch et al., 2001). Other factors may be age differences between the WGDs in the different species and differences in selective pressure (i.e., the biological need for neofunctionalization). Except in Xenopus laevis, there seems to be more evidence for asymmetric protein evolution or relaxed selection after WGD than after SSD. One possible explanation may be that WGDs facilitate the establishment of phenotypically neutral duplicates by avoiding the fixation hurdle that plagues neutral SSD duplicates. But even for asymmetrically evolving WGD duplicates that have been retained for tens of millions

DUPLICATE RETENTION MECHANISMS

35

of years, there is no guarantee that they are on their way to neofunctionalization. Scannell and Wolfe (2008) showed that the protein evolution rate for WGD duplicates in yeast has not yet dropped back to the level for single-copy genes, indicating that the neofunctionalization process may still not have come to an end some 100 million years (My) after the WGD. There are indications that many of the copies undergoing relaxed selection may never acquire a novel function and are eventually lost. In duplicate pairs remaining from the yeast WGD, the faster-evolving copy is almost never essential and has frequently been lost in several of the post-WGD lineages (Byrne and Wolfe, 2007). Similarly, Blomme et al. (2006) showed that vertebrate gene duplicates that have been maintained for hundreds of millions of years can still be lost. Protein neofunctionalization is generally believed to be a slow process. However, neofunctionalization of a gene is not necessarily accomplished by gradual mutation of the coding sequence. It can also be caused by insertion of preexisting domains, which can occur much faster (Bj¨orklund et al., 2005; Drea et al., 2006; Song et al., 2008). Genes may also acquire a novel function through changes in their expression or activation patterns (regulatory neofunctionalization), leading, for example, to their expression in other tissues or at other time points. Regulatory neofunctionalization may be faster and much more common than protein neofunctionalization. Several studies of expression divergence after duplication reported a bias toward asymmetric expression divergence, but without suitable outgroup expression patterns, the involvement of neofunctionalization is hard to assess (Gu et al., 2005; Casneuf et al., 2006; Chung et al., 2006; Conant and Wolfe, 2006; Ganko et al., 2007). Tirosh and Barkai (2007) looked for regulatory neofunctionalization of WGD duplicate pairs in S. cerevisiae by comparing the duplicate expression profiles with those of their preduplication ortholog in Candida albicans. Starting from 96 WGD pairs showing conserved expression with C. albicans for at least one of the two genes, they found 43 WGD pairs (45%) where the other gene had diverged, consistent with regulatory neofunctionalization. Twenty-eight pairs showed significant expression conservation of both duplicates. In contrast, when investigating 46 SSD pairs, they found only one example of asymmetrical divergence, which might indicate that regulatory neofunctionalization is more common among WGD duplicates than among SSD duplicates. In contrast, Casneuf et al. (2006) found that WGD duplicates in Arabidopsis have more similar expression patterns than SSD duplicates of comparable age. Furthermore, as with evolution of the coding sequence, asymmetric expression divergence does not necessarily imply neofunctionalization. The copy experiencing relaxed selection on its expression pattern could also be on its way to being silenced. Accordingly, Casneuf et al. (2006) found that among diverging SSD duplicate pairs, one copy is frequently expressed in a much lower number of tissues than the other copy, but this could also reflect asymmetric subfunctionalization. Clear, well-studied examples of neofunctionalization are difficult to find. A fertile research area in this respect is the evolution of the primate and, in particular, the human brain. For instance, GLUD2 , a duplicate glutamate dehydrogenase gene in humans and apes important for glutamate detoxification after neuron firing, appears to have gained expression in the brain and testes after the human–Old World monkey split, and it also shows signs of directional selection on its protein sequence (Shashidharan et al., 1994; Burki and Kaessmann, 2004). Similarly, a cluster of tandem Ret finger proteinlike genes (hRFPL1,2,3 ) have gained expression in the neocortex and cerebellum in human and primates since the divergence from their murine ortholog (Bonnefont et al., 2008). Positive selection on the protein sequence was also observed. The importance

36

DUPLICATE RETENTION AFTER SMALL- AND LARGE-SCALE DUPLICATIONS

of the novel RFPL1,2,3 function in brains may be gleaned from the presence of a tandem cluster in the large-brained Catarrhini, which is conceivably due to selection for increased dosage (see Section 4). Also, the cortical expression of these proteins is significantly higher in humans than in other primates (Bonnefont et al., 2008). 3.2 Subfunctionalization The general consensus is that protein neofunctionalization happens on such a large time scale that most duplicates are lost or degenerated through nonfunctionalizing mutations before neofunctionalization can occur, except perhaps in species with very large effective population sizes, such as yeasts (Lynch et al., 2001). Under the assumption that neofunctionalization is a slow process, there must be other mechanisms at work that preserve duplicates long enough to enable them to acquire novel functions. In the 1990s, subfunctionalization was advanced as an alternative way to achieve duplicate preservation (Hughes, 1994; Force et al., 1999; Stoltzfus, 1999). The subfunctionalization hypothesis states that after duplication, the functionality of an ancestral protein can get partitioned over the duplicates, so that both copies are needed to perform the complete ancestral function. In contrast to neofunctionalization, subfunctionalization does not necessarily require the action of selective forces. According to the duplication–degeneration–complementation (DDC) model, degenerative mutations can occur neutrally in both copies as long as the duplicates as a pair still retain all ancestral functionality (Force et al., 1999). In theory, the partitioning of subfunctions over different copies is enough to preserve duplicate genes indefinitely, even if there is no associated phenotypic advantage. However, subfunctionalization can also facilitate the development of beneficial features. For example, subfunctionalization may free a multifunctional gene of its pleiotropic constraints: interactions between its subfunctions that prohibit the gene from the optimal exercise of any of them (Hughes, 1994, 2005). If these constraints are lifted, natural selection can fine-tune the subfunctions separately. This subfunctionalization model is sometimes referred to as the escape from adaptive conflict (EAC) or adaptive conflict resolution model (Hittinger and Carroll, 2007; Conant and Wolfe, 2008; Des Marais and Rausher, 2008). A similar model, here referred to as the IAD (innovation, amplification, divergence) model, was proposed by several authors, who considered it more of a neofunctionalizing mechanism that amplifies and optimizes a beneficial secondary function present in trace amounts in an ancestral gene (Hendrickson et al., 2002; Francino, 2005; Bergthorsson et al., 2007). The difference with EAC is an initial amplification step in which an increase in the number of gene copies is favored by selection for increased dosage (see further). Both EAC and IAD are examples of mechanisms by which duplicate genes can get coopted for preexisting secondary functions, as discussed by Conant and Wolfe (2008). Subfunctionalization might also buy a duplicate pair the time necessary for one or both of the copies to acquire a novel function. On the basis of an analysis of protein interaction divergence and expression divergence of yeast and human duplicates, respectively, He and Zhang (2005b) proposed a “subneofunctionalization” model in which a period of rapid subfunctionalization is followed by prolonged neofunctionalization. A similar model was proposed by Rastogi and Liberles (2005) based on evolutionary simulations on duplicated lattice model proteins. Following reports of widespread asymmetric evolution and protein neofunctionalization after the WGD in

DUPLICATE RETENTION MECHANISMS

37

yeasts (Kellis et al., 2004; Byrne and Wolfe, 2007), Scannell et al. (2008) observed that both the fast- and slow-evolving copies of asymmetrically diverging WGD pairs underwent a burst of protein evolution soon after the WGD, consistent with initial rapid subfunctionalization. For classical subfunctionalization to occur on the protein level, the duplicated protein must have multiple separable functions, so not all proteins can be preserved in duplicate through this mechanism. The aforementioned study of X. laevis WGD duplicates (Chain and Evans, 2006) recovered evidence for complementary patterns of amino acid substitutions, indicative of enhancement or degradation of different functional domains, in only 3% of the cases. A more prevalent mechanism may be quantitative subfunctionalization, in which a single function becomes quantitatively partitioned over the duplicate copies: for example, through activity-reducing mutations in both copies (Stoltzfus, 1999; Scannell and Wolfe, 2008). Subfunctionalization, both qualitative and quantitative, can also occur on the regulatory level, and it is expected to be more frequent on this level. For example, a pair of duplicate genes could subfunctionalize through complementary loss of regulatory elements and concomitant specialization of the duplicate expression patterns in time or in space (Force et al., 1999). A myriad of studies have investigated the expression patterns or transcription factor–binding profiles of duplicate genes in different organisms, and most of them agree that there is substantial expression divergence among duplicates (Gu et al., 2002, 2005; Makova and Li, 2003; Papp et al., 2003b; Blanc and Wolfe, 2004a; Evangelisti and Wagner, 2004; Haberer et al., 2004; Zhang et al., 2004; Kim et al., 2005, 2006; Li et al., 2005; Casneuf et al., 2006; Chung et al., 2006; Morin et al., 2006; Ganko et al., 2007; Ha et al., 2007; Hellsten et al., 2007; Hughes and Friedman, 2007; Chain et al., 2008). Both tissue specialization and differential expression in response to stress treatments have been observed, as well as quantitative expression divergence. However, because of the lack of ancestral or outgroup expression patterns in these studies, it is difficult to assess the extent to which the expression divergence observed is due to subfunctionalization. A notable exception is a recent study comparing the tissue-specific expression of WGD pairs in X. laevis with the expression of their orthologs in the non-WGD frog X. tropicalis, in which it was estimated that 1.2 to 11% of the retained WGD pairs underwent tissue-specific expression subfunctionalization (Semon and Wolfe, 2008). A similarly conceived study in yeast (Tirosh and Barkai, 2007) found, next to a class of asymmetrically diverging duplicate expression patterns mentioned before, a class of duplicates exhibiting symmetrical divergence in expression. Although it is likely that these duplicates are undergoing some form of subfunctionalization, it is difficult to pinpoint exactly what is going on. Taking a different approach, Duarte et al. (2006) attempted to reconstruct the ancestral expression profiles of duplicated MADS-box genes in Arabidopsis and found indications for both regulatory sub- and neofunctionalization. A form of regulatory subfunctionalization that does not necessarily involve mutation is based on epigenetic repatterning of the duplicate genes. Rodin and Riggs (2003) proposed the epigenetic complementation (EC) model as a mechanism to save newborn duplicates from pseudogenization. The EC model invokes a specific subfunctionalization mechanism to mediate the exposition of both duplicate partners to purifying selection: namely, complementary tissue- or developmental stage-specific epigenetic silencing of duplicated genes via methylation or other processes involving heritable

38

DUPLICATE RETENTION AFTER SMALL- AND LARGE-SCALE DUPLICATIONS

chromatin structure. According to Rodin et al. (2005a, 2005b), one of the main conditions for EC-mediated survival of duplicates is their repositioning to ectopic sites. Epigenetic silencing, genome rearrangements and translocations have been shown to come into play soon after polyploid formation. Polyploidization events in plants are followed by intensive genomic rearrangements and enhanced activity of transposable elements (Wendel, 2000; Adams and Wendel, 2005b). Moreover, it has been shown that polyploidization can influence epigenetic silencing patterns. Studies on synthetic allopolyploid cotton and Arabidopsis thaliana have shown that reciprocal, developmentally regulated silencing of duplicates can occur during or soon after polyploid formation (Adams et al., 2003; Wang et al., 2004; Adams and Wendel, 2005a). The subfunctionalization model makes testable predictions about the types of genes that should be preferentially preserved after gene duplication. For example, it predicts that duplicates of genes with higher numbers of separable subfunctions (i.e., more complex genes) should be retained at a higher frequency. He and Zhang (2005a) investigated the relationship between gene complexity and gene duplicability in more detail. They found that duplicated genes in yeast, from both WGD and SSD, have on average longer protein sequences, more functional domains, and more regulatory motifs than singleton genes. Especially regarding domain and motif content, the difference is striking: An average duplicated yeast gene has approximately four times as many domains and twice as many motifs as a singleton. The preferential retention of SSD duplicates of longer proteins is even more striking given that shorter proteins are presumed to be more likely to generate functional duplicates through small segmental duplications. Longer proteins may more often spawn truncated duplicates that are nonfunctional or even deleterious if they give rise to dominant negative phenotypes. On the other hand, the production of truncated forms of longer, more complex proteins through SSD may also facilitate subfunctionalization. The observation that more complex genes are preferentially retained after duplication appears to support the subfunctionalization model. However, one caveat is that the subfunctionalization process is expected to lower the complexity of the duplicated genes, thereby eroding any difference in complexity between duplicates and singletons. He and Zhang proposed that the complexity of subfunctionalized duplicates can be maintained through subsequent neofunctionalization, arguing that the protein domains and regulatory motifs involved in subfunctionalization need not deteriorate completely but may evolve to adopt new functions (He and Zhang, 2005a). Chapman et al. (2006) also observed that longer and more complex genes were preferentially retained in duplicate after successive WGDs in Arabidopsis. However, they noticed that there were fewer SNPs in these duplicate pairs than in genes that reverted to single-copy status, and that the impact of those SNPs on protein structure was generally less severe. Moreover, they found indications that the sequence conservation of WGD duplicates is actively maintained by homogenization processes. These observations do not support the increased subfunctionalization of more complex genes. Instead, Chapman et al. conjectured that complex genes are performing crucial functions and advocated the theory of genetic buffering of these functions through redundancy. 3.3 Buffering A possible advantage conferred by gene duplication is a buffering effect against null mutations. Several observations argue for this hypothesis. Molecular networks appear

DUPLICATE RETENTION MECHANISMS

39

to be extremely robust to single gene deletions: Fewer than 20% of yeast genes are essential, and deletion of one of other genes very often has little or no phenotypic effect, at least under rich media conditions (Giaever et al., 2002). Similar observations have been made in plants and C. elegans (Kamath et al., 2003). The robustness against deletion of many genes is commonly associated with the presence of backup copies (i.e. closely related paralogs) (Gu et al., 2003; Wagner, 2008). Correlation studies have shown that S. cerevisiae genes with retained paralogs are indeed less likely to show a growth defect upon deletion (Gu et al., 2003). However, completely redundant duplicates are predicted to be evolutionarily unstable, because either one of the copies can be deleted without phenotypic consequences and is therefore invisible to selection (Brookfield, 1992; Cooke et al., 1997; Nowak et al., 1997). Although perfect redundancy is predicted to be unstable, redundant duplicates may persist in an organism for millions of generations (Nowak et al., 1997). Moreover, duplicate redundancy may be maintained even longer by sequence homogenization mechanisms such as gene conversion. Gao and Innan (2004) found several examples of slowly evolving WGD duplicates in yeast which they attributed to gene conversion. After making similar observations on WGD duplicates in Arabidopsis, Chapman et al. (2006) suggested that the buffering of crucial functions might be one of the principal advantages of genome duplications. They went on to speculate that this effect might even cause the apparently cyclical reoccurrence of genome duplications in angiosperm plants. If the buffering of crucial genes would be a major factor governing retention of duplicates, one would expect that important genes get duplicated more often than dispensable genes. Several studies have investigated the relationship between gene duplicability and gene essentiality or dispensability. He and Zhang (2006) reported that less important genes in yeast, as measured by the fitness effect of their homozygous deletion, intrinsically have a higher duplicability than that of important genes. To avoid the confounding effect on deletion fitness of functional compensation among duplicates (Gu et al., 2003; Kamath et al., 2003; Conant and Wagner, 2004), He and Zhang looked only at singleton genes in S. cerevisiae and measured their duplicability by assessing whether or not any of the orthologs in four other yeast species have retained duplicates, but by doing so, they probably introduced bias as well. Moreover, their findings may be explained partially by an underrepresentation of complex-forming proteins among the genes with retained duplicates in other species. Complex-forming proteins tend to be less dispensable (Jeong et al., 2001) and frequently cause dosage effects when duplicated through SSD (see Section 3.6). Seemingly at odds with their previous finding, He and Zhang also found that singleton genes with retained duplicates in other species are more conserved in sequence than across-species singletons. Jordan et al. also reported that slowly evolving genes in yeast appear to be retained preferentially after duplication, and linked slow evolution to importance (Jordan et al., 2004). However, slowly evolving genes are not necessarily more important. As Davis and Petrov (2004) pointed out, slow evolution of genes may be caused by high codon bias, a hallmark of selection for expression efficiency, and slowly evolving genes may be preferentially duplicated because of selection for increased dosage (see Section 4). None of these studies made a distinction between SSD and WGD duplicates. In a recent very elegant study, Ihmels et al. (2007) compared the genetic interaction patterns of yeast duplicate genes and calculated that the presence of duplicates accounts for only about 25% of the robustness observed against single-gene deletions. Their study also revealed that even duplicate genes that can buffer for each other’s loss

40

DUPLICATE RETENTION AFTER SMALL- AND LARGE-SCALE DUPLICATIONS

typically exhibit rich genetic interaction patterns, indicative of limited backup capacity. Indeed, in the case of perfect backup, one would expect the gene to have at most one genetic interaction: with its backup copy. Moreover, most genetic interaction patterns of duplicate genes are divergent, indicating divergence in function. So although duplicate genes may often serve as backup copies under specific circumstances, this effect alone appears to be insufficient to secure their conservation. Lin et al. (2007) made a distinction between WGD and SSD duplicates and found that WGD duplicates in yeast are on average more dispensable than SSD duplicates. Moreover, whereas SSD duplicates become less dispensable as the protein sequence divergence (Ka ) with their closest paralog increases, the dispensability of WGD duplicates appears to be remarkably little influenced by their sequence divergence. A similar conclusion was reached by Guan et al. (2007). These results indicate that WGD duplicates remain intrinsically more able to backup for each other, even when their sequences have diverged substantially. A recent study showed that more than one in three WGD pairs in yeast exhibits phenotypic buffering under standard laboratory conditions, and that other WGD pairs show epistatic effects only under particular stress conditions (Musso et al., 2008). 3.4 Increased Dosage Buffering is not the only potential reason why functionally identical duplicates may be maintained indefinitely. Duplication may serve to increase the expression levels of certain gene products that are needed in large amounts, such as ribosomal proteins or histones. Seoighe and Wolfe (1999) noticed that highly expressed genes, such as ribosomal genes, were retained preferentially in duplicate after the WGD in yeasts. Following the work of Gao and Innan (2004) mentioned above, Sugino and Innan (2006) linked selection for increased dosage to the occurrence of gene conversion in yeast WGD pairs, arguing that selection may favor gene conversion of highly dosed duplicate genes. Lin et al. (2006) reanalyzed 56 low-Ks WGD gene pairs studied by Gao and Innan (2004) and found that all of these slowly evolving WGD pairs exhibit strong codon-usage bias, a hallmark of selection for translational efficiency. Only approximately half of the pairs showed signs of gene conversion. Therefore, Lin et al. suggested that gene conversion is not so much favored by increased dosage selection, but that prolonged gene conversion is instead facilitated by the reduced rate of sequence divergence caused by codon-usage bias. Aury et al. (2006) found a strong correlation between expression levels and WGD duplicate retention rate in Paramecium. A small subset of duplicates containing, for example, ribosomal constituents, histones, and cytoskeleton components exhibited not only high expression levels but also low protein sequence divergence, low levels of synonymous substitution, and optimized codon usage, in accordance with the increased dosage hypothesis. A particularly interesting example of selection for increased dosage has been uncovered in yeast. Recently, Conant and Wolfe (2007) hypothesized that retention of specific glycolytic genes after the WGD in yeasts has caused an increased glycolytic flux that gave post-WGD yeast species a growth advantage by increasing their glucose fermentation speed. 3.5 Dosage Balance All of the mechanisms described above, except for the increased dosage hypothesis, have in common that they assume that newborn duplicates are phenotypically neutral.

DUPLICATE RETENTION MECHANISMS

41

However, this is frequently not the case. Often, severe fitness defects are observed upon gene duplication. For example, a lot of human genetic diseases are associated with duplication of single genes, larger segments, or chromosomes (Chen et al., 2005; Kondrashov and Kondrashov, 2006; Conrad and Antonarakis, 2007). In most cases, the cause of the deleterious effect is a change in dosage balance (i.e., the stoechiometry of the cellular components gets upset). Early versions of the dosage balance hypothesis focused mainly on the effects of stoechiometric imbalances in regulatory protein complexes, especially those affecting transcription (Birchler et al., 2001; Veitia, 2002, 2003). More recently, dosage balance effects have also been linked to structural proteins, signal transduction cascades, and complex-forming proteins in general (Papp et al., 2003a; Kondrashov and Koonin, 2004; Liang et al., 2008; Veitia et al., 2008). In a landmark study, Papp et al. (2003a) showed that an imbalance in the concentration of the components of protein complexes in yeast generally leads to lower fitness, demonstrating the pervasiveness of dosage balance effects. A corollary of the balance hypothesis is that duplication of individual protein complex subunits would be harmful and thus selected against. Papp et al. found that members of large gene families in yeast are indeed less frequently involved in protein complexes than members of small gene families. Another study (Yang et al., 2003) suggested that in humans, dosage sensitivity increases and subunit duplicability decreases with an increasing number of subunits in a complex. Moreover, in yeast, subunits of heterogeneous protein complexes are significantly less duplicable than homocomplex subunits, consistent with the dosage balance hypothesis (Lin et al., 2007). A significant effect of increasing heterogeneity of heterocomplexes on the duplicability of the subunits could not be established, possibly due to the small sample size (Lin et al., 2007). Papp et al. restricted their analysis of dosage effects primarily to complex-forming genes but suggested that other classes of genes, specifically transcription factors and developmental genes, might be particularly prone to cause dosage effects. They did not find any evidence of that in yeast, which they ascribed to the fact that yeast transcription factors influence relatively few genes and that yeast lacks the long regulatory cascades that underlie multicellular development in higher eukaryotes. In contrast, Yang et al. (2003) pointed out that higher eukaryotes may be less sensitive than yeast to dosage changes, because of their higher intrinsic robustness against expression variations and more sophisticated systems to control gene and protein expression levels. Additionally, alternative splicing in higher eukaryotes might play an important role in fixing imbalance effects, and the smaller effective population size of higher eukaryotes, combined with the greater potential for duplicate subfunctionalization through tissue specialization, may facilitate the fixation and retention of duplicate of dosage-sensitive genes (Liang et al., 2008). Liang et al. (2008) examined protein underwrapping as a potential cause of dosage balance effects. The underwrapping parameter quantifies the extent to which the backbone of a protein is accessible to water. Highly underwrapped proteins are structurally unstable because the backbone hydrogen bonds that determine the structural integrity of the protein may be dissolved through solvent hydration of the polar groups. Therefore, underwrapped proteins are predicted to be part of protein complexes that shield the underwrapped backbone from the surrounding water. When the stoechiometry of such a complex is upset, excess underwrapped proteins frequently show a tendency to aggregate, often with detrimental consequences, as in Alzheimer’s disease and Parkinson’s disease (Fern´andez et al., 2003; Conrad and Antonarakis, 2007). Consistent with

42

DUPLICATE RETENTION AFTER SMALL- AND LARGE-SCALE DUPLICATIONS

the dosage balance hypothesis, Liang et al. (2008) found that highly underwrapped proteins are less likely to be retained after duplication, and that they are more retained after WGD than after SSD. They also found that the effect of protein underwrapping on gene duplicability decreases with increasing organismal complexity, consistent with the hypothesis that higher eukaryotes may be less sensitive to dosage changes (Yang et al., 2003). However, protein underwrapping is only one cause of dosage effects. Gene dosage appears to be one of the main factors influencing retention of gene duplicates, even in multicellular eukaryotes (see below). 3.6 Functional Bias Dosage balance is the only mechanism that can adequately explain one of the most salient features of gene duplicate retention: the striking difference in functional bias seen in duplicates originating from SSD and WGD. Papp et al. (2003a) conjectured that constituents of protein complexes should be retained preferentially after WGD. Indeed, because all members of a protein complex are duplicated simultaneously by WGD, imbalance effects could be circumvented more easily. Moreover, following WGD, the members of a balance-sensitive duplicated protein complex should be lost or retained together to avoid deleterious imbalance effects caused by the loss of single complex constituents. Several studies have assessed the impact of SSD and WGD on the gene complement of an organism. Given their high frequency of (paleo)polyploidization, plants are particularly attractive study objects for this purpose. Blanc and Wolfe (2004a) found that genes retained in duplicate after the most recent WGD in A. thaliana (α) were not distributed evenly over all gene ontology (GO) categories (Ashburner et al., 2000). Regulatory genes, such as transcription factors, signal transducers, protein kinases, protein phosphatases, and developmental genes were found to be enriched in the set of duplicates retained from the α event. Some complex-forming genes (e.g., ribosomal genes, proteasome subunits, and the photosystem II oxygen-evolving complex) were also found to be overretained. Genes involved in several highly conserved processes, such as DNA replication and repair, tRNA charging, and mitochondrial and chloroplast function, were underretained. Seoighe and Gehring (2004) also investigated the survivability of duplicates after the α duplication. They found that genes retained in duplicate after the γ or β polyploidization events are significantly more likely to have retained duplicates from the α event as well, suggesting that some genes are inherently more duplicable through WGD than are others. Similar to Blanc and Wolfe (2004a), Seoighe and Gehring (2004) found that the set of duplicates retained after α is biased toward transcriptional regulators and signal transducers. A limiting factor in these studies is that the resolution of WGD duplicates is confined to the pairs that can still be found in duplicated blocks with conserved gene content and order (Vision et al., 2000; Simillion et al., 2002; Blanc et al., 2003; Bowers et al., 2003), which compromises their use on the older events, γ and β, that have faded signatures. Also, the studies mentioned above do not compare the effects of WGD and SSD. Maere et al. (2005) developed a method to model the population dynamics of duplicate genes in Arabidopsis, taking into account both the WGD and SSD duplication modes. One of the advantages of the modeling approach is that individual gene duplications do not need to be attributed to a specific mode of duplication, circumventing the WGD identification problem.

DUPLICATE RETENTION MECHANISMS

43

For many GO categories, duplicate decay rates after WGD and SSD were found to be strikingly different (see Figure 1). Most notably, genes involved in transcriptional regulation, signal transduction, and development generally show high retention of duplicates in WGD modes, in accordance with previous studies, but low retention in the SSD mode. The same behavior is observed for protein-binding genes and transporter categories such as ion, channel/pore class, and electron transporters. Developmental genes show strong retention after γ and β, but not after α and SSD. Cell cycle and morphogenesis genes also show more retention after the WGDs than after SSD, but at a lower amplitude, indicating less conservation of those duplicates in general. In the same vein, genes involved in DNA metabolism and RNA binding are not well retained, regardless of the mode of duplication. The same goes for structural ribosomal genes, except after the α event, consistent with Blanc and Wolfe’s results (2004a). Overall, it seems that well-conserved basic cellular processes generally accumulate fewer duplicates. Metabolism and stress response categories generally show higher duplicate retention after SSD. Markedly, genes involved in biotic stress responses and related processes (e.g., secondary metabolism, lipid binding, oxygen binding, cell death) retain duplicates at a high rate after WGD and SSD alike. Compared with biotic stress-response genes, genes involved in the response to abiotic stimuli show lower duplicate retention after γ, α, and SSD and higher retention after the β event. It has recently been shown that the γ event was a hexaploidization rather than a tetraploidization event (Jaillon et al., 2007; Tang et al., 2008). Although this may affect the model parameters learned by Maere et al., the qualitative results, in particular the reciprocal relationship of duplicate retention after WGD and SSD for regulatory and developmental genes, are unlikely to change. Recently, Michael Freeling (2009) compared the retention of duplicate genes after SSD and the α event using different methods and found the same reciprocal relationship for many regulatory and complex-forming gene classes. All four studies on Arabidopsis find that regulatory genes and complex-forming genes are preferentially retained after WGDs (Blanc and Wolfe, 2004a; Seoighe and Gehring, 2004; Maere et al., 2005; Freeling, 2009). Studies on several other organisms have arrived at similar conclusions. Seoighe and Wolfe observed that signal transduction genes and ribosomal genes were overretained after the WGD that occurred in the S. cerevisiae lineage approximately 100 Mya (Seoighe and Wolfe, 1999). Later, Davis and Petrov (2005) confirmed these results and found that transcription factors were also retained in excess after the yeast WGD. Blomme et al. (2006) constructed phylogenetic trees for more than 8000 gene families in seven vertebrate species, from fish to human, and investigated the gain and loss of duplicate genes during 600 million years of vertebrate evolution. From the position of the duplication events in these trees relative to the speciation events, the duplications can be attributed to certain branches on the species tree. Not surprisingly, large amounts of duplicate genes were found to have been created on branches that coincide with proposed genome duplication events [one or two rounds (1R/2R) at the base of the vertebrate tree, and three rounds (3R) in the teleost fish lineage]. Furthermore, when the retention pattern of these duplicates was investigated, a strong bias was uncovered toward retention of regulatory genes (e.g., transcription factors, signal transducers, developmental genes), protein-binding genes, and ion transporters, both for 1R/2R and 3R and across several species. The enrichment of transcription factors and signaling genes among polyploidy-derived gene duplicates had already been noticed earlier in smaller-scale studies, both for the WGDs

44

number of retained duplicates

0

0

2

3

4

1

2

SSD

3

4

α0 = 1.55 ± 0.28 α1 = 0.60 ± 0.04 α2 = 0.45 ± 0.03 α3 = 0.75 ± 0.05

development

1

α0 = 1.25 ± 0.02 α1 = 0.90 ± 0.03 α2 = 0.65 ± 0.01 α3 = 0.85 ± 0.01

whole paranome

5

5

α

0

1

2

3

4

α0 = 1.25 ± 0.16 α1 = 0.60 ± 0.04 α2 = 0.30 ± 0.02 α3 = 0.55 ± 0.04

TF activity

5

0

5

10

15

20

25

0

1

β

2

(A)

KS

3

γ

4

α0 = 1.40 ± 0.12 α1 = 0.45 ± 0.15 α2 = 0.15 ± 0.07 α3 = 0.55 ± 0.11

5

secondary metabolism

0

20

40

60

80

0

0

2

3

4

1

2

3

4

α0 = 0.85 ± 0.15 α1 = 0.80 ± 0.11 α2 = 0.85 ± 0.11 α3 = 0.75 ± 0.08

DNA metabolism

1

α0 = 3.05 ± 1.07 α1 = 0.35 ± 0.03 α2 = 0.10 ± 0.02 α3 = 0.35 ± 0.03

signal transduction

observed Ks distr.

0

10

20

30

40

0

20

40

60

80

100

5

5

>1.0

γ

0.7 (B)

β

α

<0.4

(P)metabolism (P) response to external stimulus

(P) response to biotic stimulus (P) secondary metabolism

(P) cell cycle (F) nuclease activity (P) DNA metabolism (F) RNA binding

(F) structural constituent of ribosome

(P) flower development (P) post-embryonic development (P) development

(F) kinase activity (P) signal transduction (F) transferase activity (F) carbohydrate binding (P) cell communication (F) transporter activity (F) enzyme regulator activity (P) protein modification (F) transcription factor activity (F) protein binding (F) carrier activity

Figure 1 (A) Duplicate retention dynamics for several GO categories in Arabidopsis. The colored areas show the simulated fraction of retained duplicates created by each duplication mode as a function of KS , the blue curve is the observed KS distribution. The best-fit decay constants for each duplication mode and their 68% confidence intervals are indicated. α0 , α1 , α2 , and α4 correspond to the SSD, γ, β, and α modes, respectively. (B) Functional bias in duplicate retention is different after WGD and SSD. Blue, high loss; yellow, high retention; black corresponds to α = 0.7, the average decay constant of SSD duplicates across the entire Arabidopsis paranome. (Data from Maere et al., 2005.) (See insert for color representation of the figure.)

0

10

20

30

40

0

200

400

600

800

1000

SSD

DUPLICATE RETENTION MECHANISMS

45

at the base of the vertebrate lineage (Gibson and Spring, 1998; Spring, 1997) and for the fish-specific 3R event (Van de Peer et al., 2001). By comparing the gene complement of vertebrates with that of the basal chordate amphioxus, Putnam et al. (2008) confirmed that genes involved in signal transduction, transcription, and development were retained preferentially after the 1R/2R duplications in the vertebrate lineage. A large-scale study in teleost fish confirmed that developmental, signaling, behavior, and regulatory genes were also preferentially retained after the 3R event (Brunet et al., 2006). Investigation of gene duplication and loss dynamics in the evolutionary history of 17 ascomycetous fungi revealed that many gene families involved in stress responses and peripheral transport have been subject to intensive gene duplication and loss, whereas gene families related to essential conserved processes such as protein biosynthesis, protein degradation, tRNA amino acid charging, DNA replication, and mitochondrial function, often involving large protein complexes, are generally more refractory to such turnover except after the WGD in yeasts (Wapinski et al., 2007). Also, a recent analysis of duplicated genes in 12 sequenced Drosophila genomes revealed that the SSD duplicates retained in flies are enriched in genes involved in response to external stimuli and significantly depleted in, for example, transcription factors, kinases, and morphogenesis genes (Heger and Ponting, 2007). Aury et al. (2006) observed a strong overretention of complex-forming genes from the most recent WGD in the ciliate Paramecium tetraurelia, declining to average retention of complex-forming duplicates from the older WGDs. Remarkably, duplicates of central metabolic pathway genes originating from the most recent WGD are still retained in excess (an effect not found in other organisms), but more such duplicates from the older WGDs have been lost than expected. These results suggest that after a period of initial preservation, the stoechiometric constraints on complex-forming duplicates and metabolic pathway components are somehow circumvented and the dosage balance effect wears off, permitting the loss of nonbeneficial duplicates (Aury et al., 2006).

3.7 Expression Divergence One would expect that dosage balance effects also influence the extent of expression divergence of balance-sensitive duplicate genes. Casneuf et al. (2006) studied the expression divergence of Arabidopsis duplicate pairs created through WGD and SSD. They found that WGD duplicates appear to diverge less in expression than SSD duplicates of approximately the same age, both in terms of their expression correlation under various conditions and their tissue-specific expression patterns, even for older WGDs. On a more detailed scale, a similar observation was made for plant polygalacturonases, a family of hydrolytic enzymes involved in several developmental programs, such as seed germination, organ abscission, fruit ripening, and pollen tube growth, but also wounding responses and host–parasite interactions (Kim et al., 2006). These observations may indicate that the expression divergence of a substantial fraction of WGD duplicates is constrained by dosage balance effects. But there are alternative explanations possible. Notably, the definition of WGD-derived duplicates in most WGD vs. SSD studies is based on their presence in duplicated blocks with conserved gene content and order. It is conceivable that the WGD-derived duplicate pairs that still lie in such blocks may indeed show higher expression pattern

46

DUPLICATE RETENTION AFTER SMALL- AND LARGE-SCALE DUPLICATIONS

conservation than SSD duplicates, since they lie in a more similar genomic context and initially are duplicated with their cis-regulatory regions intact, which is not necessarily the case for SSD duplicates. However, there are indications that a lot of WGD-derived duplicate pairs do not lie in such duplicated blocks. By comparing the number of WGD duplicates found in duplicated blocks in Arabidopsis (Simillion et al., 2002) with the number of WGD duplicates inferred by the duplicate population dynamics model discussed above (Maere et al., 2005), we estimated that the detectable duplicated blocks contain only about 60% of the 3R duplicates retained, 30% of the 2R duplicates retained, and 25% of the 1R duplicates retained. This implies that many WGD-derived duplicates in the Arabidopsis genome currently escape detection or are counted as SSD duplicates instead, even for relatively young WGDs. There are two obvious potential causes for this. First, some duplicated blocks may have degraded beyond recognition through gene loss, genome rearrangements, and fragmentation. Second, retained WGD duplicates may have been repositioned in the genome (e.g., through translocations or transposon activity) (Rodin et al., 2005a, 2005b). In both cases, these undetected WGD duplicates may have higher potential for divergence of their expression patterns because of changes in their cis-regulatory context.

3.8 Genome Duplications and Increasing Complexity The overretention of transcription factors, signal transducers, and developmental genes after WGDs and their underretention after SSD, caused by dosage balance effects, may have important evolutionary implications. Maere et al. (2005) estimated that more than 90% of the increase in regulatory genes in the Arabidopsis lineage in the last 300 my is due to WGDs. Given that expansion and functional diversification of regulatory gene families are considered necessary to bring about an increase in morphological complexity, this suggests a key role for genome duplications in the evolution of more complex organisms (De Bodt et al., 2005; Maere et al., 2005; Fawcett et al., 2009; Freeling, 2009; Freeling and Thomas, 2006; Van de Peer et al., 2009). Links between WGDs and the evolution of complexity have been made repeatedly (Ohno, 1970; Otto and Whitton, 2000; Aburomia et al., 2003; Holland, 2003; De Bodt et al., 2005; Freeling and Thomas, 2006; Roth et al., 2007; Soltis et al., 2008). For example, De Bodt et al. (2005) suggested a link between the WGDs early in angiosperm evolution and their morphological innovation and sudden rise to ecological dominance in the Early Cretaceous. Freeling (2009) and Freeling and Thomas (2006) went one step further and proffered the theory that dosage balance effects after genome duplications cause a predictable drive toward higher complexity in plants and animals. They argue that entire functional modules (i.e., protein complexes or regulatory pathways) inherently get retained in duplicate after genome duplication through nonadaptive dosage balance effects, after which they can coadaptively evolve novel functions and cause an increase in morphological complexity. Theories linking WGDs to the evolution of complexity are extremely difficult to prove directly. Moreover, the fact that WGDs may facilitate an increase in complexity does not mean that WGDs should always lead to more complex organisms (Semon and Wolfe, 2007a; Van de Peer et al., 2009). Indeed, there are, for example, no indications that morphological complexity increased

DISCUSSION

47

substantially after the youngest genome duplication in Arabidopsis or the WGD in S. cerevisiae. 3.9 Gene Duplication Versus Alternative Splicing Gene duplication is not the only way to create novel proteins from existing ones. In higher eukaryotes, alternative splicing is another important mechanism. In fact, Kopelman et al. (2005) recently found an inverse correlation between the extent of alternative splicing and gene duplication in human and murine gene families. Hughes and Friedman (2008) observed a similar anticorrelation in C. elegans. The mechanistics underlying the apparent mutual exclusion between the two competing mechanisms have thus far not been explained completely. The propensity toward expansion through duplication or alternative splicing is sometimes different for the same gene family in human and in mouse, so the inverse correlation seems not to be caused by inherent properties of the different gene families. This is remarkable since alternative splicing and duplication are very different processes with potentially different impacts on the functional differentiation of genes. For example, alternative splicing tends to influence protein structure more drastically than duplicate gene divergence (Talavera et al., 2007; Shakhnovich and Shakhnovich, 2008). Also, differential regulation of splice variant expression may differ substantially in mechanism and effect from expression divergence of duplicated genes. One mechanism that has been invoked to explain the anticorrelation between gene duplication and alternative splicing is, again, the dosage balance effect: Alternatively spliced genes would, upon duplication, give rise to multiple additional gene variants that could exacerbate balance effects (Talavera et al., 2007). If this hypothesis were true, one would expect that a higher number of successful duplications of alternatively spliced genes may have occurred after WGD, but this remains to be tested. Another potential explanation for the anticorrelation observed is that alternatively spliced genes subfunctionalize upon duplication through mutually exclusive loss of isoforms (Su et al., 2006).

4 DISCUSSION The evidence for the influence of dosage balance effects on duplicate gene retention is overwhelming. However, this does not imply that other duplicate retention mechanisms are of marginal importance. Rather, it indicates that dosage balance effects tend to take priority in determining which duplicates can be retained. Farther down the hierarchy, various sub- and neofunctionalization mechanisms come into play. In Figure 2 we prioritize the various mechanisms that determine the fate of duplicate genes after SSD and WGD. Because our current knowledge is limited, this picture is necessarily incomplete and subjective. After SSD, duplicates exhibiting deleterious dosage balance effects get lost quickly or never reach fixation. Some of the remaining SSD duplicates may undergo selection for increased dosage, followed by sub(neo)functionalization (IAD model) or not (dosage model). Others may get preserved through sub- or subneofunctionalization mechanisms (DDC, EAC). Of the ones that subfunctionalize through reversible mechanisms (e.g., quantitative subfunctionalization or EC), a substantial fraction may still get lost in the end. Relatively few duplicates probably get preserved through pure neofunctionalization, and most duplicates get lost. After WGD, duplicates exhibiting

48

DUPLICATE RETENTION AFTER SMALL- AND LARGE-SCALE DUPLICATIONS

WGD

Time

SSD

n

y

n

y

DB

n n

P

n

n

N

S

y

P/L

y

n

y

D

P

n P P

N

n

P

L P

y P

n n L

N

S

n

y P

n

N

P/L

y

y P

N

S

y

L

y

n

S

y

P

n P P

N

y P

DDC/EAC

P/L

y

n

N

D

y

IAD

L

y

IAD

N

S

y

n n

y DDC/EAC

n

D

EC

DDC/EAC

n

DB

y

N

P/L

y P

Figure 2 Fate of duplicated genes after WGD and SSD. Each split in the tree is a yes/no decision. Branch thickness is proportional to the duplicate flux through the branch, and the vertical axis is proportional to time (both not to scale). DB, dosage balance; D, increased dosage; EC, epigenetic complementation; S, subfunctionalization; N, neofunctionalization; P, preservation; L, loss; P/L, preservation or loss; y, yes; n, no.

dosage balance effects are retained preferentially, at least initially. Some balancesensitive complexes or parts thereof may get retained indefinitely for dosage reasons (e.g., ribosomal proteins after the yeast WGD and the most recent Arabidopsis WGD). When dosage balance effects start to wear off [e.g., through compensatory evolution of paralogous gene regulation (a form of quantitative subfunctionalization)], other suband/or neofunctionalization effects come into play. Many duplicates preserved initially for balance reasons may eventually get lost. Indeed, observations in yeast and vertebrates indicate that duplicates may still get lost after being retained for hundreds of millions of years (Blomme et al., 2006; Byrne and Wolfe, 2007; Scannell and Wolfe, 2008). For WGD duplicates that do not exhibit dosage balance effects, more or less the same mechanisms apply as for SSD duplicates. One particular difference is the potentially widespread occurrence of epigenetic complementation (EC)-type subfunctionalization very soon after WGD. Ohno-type neofunctionalization after WGD may be a more important mechanism than after SSD because neutral duplicates need not be fixed after WGD and may linger in the population for millions of years once a sizable effective population size has been established. Some other factors that influence duplicate retention have not been discussed in this chapter. For example, duplicate retention may be influenced by heterosis effects. Heterotic interactions can be stabilized on a haploid genome by SSD and reassortment of alleles through recombination (Spofford, 1969; Proulx and Phillips, 2006). Heterosis effects may also play a role after WGD. Among duplication mechanisms, allopolyploidization has a unique capability to increase heterozygosity and merge different sets of coadapted alleles in a genome, which may lead to hybrid vigor, immediate selective advantages, and selective retention of the duplicates involved. Such heterosis effects

REFERENCES

49

are not captured by any of the models discussed above, and they may be the result of more diffuse rewiring of the cellular networks rather than being traceable to single genes. It is likely that yet other factors influencing duplicate gene retention remain to be discovered, and a lot of questions remain concerning the mechanisms that are already known. Research in this area is complicated by the fact that many duplicate retention mechanisms play out over millions of years and are therefore not accessible through direct experimentation. The molecular archaeology studies needed to tease apart the mechanisms that influence duplicate retention will probably generate many more controversial and interesting insights for years to come. Acknowledgments We would like to thank Devin Scannell and Michael Freeling for interesting discussions on duplicate retention. Thanks also to Michael Eisen for providing S.M. with a stimulating writing environment. S.M. is a fellow of the Fund for Scientific Research–Flanders (FWO). S.M. gratefully acknowledges support from Fulbright and B.A.E.F. Postdoctoral Fellowships. REFERENCES Aburomia R, Khaner O, Sidow A. 2003. Functional evolution in the ancestral lineage of vertebrates or when genomic complexity was wagging its morphological tail. J Struct Funct Genom 3:45–52. Adams KL, Wendel JF. 2005a. Novel patterns of gene expression in polyploid plants. Trends Genet 21:539–543. Adams KL, Wendel JF. 2005b. Polyploidy and genome evolution in plants. Curr Opin Plant Biol 8:135–141. Adams KL, Cronn R, Percifield R, Wendel JF. 2003. Genes duplicated by polyploidy show unequal contributions to the transcriptome and organ-specific reciprocal silencing. Proc Natl Acad Sci USA 100:4649–4654. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. 2000. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25:25–29. Aury JM, Jaillon O, Duret L, Noel B, Jubin C, Porcel BM, Segurens B, Daubin V, Anthouard V, Aiach N, et al. 2006. Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraurelia. Nature 444:171–178. Bailey JA, Liu G, Eichler EE. 2003. An Alu transposition model for the origin and expansion of human segmental duplications. Am J Hum Genet 73:823–834. Bergthorsson U, Andersson DI, Roth Jr. 2007. Ohno’s dilemma: evolution of new genes under continuous selection. Proc Natl Acad Sci USA 104:17004–17009. Birchler JA, Bhadra U, Bhadra MP, Auger DL. 2001. Dosage-dependent gene regulation in multicellular eukaryotes: implications for dosage compensation, aneuploid syndromes, and quantitative traits. Dev Biol 234:275–288. Bj¨orklund AK, Ekman D, Light S, Frey-Skott J, Elofsson A. 2005. Domain rearrangements in protein evolution. J Mol Biol 353:911–923. Blanc G, Wolfe KH. 2004a. Functional divergence of duplicated genes formed by polyploidy during Arabidopsis evolution. Plant Cell 16:1679–1691. Blanc G, Wolfe KH. 2004b. Widespread paleopolyploidy in model plant species inferred from age distributions of duplicate genes. Plant Cell 16:1667–1678.

50

DUPLICATE RETENTION AFTER SMALL- AND LARGE-SCALE DUPLICATIONS

Blanc G, Hokamp K, Wolfe KH. 2003. A recent polyploidy superimposed on older large-scale duplications in the Arabidopsis genome. Genome Res 13:137–144. Blomme T, Vandepoele K, De Bodt S, Simillion C, Maere S, Van de Peer Y. 2006. The gain and loss of genes during 600 millionyears of vertebrate evolution. Genome Biol 7:R43. Bonnefont J, Nikolaev SI, Perrier AL, Guo S, Cartier L, Sorce S, Laforge T, Aubry L, Khaitovich P, Peschanski M, et al. 2008. Evolutionary forces shape the human RFPL1,2,3 genes toward a role in neocortex development. Am J Hum Genet 83:208–218. Bowers JE, Chapman BA, Rong J, Paterson AH. 2003. Unravelling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events. Nature 422:433–438. Brookfield J. 1992. Can genes be truly redundant? Curr Biol 2:553–554. Brunet FG, Crollius HR, Paris M, Aury JM, Gibert P, Jaillon O, Laudet V, Robinson-Rechavi M. 2006. Gene loss and evolutionary rates following whole-genome duplication in teleost fishes. Mol Biol Evol 23:1808–1816. Burki F, Kaessmann H. 2004. Birth and adaptive evolution of a hominoid gene that supports high neurotransmitter flux. Nat Genet 36:1061–1063. Byrne KP, Wolfe KH. 2007. Consistent patterns of rate asymmetry and gene loss indicate widespread neofunctionalization of yeast genes after whole-genome duplication. Genetics 175:1341–1350. Casneuf T, De Bodt S, Raes J, Maere S, Van de Peer Y. 2006. Nonrandom divergence of gene expression following gene and genome duplications in the flowering plant Arabidopsis thaliana. Genome Biol 7:R13. Chain FJ, Evans BJ. 2006. Multiple mechanisms promote the retained expression of gene duplicates in the tetraploid frog Xenopus laevis. PLoS Genet 2:e56. Chain FJ, Ilieva D, Evans BJ. 2008. Duplicate gene evolution and expression in the wake of vertebrate allopolyploidization. BMC Evol Biol 8:43. Chapman BA, Bowers JE, Feltus FA, Paterson AH. 2006. Buffering of crucial functions by paleologous duplicated genes may contribute cyclicality to angiosperm genome duplication. Proc Natl Acad Sci USA 103:2730–2735. Chen JM, Chuzhanova N, Stenson PD, Ferec C, Cooper DN. 2005. Meta-analysis of gross insertions causing human genetic disease: novel mutational mechanisms and the role of replication slippage. Hum Mutat 25:207–221. Chung WY, Albert R, Albert I, Nekrutenko A, Makova KD. 2006. Rapid and asymmetric divergence of duplicate genes in the human gene coexpression network. BMC Bioin 7:46. Comai L. 2005. The advantages and disadvantages of being polyploid. Nat Rev Genet 6:836–846. Conant GC, Wagner A. 2003. Asymmetric sequence divergence of duplicate genes. Genome Res 13:2052–2058. Conant GC, Wagner A. 2004. Duplicate genes and robustness to transient gene knock-downs in Caenorhabditis elegans. Proc Biol Sci 271:89–96. Conant GC, Wolfe KH. 2006. Functional partitioning of yeast co-expression networks after genome duplication. PLoS Biol 4:e109. Conant GC, Wolfe KH. 2007. Increased glycolytic flux as an outcome of whole-genome duplication in yeast. Mol Syst Biol 3:129. Conant GC, Wolfe KH. 2008. Turning a hobby into a job: how duplicated genes find new functions. Nat Rev Genet 9:938–950. Conrad B, Antonarakis SE. 2007. Gene duplication: a drive for phenotypic diversity and cause of human disease. Annu Rev Genom Hum Genet 8:17–35. TRCaS Consortia. 2005. The sequence of rice chromosomes 11 and 12, rich in disease resistance genes and recent gene duplications. BMC Biol 3:20.

REFERENCES

51

Cooke J, Nowak MA, Boerlijst M, Maynard-Smith J. 1997. Evolutionary origins and maintenance of redundant gene expression during metazoan development. Trends Genet 13:360–364. Cui L, Wall PK, Leebens-Mack JH, Lindsay BG, Soltis DE, Doyle JJ, Soltis PS, Carlson JE, Arumuganathan K, Barakat A, et al. 2006. Widespread genome duplications throughout the history of flowering plants. Genome Res 16:738–749. Davis JC, Petrov DA. 2004. Preferential duplication of conserved proteins in eukaryotic genomes. PLoS Biol 2:e55. Davis JC, Petrov DA. 2005. Do disparate mechanisms of duplication add similar genes to the genome? Trends Genet 21:548–551. De Bodt S, Maere S, Van de Peer Y. 2005. Genome duplication and the origin of angiosperms. Trends Ecol Evol 20:591–597. Dehal P, Boore JL. 2005. Two rounds of whole genome duplication in the ancestral vertebrate. PLoS Biol 3:e314. Des Marais DL, Rausher MD. 2008. Escape from adaptive conflict after duplication in an anthocyanin pathway gene. Nature 454:762–765. Drea SC, Lao NT, Wolfe KH, Kavanagh TA. 2006. Gene duplication, exon gain and neofunctionalization of OEP16 -related genes in land plants. Plant J 46:723–735. Duarte JM, Cui L, Wall PK, Zhang Q, Zhang X, Leebens-Mack J, Ma H, Altman N, dePamphilis CW. 2006. Expression pattern shifts following duplication indicative of subfunctionalization and neofunctionalization in regulatory genes of Arabidopsis. Mol Biol Evol 23:469–478. Endo T, Ikeo K, Gojobori T. 1996. Large-scale search for genes on which positive selection may operate. Mol Biol Evol 13:685–690. Evangelisti AM, Wagner A. 2004. Molecular evolution in the yeast transcriptional regulation network. J Exp Zool B 302:392–411. Fawcett JA, Maere S, Van de Peer Y. 2009. Plants with double genomes might have had a better chance to survive the Cretaceous–Tertiary extinction event. Proc Natl Acad Sci USA 106:5737–5742. Fern´andez A, Kardos J, Scott LR, Goto Y, Berry RS. 2003. Structural defects and the diagnosis of amyloidogenic propensity. Proc Natl Acad Sci USA 100:6446–6451. Fisher RA. 1935. The sheltering of lethals. Am Nat 69:446–455. Force A, Lynch M, Pickett FB, Amores A, Yan YL, Postlethwait J. 1999. Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151:1531–1545. Francino MP. 2005. An adaptive radiation model for the origin of new gene functions. Nat Genet 37:573–577. Freeling M. 2009. Bias in plant gene content following different sorts of duplication: tandem, whole-genome segmental, or by transposition. Annu Rev Plant Biol 60:433–453. Freeling M, Thomas BC. 2006. Gene-balanced duplications, like tetraploidy, provide predictable drive to increase morphological complexity. Genome Res 16:805–814. Ganko EW, Meyers BC, Vision TJ. 2007. Divergence in expression between duplicated genes in Arabidopsis. Mol Biol Evol 24:2298–2309. Gao LZ, Innan H. 2004. Very low gene duplication rate in the yeast genome. Science 306:1367–1370. Giaever G, Chu AM, Ni L, Connelly C, Riles L, Veronneau S, Dow S, Lucau-Danila A, Anderson K, Andre B, et al. 2002. Functional profiling of the Saccharomyces cerevisiae genome. Nature 418:387–391. Gibson TJ, Spring J. 1998. Genetic redundancy in vertebrates: polyploidy and persistence of genes encoding multidomain proteins. Trends Genet 14:46–49; discussion 49–50.

52

DUPLICATE RETENTION AFTER SMALL- AND LARGE-SCALE DUPLICATIONS

Gu Z, Nicolae D, Lu HH, Li WH. 2002. Rapid divergence in expression between duplicate genes inferred from microarray data. Trends Genet 18:609–613. Gu Z, Steinmetz LM, Gu X, Scharfe C, Davis RW, Li WH. 2003. Role of duplicate genes in genetic robustness against null mutations. Nature 421:63–66. Gu X, Zhang Z, Huang W. 2005. Rapid evolution of expression and regulatory divergences after yeast gene duplication. Proc Natl Acad Sci USA 102:707–712. Guan Y, Dunham MJ, Troyanskaya OG. 2007. Functional analysis of gene duplications in Saccharomyces cerevisiae. Genetics 175:933–943. Ha M, Li WH, Chen ZJ. 2007. External factors accelerate expression divergence between duplicate genes. Trends Genet 23:162–166. Haberer G, Hindemitt T, Meyers BC, Mayer KF. 2004. Transcriptional similarities, dissimilarities, and conservation of cis-elements in duplicated genes of Arabidopsis. Plant Physiol 136:3009–3022. Haldane JBS. 1933. The part played by recurrent mutation in evolution. Am Nat 67:5–19. He X, Zhang J. 2005a. Gene complexity and gene duplicability. Curr Biol 15:1016–1021. He X, Zhang J. 2005b. Rapid subfunctionalization accompanied by prolonged and substantial neofunctionalization in duplicate gene evolution. Genetics 169:1157–1164. He X, Zhang J. 2006. Higher duplicability of less important genes in yeast genomes. Mol Biol Evol 23:144–151. Heger A, Ponting CP. 2007. Evolutionary rate analyses of orthologs and paralogs from 12 Drosophila genomes. Genome Res 17:1837–1849. Hellsten U, Khokha MK, Grammer TC, Harland RM, Richardson P, Rokhsar DS. 2007. Accelerated gene evolution and subfunctionalization in the pseudotetraploid frog Xenopus laevis. BMC Biol 5:31. Hendrickson H, Slechta ES, Bergthorsson U, Andersson DI, Roth Jr. 2002. Amplificationmutagenesis: evidence that “directed” adaptive mutation and general hypermutability result from growth with a selected gene amplification. Proc Natl Acad Sci USA 99:2164–2169. Hittinger CT, Carroll SB. 2007. Gene duplication and the adaptive evolution of a classic genetic switch. Nature 449:677–681. Holland PW. 2003. More genes in vertebrates? J Struct Funct Genom 3:75–84. Hughes AL. 1994. The evolution of functionally novel proteins after gene duplication. Proc Biol Sci 256:119–124. Hughes AL. 2005. Gene duplication and the origin of novel proteins. Proc Natl Acad Sci USA 102:8791–8792. Hughes AL, Friedman R. 2007. Sharing of transcription factors after gene duplication in the yeast Saccharomyces cerevisiae. Genetica 129:301–308. Hughes AL, Friedman R. 2008. Alternative splicing, gene duplication and connectivity in the genetic interaction network of the nematode worm Caenorhabditis elegans. Genetica 134:181–186. Hurles M. 2004. Gene duplication: the genomic trade in spare parts. PLoS Biol 2:e206. Ihmels J, Collins SR, Schuldiner M, Krogan NJ, Weissman JS. 2007. Backup without redundancy: genetic interactions reveal the cost of duplicate gene loss. Mol Syst Biol 3:86. Jaillon O, Aury JM, Brunet F, Petit JL, Stange-Thomann N, Mauceli E, Bouneau L, Fischer C, Ozouf-Costaz C, Bernot A, et al. 2004. Genome duplication in the teleost fish Tetraodon nigroviridis reveals the early vertebrate proto-karyotype. Nature 431:946–957. Jaillon O, Aury JM, Noel B, Policriti A, Clepet C, Casagrande A, Choisne N, Aubourg S, Vitulo N, Jubin C, et al. 2007. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 449:463–467.

REFERENCES

53

Jeong H, Mason SP, Barabasi AL, Oltvai ZN. 2001. Lethality and centrality in protein networks. Nature 411:41–42. Jordan IK, Wolf YI, Koonin EV. 2004. Duplicated genes evolve slower than singletons despite the initial rate increase. BMC Evol Biol 4:22. Kamath RS, Fraser AG, Dong Y, Poulin G, Durbin R, Gotta M, Kanapin A, Le Bot N, Moreno S, Sohrmann M, et al. 2003. Systematic functional analysis of the Caenorhabditis elegans genome using RNAi. Nature 421:231–237. Kasahara M, Naruse K, Sasaki S, Nakatani Y, Qu W, Ahsan B, Yamada T, Nagayasu Y, Doi K, Kasai Y, et al. 2007. The medaka draft genome and insights into vertebrate genome evolution. Nature 447:714–719. Katju V, Lynch M. 2003. The structure and early evolution of recently arisen gene duplicates in the Caenorhabditis elegans genome. Genetics 165:1793–1803. Kellis M, Birren BW, Lander ES. 2004. Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature 428:617–624. Kim HS, Yu Y, Snesrud EC, Moy LP, Linford LD, Haas BJ, Nierman WC, Quackenbush J. 2005. Transcriptional divergence of the duplicated oxidative stress-responsive genes in the Arabidopsis genome. Plant J 41:212–220. Kim J, Shiu SH, Thoma S, Li WH, Patterson SE. 2006. Patterns of expansion and expression divergence in the plant polygalacturonase gene family. Genome Biol 7:R87. Kondrashov FA, Kondrashov AS. 2006. Role of selection in fixation of gene duplications. J Theor Biol 239:141–151. Kondrashov FA, Koonin EV. 2004. A common framework for understanding the origin of genetic dominance and evolutionary fates of gene duplications. Trends Genet 20:287–290. Kondrashov FA, Rogozin IB, Wolf YI, Koonin EV. 2002. Selection in the evolution of gene duplications. Genome Biol 3:R8. Kopelman NM, Lancet D, Yanai I. 2005. Alternative splicing and gene duplication are inversely correlated evolutionary mechanisms. Nat Genet 37:588–589. Koszul R, Dujon B, Fischer G. 2006. Stability of large segmental duplications in the yeast genome. Genetics 172:2211–2222. Li WH, Yang J, Gu X. 2005. Expression divergence between duplicate genes. Trends Genet 21:602–607. Liang H, Plazonic KR, Chen J, Li WH, Fern´andez A. 2008. Protein under-wrapping causes dosage sensitivity and decreases gene duplicability. PLoS Genet 4:e11. Lin YS, Byrnes JK, Hwang JK, Li WH. 2006. Codon-usage bias versus gene conversion in the evolution of yeast duplicate genes. Proc Natl Acad Sci USA 103:14412–14416. Lin YS, Hwang JK, Li WH. 2007. Protein complexity, gene duplicability and gene dispensability in the yeast genome. Gene 387:109–117. Lynch M, Conery JS. 2000. The evolutionary fate and consequences of duplicate genes. Science 290:1151–1155. Lynch M, Force A. 2000. The probability of duplicate gene preservation by subfunctionalization. Genetics 154:459–473. Lynch M, Katju V. 2004. The altered evolutionary trajectories of gene duplicates. Trends Genet 20:544–549. Lynch M, O’Hely M, Walsh B, Force A. 2001. The probability of preservation of a newly arisen gene duplicate. Genetics 159:1789–1804. Ma XF, Gustafson JP. 2005. Genome evolution of allopolyploids: a process of cytological and genetic diploidization. Cytogenet Genome Res 109:236–249. Maere S, De Bodt S, Raes J, Casneuf T, Van Montagu M, Kuiper M, Van de Peer Y. 2005. Modeling gene and genome duplications in eukaryotes. Proc Natl Acad Sci USA 102:5454–5459.

54

DUPLICATE RETENTION AFTER SMALL- AND LARGE-SCALE DUPLICATIONS

Makova KD, Li WH. 2003. Divergence in the spatial pattern of gene expression between human duplicate genes. Genome Res 13:1638–1645. McLysaght A, Hokamp K, Wolfe KH. 2002. Extensive genomic duplication during early chordate evolution. Nat Genet 31:200–204. Meyer A, Van de Peer Y. 2005. From 2R to 3R: evidence for a fish-specific genome duplication (FSGD). Bioessays 27:937–945. Ming R, Hou S, Feng Y, Yu Q, Dionne-Laporte A, Saw JH, Senin P, Wang W, Ly BV, Lewis KL, et al. 2008. The draft genome of the transgenic tropical fruit tree papaya (Carica papaya Linnaeus). Nature 452:991–996. Morin RD, Chang E, Petrescu A, Liao N, Griffith M, Chow W, Kirkpatrick R, Butterfield YS, Young AC, Stott J, et al. 2006. Sequencing and analysis of 10,967 full-length cDNA clones from Xenopus laevis and Xenopus tropicalis reveals post-tetraploidization transcriptome remodeling. Genome Res 16:796–803. Musso G, Costanzo M, Huangfu M, Smith AM, Paw J, San Luis BJ, Boone C, Giaever G, Nislow C, Emili A, et al. 2008. The extensive and condition-dependent nature of epistasis among whole-genome duplicates in yeast. Genome Res 18:1092–1099. Nowak MA, Boerlijst MC, Cooke J, Smith JM. 1997. Evolution of genetic redundancy. Nature 388:167–171. Ohno S. 1970. Evolution by Gene Duplication. New York: Springer-Verlag. Otto SP, Whitton J. 2000. Polyploid incidence and evolution. Annu Rev Genet 34:401–437. Papp B, P´al C, Hurst LD. 2003a. Dosage sensitivity and the evolution of gene families in yeast. Nature 424:194–197. Papp B, P´al C, Hurst LD. 2003b. Evolution of cis-regulatory elements in duplicated genes of yeast. Trends Genet 19:417–422. Paterson AH, Bowers JE, Chapman BA. 2004. Ancient polyploidization predating divergence of the cereals, and its consequences for comparative genomics. Proc Natl Acad Sci USA 101:9903–9908. Paterson AH, Bowers JE, Van de Peer Y, Vandepoele K. 2005. Ancient duplication of cereal genomes. New Phytol 165:658–661. Proulx SR, Phillips PC. 2006. Allelic divergence precedes and promotes gene duplication. Evolution 60:881–892. Putnam NH, Butts T, Ferrier DE, Furlong RF, Hellsten U, Kawashima T, Robinson-Rechavi M, Shoguchi E, Terry A, Yu JK, et al. 2008. The amphioxus genome and the evolution of the chordate karyotype. Nature 453:1064–1071. Rastogi S, Liberles DA. 2005. Subfunctionalization of duplicated genes as a transition state to neofunctionalization. BMC Evol Biol 5:28. Rensing SA, Ick J, Fawcett JA, Lang D, Zimmer A, Van de Peer Y, Reski R. 2007. An ancient genome duplication contributed to the abundance of metabolic genes in the moss Physcomitrella patens. BMC Evol Biol 7:130. Rodin SN, Riggs AD. 2003. Epigenetic silencing may aid evolution by gene duplication. J Mol Evol 56:718–729. Rodin SN, Parkhomchuk DV, Riggs AD. 2005a. Epigenetic changes and repositioning determine the evolutionary fate of duplicated genes. Biochemistry (Mosc) 70:559–567. Rodin SN, Parkhomchuk DV, Rodin AS, Holmquist GP, Riggs AD. 2005b. Repositioningdependent fate of duplicate genes. DNA Cell Biol 24:529–542. Roth C, Rastogi S, Arvestad L, Dittmar K, Light S, Ekman D, Liberles DA. 2007. Evolution after gene duplication: models, mechanisms, sequences, systems, and organisms. J Exp Zool B 308:58–73.

REFERENCES

55

Scannell DR, Wolfe KH. 2008. A burst of protein sequence evolution and a prolonged period of asymmetric evolution follow gene duplication in yeast. Genome Res 18:137–147. Scannell DR, Byrne KP, Gordon JL, Wong S, Wolfe KH. 2006. Multiple rounds of speciation associated with reciprocal gene loss in polyploid yeasts. Nature 440:341–345. S´emon M, Wolfe KH. 2007a. Consequences of genome duplication. Curr Opin Genet Dev 17:505–512. S´emon M, Wolfe KH. 2007b. Rearrangement rate following the whole-genome duplication in teleosts. Mol Biol Evol 24:860–867. S´emon M, Wolfe KH. 2008. Preferential subfunctionalization of slow-evolving genes after allopolyploidization in Xenopus laevis. Proc Natl Acad Sci USA 105:8333–8338. Seoighe C, Gehring C. 2004. Genome duplication led to highly selective expansion of the Arabidopsis thaliana proteome. Trends Genet 20:461–464. Seoighe C, Wolfe KH. 1999. Yeast genome evolution in the post-genome era. Curr Opin Microbiol 2:548–554. Shakhnovich BE, Shakhnovich EI. 2008. Improvisation in evolution of genes and genomes: whose structure is it anyway? Curr Opin Struct Biol 18:375–381. Shashidharan P, Michaelidis TM, Robakis NK, Kresovali A, Papamatheakis J, Plaitakis A. 1994. Novel human glutamate dehydrogenase expressed in neural and testicular tissues and encoded by an X-linked intronless gene. J Biol Chem 269:16971–16976. Simillion C, Vandepoele K, Van Montagu MC, Zabeau M, Van de Peer Y. 2002. The hidden duplication past of Arabidopsis thaliana. Proc Natl Acad Sci USA 99:13627–13632. Soltis DE, Bell CD, Kim S, Soltis PS. 2008. Origin and early evolution of angiosperms. Ann NY Acad Sci 1133:3–25. Song N, Joseph JM, Davis GB, Durand D. 2008. Sequence similarity network reveals common ancestry of multidomain proteins. PLoS Comput Biol 4:e1000063. Spofford J. 1969. Heterosis and the evolution of duplications. Am Nat 103:407–432. Spring J. 1997. Vertebrate evolution by interspecific hybridisation—are we polyploid? FEBS Lett 400:2–8. Stoltzfus A. 1999. On the possibility of constructive neutral evolution. J Mol Evol 49:169–181. Su Z, Wang J, Yu J, Huang X, Gu X. 2006. Evolution of alternative splicing after gene duplication. Genome Res 16:182–189. Sugino RP, Innan H. 2006. Selection for more of the same product as a force to enhance concerted evolution of duplicated genes. Trends Genet 22:642–624. Talavera D, Vogel C, Orozco M, Teichmann SA, de la Cruz X. 2007. The (in)dependence of alternative splicing and gene duplication. PLoS Comput Biol 3:e33. Tang H, Wang X, Bowers JE, Ming R, Alam M, Paterson AH. 2008. Unraveling ancient hexaploidy through multiply-aligned angiosperm gene maps. Genome Res 18:1944–1954. Taylor JS, Raes J. 2004. Duplication and divergence: the evolution of new genes and old ideas. Annu Rev Genet 38:615–643. Tirosh I, Barkai N. 2007. Comparative analysis indicates regulatory neofunctionalization of yeast duplicates. Genome Biol 8:R50. Tuskan GA, Difazio S, Jansson S, Bohlmann J, Grigoriev I, Hellsten U, Putnam N, Ralph S, Rombauts S, Salamov A, et al. 2006. The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science 313:1596–1604. Van de Peer Y. 2004. Computational approaches to unveiling ancient genome duplications. Nat Rev Genet 5:752–763. Van de Peer Y, Taylor JS, Braasch I, Meyer A. 2001. The ghost of selection past: rates of evolution and functional divergence of anciently duplicated genes. J Mol Evol 53:436–446.

56

DUPLICATE RETENTION AFTER SMALL- AND LARGE-SCALE DUPLICATIONS

Van de Peer Y, Taylor JS, Meyer A. 2003. Are all fishes ancient polyploids? J Struct Funct Genom 3:65–73. Van de Peer Y, Maere S, Meyer A. 2009. The evolutionary significance of ancient genome duplications. Nat Rev Genet 10:725–732. Vandepoele K, De Vos W, Taylor JS, Meyer A, Van de Peer Y. 2004. Major events in the genome evolution of vertebrates: paranome age and size differ considerably between ray-finned fishes and land vertebrates. Proc Natl Acad Sci USA 101:1638–1643. Veitia RA. 2002. Exploring the etiology of haploinsufficiency. Bioessays 24:175–184. Veitia RA. 2003. Nonlinear effects in macromolecular assembly and dosage sensitivity. J Theor Biol 220:19–25. Veitia RA. 2005. Gene dosage balance: deletions, duplications and dominance. Trends Genet 21:33–35. Veitia RA, Bottani S, Birchler JA. 2008. Cellular reactions to gene dosage imbalance: genomic, transcriptomic and proteomic effects. Trends Genet 24:390–397. Velasco R, Zharkikh A, Troggio M, Cartwright DA, Cestaro A, Pruss D, Pindo M, Fitzgerald LM, Vezzulli S, Reid J, et al. 2007. A high quality draft consensus sequence of the genome of a heterozygous grapevine variety. PLoS ONE 2:e1326. Vision TJ, Brown DG, Tanksley SD. 2000. The origins of genomic duplications in Arabidopsis. Science 290:2114–2117. Wagner A. 2008. Gene duplications, robustness and evolutionary innovations. Bioessays 30:367–373. Wang J, Tian L, Madlung A, Lee HS, Chen M, Lee JJ, Watson B, Kagochi T, Comai L, Chen ZJ. 2004. Stochastic and epigenetic changes of gene expression in Arabidopsis polyploids. Genetics 167:1961–1973. Wapinski I, Pfeffer A, Friedman N, Regev A. 2007. Natural history and evolutionary principles of gene duplication in fungi. Nature 449:54–61. Wendel JF. 2000. Genome evolution in polyploids. Plant Mol Biol 42:225–249. Wolfe KH. 2001. Yesterday’s polyploids and the mystery of diploidization. Nat Rev Genet 2:333–341. Wolfe KH, Shields DC. 1997. Molecular evidence for an ancient duplication of the entire yeast genome. Nature 387:708–713. Yang J, Lusk R, Li WH. 2003. Organismal complexity, protein complexity, and gene duplicability. Proc Natl Acad Sci USA 100:15661–15665. Yu J, Wang J, Lin W, Li S, Li H, Zhou J, Ni P, Dong W, Hu S, Zeng C, et al. 2005. The genomes of Oryza sativa: a history of duplications. PLoS Biol 3:e38. Zhang P, Gu Z, Li WH. 2003. Different evolutionary patterns between young duplicate genes in the human genome. Genome Biol 4:R56. Zhang Z, Gu J, Gu X. 2004. How much expression divergence after yeast gene duplication could be explained by regulatory motif evolution? Trends Genet 20:403–407.

4

Gene Dosage and Duplication FYODOR A. KONDRASHOV Center for Genomic Regulation, Barcelona, Spain

1 INTRODUCTION Innovation at the level of the whole genotype or whole phenotype is known as speciation, a key evolutionary process. Innovation at a smaller scale, involving individual genes and their functions, is akin to speciation, as it explains functional diversity of genes within a genome. It was proposed almost a century ago that gene duplication plays a key role in such small-scale innovation (Bridges, 1935). Modern data support this assertion, and just as all current species have evolved from preexisting ones, a vast majority of current genes and gene functions have also evolved from preexisting genes or gene parts (Li, 1997; Hughes, 1999; Prince and Pickett, 2002). These empirical observations, fueled by theoretical considerations, imply that the study of all stages of evolution of gene duplications is crucial for our ability to understand where new gene functions come from, and how. It is clear that gene duplications have the potential to produce a new gene function in the long term (Li, 1997; Hughes, 1999). Much less well understood are the early stages in the emergence and evolution of gene duplications, especially fixation of a polymorphic gene duplication. At first glance it may seem that short-term effects of gene duplications are not important for the future evolution of gene function. However, at least in theory, when a short-term disadvantage contradicts long-term benefits (Maynard Smith, 1978), as well as when a short-term advantage leads to eventual extinction (Webb, 2003), the evolutionary outcome is usually determined by the shortterm effect. Thus, short-term effects of gene duplications must be considered in order to understand the long-term evolution of new gene functions. There are two major hypotheses concerned with the early stages in the life of a gene duplication. The traditional and more widely accepted point of view holds that gene duplications are functionally redundant and fixed by genetic drift. The alternative hypothesis holds that gene duplications increase gene dosage and can be fixed by positive selection for this increase. Here, I analyze the theoretical implications of both of these theories for the

Evolution After Gene Duplication, Edited by Katharina Dittmar and David Liberles Copyright © 2010 Wiley-Blackwell

57

58

GENE DOSAGE AND DUPLICATION

short-term fate of gene duplications and attempt to reconcile them by unifying two different perceptions of redundancy under the framework of gene dosage.

2

GENE DUPLICATIONS AND FUNCTIONAL REDUNDANCY

An overwhelming favorite among all hypotheses on how gene duplications become fixed and evolve is a conglomerate of ideas that stake their argument on the concept of functional redundancy (Haldane, 1933; Fisher, 1935; Ohno, 1970; Kimura and King, 1979; Li, 1980; Ohta, 1990; Walsh, 1995; Wagner, 1998; Force et al., 1999; Stoltzfus, 1999; Lynch and Force, 2000; Prince and Pickett, 2002). The general idea behind functional redundancy of gene duplications can be summed up in the following framework. Suppose that a single-copy gene performs all the possible functions that are required of it. In that case, an exact duplicate of this gene cannot add any functionality to the already existing gene copy and must be completely redundant from a functional and selective perspective. This argument assumes that fitness of an individual with one original copy of the gene is exactly the same as that of an individual with two or more copies. The original definition of the concept of complete functional redundancy proposed that natural selection can distinguish between the ancestral gene copy and the derived gene copy (Ohno, 1970). The new copy was assumed to be free from the constraints of natural selection and its sequence free to wonder in “genotype space,” which may, in time, randomly stumble upon a new function. Such a random acquisition of a new function is now termed neofunctionalization (Li, 1997; Force et al., 1999; Hughes, 1999). The other side of this process is the loss of many new gene copies to degenerate mutations, mutations that do not reduce fitness but are detrimental to the function of this gene copy. Nonfunctionalization (Li, 1997; Force et al., 1999; Hughes 1999), a loss of a functional copy of the gene, is the expected result of neutral fixation of degenerate mutations, and its probability increases as the duplicated gene copies diverge (Walsh, 1995; Wagner, 1998; Stoltzfus, 1999). With the dawn of the genomic era it became apparent that the number of gene duplications in genomes is higher than the number we can reasonably expect to be maintained through the acquisition of novel functions (Force et al., 1999; Lynch and Conery, 2000). Thus, within the framework of functional redundancy, there emerged a need for a new explanation of how a large number of gene duplicates can be maintained in genomes by natural selection. Arguing that natural selection cannot distinguish between the old and the new gene copy, Force and colleagues (1999) modified the hypothesis of genetic redundancy of gene duplicates by proposing that at the moment of duplication mutations in both copies of the gene are neutral. This modification led to the proposal of subfunctionalization, a process by which originally redundant gene copies can evolve to be maintained by selection. Subfunctionalization occurs when both gene copies accumulate slightly degenerate mutations, mutations that are selectively neutral at the time of their occurrence but are harmful to the function of one of the gene copies. If both the new and the old copies gather enough complementary slightly degenerate mutations, both will be necessary to perform the entirety of the original function and will be maintained by natural selection (Force et al., 1999; Lynch and Conery, 2000; Lynch and Force, 2000).

REDUNDANCY OF GENE DUPLICATIONS

59

3 REDUNDANCY OF GENE DUPLICATIONS Since the basis of the traditional arguments centers on the concept of redundancy, it may be worthwhile to review the origin of this idea. In the book that introduced this concept, Susumu Ohno (Ohno, 1970) presented his thinking on the effect of a gene duplication on fitness: “Natural selection can eliminate forbidden mutations and effectively police the base sequence of a DNA cistron only if the genome contains a single copy of each gene. Policing by natural selection becomes very ineffective when multiple copies of the gene are present.” This claim, introduced in the section “Inherent Disadvantage of Having Multiple Gene Copies of the Same Gene,” is intuitively sensible, especially for deleterious mutations of large effect. Indeed, following this logic, if a loss-of-function mutation in a single gene produces a lethal phenotype, it seems unlikely that the same effect will be produced if this mutation strikes one of two duplicated genes. Ohno developed this concept and defined genetic redundancy from the point of view of function. “Only by the accumulation of forbidden mutations at the active sites can the gene locus change its basic character and become a new gene locus. An escape from the ruthless pressure of natural selection is provided by the mechanism of gene duplication. By duplication, a redundant copy of a locus is created. Natural selection often ignores such a redundant copy, and, while being ignored, it accumulates formerly forbidden mutations and is reborn as a new gene locus with a hitherto non-existent function.” If a given gene can perform a novel function with relatively slight modifications, yet these modifications destroy its original function, the obvious and brilliant solution is to make a copy and modify it. Since Ohno defined redundancy from a functional perspective, the implication for a gene duplication is that it is a completely neutral event, since the new copy does not add any functionality to the old one. Interestingly, Ohno recognized that there may be other issues associated with duplicating a gene, such as modifying a gene dosage: “Although the creation of a new gene loci by supplying redundancy is the most important role, there are other benefits the mechanism of gene duplication confers to organisms. When the metabolic requirement of an organism dictates the presence of an enormous amount of a particular gene product, the incorporation of multiple copies of the gene locus by the genome often fulfills that requirement.” Nevertheless, it is clear that in his view, the emergence of functional redundancy is by far a more important consequence of a gene duplication, and Ohno counters the potential benefits of increasing gene dosage with what he considers “inherent disadvantages” of having multiple copies. One of the claimed disadvantages is that even if selection favors a dosage increase, gene duplications will be lost because “slowly but surely more and more duplicates would become useless genes by mutation.” Ohno appears to have made no attempt to distinguish between mutations of large and small effects in duplicated genes, and subsequent work accepted his point of view that redundancy in duplicated genes is an all-or-nothing phenomenon [see Force et al. (1999) and references within]. The subfunctionalization model has also inherited this aspect of genetic redundancy, assuming that at the moment of their emergence, gene duplications are not subject to natural selection (Force et al., 1999): “Each level of redundancy [of a gene copy] is subject to processes of mutation and random genetic drift, which can lead to loss of function by chromosome loss, gene inactivation, or loss of individual regulatory elements.”

60

GENE DOSAGE AND DUPLICATION

The subfunctionalization model, as well as other models that formalized Ohno’s ideas, does not consider the process of fixation of the gene duplications; rather, it assumes that the duplication has already been fixed. Having already assumed the fixation of gene duplications, the model only addresses their “preservation.” Although this approach seems logical in the case of a whole-genome duplication due to polyploidization, perhaps it is not as clear that the step of fixation of small-scale gene duplications does not present any major consequences other than complete functional redundancy.

4

DEFINING THE CONCEPT OF GENETIC REDUNDANCY

Redundancy has long been applied to artificial constructions, whereby engineers add a redundant level of strength to central pieces of a construction, or redundant backup systems. For example, in airplanes the main hydraulics or the fly-by-wire systems are installed in triplicate or quadruplicate in case of the failure of one or two of these flight-crucial systems. The existence of such backup, redundant systems in no way improves the performance of the airplane, as long as the main system is performing without fault. Such redundancy is equivalent to that modeled by the traditional models of gene duplication. The internal combustion engine in a car provides a different example of redundancy. It is clear that an engine with 12 cylinders has a high level of constructional redundancy, as it will take many cylinders to break down to render the engine inoperable. Nevertheless, the main purpose of having 12 cylinders is not to be able to function in case a couple of cylinders blow out, but to provide extra power. The functionality of an airplane with one failed hydraulic system is exactly the same as that of an airplane with all functional systems, whereas a car with one failed cylinder loses in performance. Thus, redundancy is not only a qualitative but also a quantitative property, and a component with some level of redundancy can also contribute to function. In this sense, the complete redundancy of airplane systems is an extreme case among all possible functions relating performance to redundancy. The neutral view of gene duplications assumes the extreme, complete type of redundancy, whereby a new gene copy is similar to that of a backup set of wires in an airplane. However, perhaps the relationship between the gene copy number and fitness is more similar to the relationship between the number of cylinders and performance of an engine. In evolutionary terms, redundancy can be viewed in terms of genetic interactions, or epistasis between gene copies. Consider a fitness function that defines the interaction between genotype and phenotype (fitness). In a simple scenario, we can model fitness as a function of the sum of all individual contributions of each genotype. For example, for a genotype with a number of loci such that each can have two alleles (0 and 1), where allele 1 has a fitness advantage of s over allele 0, the fitness function can be f (p) = (1 + s)p , where p is the number of alleles 1 in the genotype (Wolf et al., 2000). This function implies equality of each allele 1 in every locus, regardless of how many loci are occupied by allele 1, such that the addition of each allele 1 will lead to the same fitness increase of s. Simple deviations from this function, such as a faster or slower rate of fitness increase on the log scale than in the case of (1 + s)p , correspond to synergistic or diminishing returns epistasis, respectively, such that the fitness impact of allele 1 at each locus depends on the allele states at other loci (Figure 1).

GENE DOSAGE: EVIDENCE FROM NATURE

61

2.2 2 Synergistic epistasis Log fitness

1.8 1.6 1.4 Diminishing returns epistasis 1.2 1 1

2

3

4

5

6

7

8

9

10

Number of gene copies

Figure 1 Epistasis and gene copy number. Additive (nonepistatic) fitness function is shown as a continuous line. The fitness function defined by the assumption of complete redundancy is shown as a dashed line. The area above the nonepistatic function leads to synergistic epistasis, while the area below leads to diminishing returns epistasis. The flat function shown as a dashed line is an extremity; however, it appears unlikely that a mutation in a duplicated gene can have the same impact on fitness as in a single-copy gene. Thus, few, if any, duplicated genes are expected to follow the nonepistatic fitness function or a function with synergistic epistasis.

The concept of redundancy appears to be the same as that of epsitasis: The fitness impact of different genotypes, number of gene copies, or different alleles depends on the genomic background in which these events occur. Thus, within the framework of epistasis, any fitness function that shows diminishing returns epistasis can be said to exhibit redundancy, while qualitative, complete redundancy of the classical Ohno model of gene duplication can be described with a flat function where fitness does not change with the number of copies (Figure 1). So, do gene duplications obey an alternative function to that of the flat fitness function representing complete redundancy? If so, evidence that gene duplications can affect fitness must be sought.

5 GENE DOSAGE: EVIDENCE FROM NATURE One biologically plausible hypothesis on how gene copy number can be seen as a quantitative trait is that the number of gene duplicates affects the dosage of the active product of these genes, which may influence the relationship between gene copy number and fitness. It seems reasonable to assume than an extra copy of a gene leads to an increase in the gene product, protein, or RNA, which may have measurable and substantial effects on fitness. There are several examples in the literature of a gene duplication leading to a substantial difference in fitness through dosage effects.

62

GENE DOSAGE AND DUPLICATION

5.1 Deleterious Duplications Case Studies In the early 1990s a duplication on chromosome 17 in the p12–p11.2 region was linked to Charcot–Marie–Tooth disease type 1A (CMT1A), which is one of the most common genetic neuropathies in humans (Lupski et al., 1991). It took several years to achieve some level of certainty in the molecular basis of the disease. The appearance of a segmental duplication in the genome may be deleterious because it may somehow qualitatively disrupt the normal functioning of the same or other genes or produce a deleterious product. Alternatively, an extra copy of a gene may simply cause a deleterious gene dosage effect (Lupski et al., 1992). The latter seems most likely as the case of the CMT1A, with the overexpression of the PMP22 gene being the likely cause (Lupski et al., 1992; Patel et al., 1992; King et al., 1998). CMT1A disease became the first disease shown to be caused by an increased gene dosage achieved though a gene duplication event. A relationship between the disease onset time and gene copy number has been established for a different disease-associated duplication of α-synuclein. Some mutations in α-synuclein cause Parkinson’s disease (Polymeropoulos et al., 1997) through tighter binding to the lamp2a lysosomal membrane receptor, which inhibits its own degradation in lysosomes and increases the cellular concentration of α-synuclein (Cuervo et al., 2004). Duplication of the α-synuclein gene was found to have the same effects, both in terms of the etiology of Parkinson’s disease, and in the dosage-mediated mechanism of its development (Singleton et al., 2003; Chartier-Harlin et al., 2004; Ib´an˜ ez et al., 2004). Perhaps the most convincing evidence in support of the dosage-mediated progression of Parkinson’s disease is the correlation between the age of onset and the number of copies, whereby persons heterozygous for the α-synuclein duplication show a later onset and a lower severity of the disease than do persons homozygous for the duplication (Chartier-Harlin et al., 2004; Ib´an˜ ez et al., 2004). There are relatively few examples of genetic pathologies caused by gene overdose through a gene duplication [see Kondrashov and Kondrashov (2006), Lupski and Stankiewicz (2005), or Conrad and Antonarakis (2007) for a review]. Yet there is little doubt that the basis for the pathogenicity in these cases is mediated by a dosage response (Kondrashov and Kondrashov, 2006; Lupski and Stankiewicz, 2005; Conrad and Antonarakis, 2007). The human population appears to harbor many polymorphic gene duplications (Buckland, 2003; Sebat et al., 2004; Derti et al., 2006; Redon et al., 2006; Cooper et al., 2007), yet a connection between clinical pathology and gene copy number has not been made for many of such cases. Nevertheless, it appears conceivable that other genetic disorders will be causatively linked to increasing dosage through gene duplications (Lupski and Stankiewicz, 2005). Cancer As with hereditary disorders, the development and progression of many cancers are highly dependent on oncogene duplication (Croce, 2008), in what is thought to be a dosage-mediated process (Schwab, 1999; Weir et al., 2004; Yasui et al., 2004). Recently, genomewide studies initiated a more comprehensive search for genomic alterations, including gene duplications, that may affect cancer development (Weir et al., 2004; Degenhardt et al., 2008). Although these studies suggest a wide role of gene duplications in cancer, the field has not yet produced reliable estimates of the total number of gene duplications that may have a role in cancers. Interestingly, heterozygous deletion of some dosage-dependent tumor suppressor genes can also be a factor in

GENE DOSAGE: EVIDENCE FROM NATURE

63

cancer development (Payne and Kemp, 2005). Thus, the increase or decrease of gene dosage through gene duplication or deletion, respectively, are factors in human cancers. Dosage Balance The causative effect of a gene duplication and increased gene dosage with subsequent pathogenic or carcinogenic effects serves as a proof of principle that gene duplications can decrease fitness drastically. It appears that the number of gene copies, for at least some fraction of genes, is in some sort of dosage balance, whereby both an increase and a decrease in dosage (copy number) are deleterious. This dosage balance is exemplified by the PMP22 gene, where an increase in gene dosage causes Charcot–Marie–Tooth disease, while a hereditary neuropathy with liability to pressure palsies is caused by a decreased dosage of the same gene (Lupski and Stankiewicz, 2005). Another aspect of possible deleterious impact of a gene duplication has been observed in duplicated genes remaining after whole-genome duplications (WGD). It has been argued that a difference in dosage of isoforms of the same protein complex is deleterious (Veitia, 2002). Thus, when all isoforms are duplicated at the same time during a WGD event, there may not be such a dosage imbalance. Similarly, genes coding for isoforms from the same protein–protein complex (Papp et al., 2003) as well as other dosage-balanced genes (Veitia, 2004; Liang et al., 2006) will be retained in duplicated form after the WGD event, because a loss of an extra copy will create the same dosage imbalance. These predictions are upheld in Saccharomyces (Papp et al., 2003), Arhabidopsis (Maere et al., 2005; Thomas et al., 2006), and Paramecium (Aury et al., 2006). These observations—that genes with functions sensitive to dosage balance with other genes are retained preferentially after WGD—demonstrate the importance of dosage in determining the number of gene copies (Birchler et al., 2005; Freeling and Thomas, 2006; Birchler and Veitia, 2007; Freeling, 2008; Liang and Fern´andez, 2008) mostly because these observations encompass a much larger number of genes than those discussed in previous sections, which are known to cause pathologies directly. These data are, however, insufficient to fully appreciate the interplay between dosage and duplication, because neither these data nor the dosage balance model offer any possibility of beneficial increase in dosage or gene copy number. 5.2 Beneficial Duplications Case Studies The beneficial impact of gene duplications has been shown for several different classes of genes. Perhaps the clearest example of an adaptive increase of gene dosage through a gene duplication is that of the amylase gene in humans (Perry et al., 2007). Amylase is secreted in the pancreas and saliva, and it starts digestion in the course of chewing food with a significant portion of starch hydrolysis occurring before the food reaches the stomach (Valdez and Fox, 1991; Hoebler et al., 1998). Since this gene has long been noted as one of the polymorphic gene copy number variants in the human population (Pronk et al., 1982; Groot et al., 1989; Meisler and Ting, 1993; Iafrate et al., 2004), Perry and coauthors have speculated that the number of these polymorphic amylase copies may correlate with the historical starch content in the diet of different human population. Indeed, the number of copies of the amylase gene was found to be significantly larger among populations with a high-starch diet. In addition, the frequency of individuals with more than six copies was two times higher in high-starch-diet populations (Perry et al., 2007). Most important, there is a clear

64

GENE DOSAGE AND DUPLICATION

interdependence between the number of gene copies and the amount of amylase in saliva. These data, coupled with the observation of only a single copy of the amylase gene in chimps and, possibly, just a pseudogene in bonobos (Perry et al., 2007), paints the following evolutionary scenario. The increase of starch in the human diet has led to a significant selection for an increased expression of amylase in saliva (Meisler and Ting, 1993; Perry et al., 2007). Part of this increase occurred through an increase in copy number, with such selection being stronger among populations relying on a high starch content in their diet. An entire set of dosage-mediated adaptive gene duplications resulted from the pressure of recent insecticide use. A particularly well-documented case is that of the adaptive response of Culex sp. in areas treated regularly with organophosphorous insecticides (Mouch`es et al., 1986; Pasteur and Raymond, 1996; Raymond et al., 1998; Guillemaid et al., 1999; Paton et al., 2000; Labb´e et al., 2007a,b). The general basis behind the adaptive mechanism is simple—overproduction of esterases that play a role in the breakdown of the organophosphates. This overproduction is achieved in part by duplication of the esterase genes, which leads to an increased resistance to the organophosphate insecticides (Guillemaid et al., 1999). Other insects show a similar adaptive response when faced with the same pressure of organophosphates. Aphids (Devonshire and Moores, 1982; Field et al., 1988, 1999; Foster et al., 2003), Anopheles gambiae (Djogb´enou et al., 2008), the brown planthopper (Vontas et al., 2000), the sheep blowfly (Newcomb et al., 2005), and others [see Tabashnik (1990), Hemingway et al. (2004), and Li et al. (2007)] evolved increased levels of different detoxifying enzymes through gene duplications, suggesting that dosage response through gene duplication is a general mechanism in the genetics of stress-induced adaptive response. Indeed, insecticide resistance through gene duplication has been recognized as a major force by many authors [see Devonshire and Field (1991), Pasteur and Raymond (1996), and Hemingway (2000)], although duplications as a molecular basis of insecticide resistance have not been found in some species (Wilson, 2001). There are three major types of enzymes involved in insecticide resistance: cytochrome P450 monooxygenases, esterases, and glutathione-S-transferases (GSTs), with esterases being by far the more commonly duplicated type (Scott, 1999; Hemingway, 2000; Li et al., 2007). Several resistance-related duplications are known for GST genes (Li et al., 2007), whereas duplication of cytochrome P450 genes as a resistance mechanism has not been observed to date (Scott, 1999; Li et al., 2007), even though their duplication is known to affect toxin metabolism in humans (Buckland, 2003). Another important aspect of insecticide resistance through gene duplication is that at least some of these duplications are actually deleterious in an environment without the pesticide (Raymond et al., 1998; Guillemaid et al., 1999; Bourguet et al., 2004; Foster et al., 2003, 2005). Thus, the impact of gene duplications on fitness can be dependent on the environmental conditions. Such dependence can be seen not only for gene duplications offering resistance against toxic elements, but also for gene duplications that increase fitness by improving the ability of the organism to acquire beneficial substances when they are scarce. For example, in baker’s yeast the duplication of high-affinity hexose transport genes HXT6 and HXT7 leads to an increase in expression of those two genes and an increased fitness of the strain with extra gene copies under glucose-limited conditions (Brown et al., 1998). The basic premise of this adaptation is the same as in the previous example with esterases—positive selection for increased gene dosage can drive fixation of gene duplications.

GENE DOSAGE: EVIDENCE FROM NATURE

65

5.3 Review of Adaptive Responses Gene duplication may cause an adaptive (beneficial) increase in gene dosage as well as a deleterious one. However, unlike the few examples of deleterious effects, the literature is full of examples of adaptive gene duplications. Several authors have reviewed this body of literature throughout the years, targeting either all cases of adaptive gene duplications (Velkov, 1982; Stark and Wahl, 1984; Sonti and Roth, 1989; Kondrashov et al., 2002; Francino, 2005; Kondrashov and Kondrashov, 2006) or a select group of taxa or genes (Anderson and Roth, 1977, 1979; Koch et al., 1981; Tabashnik, 1990; Taylor and Feyereisen, 1996; Romero and Palacios, 1997; Widholm et al., 2001; Moore and Purugganan, 2005; Craven and Neidle, 2007; Hastings, 2007). Several years ago, when compiling a review of the literature on adaptive gene duplications (Kondrashov et al., 2002), I noticed a semantic division within the literature that segregates theoretical and experimental biologists. The theoretical and evolutionary crowd tended to use the term gene duplication to describe the emergence of an extra copy of a gene in the genome. Experimental biologists, especially microbiologists, preferred to use the term gene amplification when referring to essentially the same phenomenon. In single-celled organisms there seems to be absolutely no difference between these two terms, whereas in a multicellular organism it may be appropriate to define gene duplications as mutations that are heritable, whereas gene amplifications are entirely somatic events. Unfortunately, there is no clear understanding in the community of the exact meaning of these terms, and they are often used interchangeably. Thus, only a handful of reviews or papers bridge this semantic divide (Hendrickson et al., 2002; Kondrashov et al., 2002; Francino, 2005; Kugelberg et al., 2006) and what has been obvious to microbiologists for decades—that heritable gene amplifications are often adaptive (Anderson and Roth, 1977, 1979; Koch, 1981; Velkov, 1982; Stark and Wahl, 1984; Hastings, 2007)—is often ignored in the evolutionary literature on gene duplications [e.g., see Force et al. (1999), Prince and Pickett (2002), Taylor and Raes (2004), and Qian and Zhang (2008). Given the breadth of the past review efforts on adaptive gene duplication, only a short synopsis is presented here. Adaptive gene duplications have been observed in almost all taxa: prokaryotes, plants, mammals, fungi, protists, and so on [see Kondrashov et al. (2002)]. Only for viruses does not appear to be any data on adaptive gene duplication, even though gene duplications are certainly a factor in the evolution of at least some types of viruses (Roossinck, 1997; Shackelton and Holmes, 2003). In contrast, there appears to be a bias in the types of functions of genes that underwent adaptive duplications (Velkov, 1982; Stark and Wahl, 1984; Kondrashov et al., 2002). Adaptive duplications are particularly common among genes responsible for the following four classes of functions: transporters, synthetases, a wide range of detoxification enzymes, and various stress-induced proteins [see references in Velkov (1982), Stark and Wahl (1984), Kondrashov et al. (2002), Francino (2005), and Kondrashov and Kondrashov (2006). Changes in the dosage of these proteins allow the organism to respond better under new, often stressful environmental conditions. Duplication of transporters allow the organism to remove harmful molecules, such as heavy metals, antibiotics, or harmful toxins from pathogens, from the organism, or to import necessary molecules, such as sugars and other nutrients, that are usually involved in some type of metabolic process. A higher dose of detoxification enzymes breaks down harmful molecules, such as pesticides, antibiotics, or toxins produced by pathogens,

66

GENE DOSAGE AND DUPLICATION

while duplications of synthetases may be adaptive because they allow the organisms to synthesize necessary substances, such as sugars or amino acids, that are lacking in the environment from other, more readily available substrates. Duplication of other stress-response proteins, such as heat-shock proteins, may be adaptive when the organism experiences prolonged periods of such stress. Finally, it is conceivable that other classes of functions undergo adaptive gene duplications under certain conditions. 5.4 CNVs and Dosage Balance In the last few years, a new body of literature has emerged on the study of gene duplications segregating in the population. These variants are called copy number variations or polymorphisms (CNVs or CNPs), and this literature exists virtually outside the scope of the older literature on gene duplications. “Owing to their size and gene content, CNPs are unlikely to be selectively neutral” is an a priori assumption of one team of authors (Sebat et al., 2004), without any reference to almost a century-long history of this idea and subsequent debates (Bridges, 1935; Rapoport, 1940; Ohno, 1970). Such a cavalier approach is mirrored by many other researchers investigating CNVs, and perhaps not without cause. Since there is no conceptual difference between studies of polymorphic vs. fixed gene duplications (Kondrashov et al., 2002), CNVs appear to be more useful than fixed duplications for detection of short-term selection acting on gene copy number. Large-scale studies that have made an effort to detect selection on CNVs generally find CNVs as a class of variation to be under negative selection (Sebat et al., 2004; Derti et al., 2006; Redon et al., 2006; Cooper et al., 2007; Dopman and Hartl, 2007; Emerson et al., 2008), as evidenced by frequency distribution of the CNVs (Sebat et al., 2004) or by the observation of lower gene and conserved elements content in recent genomic segmental duplications (Derti et al., 2006; Redon et al., 2006). Such genomewide studies have also produced evidence for positive selection on some CNVs (Moore and Purugganan, 2003; Nguyen et al., 2006; Redon et al., 2006; Cooper et al., 2007; Jiang et al., 2007; Emerson et al., 2008). The study of Moore and Purugganan (2003) is particularly exciting, owing to their method of detecting positive selection through lowered frequency of nucleotide polymorphisms around recently fixed duplicated genes, which is indicative of hitchhiking. On average, a CNV is more likely to affect fitness compared to a single-nucleotide polymorphism, which is hardly surprising (Cooper et al., 2007). Similar to the several examples of pathogenic and adaptive gene duplications, a genomewide study showed a clear correlation between copy number and gene expression levels in CNVs (Stranger et al., 2007), indicating a wider role for dosage in determining the fitness impact of duplicated genes.

6 ENVIRONMENTAL INTERACTION AS THE BASIS OF THE ADAPTIVE RESPONSE There is a wealth of evidence showing that in some cases, the benefit of increased gene dosage is sufficient to drive the fixation of gene duplications. Such a benefit seems to emerge when populations of organisms face new or stressful environments: heavy or targeted pollution, nutrient limitations, pressure of pathogens, or other types of stress. A point of contention for the Ohno-like models, which assume neutral fixation of

ENVIRONMENTAL INTERACTION AS THE BASIS OF THE ADAPTIVE RESPONSE

67

most gene duplications, is how frequently such beneficial duplications arise in nature. Another way of phrasing the same question is how many of the recent gene duplications in modern genomes have been fixed by positive selection instead of genetic drift. The first genomewide study that measured selection acting on diverging gene copies by looking at the ratio of the rate of nonsynonymous (Ka ) to the rate of nonsynonymous substitutions (Ks ) did not consider the possibility that gene duplications may be fixed by positive selection for increased gene dosage (Lynch and Conery, 2000). Nevertheless, they observed that very few duplicated genes appear to evolve neutrally, and that there appear to be many fewer older duplications than more recent ones. Another genomewide study analyzed not only the selective pressure (Ka /Ks ) but also the functional repertoire of recent gene duplications and found similarities between the types of functions found among the recently duplicated genes in complete genomes and those functions that have been shown in smaller-scale experiments to be fixed by positive selection for increased gene dosage (Kondrashov et al., 2002). This observation spurred the authors to propose an alternative hypothesis: “Gene duplications that persist in an evolving lineage are beneficial from the time of their origin, due primarily to a protein dosage effect in response to variable environmental conditions.” Currently, relatively little has been done to resolve the question of what fraction of emerging duplications is fixed by positive selection. The assumption of complete genetic redundancy of gene duplication still dominates the field. The assumption of neutrality as a null model is pervasive throughout population genetics, and it is difficult to find evidence in support of, as opposed to against, a null model. Thus, few authors attempt to investigate the mode of selection acting on emerging gene copies on a genomewide scale. The observation that there appear to be a very large number of very recent gene duplications, as judged by their very high (>98%) identity level, and many fewer more diverged gene copies (Lynch and Conery, 2000; Kondrashov et al., 2002) generally supports models of gene duplications that assume neutral fixation. On the other hand, these data are also consistent with a large fraction of very similar gene copies undergoing frequent gene conversion, which appears to be the case in baker’s yeast (Gao and Innan, 2004) and Drosophila (Osada and Innan, 2008). If gene conversion would be found to be pervasive among gene copies in other species as well, it would undermine a major argument for neutrality of recent gene duplications. Evidence in support of the hypothesis that the majority of gene duplications are fixed by positive selection rather than genetic drift is also scarce on a genomewide level. Moore and Purugganan (2003) found a reduced level of polymorphisms around recently duplicated genes in Arabidopsis thaliana, implying that their fixations occurred under positive selection. It is unfortunate that no other study attempts to examine the selection pressure acting on gene duplications in the course of their fixation because such population genetic studies are probably the best way to resolve the issue at hand. Another study observed a higher number of recently duplicated genes in the mouse genome compared to the human genome, which can be explained by a stronger selection for these gene copies in the mouse because of the larger effective population size in that species (Shiu et al., 2006). Finally, the observation that recent gene duplications, including CNVs, are enriched for environmentally sensitive, stress-induced, and

68

GENE DOSAGE AND DUPLICATION

defensive functions in a wide variety of species (Kondrashov et al., 2002; Gu et al., 2002; Hooper and Berg, 2003; Francino, 2005; Nguyen et al., 2006; Shiu et al., 2006; Cooper et al., 2007; Emerson et al., 2008; Hanada et al., 2008; Korbel et al., 2008; Ponting, 2008; Powell et al., 2008) currently provides the broadest support for the hypothesis that most gene copies that have emerged independent of WGD events are fixed by positive selection in response to a changing environment. Two interesting observations have been made on a genomewide level that do not support either neutral or adaptive viewpoint of gene duplications. The first is the observation that gene duplications that remain after WGD events are functionally different from gene duplications that occur individually (Davis and Petrov, 2005; Hakes et al., 2007). It is likely that different evolutionary mechanism determine which genes are retained after a WGD or after a single-gene duplication, but this observation in itself does not reveal the selection regime present in the course of fixation of single-gene duplications. The second observation is that there is a substantial relaxation of selection in gene copies after a gene duplication (Lynch and Conery, 2000; Kondrashov et al., 2002). This observation is consistent with both neutral fixation and selective fixation: Genes that are fixed by selection for increased gene dosage will experience a period of relaxed selection as long as fitness after a gene duplication has not increased more than twofold (Kondrashov et al., 2002). In addition, the acquisition of a functional novelty may occur rapidly: that is, before enough substitutions accumulate in diverging lineage to make the Ka /Ks ratio informative.

7

DOSAGE AND GENETIC DOMINANCE

As we have seen from examples described in the literature, an increase in the number of gene copies can lead to an increase or decrease in fitness. Similarly, a decrease in the number of gene copies can have different consequences for fitness. In the 1930s, Sewall Wright related the concept of gene dosage to the fitness of heterozygous deleterious mutations, which in the case of a loss-of-function mutation corresponds to a loss of one copy of a gene (Wright, 1934). The aim of Wright’s model was to explain why some alleles are recessive whereas others are dominant, and he modeled different fitness functions that showed interdependence between dosage (genotype) and fitness (phenotype). Wright’s ideas on how the action of genes and their dosage is manifested in the relationship between genotype and fitness have been developed for certain functions of genes (Kacser and Burns, 1981; Veitia, 2002, 2004; Conrad and Antonarakis, 2007) and extended beyond the concept of dominance to include gene duplications (Kondrashov and Koonin, 2004; Veitia, 2004; Conrad and Antonarakis, 2007). The mathematical intricacies of these models are important for the understanding of the reasons why and how dosage may affect fitness. Wright (1934) and others (Kacser and Burns, 1981; Veitia, 2002, 2004) have shown that within one unifying theory, two types of functions lead to different consequences of decreasing gene dosage, which correspond to recessive and dominant loss-of-function alleles. Similarly, understanding the interplay between the increased dosage of a gene and its impact on fitness provides a basis for predicting the evolutionary fate of gene duplications (Papp et al., 2003; Kondrashov and Koonin, 2004; Birchler and Veitia, 2007; Conrad and Antonarakis, 2007; Lehner, 2008).

DOSAGE THEORY AND GENE DUPLICATIONS

69

8 DOSAGE THEORY AND GENE DUPLICATIONS

Fitness (phenotype)

Wright’s idea has been generalized for gene duplications with four different types of fitness functions. Perhaps the most straightforward function is a simple linear one, where a decrease of gene dosage is deleterious and an increase is beneficial. The diminishing returns function proposed by Wright describes instances where both a decrease and an increase of gene dosage have no effect. A function with a clear optimum describes the opposite case of dosage balance (Veitia, 2002), where both decrease and increase of gene dosage are deleterious. Finally, a concave-shaped function describes the case where a decrease of gene dosage may be benign whereas an increase is deleterious (Figure 2). It seems that most genes should be classifiable within these four categories, including those genes that may be showing complete redundancy. Conversely, modeling the fates of gene duplications within a context of gene dosage is possible for a wide variety of fitness outcomes of changes in gene dosage and copy number. The theoretical treatment of gene duplications through dosage, which includes the possibility of complete redundancy as an extreme case, appears to be more universal than the neutral theory, which assumes only complete redundancy. However, two major obstacles remain to our ability to apply this theory to real data. First, we have only a vague idea of how to understand which type of a fitness function is appropriate for a particular gene. In theory, enzymes should be described by a diminishing returns type of function (Kacser and Burns, 1981), and genes encoding isoforms of a large protein should be described by a function with an optimum (Veitia, 2002), whereas proteins

0 (aa)

1/2 (Aa)

1 (AA)

2 (AA,AA)

Protein concentration (genotype)

Figure 2 Dependence of fitness on gene dosage (the number of gene copies or genotype). Four types of functions are displayed here: in red, a linear fitness function; in blue, a diminishing returns function; in green, a function with a well-defined optimum; and a concave function is shown in magenta. The area between the two dashed lines defines an area where fitness is similar to the fitness of an individual with one gene copy. (Adapted from Wright, 1934; Kondrashov and Koonin, 2004; Conrad and Antonarakis, 2007.) (See insert for color representation of the figure.)

70

GENE DOSAGE AND DUPLICATION

with a propensity to aggregate should be described by the concave function (Conrad and Antonarakis, 2007). It is less clear what types of gene functions may exhibit a linear relationship between dosage and fitness; however, empirical observations point to genes with binding or regulatory function (Kondrashov and Koonin, 2004). Despite these theoretical considerations, empirical observations of many enzymes providing an adaptive response to stressful environments prove that not all enzymes follow the diminishing returns dosage rule. More so, it is likely that for many genes the real fitness function will be somewhere in between of the four characterizations displayed here. Another conceptual problem with determining the relationship between dosage and fitness for individual genes is that this relationship can depend drastically on the environment. Duplications of genes can be beneficial under stressful conditions but deleterious in a benign environment (Brown et al., 1998; Raymond et al., 1998; Guillemaid et al., 1999; Kondrashov et al., 2002; Foster et al., 2003, 2005; Bourguet et al., 2004; Francino, 2005; Lawrence, 2005); thus under stressful conditions the fitness function is closer to linear, or at least fast-growing diminishing returns, whereas in a normal setting the function is more of convex or optimum. It is possible that for a vast majority of genes, the relationship between dosage and fitness may resemble a very flat diminishing returns function, which is essentially the expectation of the complete redundancy model. On the other hand, there seem to be plenty of data that support a more vigorous dependence of fitness on dosage. However, many of the genome-level inferences of the fitness impacts of recent, smallscale duplications are obtained through indirect observations, and thus whether or not flat, diminishing returns functions describe most of the genes in the genome remains a matter of opinion. In addition, evidence of a difference in the functional repertoire of gene duplications created by recent small-scale gene duplications and those that have been retained from WGD events (Davis and Petrov, 2005; Hakes et al., 2007) suggests that different functions may be appropriate for different scales of gene duplications. How are small-scale gene duplications fixed in natural populations? Almost a century after the importance of gene duplications has been realized for the first time, we still do not have a solid answer to this question. In this chapter I presented a synopsis of evidence that supports the assertion that positive selection may play a decisive role in many gene duplications. Many other authors support the traditional view that most gene duplications are fixed by genetic drift. It seems that only genome-wise population genetics studies aimed at probing the selection pressures acting on gene duplications in the course of their fixation may provide a straightforward answer.

REFERENCES Anderson RP, Roth JR. 1977. Tandem genetic duplications in phage and bacteria. Annu Rev Microbiol 31:473–505. Anderson RP, Roth JR. 1979. Gene duplication in bacteria: alteration of gene dosage by sisterchromosome exchanges. Cold Spring Harb Symp Quant Biol 43(Pt 2):1083–1087. Aury JM, Jaillon O, Duret L, Noel B, Jubin C, Porcel BM, S´egurens B, Daubin V, Anthouard V, Aiach N, et al. 2006. Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraurelia. Nature 444:171–178. Birchler JA, Veitia RA. 2007. The gene balance hypothesis: from classical genetics to modern genomics. Plant Cell 19:395–402.

REFERENCES

71

Birchler JA, Riddle NC, Auger DL, Veitia RA. 2005. Dosage balance in gene regulation: biological implications. Trends Genet 21:219–226. Bourguet D, Guillemaud T, Chevillon C, Raymond M. 2004. Fitness costs of insecticide resistance in natural breeding sites of the mosquito Culex pipiens. Evolution 58: 128–135. Bridges CA. 1935. Salivary chromosome maps. J Hered 26:60–64. Brown CJ, Todd KM, Rosenzweig RF. 1998. Multiple duplications of yeast hexose transport genes in response to selection in a glucose-limited environment. Mol Biol Evol 15: 931–942. Buckland PR. 2003. Polymorphically duplicated genes: their relevance to phenotypic variation in humans. Ann Med 35:308–315. Chartier-Harlin MC, Kachergus J, Roumier C, Mouroux V, Douay X, Lincoln S, Levecque C, Larvor L, Andrieux J, Hulihan M, et al. 2004. Alpha-synuclein locus duplication as a cause of familial Parkinson’s disease. Lancet 364:1167–1169. Conrad B, Antonarakis SE. 2007. Gene duplication: a drive for phenotypic diversity and cause of human disease. Annu Rev Genom Hum Genet 8:17–35. Cooper GM, Nickerson DA, Eichler EE. 2007. Mutational and selective effects on copy-number variants in the human genome. Nat Genet 39:S22–S29. Craven SH, Neidle EL. 2007. Double trouble: medical implications of genetic duplication and amplification in bacteria. Future Microbiol 2:309–321. Croce CM. 2008. Oncogenes and cancer. N Engl J Med 358:502–511. Cuervo AM, Stefanis L, Fredenburg R, Lansbury PT, Sulzer D. 2004. Impaired degradation of mutant alpha-synuclein by chaperone-mediated autophagy. Science 305:1292–1295. Davis JC, Petrov DA. 2005. Do disparate mechanisms of duplication add similar genes to the genome? Trends Genet 21:548–551. Degenhardt YY, Wooster R, McCombie RW, Lucito R, Powers S. 2008. High-content analysis of cancer genome DNA alterations. Curr Opin Genet Dev 18:68–72. Derti A, Roth FP, Church GM, Wu CT. 2006. Mammalian ultraconserved elements are strongly depleted among segmental duplications and copy number variants. Nat Genet 38: 1216–1220. Devonshire AL, Field LM. 1991. Gene amplification and insecticide resistance. Annu Rev Entomol 36:1–23. Devonshire AL, Moores GD. 1982. A carboxylesterase with broad substrate specificity causes organophosphorus, carbamate and pyrethroid resistance in peach–potato aphids (Myzus persicae). Pestic Biochem Physiol 18:235–246. Djogb´enou L, Chandre F, Berthomieu A, Dabir´e R, Koffi A, Alout H, Weill M. 2008. Evidence of introgression of the ace-1(R) mutation and of the ace-1 duplication in West African Anopheles gambiae ss. PLoS ONE 3:e2172. Dopman EB, Hartl DL. 2007. A portrait of copy-number polymorphism in Drosophila melanogaster . Proc Natl Acad Sci USA 104:19920–19925. Emerson JJ, Cardoso-Moreira M, Borevitz JO, Long M. 2008. Natural selection shapes genome-wide patterns of copy-number polymorphism in Drosophila melanogaster. Science 320:1629–1631. Field LM, Devonshire AL, Forde BG. 1988. Molecular evidence that insecticide resistance in peach–potato aphids (Myzus persicae Sulz.) results from amplification of an esterase gene. Biochem J 251:309–312. Field LM, Blackman RL, Tyler-Smith C, Devonshire AL. 1999. Relationship between amount of esterase and gene copy number in insecticide-resistant Myzus persicae (Sulzer). Biochem J 339:737–742.

72

GENE DOSAGE AND DUPLICATION

Fisher RA. 1935. The sheltering of lethals. Am Nat 69:446–455. Force A, Lynch M, Pickett FB, Amores A, Yan YL, Postlethwait J. 1999. Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151:1531–1545. Foster SP, Young S, Williamson MS, Duce I, Denholm I, Devine GJ. 2003. Analogous pleiotropic effects of insecticide resistance genotypes in peach–potato aphids and houseflies. Heredity 91:98–106. Foster SP, Denholm I, Thompson R, Poppy GM, Powell W. 2005. Reduced response of insecticide-resistant aphids and attraction of parasitoids to aphid alarm pheromone; a potential fitness trade-off. Bull Entomol Res 95:37–46. Francino MP. 2005. An adaptive radiation model for the origin of new gene functions. Nat Genet 37:573–577. Freeling M. 2008. The evolutionary position of subfunctionalization, downgraded. Genome Dyn 4:25–40. Freeling M, Thomas BC. 2006. Gene-balanced duplications, like tetraploidy, provide predictable drive to increase morphological complexity. Genome Res 16:805–814. Gao LZ, Innan H. 2004. Very low gene duplication rate in the yeast genome. Science 306:1367–1370. Groot PC, Bleeker MJ, Pronk JC, Arwert F, Mager WH, Planta RJ, Eriksson AW, Frants RR. 1989. The human alpha-amylase multigene family consists of haplotypes with variable numbers of genes. Genomics 5:29–42. Gu Z, Cavalcanti A, Chen FC, Bouman P, Li WH. 2002. Extent of gene duplication in the genomes of Drosophila, nematode, and yeast. Mol Biol Evol 19:256–262. Guillemaud T, Raymond M, Tsagkarakou A, Bernard C, Rochard P, Pasteur N. 1999. Quantitative variation and selection of esterase gene amplification in Culex pipiens. Heredity 83:87–99. Hakes L, Pinney JW, Lovell SC, Oliver SG, Robertson DL. 2007. All duplicates are not equal: the difference between small-scale and genome duplication. Genome Biol 8:R209. Haldane JBS. 1933. The part played by recurrent mutation in evolution. Am Nat 67:5–9. Hanada K, Zou C, Lehti-Shiu MD, Shinozaki K, Shiu SH. 2008. Importance of lineage-specific expansion of plant tandem duplicates in the adaptive response to environmental stimuli. Plant Physiol 148:993–1003. Hastings PJ. 2007. Adaptive amplification. Crit Rev Biochem Mol Biol 42:271–283. Hemingway J. 2000. The molecular basis of two contrasting metabolic mechanisms of insecticide resistance. Insect Biochem Mol Biol 30:1009–1015. Hemingway J, Hawkes NJ, McCarroll L, Ranson H. 2004. The molecular basis of insecticide resistance in mosquitoes. Insect Biochem Mol Biol 34:653–665. Hendrickson H, Slechta ES, Bergthorsson U, Andersson DI, Roth JR. 2002. Amplificationmutagenesis: evidence that “directed” adaptive mutation and general hypermutability result from growth with a selected gene amplification. Proc Natl Acad Sci USA 99:2164–2169. Hoebler C, Karinthi A, Devaux MF, Guillon F, Gallant DJ, Bouchet B, Melegari C, Barry JL. 1998. Physical and chemical transformations of cereal food during oral digestion in human subjects. Br J Nutr 80:429–436. Hooper SD, Berg OG. 2003. On the nature of gene innovation: duplication patterns in microbial genomes. Mol Biol Evol 20:945–954. Hughes AL. 1999. Adaptive Evolution of Genes and Genomes. New York: Oxford University Press. Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW, Lee C. 2004. Detection of large-scale variation in the human genome. Nat Genet 36:949–951.

REFERENCES

73

Ib´an˜ ez P, Bonnet AM, D´ebarges B, Lohmann E, Tison F, Pollak P, Agid Y, D¨urr A, Brice A. 2004. Causal relation between alpha-synuclein gene duplication and familial Parkinson’s disease. Lancet 364:1169–1171. Jiang Z, Tang H, Ventura M, Cardone MF, Marques-Bonet T, She X, Pevzner PA, Eichler EE. 2007. Ancestral reconstruction of segmental duplications reveals punctuated cores of human genome evolution. Nat Genet 39:1361–1368. Kacser H, Burns JA. 1981. The molecular basis of dominance. Genetics 97:639–666. Kimura M, King JL. 1979. Fixation of a deleterious allele at one of two “duplicate” loci by mutation pressure and random drift. Proc Natl Acad Sci USA 76:2858–2861. King PH, Waldrop R, Lupski JR, Shaffer LG. 1998. Charcot–Marie–Tooth phenotype produced by a duplicated PMP22 gene as part of a 17p trisomy-translocation to the X chromosome. Clin Genet 54:413–416. Koch AL. 1981. Evolution of antibiotic resistance gene function. Microbiol Rev 45:355–378. Kondrashov FA, Kondrashov AS. 2006. Role of selection in fixation of gene duplications. J Theor Biol 239:141–151. Kondrashov FA, Koonin EV. 2004. A common framework for understanding the origin of genetic dominance and evolutionary fates of gene duplications. Trends Genet 20:287–290. Kondrashov FA, Rogozin IB, Wolf YI, Koonin EV. 2002. Selection in the evolution of gene duplications. Genome Biol 3:R8. Korbel JO, Kim PM, Chen X, Urban AE, Weissman S, Snyder M, Gerstein MB. 2008. The current excitement about copy-number variation: how it relates to gene duplications and protein families. Curr Opin Struct Biol 18:366–374. Kugelberg E, Kofoid E, Reams AB, Andersson DI, Roth JR. 2006. Multiple pathways of selected gene amplification during adaptive mutation. Proc Natl Acad Sci USA 103:17319–17324. Labb´e P, Berthomieu A, Berticat C, Alout H, Raymond M, Lenormand T, Weill M. 2007a. Independent duplications of the acetylcholinesterase gene conferring insecticide resistance in the mosquito Culex pipiens. Mol Biol Evol 24:1056–1067. Labb´e P, Berticat C, Berthomieu A, Unal S, Bernard C, Weill M, Lenormand T. 2007b. Forty years of erratic insecticide resistance evolution in the mosquito Culex pipiens. PLoS Genet 3:e205. Lawrence JG. 2005. Common themes in the genome strategies of pathogens. Curr Opin Genet Dev 15:584–588. Lehner B. 2008. Selection to minimise noise in living systems and its implications for the evolution of gene expression. Mol Syst Biol 4:170. Li WH. 1980. Rate of gene silencing at duplicate loci: a theoretical study and interpretation of data from tetraploid fishes. Genetics 95:237–258. Li WH. 1997. Molecular Evolution. Sunderland, MA: Sinauer Associates. Li X, Schuler MA, Berenbaum MR. 2007. Molecular mechanisms of metabolic resistance to synthetic and natural xenobiotics. Annu Rev Entomol 52:231–253. Liang H, Fern´andez A. 2008. Evolutionary constraints imposed by gene dosage balance. Front Biosci 13:4373–4378. Liang H, Plazonic KR, Chen J, Li WH, Fern´andez A. 2008. Protein under-wrapping causes dosage sensitivity and decreases gene duplicability. PLoS Genet 4:e11. Lupski JR, Stankiewicz P. 2005. Genomic disorders: molecular mechanisms for rearrangements and conveyed phenotypes. PLoS Genet 1:e49. Lupski JR, de Oca-Luna RM, Slaugenhaupt S, Pentao L, Guzzetta V, Trask BJ, SaucedoCardenas O, Barker DF, Killian JM, Garcia CA, et al. 1991. DNA duplication associated with Charcot–Marie–Tooth disease type 1A. Cell 66:219–232.

74

GENE DOSAGE AND DUPLICATION

Lupski JR, Wise CA, Kuwano A, Pentao L, Parke JT, Glaze DG, Ledbetter DH, Greenberg F, Patel PI. 1992. Gene dosage is a mechanism for Charcot–Marie–Tooth disease type 1A. Nat Genet 1:29–33. Lynch M, Conery JS. 2000. The evolutionary fate and consequences of duplicate genes. Science 290:1151–1155. Lynch M, Force A. 2000. The probability of duplicate gene preservation by subfunctionalization. Genetics 154:459–473. Maere S, De Bodt S, Raes J, Casneuf T, Van Montagu M, Kuiper M, Van de Peer Y. 2005. Modeling gene and genome duplications in eukaryotes. Proc Natl Acad Sci USA 102: 5454–5459. Maynard Smith J. 1978. The Evolution of Sex . Cambridge, UK: Cambridge University Press. Meisler MH, Ting CN. 1993. The remarkable evolutionary history of the human amylase genes. Crit Rev Oral Biol Med 4:503–509. Moore RC, Purugganan MD. 2003. The early stages of duplicate gene evolution. Proc Natl Acad Sci USA 100:15682–15687. Moore RC, Purugganan MD. 2005. The evolutionary dynamics of plant duplicate genes. Curr Opin Plant Biol 8:122–128. Mouch`es C, Pasteur N, Berg´e JB, Hyrien O, Raymond M, de Saint Vincent BR, de Silvestri M, Georghiou GP. 1986. Amplification of an esterase gene is responsible for insecticide resistance in a Californian Culex mosquito. Science 233:778–780. Newcomb RD, Gleeson DM, Yong CG, Russell RJ, Oakeshott JG. 2005. Multiple mutations and gene duplications conferring organophosphorus insecticide resistance have been selected at the Rop-1 locus of the sheep blowfly, Lucilia cuprina. J Mol Evol 60:207–220. Nguyen DQ, Webber C, Ponting CP. 2006. Bias of selection on human copy-number variants. PLoS Genet 2:e20. Ohno S. 1970. Evolution by Gene Duplication. New York: Springer-Verlag. Ohta T. 1990. How gene families evolve. Theor Popul Biol 37:213–219. Osada N, Innan H. 2008. Duplication and gene conversion in the Drosophila melanogaster genome. PLoS Genet 4:e1000305. Papp B, P´al C, Hurst LD. 2003. Dosage sensitivity and the evolution of gene families in yeast. Nature 424:194–197. Pasteur N, Raymond M. 1996. Insecticide resistance genes in mosquitoes: their mutations, migration, and selection in field populations. J Hered 87:444–449. Patel PI, Roa BB, Welcher AA, Schoener-Scott R, Trask BJ, Pentao L, Snipes GJ, Garcia CA, Francke U, Shooter EM, et al. 1992. The gene for the peripheral myelin protein PMP-22 is a candidate for Charcot–Marie–Tooth disease type 1A. Nat Genet 1:159–165. Paton MG, Karunaratne SH, Giakoumaki E, Roberts N, Hemingway J. 2000. Quantitative analysis of gene amplification in insecticide-resistant Culex mosquitoes. Biochem J 346: 17–24. Payne SR, Kemp CJ. 2005. Tumor suppressor genetics. Carcinogenesis 26:2031–2045. Perry GH, Dominy NJ, Claw KG, Lee AS, Fiegler H, Redon R, Werner J, Villanea FA, Mountain JL, Misra R, et al. 2007. Diet and the evolution of human amylase gene copy number variation. Nat Genet 39:1256–1260. Polymeropoulos MH, Lavedan C, Leroy E, Ide SE, Dehejia A, Dutra A, Pike B, Root H, Rubenstein J, Boyer R, et al. 1997. Mutation in the alpha-synuclein gene identified in families with Parkinson’s disease. Science 276:2045–2047. Ponting CP. 2008. The functional repertoires of metazoan genomes. Nat Rev Genet 9: 689–698.

REFERENCES

75

Powell AJ, Conant GC, Brown DE, Carbone I, Dean RA. 2008. Altered patterns of gene duplication and differential gene gain and loss in fungal pathogens. BMC Genom 9:147. Prince VE, Pickett FB. 2002. Splitting pairs: the diverging fates of duplicated genes. Nat Rev Genet 3:827–837. Pronk JC, Frants RR, Jansen W, Eriksson AW, Tonino GJ. 1982. Evidence of duplication of the human salivary amylase gene. Hum Genet 60:32–35. Qian W, Zhang J. 2008. Gene dosage and gene duplicability. Genetics 179:2319–2324. Rapoport IA. 1940. Mnogokratnye linejnye povtoreniya uchastkov khromosom i ikh evolyucionnoe znachenie. [Multiple linear repeats of chromosome segments and their evolutionary significance.] Zh Obshch Biol 1:235–270. Raymond M, Chevillon C, Guillemaud T, Lenormand T, Pasteur N. 1998. An overview of the evolution of overproduced esterases in the mosquito Culex pipiens. Philos Trans R Soc Lond B 353:1707–1711. Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W, et al. 2006. Global variation in copy number in the human genome. Nature 444:444–454. Roossinck MJ. 1997. Mechanisms of plant virus evolution. Annu Rev Phytopathol 35:191–209. Romero D, Palacios R. 1997. Gene amplification and genomic plasticity in prokaryotes. Annu Rev Genet 31:91–111. Schwab M. 1999. Oncogene amplification in solid tumors. Semin Cancer Biol 9:319–325. Scott JG. 1999. Cytochromes P450 and insecticide resistance. Insect Biochem Mol Biol 29:757–777. Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, MAAn´er S, Massa H, Walker M, Chi M, et al. 2004. Large-scale copy number polymorphism in the human genome. Science 305:525–528. Shackelton LA, Holmes EC. 2004. The evolution of large DNA viruses: combining genomic information of viruses and their hosts. Trends Microbiol 12:458–465. Shiu SH, Byrnes JK, Pan R, Zhang P, Li WH. 2006. Role of positive selection in the retention of duplicate genes in mammalian genomes. Proc Natl Acad Sci USA 103:2232–2236. Singleton AB, Farrer M, Johnson J, Singleton A, Hague S, Kachergus J, Hulihan M, Peuralinna T, Dutra A, Nussbaum R, et al. 2003. α-Synuclein locus triplication causes Parkinson’s disease. Science 302:841. Sonti RV, Roth JR. 1989. Role of gene duplications in the adaptation of Salmonella typhimurium to growth on limiting carbon sources. Genetics 123:19–28. Stark GR, Wahl GM. 1984. Gene amplification. Annu Rev Biochem 53:447–491. Stoltzfus A. 1999. On the possibility of constructive neutral evolution. J Mol Evol 49:169–181. Stranger BE, Forrest MS, Dunning M, Ingle CE, Beazley C, Thorne N, Redon R, Bird CP, de Grassi A, Lee C, et al. 2007. Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science 315:848–853. Tabashnik BE. 1990. Implications of gene amplification for evolution and management of insecticide resistance. J Econ Entomol 83:1170–1176. Taylor M, Feyereisen R. 1996. Molecular biology and evolution of resistance of toxicants. Mol Biol Evol 13:719–734. Taylor JS, Raes J. 2004. Duplication and divergence: the evolution of new genes and old ideas. Annu Rev Genet 38:615–643. Thomas BC, Pedersen B, Freeling M. 2006. Following tetraploidy in an Arabidopsis ancestor, genes were removed preferentially from one homeolog leaving clusters enriched in dosesensitive genes. Genome Res 16:934–946.

76

GENE DOSAGE AND DUPLICATION

Valdez IH, Fox PC. 1991. Interactions of the salivary and gastrointestinal systems: I. The role of saliva in digestion. Dig Dis 9:125–132. Veitia RA. 2002. Exploring the etiology of haploinsufficiency. Bioessays 24:175–184. Veitia RA. 2004. Gene dosage balance in cellular pathways: implications for dominance and gene duplicability. Genetics 168:569–574. Velkov VV. 1982. Gene amplification in prokaryotic and eukaryotic systems. Genetika 18:529–543. Vontas JG, Small GJ, Hemingway J. 2000. Comparison of esterase gene amplification, gene expression and esterase activity in insecticide susceptible and resistant strains of the brown planthopper, Nilaparvata lugens (StAAl). Insect Mol Biol 9:655–660. Wagner A. 1998. The fate of duplicated genes: loss or new function? Bioessays 20:785–788. Walsh JB. 1995. How often do duplicated genes evolve new functions? Genetics 139:421–428. Webb C. 2003. A complete classification of Darwinian extinction in ecological interactions. Am Nat 161:181–205. Weir B, Zhao X, Meyerson M. 2004. Somatic alterations in the human cancer genome. Cancer Cell 6:433–438. Widholm JM, Chinnala AR, Ryu JH, Song HS, Eggett T, Brotherton JE. 2001. Glyphosate selection of gene amplification in suspension cultures of 3 plant species. Physiol Plant 112:540–545. Wilson TG. 2001. Resistance of Drosophila to toxins. Annu Rev Entomol 46:545–571. Wolf JB, Brodie ED III, Wade MJ (eds.). 2000. Epistasis and the Evolutionary Process. Oxford, UK: Oxford University Press. Wright S. 1934. Physiological and evolutionary theories of dominance. Am Nat 68:24–53. Yasui K, Mihara S, Zhao C, Okamoto H, Saito-Ohara F, Tomida A, Funato T, Yokomizo A, Naito S, Imoto I, et al. 2004. Alteration in copy numbers of genes as a mechanism for acquired drug resistance. Cancer Res 64:1403–1410.

5

Myths and Realities of Gene Duplication AUSTIN L. HUGHES and ROBERT FRIEDMAN Department of Biological Sciences, University of South Carolina, Columbia, South Carolina

1 INTRODUCTION According to Li (1983, p. 14), “gene duplication is probably the most important mechanism for generating new genes and new biochemical processes that have facilitated the evolution of complex organisms from primitive ones.” Many evolutionary theorists have emphasized the importance of gene duplication in evolution (Nei, 1969; Ohno, 1973; Kimura and Ohta, 1974; Hughes, 1999a). Although the availability of complete genome sequences, particularly of eukaryotes, has provided us with an extraordinarily rich database on the gene duplication events that have occurred over evolutionary history, the mechanisms by which new gene functions evolve in connection with gene duplication remain elusive. In this chapter we briefly review some of the main theoretical ideas that have been proposed regarding the evolution of new gene and protein function after gene duplication. We then review some data regarding the evolution of new gene functions in the light of theory. We emphasize in particular the results of our own studies over the past decade, which illustrate the complexity of gene evolution and the difficulty of making general statements that are applicable to every case. Indeed, our view of the evolutionary process is that by its very nature, it defies easy generalization. Evolution depends on the haphazard and unpredictable raw material of mutation, filtered through such population processes as genetic drift and natural selection. What natural selection favors will be what works in terms of reproductive success—not necessarily what is “well designed” by any standard derived from human engineering. And what works may come about by many different pathways, suggesting that evolutionary biologists must always adopt a pluralistic mindset, ready to acknowledge that in biology just about every general rule has exceptions. Unfortunately, evolutionary biology at the present time remains shackled by an outmoded way of thinking inherited from the Neo-Darwinists of the early twentieth century. The Neo-Darwinists tended to view natural selection as a kind of magic that can accomplish whatever is needed to enable an organism to achieve optimal adaptation to a given environment. This naive view was based on the oft-stated belief that most natural populations contain sufficient heritable variation to respond to any Evolution After Gene Duplication, Edited by Katharina Dittmar and David Liberles Copyright © 2010 Wiley-Blackwell

77

78

MYTHS AND REALITIES OF GENE DUPLICATION

challenge that the environment might impose. The reality is far different. The ability to respond to environmental change can rarely rely on preexisting variation within a population but, rather, depends on new mutations. The numerous species that have been driven to extinction as a result of human disruption of their environments in the past few centuries provide a dramatic illustration of the impotence of natural selection to respond to environmental challenges in the absence of appropriate mutations. Moreover, because evolutionary biology is a historical science, it depends on the type and quality of evidence available. There may be certain questions about the past history of life on Earth that we will never be able to answer with any certainty, simply because the evidence that might enable us to answer them is no longer available to us. In such cases we can hypothesize and can deem certain hypotheses to be more plausible than others on the basis of what data are available. But the elaboration and testing of evolutionary hypotheses always require a certain degree of humility, the ability to distinguish what we know for certain from what is only a plausible story. In our view, hypothesis testing is the essence of the scientific process; and no hypothesis should be accepted without rigorous and thorough testing against alternatives. Because of the fact that evolutionary biology is a historical science, scientists working in this field need to be particularly rigorous and careful in testing hypotheses in order to distinguish what is solidly established from what remains mere speculation. Unfortunately, evolutionary biologists have frequently been remiss in this regard. In a famous paper, Gould and Lewontin (1979) decried the “adaptive storytelling” rampant among Neo-Darwinists. One would hope that the situation would have improved with the advent of molecular data, but in many ways it has gotten worse. The proliferation of numerous ill-founded statistical methods has given rise to a kind of “computer-assisted storytelling” that purports to test hypotheses but in fact does not adequately consider alternatives (Hughes et al., 2006). As a result of the uncritical acceptance of untested hypotheses, modern molecular evolutionary biology has become riddled with mythical thinking. In order to come to an accurate understanding of the role of gene duplication in evolution, it will be necessary to free ourselves from myths and accept only what is solidly established. As well as a review of some major hypotheses and relevant data in the field of gene duplication, our chapter represents a plea for critical thinking in the field of gene duplication.

2

HYPOTHESES OF NEW GENE FUNCTION

2.1 Mutation During Redundancy The cytogeneticist Susumu Ohno (1973) proposed that the origin of a new function after gene duplication must involve a stage in which the duplicate gene is completely redundant and therefore free to vary at random. According to Ohno’s description of this process, he seemed to envisage that a duplicate gene would be functionally completely redundant, allowing the accumulation of random mutations, which might by chance give rise to a new function. He described the proposed scenario as follows: The mechanism of gene duplication provides a temporary escape from the relentless pressure of natural selection to a duplicated copy of a functional gene locus. While being ignored by natural selection, a duplicated and thus redundant copy is free to accumulate all manner of randomly sustained mutations. As a result, it may become a degenerate, nonsense DNA

HYPOTHESES OF NEW GENE FUNCTION

79

base sequence. Occasionally, however, it may acquire a new active site sequence, therefore a new function and emerge triumphant as a new gene locus. (Ohno, 1973, p. 259)

Ohno’s model became the dominant one over the next two decades, although occasionally other authors expressed or implied different views. Hughes (1994a) noted a number of reasons why the scenario proposed by Ohno [the mutation during redundancy (MDR) model ] is unlikely to apply in most cases where a new function has evolved after gene duplication. First, the MDR model implies an absence of constraint on duplicate genes. Yet comparison of the pattern of synonymous and nonsynonymous nucleotide substitution in duplicate genes in a polyploid animal, the frog Xenopus laevis, supported the hypothesis that duplicate genes are subject to purifying selection and thus are not freed from all constraint (Hughes and Hughes, 1993). In addition, there were at least some cases in which credible evidence supported the hypothesis that functional divergence between members of multigene families occurred as a result of positive Darwinian selection favoring multiple amino acid changes in functionally important regions (Hughes, 1994a). Ohno’s (1973) model assumed that mutations giving rise to a new function occur at random during a period of nonfunctionality, and positive selection is not really compatible with this model. Hughes (1999a) reviewed examples of apparent positive selection leading to diversification of the members of multigene families. Since that time, there has been explosion in the literature of cases in which it has been claimed that positive Darwinian selection has been in diversification of gene family members. Unfortunately, most of these claims have been based on “codon-based” methods of testing for positive selection, which are based on a false assumption (Hughes, 2007). Therefore, the majority of these claims are probably not valid (see below). In addition, there was evidence that functional differentiation might precede rather than follow gene duplication. The most striking example involved the lens crystallins of animals. In the evolution of eyes, it appears that a wide variety of different proteins have been recruited to act as crystallins, the major structural protein of the lens. Moreover, Piatigorsky and colleagues provided evidence that this evolutionary process regularly involves a stage they called gene sharing, where a single gene encodes both a crystallin and a protein performing a completely different function (Piatigorsky and Wistow, 1991). For example, the δ-crystallins of certain birds do double duty as argininosuccinate lyases (Piatigorsky and Wistow, 1991). A pattern in which gene sharing precedes gene duplication suggests a very different model than that proposed by Ohno, which envisages the new function as a result of random exploration of functional space. It also contradicts the pronouncement of Kimura and Ohta (1974, p. 429) that “gene duplication must always precede the emergence of a gene having a new function.” Finally, molecular biology made available the sequences of many pseudogenes. It seems obvious that most pseudogenes are too severely damaged by mutation for Ohno’s concept of the resurrection of a functionless gene to be plausible in most cases. Loss of function in the case of genes seems to be in general a one-way street. However, since one should never say “never” when it comes to evolution, it is still possible that Ohno’s scenario or something like it has occurred in a small number of cases. One possibility might involve cases where a certain type of mutation that damages one function of a gene might thereby give rise to a new function. As a possible example, there are several known cases where a competitive inhibitor or

80

MYTHS AND REALITIES OF GENE DUPLICATION

antagonist of a given signaling molecule is encoded by a gene that is homologous to the gene encoding that signaling molecule. For example, the interleukin-1 receptor antagonist (IL-1RA) in mammals is clearly homologous to the functional interleukins IL-1α and IL-1β (Eisenberg et al., 1991). One possible scenario for the origin of an antagonist from a homologous functional cytokine would be gene duplication followed by mutation in one gene copy that abolished the encoded protein’s signaling function without damaging its ability to bind the IL-1 receptor (Hughes, 1994b). On the other hand, it is worth noting that damage to a duplicate gene may occur during the process of gene duplication itself. Thus, the prolonged period of redundancy during which mutations accumulate—as envisaged by Ohno—may not be necessary at all. Analysis of evolutionarily recent gene duplicates in the nematode worm Caenorhabditis elegans has revealed a surprising degree of structural heterogeneity between duplicates (Katju and Lynch, 2006). Gene duplications may be partial, and the loss of one or more exons may immediately create a gene with an altered function. Similarly, duplicate genes may pick up new exons from a variety of sources, including both coding and noncoding sequences (Katju and Lynch, 2006). 2.2 Gene-Sharing Model Hughes (1994a) pointed out that gene sharing in crystallins suggests a general model for the evolution of new gene (and protein) functions: namely, that in cases where new functions have evolved after gene duplication, a period of gene sharing typically precedes duplication (Figure 1). When a gene is bifunctional (or multifunctional), gene duplication permits partitioning of the ancestral functions between the daughter genes, allowing them to specialize for distinct subsets of the ancestral function. Because the functions divided among the daughter genes are already present in the ancestral gene, this model requires no period of redundancy during which a new function may emerge by chance. On this model, gene sharing is not just a peculiarity of crystallins but represents the standard pathway by which new gene functions evolve. The idea that a period of gene sharing might precede duplications that lead to functional differentiation was, in fact, not new. Several previous authors had developed or implied a similar scenario. For example, Goodman and colleagues (1975) hypothesized

A′

A

A′′

GENE DUPLICATION

FUNCTION 1

FUNCTION 2

FUNCTION 1

FUNCTION 2

Figure 1 Subdivision of ancestral functions between daughter genes (A and A ) after duplication of an ancestral bifunctional gene.

HYPOTHESES OF NEW GENE FUNCTION

81

that the vertebrate hemoglobin molecule was originally a homotetramer, prior to the gene duplication that gave rise to separate α-chain and β-chain genes. Along similar lines, Jensen (1976) and Jensen and Byng (1981) proposed that two enzymes that are specialized to catalyze two separate reactions might evolve, after gene duplication, from an enzyme capable of catalyzing both reactions (Jensen, 1976; Jensen and Byng, 1981). And the biochemist Leslie Orgel (1977), in a brief paper, specifically proposed that functional differentiation might precede rather than follow gene duplication. The GAL genes of brewer’s yeast, Saccharomyces cerevisiae, provide a potential example (Hughes, 1999a). These genes, which encode proteins involved in the metabolism of galactose, are inducible by galactose and repressible by glucose (Johnston, 1987). The protein product of GAL1 , designated Gal1p, is a galactokinase, which catalyzes the production of galactose-1-phosphate from ATP and galactose, thereby initiating the pathway of galactose metabolism. Gal3p, the product of the GAL3 gene, shows clear evidence of a close evolutionary relationship with Gal1p but has no galactokinase activity (Bajwa et al., 1988; Bhat et al., 1990). Gal4p, the major transcription factor for the GAL genes, is inhibited by Gal80p (Figure 2A). In the presence of galactose and ATP, Gal3p acts as a coinducer for GAL gene expression, removing the inhibition by binding Gal80p and thus enabling Galp4 to activate transcription (Yano and Fukasawa, 1997; Figure 2A). In a related fungus species, Kluyveromyces lactis, there is no GAL3 gene; and Gal1p functions as both a coinducer and a galactokinase (Meyer et al., 1991). Interestingly, in S. cerevisiae itself, there are mutants lacking GAL3 expression in which Gal1p appears to be able to take over the regulatory role of Gal3p. Phylogenetic analysis supports the hypothesis that the common ancestor of S. cerevisiae GAL1 and GAL3 encoded a bifunctional protein that, like that of K. lactis, functioned as both a coinducer and a galactokinase (Hughes, 1999a; Figure 2B). S. cerevisiae Gal3p is characterized by the deletion of two amino acid residues (SerAla) relative to known functional galactokinases (Hughes, 1999a); and it has been shown that experimental reinsertion of these two residues restores galactokinase activity to Gal3p (Platt et al., 2000). Thus, after gene duplication, the functions of the ancestral gene were partitioned between GAL1 and GAL3 , and the loss of enzymatic function accompanied specialization of Gal3p as a coinducer (Hughes, 1999a). A recent series of experiments by Hittinger and Carroll (2007) reveals the complexity of the functional changes after gene duplication in this apparently simple case. First, they showed that reinsertion of the deleted Ser-Ala dipeptide into Gal3p, while restoring galactokinase activity, caused the resulting molecule to be a subpar coinducer. Both K. lactis Gal1p and S. cerevisiae Gal1p were shown to be worse coinducers than the intact Gal3p; but the modified Gal3p was a still worse coinducer than either of those two functional galactokinases (Hittinger and Carroll, 2007). These results indicate that gene duplication has been followed by specialization on the part of Gal3p and that loss of its galactokinase activity has made possible increased effectiveness as a coinducer. Moreover, S. cerevisiae GAL1 and GAL3 have specialized in their promoters as well as in their coding sequences. The promoter for GAL3 has only one binding site for Gal4p, resulting in weakly inducible expression (Hittinger and Carroll, 2007). By contrast, the promoter for GAL1 has four binding sites for Gal4p, resulting in strongly inducible expression (Hittinger and Carroll, 2007). Moreover, the promoter of S. cerevisiae GAL1 has been structurally remodeled in comparison to that of K. lactis GAL1 , which presumably represents the ancestral state prior to gene duplication in the

82

MYTHS AND REALITIES OF GENE DUPLICATION

gal3p

gal80p gal4p

GAL GENES REPRESSED

ATP GALACTOSE gal80p gal3p gal4p

TRANSCRIPTION OF GAL GENES

GALACTOSE ATP gal80p gal1p gal4p

TRANSCRIPTION OF GAL GENES (A) Other eukaryotic galactosidases

K. lactis gal1p

S. carlbergensis gal1p

S. cerevisiae gal1p

S. cerevisiae gal3p (B)

Figure 2 (A) Function of Gal3p in brewer’s yeast, Saccharomyces cerevisiae. Gal3p de-represses expression of the GAL genes, a role that can be assumed by Gal1p in the absence of Gal3p (bottom). (B) Phylogeny of galactosidases.

Saccharomyces lineage. The result is a more strongly inducible Gal1p in S. cerevisiae than in K. lactis (Hittinger and Carroll, 2007). Thus, the GAL1 /GAL3 example suggests that specialization by daughter genes has made possible more precise adaptation to functions that were previously shared by a bifunctional ancestral protein. Some other examples suggest that a similar process has occurred (Hughes, 2005). Note that the gene-sharing model of the evolution of new functions does not imply that all or even most gene duplications involve bifunctional genes. Gene duplication occurs continually in the evolution of genomes, and most duplicate genes are probably eventually lost (Lynch and Conery, 2000). Rather, it predicts that in the limited number of cases where gene duplication has given rise to functionally distinct daughter genes, the ancestor performed both functions (Hughes, 1994a). Thus, the immediate ancestor of each pair of functionally distinct paralogs is predicted to have performed both of

HYPOTHESES OF NEW GENE FUNCTION TF BINDING SITE A

TF BINDING SITE B

83

CODING REGION

Figure 3 How the DDC model might operate in the case of a gene with two transcription factor–binding sites (TF binding sites A and B). After gene duplication, complementary mutations knock out each binding site in one of the daughter genes.

the daughter genes’ distinct functions, although not necessarily as well as the daughter genes do. Biologists may have been reluctant to see in gene sharing a general model for gene duplication because of a tendency to expect each gene (or protein) to have a single function. However, in the present era of systems biology, such a view seems outmoded (Hughes, 2005). Data on gene expression patterns and on networks of gene and protein interaction have provided striking evidence that the function of every gene is multidimensional. Thus, partitioning of the multidimensional functional space seems a possible occurrence after the duplication of any gene (Hughes, 2005; Piatigorsky, 2007). However, it remains uncertain how widespread the gene-sharing scenario has in fact been in the history of life. 2.3 DDC Model Lynch and Force (2000) proposed an additional model of how gene functions might be shared among daughter genes after gene duplication, a process they called subfunctionalization. Their model, called duplication–degeneration–complementation (DDC), envisages complementary loss of function in the two duplicates, thereby rendering the loss of either duplicate disadvantageous. The simplest example of this model might be to imagine a gene that is expressed in two different tissues—say, liver and kidney—because it has binding sites for two different tissue-specific transcription factors (Figure 3). After duplication, a mutation might occur in the promoter region of one gene copy that damages the liver-specific transcription factor–binding site and thus eliminates gene expression in liver. But such a mutation is not strongly deleterious because the other gene copy retains expression in liver. Thus, a loss-of-function mutation in the liver-specific promoter of one of the gene copies may drift to fixation. Meanwhile, a similar loss-of-function mutation in the kidney-specific promoter of the other gene may drift to fixation; again, such a mutation is not strongly deleterious because it complements the loss-of-function mutation in the other gene (Figure 3). However, the result is one liver-specific gene and one kidney-specific gene. Assuming that the organism requires expression of the protein in both tissues, purifying selection will oppose any further mutation eliminating the expression or protein function of either of the two genes. Note that although the DCC model is most easily illustrated in terms of tissue-specific expression, it might apply to any type of gene or protein function.

84

MYTHS AND REALITIES OF GENE DUPLICATION

The DDC model cannot be applied straightforwardly to several well-studied cases, including that of the yeast galactokinases discussed above. The loss of galactokinase activity in S. cerevisiae Gal3p represents an apparent example of the type of “degeneration” envisaged in the DDC scenario, as does the loss of Galp4-binding sites in the GAL3 promoter. However, there is no evidence of a “complementary” loss function on the part of S. cerevisiae GAL1 . The evidence suggests that S. cerevisiae Gal1p is no worse as a coinducer than is the bifunctional Gal1p of K. lactis (Hittinger and Carroll, 2007). Moreover, the change in the promoter of S. cerevisiae GAL1 has involved a change in the helical phasing of Gal4p binding sites, a change that can in no way be considered “degeneration” of the sort assumed by DDC. Thus, as with other models discussed here, the DDC model may be somewhat oversimplified. Nonetheless, the DDC model is important because it shows how shared ancestral functions can be parceled out between duplicates by processes involving only mutation and genetic drift, without requiring any role for positive Darwinian selection. If the “degenerative” mutations involved are either neutral or slightly deleterious, their chance of fixation will be greater when the effective population size is small. The effective population sizes of multicellular eukaryotes are smaller in general than those of unicellular eukaryotes or prokaryotes; and the difference in population size may be one factor contributing to the fact that gene families tend to be much larger in the former than in the latter (Lynch and Conery, 2003). However, because there are other factors at work, it is difficult to attribute the difference in gene family size between multicellular and unicellular organisms to population size difference alone. For one thing, it seems likely that multicellularity creates more opportunities for subfunctionalization than does unicellularity; for example, gene expression can be subdivided by tissue. Second, the circular chromosomes typical of prokaryotes may impose some upper bound on genome size, resulting in some degree of purifying selection against gene duplication that is probably absent in most multicellular eukaryotes. It is interesting that in prokaryotes, the mean gene family size and the total nucleotide content of the genome are strongly correlated (Hughes et al., 2005; Figure 4). Such a close relationship seems unlikely in the case of eukaryotes, where

Figure 4 Mean number of genes per gene family as a function of genome size in 99 prokaryotic genomes. (From Hughes et al., 2005.)

ROLE OF NATURAL SELECTION

85

most variation in genome size is probably due to variation in the content of repeating DNA in the genome (Hughes and Piontkivska, 2005). In prokaryotes themselves, it seems very unlikely, in fact, that species with large genomes, and thus large gene families, also have small long-term effective population sizes.

3 ROLE OF NATURAL SELECTION In present-day organisms, we can recognize certain gene duplications that occurred in the past and gave rise to important new organismal adaptations. However, the role of positive Darwinian selection—that is, natural selection favoring advantageous mutations—in this process is not always clear. Part of the reason for this is that many such gene duplications occurred in the quite distant past, and the statistical methods often used to test for positive Darwinian selection (which involve comparing synonymous and nonsynonymous substitutions) are not applicable because synonymous sites are saturated with changes. Further confusion arises from the fact that certain widely used methods of testing for positive selection depend on seriously flawed assumptions and thus do not provide trustworthy evidence. Moreover, many—probably most—of the mutations that give rise to new protein functions leave no identifiable “signature” of positive selection. In a few cases, natural selection may favor a series of changes in the amino acid sequence of a protein that fit it better for a specialized task. A powerful tool in testing hypotheses regarding this type of natural selection is a comparison of the patterns of synonymous and nonsynonymous (amino acid–altering) nucleotide substitution (Hughes and Nei, 1988). Unfortunately, this approach has been widely abused in recent years (Hughes, 2007), and published claims that natural selection has acted to diversify members of multigene families need to be treated with caution. One of the earliest published cases that compared synonymous and nonsynonymous substitutions among members of a multigene family involved a highly unusual kind of multigene family, the variable region genes (or, more strictly speaking, gene segments) of mammalian immunoglobulins (Tanaka and Nei, 1989). In the portion of these gene segments encoding the CDR region of the immunoglobulin, which binds antigens, the number of nonsynonymous nucleotide differences per nonsynonymous site (pN ) was found to exceed the number of synonymous differences per synonymous site (pS ). By contrast, in the remainder of the gene segment (encoding the framework region) the reverse pattern was seen, as in most genes (Tanaka and Nei, 1989). The fact that this highly unusual pattern of nucleotide substitution was seen in the CDR, where a diversity of amino acid sequences is likely to be advantageous because it enables the host to bind a diverse array of foreign antigens, supports the hypothesis that natural selection has acted to favor amino acid changes in the CDR region (Tanaka and Nei, 1989). Frequently, it is stated that a pattern whereby, in some set of codons, the number of nonsynonymous substitutions per nonsynonymous site (dN ) exceeds the number of synonymous substitutions per synonymous site (dS ) is a signature of positive selection, but such a statement is misleading. Note that in Tanaka and Nei’s (1989) study of immunoglobulins, as in Hughes and Nei’s (1988) study of major histocompatibility complex genes, the authors tested for a pattern of dN > dS in a set of codons where there was a biological reason for predicting that selection would favor amino acid

86

MYTHS AND REALITIES OF GENE DUPLICATION

diversity. This is not the same thing as simply searching in a set of coding sequences for one or more codons using codon-based methods of testing for positive selection. Such methods no doubt are able to identify codons with this property, but a certain proportion of such codons are likely to occur by chance in most coding sequences, even under strong purifying selection (Hughes and Friedman, 2005a). Codon-based methods depend on the false assumption that the existence of one or more codons, dN > dS , implies positive selection, and therefore these methods do not provide a valid test of the hypothesis of positive selection (Hughes, 2007). Unfortunately, codon-based methods have been used in the vast majority of published cases, where it has been claimed that positive selection has acted to favor amino acid changes within multigene families. Therefore, it remains unclear how widespread this phenomenon is. One problem with the comparison of synonymous and nonsynonymous substitutions is that it is only applicable over a fairly short time frame. Suppose that immediately after gene duplication, natural selection favors a series of nonsynonymous substitutions between two genes (Figure 5). If we can compare the two genes at that point, we will find that dN exceeds dS . However, once all the amino acid changes required to adapt the daughter genes to their specialized functions have occurred, no more amino acid changes will occur; and purifying selection at the amino acid sequence level will predominate, as in most genes (Figure 5). Assuming the neutrality or near-neutrality of most synonymous substitutions, dS between the two genes will thus eventually equal and finally overtake dN (Figure 5). Thus, even in cases where a pattern of dN > dS has occurred, it may not be detectable unless the events involved were relatively recent. The pregnancy-associated glycoproteins (PAGs) of ruminants provide an example of a recently duplicated mammalian gene family in which a pattern of dN > dS can be observed in a potentially functionally important region of the molecule (Hughes et al., 2000). PAGs are homologous to aspartic proteinases; they have apparently lost proteinase function, although they retain the ability to bind peptides. The PAG family has undergone massive gene duplication in the ruminant lineage, resulting in 100 or more genes, which are expressed in the placenta. In most comparisons between PAG

dS

dN

T0

T1

T2

T3

Figure 5 Effect on dS and dN of directional selection favoring a series of amino acid differences between two genes duplicated at T0 . At a certain time (T1 ), dN will exceed dS . But after the selectively favored amino acid changes have all been made, dS will catch up with and eventually exceed dN (T2 and T3 ).

ROLE OF NATURAL SELECTION

87

0.5

0.4

dN

0.3 0.2 0.1 0.0 0.0

0.1

0.2

0.3

0.4

0.5

0.3

0.4

0.5

dS (A) 0.5

0.4

dN

0.3 0.2 0.1 0.0 0.0

0.1

0.2 dS (B)

Figure 6 Plots of dN vs. dS : (A) putative peptide-binding region; (B) remainder of pregnancy-associated glycoprotein genes of ruminants. (From Hughes et al., 2000.)

genes, dN exceeds dS in the codons encoding the putative peptide-binding region of the protein (Figure 6A). By contrast, in the remainder of the gene, dS exceeds dN , as in most genes (Figure 6B). The duplication of PAG genes has all occurred within the ruminant lineage with the past 50 million years or so (Hughes et al., 2000, 2003a). Thus, diversification of these genes has occurred recently enough that the acceleration of dN relative to dS in the putative peptide-binding region is clearly detectable. The PAGs cannot really be considered a convincing example of positive selection until the function of these molecules is understood. Only if we understand the function of PAGs can we know why repeated amino acid changes in a portion of the PAG protein might be favored. Nonetheless, the PAGs provide a vivid illustration of the fact that comparison dS and dN is most easily detected in the case of recently diverged paralogs. Both the immunoglobulin and PAG examples illustrate another point that is worth emphasizing: Comparison of nonsynonymous and synonymous substitutions is only

88

MYTHS AND REALITIES OF GENE DUPLICATION

suitable for testing the hypothesis that natural selection has favored repeated amino acid changes at a limited set of positions (Hughes, 2007). But there is no reason to believe that this is a particularly common mode of selection on duplicated genes. Rather, in many cases, mutations that adapt duplicate genes for specialized functions may include such mutational events as the following: (1) replacement of a single amino acid; (2) deletion or insertion of one or more codons; (3) creation of a chimeric gene either by recombination with another protein-coding locus or by “capture” into coding exons or previously noncoding sequence; (4) loss of the appropriate splice signals and thus of expression of one or more exons; and (5) changes in regulatory regions leading to changes in gene expression. However, there are no statistical methods to test for positive selection in any of these cases, which almost certainly account for the overwhelming majority of cases of positive selection on duplicate genes. It is worth remarking that even degenerative changes, such as envisaged by the DDC model, may in some cases be positively selected. Known examples of adaptive evolution at the molecular level very often involve loss-of-function mutations (Hoekstra and Coyne, 2007). Moreover, in the case of bifunctional proteins, loss of one function may sometimes be selectively favored if the presence of that function imposes a constraint that limits the effectiveness of the other function (Hughes, 2005). It seems to us that the most reasonable course for evolutionary biologists to take is to concentrate on understanding the functional differences between duplicates and not worry about uncovering evidence of past positive selection. For example, in the case of the yeast GAL genes discussed previously, it is plausible that natural selection played a role in favoring changes in the S. cerevisiae GAL1 promoter that led to a strongly inducible expression, given the well-understood adaptive advantage this confers (Hittinger and Carroll, 2007). But if it occurred, this selection has left no signature that we are likely be able to detect. Understanding the functional differences between gene duplicates and identifying the mutations that caused those functional differences represents a much more important contribution to the advancement of biology as a science than does any bioinformatic search for supposed signatures of positive selection.

4

EXPRESSION DIFFERENTIATION

Theoretically, there are two distinct pathways by which duplicated protein-coding genes can become differentiated: (1) by differences in expression pattern, and (2) by differences in the amino acid sequence of the encoded protein, leading to functional change or specialization at the amino level. Intuitively, it seems likely that both types of differentiation can occur in the course of the evolution of a given duplicate gene pair, either simultaneously or at different stages of the process of functional differentiation. But it is unclear whether these two modes of diversification tend to occur in a mutually exclusive fashion or whether, on the contrary, they tend to go hand in hand. Intuitively, it seems likely that both types of differentiation can occur in the course of the evolution of a given duplicate gene pair, either simultaneously or at different stages of the process of functional differentiation. The availability of data from a number of sources regarding the patterns of gene expression and its regulation by transcription factors has made it possible to address questions of this sort.

EXPRESSION DIFFERENTIATION

89

4.1 Duplicated Genes in Arabidopsis Root Development Hughes and Friedman (2005b) approached the question of how coding sequence divergence and gene expression patterns relate in multigene families using data from a study of gene expression at three developmental stages in five different tissue types of the developing root in Arabidopsis thaliana (Birnbaum et al., 2003). Expression data were obtained using the ATH1 GeneChip (Affymetrix, Santa Clara, CA) covering three developmental stages (stages 1, 2, and 3) in the following cell zones: (1) stele, (2) endodermis, (3) endodermis + cortex, (4) epidermal atrichblasts, and (5) lateral root cap. Birnbaum et al. (2003) provided a data set giving raw expression scores (mean of three replicates) for 5717 transcripts that were shown to be regulated differentially across the 15 separate subzones (three stages × five cell types). Hughes and Friedman (2005b) estimated the 15 × 15 linear correlation matrix among the subzones, and principal components were extracted from this correlation matrix. The purpose of principal components analysis is to reduce a large number of variables (in the present case, 15 variables corresponding to the 15 subzones) to a smaller number of variables that explain most of the variance in the larger set. This amounts to rotating the original coordinate system in multivariate space to define new axes. The first two principal components (PC1 and PC2) extracted from the correlation matrix of expression scores in the 15 subzones accounted together for 81.3% of the trace of the matrix, 66.3% in the case of PC1 and 15.0% in the case of PC2. PC1 appeared to be a measure of overall level of expression, as shown by nearly equal loadings on each of the 15 subzones (Table 1). PC2 evidently provided a contrast between early and late expression, as shown by negative loadings on stage 1 and positive loadings on stage 3 (Table 1). PC2 also showed negative loadings on stage 2 in all but one cell type, stele (Table 1).

TABLE 1 Loadings for the First Two Principal Components (PC1 and PC2) on Variables Corresponding to Expression Levels in Arabidopsis Root Subzones Cell Type Stele

Endodermis Endodermis + cortex

Epidermal atrichoblasts

Lateral root cap

Stage

PC1

PC2

1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

0.292 0.294 0.178 0.284 0.294 0.209 0.282 0.296 0.193 0.280 0.285 0.243 0.229 0.274 0.187

−0.166 0.032 0.498 −0.200 −0.063 0.442 −0.190 −0.038 0.474 −0.244 −0.159 0.181 −0.186 −0.096 0.243

Source: Data from Hughes and Friedman (2005b).

90

MYTHS AND REALITIES OF GENE DUPLICATION

These interpretations of PC1 and PC2 based on variable loadings (Table 1) were supported by analyses comparing PC1 and PC2 with ad hoc composite variables designed to reflect similar aspects of the data. The correlation coefficient between PC1 score and the mean expression level for the 15 subzones was 0.999 (p < 0.001), supporting the hypothesis that PC1 reflects overall expression level. Similarly, PC2 was strongly positively correlated with two composite variables, reflecting a contrast between early and late developmental stages: (1) the mean score for stage 3 (across the five zones) minus the mean score for stage 1 (r = 0.902; p < 0.001), and (2) the mean score for stage 3 minus the mean score of stages 1 and 2 (r = 0.890; p < 0.001). To provide an index of divergence at the coding sequence level, Hughes and Friedman (2005b) estimated dS and dN between the two members of two-member families (428 families). The ratio dN /dS was then compared with the scores on PC1 and PC2 to examine how divergence at the amino acid sequence level relates to expression divergence. For each of the families, the range of PC1 (i.e., the absolute difference in PC1 score between the two family members) was plotted against the ratio dN /dS (Figure 7A). There was a modest negative correlation between the two variables (Spearman’s rank correlation coefficient, rS = −0.235; p < 0.001) (Figure 7A). The evident cause of this negative correlation was the fact that there were a number of families with relatively high ranges of PC1 but low dN /dS (Figure 7A). When the range of PC2 was plotted against the ratio dN /dS , a negative correlation was again observed (rS = −0.107; p = 0.027) (Figure 7B). As with PC1, this negative correlation can be explained by the occurrence of a number of families with relatively high ranges of PC2 but low dN /dS (Figure 7B). Examination of the individual families showed that ribosomal protein families provided the most striking cases of two-member families having low dN /dS values along with high expression divergence (Hughes and Friedman, 2005b). When the ribosomal proteins were removed from the data set, the correlation between the range of PC1 and dN /dS became more modest (rS = −0.105; p = 0.038); and the correlation between the range of PC2 and dN /dS was no longer significant (rS = −0.022; n.s.). Using the same gene expression data set, dN and dS were estimated between the members of 190 phylogenetically independent sister pairs of sequences from the 41 largest families (out of 820 families in the data set). In these comparisons, dN /dS was not significantly correlated with the absolute difference in PC1 scores between the pair members (rS = 0.059; n.s.) nor with the absolute difference in PC2 scores between the pair members (rS = 0.062; n.s.). Nonetheless, these 190 comparisons included some pairs with relatively low dN /dS but high absolute difference in PC1 or PC2 scores. Thus, in the majority of cases, Hughes and Friedman (2005b) found that duplicate genes in Arabidopsis have diverged in the coding sequence but not to a great extent in expression pattern, at least in the cell types analyzed. There were exceptions to this generalization—most notably the ribosomal proteins, for which the opposite was true. But overall, these data suggest caution regarding any model of the evolution of new functions after gene duplication that places a great deal of emphasis on differences of expression pattern. 4.2 Transcription Factor Binding Another way to examine the change of expression after gene duplication is to compare transcription factor binding by paralogous genes. Hughes and Friedman (2007) used

EXPRESSION DIFFERENTIATION

91

Absolute difference (PC1)

30

20

10

0 0.0

0.1

0.2

0.3

0.4 0.5 dN/dS

0.6

0.7

0.8

0.9

0.6

0.7

0.8

0.9

Absolute difference (PC2)

(A) 10 9 8 7 6 5 4 3 2 1 0 0.0

0.1

0.2

0.3

0.4 0.5 dN/dS (B)

Figure 7 (A) Range of PC1 (see Table 1) plotted against dN /dS for 428 Arabidopsis two-member families (rS = −0.235; p < 0.001); (B) range of PC2 (see Table 1) plotted against dN /dS for 428 Arabidopsis two-member families (rS = −0.107; p = 0.027). (From Hughes and Friedman, 2005b.)

data on transcription factors associated with brewer’s yeast (Saccharomyces cerevisiae) genes to compare the transcription factors targeting 190 pairs of duplicate genes. Overall, there was a negative correlation between sequence similarity, at either synonymous or nonsynonymous sites, and the degree to which the duplicated gene pair shared the same transcription factors. However, there was a great deal of scatter in the data. Some gene pairs with little divergence in the coding region were found to bind similar sets of transcription factors, whereas other gene pairs with highly similar coding regions were found to have essentially no overlap in the sets of transcription factors they bind (Hughes and Friedman, 2007). The genome of brewer’s yeast contains a number of apparently duplicated segments, hypothesized by some authors to be relics of an ancient polyploidization event (see below; Wolfe and Shields, 1997). It was of interest that gene pairs in duplicated segments tended to share transcription factors to a greater extent than gene pairs outside duplicated segments (Hughes and Friedman, 2007). This pattern may be explained in part by the fact that when a large genomic segment is duplicated, transcription

92

MYTHS AND REALITIES OF GENE DUPLICATION

factor–binding sites are likely to be duplicated along with the protein-coding genes. On the other hand, when a single gene is duplicated, there is a greater chance that not all of the transcription factor–binding sites will be duplicated. Of the 190 duplicate gene pairs analyzed by Hughes and Friedman (2007), the sets of transcription factors bound matched perfectly in 17 duplicated gene pairs. Ten of these pairs were ribosomal protein genes. Some information was available from the published literature regarding the function of three of the other gene pairs. DMA1 (THR115C) and DMA2 (YNL116W) encode proteins that function in the positioning of the mitotic spindle (Fraschini et al., 2004). Mutants lacking both genes showed aberrant spindle positioning, but there was no detectable phenotypic effect if either of the two genes was deleted but the other was not (Fraschini et al., 2004). Thus, in terms of their role in spindle positioning, the two duplicated genes seem to be functionally redundant or nearly so. Nonetheless, the two proteins are quite divergent at the amino acid level, and the correlation of their expression scores across a large number of microarray experiments was actually rather low (Hughes and Friedman, 2007). PDI1 (YCL043C) and YDR518W (YDR518W) both encode protein disulfide isomerases found in the lumen of the endoplasmic reticulum (Noiva and Lennarz, 1992; Tachibana and Stevens, 1992). Although the two proteins are functionally similar, their functions appear not to be identical. Deletion of PDI1 renders the cell nonviable, but deletion of EUG1 has no such effect (Tachibana and Stevens, 1992). Moreover, in cells in which the protein encoded by PDI1 was depleted, overexpression of EUG1 compensated partially only for the defect (Tachibana and Stevens, 1992). Finally, SAS5 (YOR213C) and TAF14 (YPL129W) are functionally similar in that both encode subunits of multisubunit complexes that regulate transcription, but the two proteins form part of structurally and functionally very different complexes. SAS5 encodes a protein that is part of a trimeric complex involved in transcriptional silencing of heterochromatin (Sutton et al., 2003; Shia et al., 2005). By contrast, the protein encoded by TAF14 (also known as ANC1 and TFG3 ) forms a part of the multisubunit RNA polymerase II holoenzyme, essential for transcription of protein-coding genes (Meyer and Young, 1998). The proteins are also very divergent at the amino acid level. These three examples of duplicate gene pairs with shared transcription factors show a range of patterns of functional divergence at the protein level, extending from apparent near-redundancy in the case of DMA1 and DMA2 to distinct and even opposing functions in the case of SAS5 and TAF14 . The different ways in which these duplicate gene pairs have differentiated provide a vivid illustration of the fact that gene expression and protein sequence constitute a multidimensional space in which duplicates can differentiate along one or more axes (Hughes, 2005). Thus, different gene pairs can differentiate functionally in very different ways by exploiting different dimensions in this space of possible patterns of expression and protein function.

5

SEGMENTAL DUPLICATION AND ITS AFTERMATH

5.1 The Polyploidization Obsession One of the most persistent—and, to our way of thinking, pernicious—myths in the field of gene duplication has been the late Susumo Ohno’s belief that the duplication

SEGMENTAL DUPLICATION AND ITS AFTERMATH

93

of entire genomes by polyploidization is a key to the origin of new adaptations (Ohno, 1970). Ohno probably developed this theory because at the time he was working, the only known model for gene expression in any organism was that of the lac operon in Escherichia coli (Hughes, 2000). Assuming that eukaryotic genes are similarly organized into operons, Ohno evidently reasoned that duplication of a single gene was unlikely to lead to anything productive because the regulatory region for the operon would not be duplicated as well. Ohno (1970) dismissed the lungfish as a supposed evolutionary failure, but it was problematic for his theory that the lungfish has a large genome. Thus, using circular reasoning, Ohno (1970) argued that the lungfish must have achieved its large genome size through tandem duplication rather than whole-genome duplication because, in the case of the poor lungfish, no “progress” has resulted. He did not seem to realize that a lungfish is at least as well adapted to its environment as is any tetrapod, despite its lack of recent phenotypic innovation. One idea that traces its descent to Ohno is the 2R hypothesis, the hypothesis that two rounds of genome duplication occurred early in the vertebrate lineage, before the origin of jawed vertebrates (Hughes, 2000). These supposed rounds of genome duplication have been implicated as playing a key role in the origin of a number of key vertebrate adaptations, including the vertebrate-specific (“adaptive”) immune system (Flajnik and Kasahara, 2001). Vertebrate-specific immunity involves a number of gene families unique to jawed vertebrates: namely, immunoglobulins, T-cell receptors, and the class I and II molecules of the major histocompatibility complex (MHC). It has been argued that there are four clusters of genes in the typical vertebrate genome that are homologous to a set of genes linked to the class I and II MHC genes in mammals (Flajnik and Kasahara, 2001). In fact, phylogenetic analyses show that many of the genes in these clusters were, in fact, duplicated early in the history of life, well before the origin of vertebrates (Hughes, 1998; Yeager and Hughes, 1999). But even if it is true that certain genes in these clusters were duplicated as a result of polyploidization events early in vertebrate history, those events explain nothing about the origin of the MHC, since the class I and II genes are linked to only one of the clusters, and no homolog to the MHC genes has been found in any species outside the jawed vertebrates. Advocates of the 2R hypothesis have focused on certain aspects of vertebrate genomes that are consistent with two rounds of polyploidization, notably the presence of four HOX gene clusters. But they have not subjected the hypothesis to rigorous testing. Every rigorous attempt to uncover an unambiguous signal of two rounds of genome duplication in vertebrates has failed to do so (Hughes, 1999b; Friedman and Hughes, 2001; Hughes et al., 2001; Hughes and Friedman, 2003a, 2004). In fact, all of the features of vertebrate genomes that have been attributed to 2R can be explained as easily or more easily by multiple separate events of duplication of individual genes or genomic segments. An unfortunate consequence of Ohno’s influence is the tendency to attribute to ancient polyploidization genomic features that might just as easily be explained by other mechanisms. Typically adduced as evidence of ancient polyploidization are the following: (1) regions of double synteny, that is, two or more genomic regions containing members of the same set of gene families (Wolfe and Shields, 1997); and (2) the existence of multiple pairs of duplicate genes with evidence (either from phylogenies of from evolutionary sequence distances) that they duplicated within roughly the same time frame (McLysaght et al., 2002).

94

MYTHS AND REALITIES OF GENE DUPLICATION

Among genomes of model organisms, both brewer’s yeast and Arabidopsis undoubtedly show double synteny, but whether this is due to polyploidization remains debatable. Certainly, if polyploidization occurred in either case, the organisms have since become re-diploidized; and re-diploidization is hypothesized to have involved a massive loss of duplicated genes. In brewer’s yeast, it has been estimated that 85% of genes duplicated by a hypothetical polyploidization must subsequently have been lost (Wolfe and Shields, 1997). There has been only one rigorous test of the polyploidization hypothesis in yeast: that of Martin et al. (2007), who used standard algorithms for counting minimal numbers of rearrangements to compare the hypothesis of polyploidization with that of a series of segmental duplications. The hypothesis of multiple segmental duplications was found to provide a much more parsimonious explanation than polyploidization (Martin et al., 2007). There are several problems with studies that present evidence of a “burst” or “peak” of gene duplication within a particular time frame as evidence for polyploidization. First, many such studies are based simply on frequency distributions of dS values in comparisons of paralogs (Cui et al., 2006). But because nucleotide substitution is a discrete process, the distribution of dS will tend to form a number of peaks that do not reflect any biological reality. Especially true when dS is greater than 1.0 (i.e., when synonymous sites are saturated with changes), peaks in the frequency distribution of dS values are very likely artifactual. Yet many studies have based claims of polyploidization on such high dS values. A further point is that any “peak” of gene duplication in the evolutionary past is not really a peak of gene duplication but rather a peak of retention of duplicate genes. It is probable that gene duplication—like other forms of mutation—occurs at a steady rate over evolutionary time (Lynch and Conery, 2003), but that at certain periods duplicate genes are more likely to be retained (Friedman and Hughes, 2003). The role of retention of gene duplicates in evolution is illustrated by comparing the genomes of brewer’s yeast Saccharomyces cerevisiae and fission yeast Schizosaccaromyces pombe (Hughes and Friedman, 2003b). These two species are only very distantly related, their last common ancestor being estimated at 420 million years ago (Sipiczki, 2000). Both have evolved independently a single-celled “yeast” lifestyle. As mentioned previously, the genome of S. cerevisiae includes a number of duplicated segments; but there is no evidence of such segmental duplication in S. pombe. Phylogenetic analyses of individual gene families showed that the same genes were duplicated independently in S. cerevisiae and S. pombe to a far greater extent than expected by chance (Figure 8; Hughes and Friedman, 2003b). Many of these are families likely to play roles in the unicellular life cycle (Hughes and Friedman, 2003b). In both S. cerevisiae and S. pombe, there were “bursts” of retention of duplicate genes, yet the bursts occurred at different times and involved different mechanisms, since segmental duplication was a factor only in the former species (Hughes and Friedman, 2003b). If duplicate genes represent an important raw material of evolution, it might be argued that their origin matters little: whether duplication of single genes, duplication of genomic segments, or duplication of entire genomes. From this perspective, debating whether or not whole-genome duplication occurred in the past history of this or that species may appear a rather pointless exercise. In a sense this is true, since we often cannot distinguish segmental duplication from whole-genome duplication followed by massive loss of duplicate genes. Nonetheless, we feel that the recent fad for publishing claims of ancient polyploidization has had a detrimental effect on the progress of

SEGMENTAL DUPLICATION AND ITS AFTERMATH

95

ALL PROTEINS 61

OBSERVED

EXPECTED

56

102

24

65

15

EXCLUDING RIBOSOMAL PROTEINS OBSERVED

59

EXPECTED

85

34

20

S. cerevisiae 8

46

S. pombe

Figure 8 Numbers of gene families with one or more duplications observed after the last common ancestor of S. pombe and S. cerevisiae, illustrating the numbers of families duplicated in each species separately and in both species. Separate analyses were observed for a data set of all proteins (650 families) and for a data set excluding ribosomal proteins (623 families). Numbers observed were compared with numbers expected, calculated by multiplying the proportions of families with one or more duplications in S. pombe by the proportion of families with one or more duplications in S. cerevisiae. For both data sets the numbers observed and expected were significantly different (χ2 = 117.2 and 82.3, respectively; p < 0.00001 in both cases). (From Hughes and Friedman, 2003b.)

evolutionary genomics, both because it has caused a general lack of interest in the evolutionary importance of mechanisms for segmental duplication and because it has implied the acceptance of some seriously inappropriate models for major phenotypic change. Molecular biology has revealed a number of exciting possible mechanisms of segmental duplication that have been almost entirely ignored in the evolutionary literature. One is the role of transposable elements. The fact that duplicated segments in the genome of Arabidopsis are associated with transposable elements to a far greater degree than expected by chance suggests that transposon-mediated segmental duplication, rather than polyploidization, may have structured the genome of that species (Hughes et al., 2003b). There are a number of ways that transposable elements can potentially mediate segmental duplication in genomes. First, transposable elements can provide sites of homology for unequal crossing over (Fedoroff, 2000). Recombination between transposable elements on different chromosomes can lead to translocation of a large genomic segment from one chromosome to another (Bennetzen, 2000). If a chromosome that has received such a translocation ends up, as a result of independent assortment of chromosomes, in the same genome with the wild-type version of the

96

MYTHS AND REALITIES OF GENE DUPLICATION

donor chromosome, the net effect will be a segmental duplication. The emphasis on polyploidization has led to an almost total disregard for the potential role of transposable elements as agents of segmental duplication both in Arabidopsis and in other species. Examination of the human genome has revealed that it differs from other sequenced mammalian genomes in possessing large blocks of interspersed duplications. These duplications have occurred fairly recently—within the primate lineage—and that they have occurred in a complicated fashion involving duplications within duplications, as elucidated in recent years by Eichler and colleagues (e.g., Jiang et al., 2007). Moreover, extensive gene copy number polymorphisms have been found in the human population, resulting from segmental duplications that are so recent that they have not yet been fixed in the population (Fredman et al., 2004). Thus, in our own species we can see both the recent evolution of paralogs and the raw material for evolution of future paralogs, and it seems reasonable to suppose that similar processes of segmental duplication have occurred in other lineages in the history of life. 5.2 Models of Phenotypic Innovation Stebbins (1971, p. 132) considered that “polyploidy has contributed little to progressive evolution.” In plants, polyploids have larger cells and slower development times, traits that are probably often deleterious but may under certain ecological circumstances confer advantages (Levin, 1983). In animals it has not been possible to associate known cases of polyploidy with any major phenotypic adaptation (Otto and Whitton, 2000). Despite this overall negative balance, numerous evolutionary biologists remain infected by an Ohno-inspired enthusiasm for the supposed innovative power of polyploidy. We believe that this view is seriously in error and harmful to progress in the understanding of genomic evolution. Here we illustrate the futility of studying polyploid organisms as models of evolutionary innovation by comparing two possible models: (1) the frogs of the genus Xenopus; (2) and the hominids. The frog genus Xenopus (family Pipidae) provides a well-studied example of an animal taxon in which polyploidization has been a frequent occurrence (Cannatella and de S´a, 1993). Xenopus laevis, a widely used laboratory animal, is known to be an ancient allo-tetraploid, which underwent a allo-polylpoidization event 30 to 40 million years ago (Bisbee et al., 1977). X. laevis has become re-diploidized, and many of the duplicate genes have been lost, perhaps as many as 50 to 75% (Hughes and Hughes, 1993; Hellsten et al., 2007). The genus Xenopus also includes octoploid and dodecaploid species (Bisbee et al., 1977; Cannatella and de S´a, 1993). Hellsten et al. (2007) compared duplicated genes in X. laevis with those of X. tropicalis, a related species that has not undergone polyploidization. They show that in many cases, one of the two X. laevis genes has diverged from the X. tropicalis gene at nonsynonymous sites somewhat more than has the other; but most such change appears to be due to the stochastic nature of the mutational process. In only 28 of 578 gene pairs (4.8%) was there significant asymmetry in amino acid evolution between the two X. laevis duplicates in comparison with a random (binomial) model (Hellsten et al., 2007). Chain and Evans (2006) detected a similar level of asymmetric divergence (18 of 290 gene pairs, or 6.2%). Thus, there is evidence for the possible occurrence of subfunctionalization among some of those duplicate gene pairs in X. laevis that have not been lost. In the case of

DUPLICATE GENES IN NETWORKS

97

the duplicated developmental gene hairy2 , there is evidence that the two copies are expressed differentially, as predicted by the DDC model (Murato et al., 2007). On the other hand, the two duplicates of another developmental regulatory gene, foxi1 in X. laevis, are expressed in identical ways (Matsuo-Takasaki et al., 2005). Moreover, there is no evidence that subfunctionalization in X. laevis has given rise to any significant evolutionary novelty. X. laevis and X. tropicalis are morphologically, physiologically, and behaviorally very similar. Similarly, there are no pronounced phenotypic differences between these species and the octoploid and dodecaploid members of the genus Xenopus. Contrast this situation with that of the hominids. In the 5 to 7 million years since their last common ancestor with the chimpanzee, the hominids have undergone numerous morphological changes, including (to name but a few) reduction of the canines, the evolution of upright posture, and a massive increase in brain size. Polyploidization was not a factor in any of this, of course. But it is at least a plausible hypothesis that segmental duplication did play a major role, given the high-frequency recent retained segmental duplications in the human genome (Jiang et al., 2007). It is at least suggestive that of all the mammalian species whose genomes have been sequenced, the species that has undergone the most extensive recent adaptive change is also the one with the most recent segmental duplication. Thus, it seems obvious to us that if one is seeking a model organism for the origin of new phenotypes, our own species represents a much more appropriate model than does any recent polyploidy (such as Xenopus). For example, there was a period of rapid morphological change in the early history of vertebrates (Carroll, 1988); to understand what happened during that period, what has gone on in the past five to seven years in hominids seems a much more reasonable starting place than is polyploidy in organisms that have changed little in over 30 million years. But evolutionary biologists will need to liberate themselves from Ohnoist mythology if they are to appreciate the striking model of phenotypic innovation that is literally right before their eyes—at least when they look in the mirror each morning.

6 DUPLICATE GENES IN NETWORKS In recent years there has been a great deal of interest in biological networks, including gene interaction networks, protein interaction networks, and metabolic networks (Kanehisa, 2000; Ravasz et al., 2002). It is generally acknowledged that gene duplication has played a major role in shaping biological networks as we see them today (Wagner, 2001), and is responsible for the distinctive properties of biological networks, such as their scale-free (Barab´asi and Albert, 1999) and modular (Ravasz et al., 2002) nature. However, these properties will arise only if gene duplication occurs in a differential fashion, whereby certain multiply connected “hub” genes in a network remain unduplicated while less connected “spoke” genes are duplicated (Hughes and Friedman, 2005c). The genome of brewer’s yeast provides some intriguing evidence that natural selection may act to eliminate duplication of multiply connected “hub” genes. In the analysis of a yeast genetic network (Tong et al., 2004), Hughes and Friedman (2005c) found that of 68 single-member yeast families with 25 or more network connections, 28 (44.4%) were located in duplicated segments of the genome (Seoighe and Wolfe, 1999). Each

98

MYTHS AND REALITIES OF GENE DUPLICATION

of these 28 loci was thus presumably duplicated along with the genomic segment to which it belongs. However, the fact that each of these 28 families now contains a single member implies that after segmental duplication, one duplicate member of each familiy was deleted from the genome. In addition, there is evidence that network connections are remarkably labile over evolutionary time (Hughes and Friedman, 2005c). Immediately after gene duplication, it seems reasonable to suppose that gene duplicates will often have the same network connections. There might be exceptions if one duplicate is partial or if an exon-shuffling event or other recombinational event has accompanied gene duplication. But data from both yeast and the nematode worm Caenorhabditis elegans suggested that those pairs of duplicated genes that have been retained in these genomes generally have quite distinct sets of connections (Hughes and Friedman, 2005c). Moreover, there is evidence that such changes can happen soon after duplication, as indicated by examples in C. elegans of closely related genes that shared no common network connections (Hughes and Friedman, 2005c). Phylogenetic analyses of multigene families were used to examine the relationship between phylogenetic relatedness and sharing of network connections in a yeast genetic network (Hughes and Friedman, 2005c). Figure 9A shows the phylogenetic tree of MAP kinases included in this network; this was the family showing the greatest within-family contrasts of all those included in the network. The two genes in this family with the highest numbers of connections, YPL031C (with 62 connections) and YHR030C (with 60 connections), were not sisters (Figure 9A). There was strong (100%) bootstrap support for clustering of YHR030C with YLR113W, which had only 24 connections (Figure 9A). When sharing of connections among these genes was examined, YHR030C was found to share only a single connection with the closely related YLR113W (Figure 9B). On the other hand, YHR030C shared 14 connections with YPL031C (Figure 9B). All other members of this family included in the yeast network shared at most a single connection (Figure 9B).

7

CONCLUSIONS

The current era is one of exciting potential for increased understanding of evolutionary processes at the genomic level. Genomic sequences have provided a vast new source of information that is only beginning to be tapped. However, the exploitation of this information has been seriously hampered by misuse of the scientific method (storytelling instead of hypothesis testing) and by invalid statistical method (particularly the codon-based tests for positive selection). Evolutionary biologists who seek a more accurate understanding of gene duplication and its aftermath will need to liberate themselves from outmoded ideas and approaches. The most important step toward an advanced understanding in this area, as in any field of science, is to apply methods that test critically among hypotheses. The mere fact that certain data are consistent with a given hypothesis provides no real support for that hypothesis unless and until one is able to rule out reasonable alternatives. Evolution should be understood as a fundamentally opportunistic process. Presentday genomes show the accumulated effects of many generations of mutation, drift, and natural selection. Because these processes are fundamentally unpredictable, few generalizations can be made regarding the evolutionary process. Rather than seeking

REFERENCES YLR113W (24)

99

YHR030C (60)

YKL126W (2) 100

80

35

YDR477W (1)

YPL031C (62) 54

YKL139W (2)

YPL042C (1) (A)

YPL042C (1) 1 YKL126W (2)

YDR477W (1) 1

YKL139W (2)

1 1

1 14

YHR030C (60)

YPL031C (62)

1 YLR113W (24)

1 (B)

Figure 9 (A) Maximum parsimony tree of yeast MAP kinases included in the genetic interaction network. Numbers in parentheses after each gene name are numbers of network connections. Numbers on the branches show the percentage of 1000 bootstrap samples supporting the branch. (B) Network indicating numbers of network connections shared by yeast MAP kinases. Numbers in parentheses after each gene name are numbers of network connections. (From Hughes and Friedman, 2005c.)

general laws of genomic evolution, evolutionary biologists can contribute most to advancing our knowledge of genome organization by patient reconstruction of the evolutionary events that have structured individual gene families, genomic regions, and genomes. In doing so, bioinformaticists will need to work closely with experimentalists, because much remains to be learned regarding the multidimensional functional space occupied by each gene.

REFERENCES Bajwa W, Torchia TE, Hopper JE. 1988. Yeast regulatory gene GAL3: carbon regulation; UASGal elements in common with GAL1, GAL2, GAL7, GAL10, GAL80 , and MEL1 ; encoded protein strikingly similar to yeast and Escherichia coli galactokinases. Mol Cell Biol 8:3439–3447. Barab´asi AL, Albert R. 1999. Emergence of scaling in random networks. Science 286:509–512.

100

MYTHS AND REALITIES OF GENE DUPLICATION

Bennetzen JL. 2000. Transposable element contributions to plant gene and genome evolution. Plant Mol Biol 42:251–269. Bhat PJ, Oh D, Hopper JE. 1990. Analysis of the GAL3 signal transduction pathway activating Gap4 protein-dependent transcription in Saccharomyces cerevisiae. Genetics 125:281–291. Birnbaum K, Shasha DE, Wang JY, Jung JW, Lambert GM, Galbraith DW, Benfrey PN. 2003. A gene expression map of the Arabidopsis root. Science 302:1956–1960. Bisbee CA, Baker MA, Wilson AC, Irandokht H-A, Fischberg M. 1977. Albumen phylogeny for clawed frogs (Xenopus). Science 195:785–787. Cannatella DC, de S´a RO. 1993. Xenopus laevis as a model organism. Syst Biol 42:476–507. Carroll RL. 1988. Vertebrate Paleontology and Evolution. New York: W.H. Freeman. Chain FJ, Evans BJ. 2006. Multiple mechanisms promote the retained expression of gene duplicates in the tetraploid frog Xenopus laevis. PloS Genet 2(4):e56. Cui L, Wall PK, Leebens-Mack JH, Lindsay BG, Soltis DE, Doyle JJ, Soltis PS, Carlson JE, Arumuganathan K, Barakat A, et al. 2006. Widespread genome duplications throughout the history of flowering plants. Genome Res 16:738–749. Eisenberg SP, Brewer MT, Verderber E, Heimdal P, Brandhuber BJ, Thompson RC. 1991. Interleukin 1 receptor antagonist is a member of the interleukin 1 gene family: evolution of a cytokine control mechanism. Proc Natl Acad Sci USA 88:5232–5236. Fedoroff N. 2000. Transposons and genome evolution in plants. Proc Natl Acad Sci USA 97:7002–7007. Flajnik MF, Kasahara M. 2001. Comparative genomics of the MHC: glimpses into the evolution of the adaptive immune system. Immunity 15:351–362. Fraschini R, Bilotta D, Lucchini G, Piatti S. 2004. Functional characterization of Dma1 and Dma2, the budding yeast homologues of Schizosaccharomyces pombe Dma1 and human Chfr. Mol Biol Cell 15:3796–3810. Fredman D, White SJ, Potter S, Eichler EE, Den Dunnen JT, Brookes AJ. 2004. Complex SNP-related sequence variation in segmental genome duplications. Nat Genet 36:861–866. Friedman R, Hughes AL. 2001. Pattern and timing of gene duplication in animal genomes. Genome Res 11:1842–1847. Friedman R, Hughes AL. 2003. The temporal distribution of gene duplication events in a set of highly conserved human gene families. Mol Biol Evol 20:154–161. Goodman M, Moore GW, Matsuda G. 1975. Darwinian evolution in the genealogy of haemoglobin. Nature 253:603–608. Gould SJ, Lewontin RC. 1979. The spandrels of San Marco and the Panglossian paradigm: a critique of the adaptationist programme. Proc R Soc Lond B 205:581–598. Hellsten U, Khokha MK, Grammer TC, Harland RM, Richardson P, Rokhsar DS. 2007. Accelerated gene evolution and subfunctionalization in the pseudotetraploid frog Xenopus laevis. BMC Biol 5:31. Hittinger CT, Carroll SB. 2007. Gene duplication and the adaptive evolution of a classic genetic switch. Nature 449:677–681. Hoekstra HE, Coyne JA. 2007. The locus of evolution: evo devo and the genetics of adaptation. Evolution 61:995–1016. Hughes AL. 1994a. The evolution of functionally novel proteins after gene duplication. Proc R Soc Lond B 256:119–124. Hughes AL. 1994b. Evolution of the interleukin-1 gene family in mammals. J Mol Evol 39:6–12. Hughes AL. 1998. Phylogenetic tests of the hypothesis of block duplication of homologous genes on human chromosomes 6, 9, and 1. Mol Biol Evol 15:854–870.

REFERENCES

101

Hughes AL. 1999a. Adaptive Evolution of Genes and Genomes. New York: Oxford University Press. Hughes AL. 1999b. Phylogenies of developmentally important proteins do not support the hypothesis of two rounds of genome duplication early in vertebrate history. J Mol Evol 48:565–576. Hughes AL. 2000. Polyploidization and vertebrate origins: a review of the evidence. In Sankoff D, Nadeau JH (eds.), Comparative Genomics. Dordrecht, The Netherlands: Kluwer, pp. 493–502. Hughes AL. 2005. Gene duplication and the origin of novel proteins. Proc Natl Acad Sci USA 102:8791–8792. Hughes AL. 2007. Looking for Darwin in all the wrong places: the misguided quest for positive selection at the nucleotide sequence level. Heredity 99:364–373. Hughes AL, Friedman R. 2003a. 2R or not 2R: testing hypotheses of genome duplication in early vertebrates. J Struct Funct Genom 3:85–93. Hughes AL, Friedman R. 2003b. Parallel evolution by gene duplication in the genomes of two unicellular fungi. Genome Res 13:794–799. Hughes AL, Friedman R. 2004. Pattern of divergence of amino acid sequences encoded by paralogous genes in human and pufferfish. Mol Phylogenet Evol 32:337–343. Hughes AL, Friedman R. 2005a. Variation in the pattern of synonymous and nonsynonymous difference between two fungal genomes. Mol Biol Evol 22:1320–1324. Hughes AL, Friedman, R. 2005b. Expression patterns of duplicate genes in the developing root in Arabidopsis thaliana. J Mol Evol 60:247–256. Hughes AL, Friedman R. 2005c. Gene duplication and the properties of biological networks. J Mol Evol 61:758–764. Hughes AL, Friedman R. 2007. Sharing of transcription factors after gene duplication in the yeast Saccharomyces cerevisiae. Genetica 129:301–308. Hughes AL, Nei M. 1988. Pattern of nucleotide substitution at MHC class I loci reveals overdominant selection. Nature 335:167–170. Hughes AL, Piontkivska H. 2005. DNA repeat arrays in chicken and human genomes and the adaptive evolution of avian genome size. BMC Evol Biol 5:12. Hughes AL, Green JA, Garbayo JA, Roberts RM. 2000. Adaptive diversification within a large family of recently duplicated, placentally expressed genes. Proc Natl Acad Sci USA 97:3319–3323. Hughes AL, da Silva J, Friedman R. 2001. Ancient genome duplications did not structure the human Hox -bearing chromosomes. Genome Res 11:771–780. Hughes AL, Green JA, Piontkivska H, Roberts RM. 2003a. Aspartic proteinase phylogeny and the origin of pregnancy-associated glycoproteins. Mol Biol Evol 20:1940–1945. Hughes AL, Friedman R, Ekollu V, Rose JR. 2003b. Non-random association of transposable elements with duplicated genomic blocks in Arabidopsis thaliana. Mol Phylogenet Evol 29:410–416. Hughes AL, Ekollu V, Friedman R, Rose JR. 2005. Gene family content-based phylogeny of prokaryotes: the effect of search criteria. Syst Biol 54:268–276. Hughes AL, Friedman R, Glenn NL. 2006. The future of data analysis in evolutionary genomics. Curr Genom 7:227–234. Hughes MK, Hughes AL. 1993. Evolution of duplicate genes in a tetraploid animal, Xenopus laevis. Mol Biol Evol 10:1360–1369. Jensen RA. 1976. Enzyme recruitment in the evolution of new function. Annu Rev Microbiol 30:409–425.

102

MYTHS AND REALITIES OF GENE DUPLICATION

Jensen RA, Byng GS. 1981. The partitioning of biochemical pathways with isozyme systems. Isozymes 5:143–174. Jiang Z, Tang H, Ventura M, Cardone MF, Marques-Bonet T, She X, Pevzner PA. Eichler EE. 2007. Ancestral reconstruction of segmental duplications reveals punctuated cores of human genome evolution. Nat Genet 39:1361–1368. Johnston M. 1987. A model fungal gene regulatory system: the GAL genes of Saccharomyces cerevisiae. Microbiol Rev 51:458–476. Kanehisa M. 2000. Post-genome Informatics. Oxford, UK:Oxford University Press. Katju V, Lynch M. 2006. On the formation of novel genes by duplication in the Caenorhabditis elegans genome. Mol Biol Evol 23:1056–1063. Kimura M, Ohta T. 1974. On some principles governing molecular evolution. Proc Natl Acad Sci USA 71:2848–2852. Levin DA. 1983. Polyploidy and novelty in flowering plants. Am Nat 122:1–25. Li W-H. 1983. Evolution of duplicate genes and pseudogenes. In Nei M, Koehn RK (eds.), Evolution of Genes and Proteins. Sunderland, MA: Sinauer Associates, pp. 14–37. Lynch M, Conery JS. 2000. The evolutionary fate and consequences of duplicate genes. Science 290:1151–1155. Lynch M, Conery JS. 2003. The origins of genome complexity. Science 302:1401–1404. Lynch M, Force A. 2000. The probability of duplicate gene preservation by subfunctionalization. Genetics 154:459–473. Martin N, Ruedi EA, LeDuc R, Sun F-J, Caetano-Anoll´es G. 2007. Gene-interleaving patterns of synteny in the Saccharomyces cerevisiae genome: Are they proof of an ancient genome duplication event? Biol Direct 2:3. Matsuo-Takasaki M, Matsumura M, Sasai Y. 2005. An essential role of Xenopus Foxi1a for ventral specification of the cephalic ectoderm during gastrulation. Development 132:3885–3894. McLysaght A, Hokamp K, Wolfe KH. 2002. Extensive genomic duplication during early chordate evolution. Nat Genet 31:200–204. Meyer J, Walker-Jonah A, Hollenberg CP. 1991. Galactokinase encoded by GAL1 is a bifunctional protein required for induction of the GAL genes in Kluyveromyces lactis and is able to suppress the gal3 phenotype in Saccharomyces cerevisiae. Mol Cell Biol 11:5454–5461. Meyer VE, Young RA. 1998. RNA polymerase II holoenzymes and subcomplexes. J Biol Chem 273:27757–27760. Murato Y, Nagatomo K, Yamaguti M, Hashimoto C. 2007. Two alloalleles of Xenopus laevis hairy2 gene: evolution of duplicated gene function from a developmental perspective. Dev Genes Evol 217:665–673. Nei M. 1969. Gene duplication and nucleotide substitution in evolution. Nature 211:40–42. Noiva R, Lennarz WJ. 1992. Protein disulfide isomerase: a multifunctional protein resident in the lumen of the endoplasmic reticulum. J Biol Chem 267:3553–3556. Ohno S. 1970. Evolution by Gene Duplication. New York: Springer-Verlag. Ohno S. 1973. Ancient linkage groups and frozen accidents. Nature 244:259–262. Orgel LE. 1977. Gene-duplication and the origin of proteins with novel functions. J Theor Biol 67:773. Otto SP, Whitton J. 2000. Polyploid incidence and evolution. Annu Rev Ecol Syst 34:401–407. Piatigorsky, J. 2007. Gene Sharing and Evolution: the Diversity of Protein Functions. Cambridge, MA: Harvard University Press. Piatigorsky J, Wistow G. 1991. The recruitment of crystallins: new functions precede gene duplication. Science 252:1078–1079.

REFERENCES

103

Platt A, Ross HC, Hankin S, Reece RJ. 2000. The insertion of two amino acids into a transcriptional inducer converts it into a galactokinase. Proc Natl Acad Sci USA 97:3154–3159. Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barab´asi AL. 2002. Hierarchical organization of modularity in metabolic networks. Science 297:1551–1555. Seoighe C, Wolfe KH. 1999. Updated map of duplicated regions in the yeast genome. Gene 238:253–261. Shia W-J, Osada S, Florens L, Swanson SK, Washburn MP, Workman JL. 2005. Characterization of the yeast trimeric–SAS acetyltransferase complex. J Biol Chem 280:11987–11994. Sipiczki M. 2000. Where does fission yeast sit on the tree of life? Genome Biol 1(2):1011. Stebbins GL. 1971. Chromosomal Evolution in Higher Plants. London: Edward Arnold. Sutton A, Shia W-J, Band D, Kaufman PD, Osada S, Workman JL, Sternglanz R. 2003. Sas4 and Sas5 are required for the histone acetyltransferase activity of Sas2 in the SAS complex. J Biol Chem 278:16887–16892. Tachibana C, Stevens TH. 1992. The yeast EUG1 gene encodes an endoplasmic reticulum protein that is functionally related to protein disulfide isomerase. Mol Cell Biol 12:4601–4611. Tanaka T, Nei M. 1989. Positive Darwinian selection observed at the variable-region genes of immunoglobulin. Mol Biol Evol 6:447–459. Tong AHY, Lesage G, Bader GD, Ding H, Xu H, Xin X, Young J, Berriz GF, Brost RL, Chang M, et al. 2004. Global mapping of the yeast genetic interaction network. Science 303:808–813. Wagner A. 2001. The yeast protein interaction network evolves rapidly and contains few redundant duplicated genes. Mol Biol Evol 18:1283–1292. Wolfe KH, Shields DC. 1997. Molecular evidence for an ancient duplication of the entire yeast genome. Nature 387:708–713. Yano K-I, Fukasawa T. 1997. Galactose-dependent reversible interaction of Gal3p with Gal80p in the induction pathway of Gal4p-activated genes of Saccharomyces cerevisiae. Proc Natl Acad Sci USA 94:1721–1726. Yeager M, Hughes AL. 1999. Evolution of the mammalian MHC: natural selection, recombination, and convergent evolution. Immunol Rev 167:45–58.

wwwwwww

6

Evolution After and Before Gene Duplication? TOBIAS SIKOSEK and ERICH BORNBERG-BAUER Evolutionary Bioinformatics Group, Institute for Evolution and Biodiversity, University of Muenster, Muenster, Germany

1 INTRODUCTION 1.1 Where Do Proteins Come From? Life, as we know it today, depends on protein function. Proteins in large part constitute the phenotype that evolution acts upon through genetic mutations and natural selection. To understand how evolution works, it is fundamental to know how proteins evolve. In a constantly changing environment and under constant competition between individuals, proteins with new functions often determine how successfully an organism can reproduce. It is now generally accepted that new proteins evolve from existing ones, either through small-scale mutations (i.e., single-nucleotide substitutions that change the encoded amino acid) or, more fundamentally, larger-scale mutations (such as domain rearrangements and gene duplications). A protein domain can be considered a structurally as well as functionally independent unit or building block of a protein (unless domains catalyze codependent steps in a multistep reaction and have therefore become fused to one protein). The same domain can be found in various arrangements with other domains in different proteins. Therefore, the term protein is used below interchangeably with the term single domain (unless stated otherwise). The emergence of new proteins and, in particular, new functional protein domains is still an unsolved problem [for a review, see Vogel et al. (2004)]. Thanks to the wide availability of genomic data it has become possible to reconstruct in quite some detail how proteins have rearranged their domains. At the bottom line, it appears that ancestral proteins were probably mostly single-domain proteins which, at a later stage, became fused in different combinations (Bj¨orklund et al., 2005). This fusion has been iterated to give rise to more complex arrangements (architectures). The second major driving force is the loss of domains, in particular at the C-termini, due to nonsense mutations (Weiner et al., 2006). If one tries to trace back evolution even further, one has to ask how protein domains themselves came into existence. Again, the mechanism of fusion and fission of smaller Evolution After Gene Duplication, Edited by Katharina Dittmar and David Liberles Copyright © 2010 Wiley-Blackwell

105

106

EVOLUTION AFTER AND BEFORE GENE DUPLICATION?

fragments (polypeptides) is a possibility that has been investigated (Riechmann and Winter, 2000, 2006; Lupas et al., 2001; S¨oding and Lupas, 2003; Alva et al., 2007). Those polypeptides would have been too short to fold natively on their own, but possibly were useful as cofactors to ribozymes (S¨oding and Lupas, 2003). Upon fusion, however, those peptides may have led to stable folds, perhaps due primarily to their repetitiveness (S¨oding and Lupas, 2003). It is still not clear whether or not some domains still evolve from existing ones by fusion and fission of small peptide fragments, but it seems possible that this mechanism would facilitate the evolution of new protein functions. This is because single amino acid substitutions usually change a protein rather gradually and slowly, whereas larger-scale changes can make “jumps” in sequence space to reach more “distant” phenotypes. Since such larger-scale mutations (also called insertions/deletions, or simply indels) occur primarily during recombination events, this mechanism could provide an answer to the unsolved question of why sexual reproduction (which frequently leads to recombination) has been so “successful” during evolution, despite its costs (Barton and Charlesworth, 1998; Kouyos et al., 2007). Another source for insertions and deletions is the slippage of the DNA strand or of DNA polymerase during replication (Garcia-Diaz and Kunkel, 2006). There is evidence that the existing protein families did not evolve from one common ancestor, but that the multiple-birth model holds instead. This model states that proteins evolved by independent recombination of subdomain fragments or supersecondary structural elements (Choi and Kim, 2006). All known protein structures can be assigned to one of only four major classes [all-α, all-β, α/β, and α + β, as defined by SCOP (Murzin et al., 1995)]. Each of these classes is defined by the supersecondary structural elements (ββ-hairpin, αα-hairpin, and βαβ-element) it contains (S¨oding and Lupas, 2003). These relatively short elements occur in varying numbers of repeats, possibly due to multiple recombination events and might therefore have been the building blocks of ancient proteins.

1.2 Effects of Point Mutations on the Emergence of New Protein Function The problem with any kind of mutation is that in most cases genes are under negative or purifying selection, because their activity is needed by the cell. This puts a considerable constraint on evolution, because it requires proteins to remain unchanged (i.e., conserved). At the same time, organisms need to adapt to new environmental conditions by modifying proteins or generating new ones via mutations. They can do so with astounding speed: for example, when pathogens develop antibiotic resistance. Wagner (2008) discusses this apparent paradox. The extent to which point mutations are either neutral, advantageous, or detrimental is still a matter of debate. A mutation in the active site of an enzyme, for example, will almost certainly affect its function. It is more difficult to infer the fitness effects of mutations that are not directly involved in function. Most mutations alter the stability with which a protein sequence folds into its three-dimensional conformation, especially if a polar residue in the protein core is substituted for a hydrophobic residue. But there are at least two other properties of protein evolution that balance the detrimental effects of a mutation: (1) compensatory mutations (DePristo et al., 2005) and (2) the general robustness of protein structures to mutations (Wagner, 2008). Compensatory mutations can correct the destabilizing effects that a mutation has on protein structure.

INTRODUCTION

107

This means that the fitness effect that a mutation has is always relative to its genetic background (i.e., to amino acids at other sites in the protein sequence). Mutational robustness (i.e., the ability to maintain a functioning phenotype under genetic mutations) is a property observed in proteins (Chan and BornbergBauer, 2002; Xia and Levitt, 2004; Tokuriki et al., 2007)[as well as in RNAs (Wagner and Stadler, 1999)] that in itself has been proposed to be subjected to adaptive evolution because it reduces the overall proportion of detrimental mutations in favor of neutral mutations. These and other aspects [such as selection for increased translational efficiency (Drummond et al., 2005)] have to be considered when estimating mutational effects on protein fitness. However, the more conserved a protein (i.e., the more important its correct, uncompromised functioning is to the organism), the higher the chance of a mutation being detrimental. On the other hand, some proteins are more dispensable than others, and therefore their malfunctioning might be more tolerable (P´al et al., 2006). Therefore, fitness effects of mutations vary not only along the nucleotide sequence of a protein-coding gene [expressed as its fitness density (Drummond et al., 2005; P´al et al., 2006)] but also between different genes [referred to as the dispensability of the encoded proteins (P´al et al., 2006)]. Still, it seems reasonable to assume that many, if not most, point mutations in a gene will have a negative effect on the fitness of the protein. 1.3 Emergence of New Proteins via Gene Duplication The predominant view is that a gene duplication is required for biological innovation because it provides the opportunity for evolution to “try out” alternative protein designs without sacrificing an existing design. There are different mechanisms that may produce gene duplicates. Small-scale duplications (SSDs) of one or a few genes can happen, for example, by unequal crossing-over or retrotransposition. Unequal crossing-over occurs when two homologous sequences on different chromosomes misalign during recombination. Retrotransposition occurs by mobile genetic elements that copy themselves into other regions of the genome, using RNA intermediates and reverse transcriptase. Occasionally, adjacent genes are copied with those mobile elements. Such SSDs have been estimated to occur quite frequently: about 0.01 per gene per million years (Lynch and Conery, 2000). Whole-genome duplications (WGDs) are another possible mechanism, although much rarer. WGDs are the result of mitotic cell divisions that only duplicate the genome but fail to separate the cytoplasm as well, or alternatively, WGDs can be caused by two rounds of replication without a cell division in between. In diploid organisms this results in a tetraploid cell. When this cell enters meiosis it produces gametes that are diploid, and a zygote produced by the fusion of two such gametes will be tetraploid. Polyploidy is very common in plants but also occurs in animals, mostly in fish and amphibians. In mammals, polyploid zygotes are not viable, probably because it interferes with the chromosome-based sex determination mechanisms. After a WGD, the organism usually returns to its original ploidy level. For example, a diploid organism that has become tetraploid will return to its diploid state (diploidization) by, for example, losing excess chromosomes or by fusing them. This has been shown for the plant Arabidopsis thaliana, for which evidence of several WGDs could be found (Bowers et al., 2003). The result of the WGD event is a number of paralogs

108

EVOLUTION AFTER AND BEFORE GENE DUPLICATION?

that have been retained for various reasons, as explained below. WGDs are thought to have played key roles in large evolutionary transitions which increased organismal complexity and led to major innovations in the larger branches of the tree of life. Above all, a WGD provides a high degree of redundancy in the genome, since every gene then exists twice. This also has consequences for the interactions between proteins: for example, in gene regulatory networks, where the duplicated version of a network can assume entirely different regulatory interactions. Potentially, this enabled the evolution of more complex developmental processes. Duplications of genomic regions can also occur in a way that cuts off parts of a gene (Figure 1A). The duplicate might then miss parts or all of its regulatory and promotor regions, which means that it is regulated differently or cannot be expressed at all. Alternatively, the duplicate might miss some part of its C-terminus: for example, when a retrotransposon lies within an intron. Gene duplications, at whatever scale, occur in single individuals. A population-wide fixation has to follow for evolution to take advantage of the innovative potential of the duplicates. (The fixation of a WGD might actually result in a new species, as reproductive incompatibilities are often found between individuals of different ploidy levels.) This can occur either randomly by drift or because the duplicates provide an immediate positive fitness effect. It is also possible, and may even be likely, that a duplication event has a negative effect on fitness. This could occur, for example, by bringing an imbalance into a metabolic equilibrium by doubling the amount of the gene’s product or, in case of a regulatory protein, by overregulating a certain cellular process (Papp et al., 2003). These are called dosage effects (Figure 1B). It is arguable whether or not a WGD escapes the negativity of dosage effects by doubling the amount of all gene products at the same time, thus reducing the probability of imbalances. However, since traces of gene duplications can be found abundantly in all organisms, it can be assumed that a sufficient proportion of gene duplicates reaches fixation (Force et al., 1999; Blomme et al., 2006; Brunet et al., 2006). 1.4 What Happens After Gene Duplication? For the period after fixation, several possible fates of gene duplicates have been proposed, and they are summarized in Figure 1C. Most duplicates will be lost again (Lynch and Conery, 2000). Since they are redundant and therefore can accumulate mutations without any effects on fitness, mutations in the promotor region will eventually prevent transcription of the gene. It then becomes a pseudogene, a gene that is still homologous to its former paralogs and orthologs but that is no longer transcribed. Pseudogenes eventually lose all resemblance to their homologs because they accumulate mutations at a relatively high rate. Finally, the pseudogenization process is completed when the duplicated gene becomes fixed in the population. For retention, a gene duplicate will have to acquire properties that render it beneficial to the organism. There are several alternative, nonexclusive models for how this could happen [for an excellent review on this, see a recent article by Conant and Wolfe (2008)]. The classical view, as presented by Ohno (1970), states that one of the gene copies remains as it was before the duplication, while the other copy develops a new function by randomly “exploring” protein sequence space (which contains all possible protein sequences) through neutral drift until it “finds” a new function that provides a fitness advantage. This hypothesis, referred to as neofunctionalization, is

INTRODUCTION

109

(A)

(B)

(C)

Figure 1 Possible outcomes of gene duplication. (A) A gene duplication can copy genes either in their entirety (complete duplication) or as a fragment of the original gene (incomplete duplication). An incomplete duplication can result either in a nonfunctional copy or in a copy with modified function and/or expression. (B) Dosage effects can play a role for some duplicated genes if the level of expressed protein is under selection. The second copy of the gene might therefore get lost eventually. Alternatively, mutations in the protein-coding region or in the regulatory region of a gene can compensate for increased protein levels caused by the duplication. The increase in dosage can also be advantageous or simply irrelevant for fitness. (C) In case of a complete duplication, both copies initially have redundant functions and expression patterns. According to the currently predominant models, pseudogenization, neofunctionalization, and subfunctionalization are the most common fates of duplicate genes. (Based on Hughes, 2007.)

110

EVOLUTION AFTER AND BEFORE GENE DUPLICATION?

illustrated in Figure 2. The new function of the duplicate can be an actual change in the amino acid sequence of the protein, but it might also be a mutation in the regulatory region of the gene (thus altering expression level, time, and/or location) or potentially even a mutation that alters the splicing pattern of the gene. Since the formulation of the neofunctionalisation model, examples have been found where the percentage of retention after a WGD is quite high (Ahn and Tanksley, 1993; Hughes and Hughes, 1993; Van de Peer et al., 2003; Blomme et al., 2006): for example, 15% in teleost fish (Brunet et al., 2006). This suggests that neofunctionalization cannot be the only explanation for retention, because it would take too long to acquire so many new functions via beneficial mutations, which are generally assumed to be very rare. Another model is subfunctionalization [also known as the duplication–degeneration–complementation (DDC) model (Force et al., 1999)], where a protein with two functions hands down one function to each of its duplicates by degrading complementary regions of the genes via deleterious mutations (Force et al., 1999; Lynch and Force, 2000). This model does not require the assumption of beneficial mutations. Instead, retention occurs only because the duplicates have become nonredundant in terms of function. Subfunctionalization can occur either in the regulatory region of a gene (thus altering expression patterns) or in the protein-coding region. Subfunctionalization of the coding region in a multidomain protein happens when the two duplicates accumulate changes in different domains. In one duplicate one of

Figure 2 Gene duplication event followed by neofunctionalization within a gene tree. The ancestral gene XYZ diverges via two speciation events into the orthologs X1 , Y1 , and Z1 (X, Y, and Z representing different species). As the gene is under purifying selection, it accumulates only neutral mutations (horizontal axis). Additionally, a duplication of the gene occurs in one of the branches after the first speciation. One copy evolves toward a new beneficial function via adaptive mutations (vertical axis), which then becomes conserved (also during the second speciation). The result is two pairs of paralogs: (X1 , X2 ) and (Y1 , Y2 ).

INTRODUCTION

111

the domains keeps its original function while another domain changes, and vice versa in the other duplicate [see, e.g., Chain and Evans (2006) and Figure 1C]. The question of how the conversion from one domain into another might occur is an important one and is addressed later. Domains are likely to play an important role in the diversification of protein functions by subfunctionalization and neofunctionalization. Since domains are functionally distinct from each other, sub- and neofunctionalization can occur only for a subset of domains of the same protein while leaving the other domains unchanged. An example of this can be found in the bHLH (basic helix–loop–helix) transcription factor family in humans, where two domains are the same for all family members (the basic DNA-binding domain, as well as the helix–loop–helix dimerization domain), whereas a third domain (which is also involved in dimerization) varies among groups within the family (Amoutzias et al., 2004). These groups coincide with subnetworks of the bHLH regulatory network, because dimerization can occur only between monomers with the same dimerization domains. A very similar example can be found in regulatory networks relying on the widely spread MADS domain (Veron et al., 2007). The evolution of regulatory networks might therefore, at least in some cases, rely on subfunctionalization events involving the rearrangement and modification of domains (Amoutzias et al., 2004; Veron et al., 2007). Another possible mechanism for duplicate retention, especially after WGDs, is dosage balance. As described above, dosage effects might be less severe if all interaction partners are duplicated, as in WGDs. Those genes that are especially prone to dosage effects might therefore be dependent on the retention of their duplicates in order to maintain their newly found balance after duplication. Dosage effects might in some cases even be of adaptive value in that a gene that is already transcribed at its maximum rate can further increase its transcription rate by having additional copies of itself in the genome. However, dosage-related effects are probably only relevant for some genes, so that subfunctionalization remains the most likely candidate among the models explaining gene retention after duplication. A related mechanism of potential importance is the retention of gene duplicates due to “activity-reducing mutations” (Scannell and Wolfe, 2008). Dosage effects might be reduced relatively quickly by accumulation of slightly deleterious mutations (Figure 1B). Since most mutations are thought to reduce stability (possibly accompanied by a decrease in catalytic efficiency), the activity of both copies might be reduced to the original (preduplication) level, which will improve the chances of duplicate retention. It is obvious that subfunctionalization by itself (i.e., the redistribution of existing functions) is not a satisfying explanation for innovation of protein function, but only for duplicate retention. More promising, therefore, is a combination of neoand subfunctionalization [subneofunctionalization (He and Zhang, 2005; Rastogi and Liberles, 2005)], according to which subfunctionalization occurs rapidly after gene duplication and leads to retention of the duplicate, followed by neofunctionalization. At least for SSDs in mammals, neo- or subneofunctionalization seem to be the dominant fate of gene duplicates (Hughes and Liberles, 2007). Neither neofunctionalization nor subfunctionalization can yet account for the high probability of pseudogenization (i.e., the loss of function of a gene relieved from selection and therefore exposed to the accumulation of deleterious mutations). If adaptive evolution can only occur after a gene has already been duplicated, there might not be

112

EVOLUTION AFTER AND BEFORE GENE DUPLICATION?

enough time for drift to produce a mutation that eventually provides a fitness advantage before pseudogenization occurs. Directly after gene duplication the rate of pseudogenization is high but also decreases rapidly with time (Hughes and Liberles, 2007), so that a quick mechanism of subfunctionalization is required to retain a sufficient number of gene duplicates that can then undergo further neofunctionalization. This problem—the retention of the duplicate at frequencies and over time periods sufficient for the accumulation of adaptive mutations—has recently been termed Ohno’s dilemma, and the proposed solution is similar to the one outlined here (Bergthorsson et al., 2007). In this chapter we present experimental and theoretical evidence that might explain how evolution toward different protein functions can start before the corresponding gene undergoes duplication and how this “preduplication” evolution might therefore facilitate the following sub- or neofunctionalization process. A key mechanism might be the exploitation of the promiscuous (or latent) protein functions that many proteins seem to have and that are free to change and adapt. For several decades the concept of single-domain proteins performing more than one function has been known as gene sharing, and the possibility of such proteins preceding and facilitating evolution after gene duplication has been proposed (Piatigorsky and Wistow, 1989; Hughes, 1994) and is now beginning to gain some experimental support (McLoughlin and Copley, 2008). This model has, however, received little attention compared with the models of evolution after gene duplication, probably because only a few very specific examples were known. Only recently has this concept reappeared under the name escape from adaptive conflict (Hittinger and Carroll, 2007; Conant and Wolfe, 2008; DesMarais and Rausher, 2008), proposing that it might be a more common evolutionary mechanism than previously thought. This adaptive conflict arises when a certain gene is under selection pressure to both conserve an old function and acquire a new one at the same time. This conflict might lead to the phenomenon of gene sharing via promiscuous protein functions associated with a reduced efficiency for each function. The escape from this conflict can only occur after gene duplication, when each copy specializes in only one of the ancestral functions. The shift between protein functions is likely to be associated with changes in proteinfolding stabilities and we therefore begin with a discussion of protein stability and proceed with the role that stability plays in the evolution to new functional phenotypes. We then discuss population genetic aspects as well as the possible role of phenotypic mutations in adaptation. Finally, we compare the evolution of ribozymes with the evolution of proteins to find potential similarities.

2

STABILITY OF PROTEIN STRUCTURES

Each protein folds into the conformation with the lowest free energy of all possible conformations and therefore the conformation that is thermodynamically the most stable. This is called the native conformation, and it is traditionally associated with protein function. However, there is now multiple and compelling evidence that the structural dynamics of a protein are essential for its function (Boehr et al., 2006; Vendruscolo and Dobson, 2006; Henzler-Wildman et al., 2007). Although proteins are usually thought of as having only one native conformation, it is possible for protein sequences to have more than one native conformation (i.e., that result in the same amount of free energy).

STABILITY OF PROTEIN STRUCTURES

113

A remarkable property of protein structures is their robustness toward single amino acid changes (Ferrada and Wagner, 2008; Wagner, 2008). Many amino acids in a protein sequence can be substituted by other amino acids without compromising the overall structure or function of the protein. This means that in protein sequence space (which contains all possible protein sequences) a sequence has many neutral neighbors, one-error mutants that all fold into the same conformation. The theoretical construct representing all the “mutationally connected” protein sequences folding into the same conformation is called a neutral network , a concept first used for RNA structures (Schuster et al., 1994) and then for proteins (Bornberg-Bauer, 1997). Neutral networks are often drawn as graphs in two dimensions (Figure 3); this is a strong simplification, however, as sequence space is multidimensional (one dimension for each amino acid in the peptide chain). Therefore, neutral networks, and the distances between them, usually cannot be drawn to the correct scale. According to Bornberg-Bauer and Chan (1999), every neutral network is associated with one conformation and organized around a prototype sequence. The prototype sequence of a neutral network is, in theory, the sequence with the highest thermodynamic stability for the associated conformation and might also coincide with the consensus sequence of a protein family (Figure 3). The prototype sequence usually lies at the center of its network, and thermodynamic stability is supposed to decrease smoothly when moving toward the edge of the network (Figure 4). This funnel-like distribution of thermodynamic stability in neutral networks of proteins, predicted by

Protein family Protein 3 Protein 1

Free energy

Se

qu

en

ce

sp

ac

e

Protein 2

Unstable Sequence space Stable

Figure 3 Members of a protein family within the same neutral network. The nodes in the neutral network are protein sequences connected by single amino acid substitutions (edges). The plane represents sequence space, the vertical axis is the free energy of protein folding. The members of this hypothetical protein family recently evolved from the same ancestral protein via gene duplication and subsequent neutral point mutations and all fold into the same native structure. Therefore, they all inhabit the same neutral network. The prototype sequence (node surrounded by a circle) of this network has the highest number of neutral neighbors and might be equivalent to a consensus sequence derived from all sequences belonging to this protein family.

114

EVOLUTION AFTER AND BEFORE GENE DUPLICATION?

(A)

(B)

Figure 4 Energy landscape of two adjacent neutral networks. (A) The large plane is a two-dimensional representation of sequence space. The vertical axis represents the free energy (G) of the protein structures x1 and x2 associated with the neutral networks [here, molecular structures of the Arc-repressor are taken as an example (Cordes et al., 2000)]. The lower the energy, the higher the thermodynamic stability of the structure. Each symbol represents a protein sequence; the lines between them represent amino acid substitutions. Neutral networks are distinguished by symbol shape. Sequences that fold uniquely into one conformation are shown in black, those that are equally stable in more than one conformation are shown in white. The sequence in the middle of the two nets folds equally well into both structures, x1 and x2 . A path connecting the prototype sequences (framed symbols) of the two networks is drawn. (B) Frontal view of the energy landscape, showing that the two structures x1 and x2 coexist in an equilibrium for the protein sequences lying in between the two neutral networks. The locations of the prototype sequences are indicated in parts A and B by dashed lines. (See insert for color representation of the figure.)

STABILITY OF PROTEIN STRUCTURES

115

theoretical studies with lattice models∗ (Bornberg-Bauer and Chan, 1999), has recently been confirmed experimentally (Bloom et al., 2006). In the simple world of lattice proteins it is possible to assign the highest fitness to the prototype sequence if structural stability is to be taken as a proxy for functional efficiency (Cui et al., 2002; Wroe et al., 2007). Real proteins, however, function only within a certain window of stability. On the one hand, proteins need some conformational flexibility to bind to their ligands or interaction partners. On the other hand, too much instability leads to unfolding, aggregation, and degradation (DePristo et al., 2005). Many enzymes undergo structural changes while carrying out their normal functions. Adenylate kinase, for example, undergoes the same steps of conformational changes that it would need for processing its substrate, even in its unbound state (Henzler-Wildman et al., 2007). Most mutations alter the thermodynamic stability of proteins (Alber, 1989; Pakula and Sauer, 1989; Matthews, 1995; Wilson et al., 1992). The effect that a mutation has on protein stability, however, depends largely on the genetic context (i.e., on other stability-changing mutations). The same mutation can therefore be either neutral, advantageous, or deleterious in terms of stability. It is neutral if it does not alter protein stability in a way that compromises its correct folding or functioning. Also, an adaptive mutation might be temporarily disadvantageous regarding stability but can be compensated for by another mutation that restores the original stability of the protein (DePristo et al., 2005). For each deleterious mutation there is a number of potentially compensatory mutations, as demonstrated for the bacterium Salmonella typhimurium (Maisnier-Patin et al., 2002). This bacterium was equipped with a mutation in the ribosomal protein S12, which confers antibiotic resistance but is detrimental otherwise (it slows the rate of protein synthesis, due to an increased proofreading rate). Of 80 lineages carrying this mutant, 77 independently evolved additional compensatory mutations so that the antibiotic resistance was still given, but the deleterious side effects were reduced. In total, the compensatory mutations found comprised 35 different amino acid substitutions (Maisnier-Patin et al., 2002). Mutations can also occur simultaneously during crossing-over (recombination) of homologous DNA sequences. Instead of only one amino acid substitution, an entire part of a protein sequence can be exchanged, carrying all the substitutions within it. How many amino acids are changed depends, of course, on how different the homologous DNA sequences are. Crossing-over can also maintain epistatic effects between amino acids if they are copied together (Barton and Charlesworth, 1998; Kouyos et al., 2007). Recombination has been shown to speed up evolutionary transitions in computer simulations of lattice proteins by “tunneling” through sequence space, thus reaching more distant structures (Cui et al., 2002).

∗ Lattice

models use very simplified representations of proteins to test general properties of protein folding, which are then extrapolated to the folding of real proteins. The use of such simple models is necessary because an algorithm for sequence-to-structure mapping is still not available for proteins. In lattice models the polymer chain of a protein can assume discrete positions only on a two- or three-dimensional lattice. The monomer alphabet is usually reduced from 20 to only two monomers, with the properties “hydrophobic” and “polar” (or H and P). Those are the two properties thought to be most relevant in defining a protein’s structure. Alternative models use compact squares or cubes and contact interactions which are drawn randomly from a continuous energy distribution.

116

3

EVOLUTION AFTER AND BEFORE GENE DUPLICATION?

STRUCTURAL PROMISCUITY OF PROTEINS

To understand how one domain can be converted into another, as expected during neofunctionalization (see Section 1.4), it is important to consider aspects of protein structure and stability. If a domain eventually assumes a new fold by amino acid substitutions, there must be a point where enough mutations have accumulated to switch from one fold to another. Determining this point (i.e., the number of amino acid changes necessary) has been a challenge for some time now, resulting, for example, in the Paracelsus challenge, which called for a conversion between two protein folds while retaining 50% of the original amino acid composition (Rose and Creamer, 1994; Jones et al., 1996). This goal was achieved [with the Janus protein (Dalal et al., 1997)] and even exceeded recently by a pair of proteins sharing 88% of their amino acids but assuming different folds (Alexander et al., 2007). These are, however, extreme, artificially produced examples and are not necessarily the rule for structural transitions in nature. Instead of changing entire folds, smaller conformational changes might be more common during neofunctionalization. As already discussed, structural transitions via single amino acid substitutions can be drawn as neutral paths through sequence space. The more mutational steps a protein sequence takes from the center of its neutral network toward the edge (Figure 4), the less stably it folds into its native conformation∗ . It has been shown that some proteins occur in an equilibrium state of two different folds, and that a very small number of mutations is sufficient to change this equilibrium in either direction. This means that the two folds are thermodynamically equally (or at least similarly) stable and that they either stay in their respective fold once they have begun one particular folding pathway [such as Arc-repressor (Cordes et al., 2000), Rop, or the human prion protein; see (Meier and Ozbek, 2007) for an extended list], or they switch back and forth (fluctuate) constantly between the two structures. For minor conformational changes, fluctuations would be more likely and, as mentioned above, are to a certain degree necessary for the function of many proteins. In those cases, the general fold (i.e., the relative distribution of secondary structural elements) remains the same. Examples of ambiguity in the secondary structure can be found in proteins with chameleon sequences, which are equally likely to form an α-helix or a β-sheet, such as the Janus protein (Dalal et al., 1997). This illustrates how close two very different structures can be in terms of the amino acid sequences that form them. Interestingly, some evidence has already been gathered of in vivo protein structures that appear to lie in between neutral networks. One example is the Arc-repressor, which carries a region folding into a β-sheet in the wild-type form. One amino acid substitution in that region, however, leads to a mutant form that can carry an αhelix instead. In this mutant, both conformations occur in equilibrium, switching back and forth dynamically (Cordes et al., 2000). Another example is the prion protein, which is supposed to be an evolutionary intermediate of a transmembrane protein “on its way” to becoming a soluble globular protein. This would be the reason for its occasional folding into an insoluble, aggregating form, which causes the pathogenic ∗ The concept of neutrality has to be used very carefully here, because even within neutral networks, mutations are never entirely neutral. Each mutation alters stability, but not all mutations are so disruptive that they lead to a new native conformation. The fitness effects of those minor stability changes are relatively small, so they can be called neutral.

EVOLUTIONARY TRANSITIONS BETWEEN PROTEIN PHENOTYPES

117

symptoms of Creutzfeldt–Jakob disease and related diseases (Tompa et al., 2001). Similar properties have been found in proteins with cystein-rich domains (CRDs), which provide physical strength through disulfide bridges and occur in the nematocysts of Hydra. The structure of one such domain could be transformed into the structure of another naturally occurring CRD (with a completely different pattern of disulfide bridges) by introducing only two point mutations in vitro (Meier et al., 2007). The first mutation already led to an equilibrium state of both conformations; the second then completed the transition. Also, Rop (repressor of primer) protein folds into two different four-helix-bundle structures that form two different dimers (Levy et al., 2005). These and other examples (reviewed by Meier and Ozbek (2007)) provide a continuously growing body of evidence for the plasticity of protein structures. As mentioned earlier, another type of mutation is that of indels (insertions/deletions). Estimations of structural transitions have recently been attempted with bioinformatics approaches, which focused primarily on insertions of small peptides into existing structures (Jiang and Blouin, 2007; Viksna and Gilbert, 2007). These studies found that structural transitions via insertions seem to be an important mechanism of protein evolution.

4 EVOLUTIONARY TRANSITIONS BETWEEN PROTEIN PHENOTYPES The emergence of new protein phenotypes (in terms of structure, function, or both) has long been an unsolved problem and to some extent still is. The main difficulty in finding a solution to this problem might be the fact that most mutations are thought to be deleterious and that, consequently, adaptive mutations are much too rare to explain how a protein can evolve from one phenotype to another (let alone how to create a de novo protein from a random sequence). As already mentioned, the crucial step is to maintain a viable protein simultaneously with inventing a new one. The evolution from one protein structure x1 to another protein structure x2 by single amino acid substitutions can be drawn as a path through sequence space. This path starts within the neutral network of x1 , somewhere near its prototype sequence, and ends near the prototype sequence of the neutral network of x2 . The two neutral networks have to be adjacent (i.e., one mutation can change the dominant fold of the protein) or overlapping [i.e., there is one (or more) protein sequence(s) that can fold into the conformations associated with both nets] for the transition to be smooth enough in terms of fitness. This is possible when both phenotypes provide some fitness advantage. As the sequence “moves” (by means of mutations) along the path between the two neutral networks, x1 becomes less stable and therefore ceases to be the dominant fold. At the same time the stability of x2 increases and becomes the dominant fold of the protein. This means that the protein occurs in one or the other conformation during the transition, so that the protein’s fitness cannot drop too low. Such neutral paths would be suitable for guiding the evolution of new phenotypes through regions of sequence space that are “safe enough” (i.e., the intermediate states between two phenotypes are not too detrimental and therefore remain in the population). Such phenotypic transitions could be observed in both computational simulations (Wroe et al., 2007) and laboratory evolution experiments (Amitai et al., 2007). Conceivably, there are two different “routes” that the transition from one phenotype to another can take: either via a promiscuous generalist intermediate or directly

118

EVOLUTION AFTER AND BEFORE GENE DUPLICATION?

from one specialist phenotype to another (Khersonsky et al., 2006). The route of the promiscuous intermediate is more likely when there is not a high fitness cost for having both promiscuous phenotypes at the same time. The direct route is favored when there is strong dual selection pressure that acts to promote one phenotype and demote the other one at the same time [e.g., in the context of signaling and quorum sensing (Collins et al., 2006)], so that the transition between the phenotypes is very abrupt. Both cases, generalist intermediates (Rothman and Kirsch, 2003) and dual selection (Varadarajan et al., 2005; Collins et al., 2006), have been observed, although the first type seems to be the predominant one (Khersonsky et al., 2006). After having focused on the structural implications of phenotypic transitions, we look next into aspects of protein function and the probable correlation between the two.

5

FUNCTIONAL PROMISCUITY OF ENZYMES

For some time now, enzymes have been known that catalyze substrates other than their native ones, albeit at very low rates (O’Brien and Herschlag, 1999; Khersonsky et al., 2006). This phenomenon is referred to as catalytic or functional promiscuity. One of the best-studied examples in terms of promiscuous protein function is serum paraoxonase (PON1), for which four promiscuous activities have been described (Aharoni et al., 2005) (Figure 5A). Furthermore, these activities could be increased by factors of 101 to 102 [or even more in other examples (Khersonsky et al., 2006)] through directed laboratory evolution, without substantially reducing the native activity (Aharoni et al., 2005; Amitai et al., 2007). The native function of PON1 is that of a lipo-lactonase, which hydrolyzes esters in lactone rings. All promiscuous functions are esterase activities as well, but for slightly different substrates. As is typical for enzyme promiscuity, just a few amino acid changes in or around the active site of PON1 are needed to increase promiscuous activity by several orders of magnitude (Aharoni et al., 2005). Promiscuous activities can also be called latent when their effects are neglectable but have the potential to be increased dramatically by just a few mutations in the right places. Often, promiscuous enzymes have more than one latent activity besides their native activity. This can be interpreted as being close to more than one neighboring neutral network of a functional phenotype (as demonstrated for PON1 in Figure 1A). Experimentally, it was observed for some protein families that their members share one promiscuous function or that the native function of one member is the promiscuous function of another, and so on (Khersonsky et al., 2006). This indicates that all the functions found within a family are located within the same evolutionary neighborhood in sequence space. In the course of evolution, different family members might have explored this neighborhood in different directions. It has been proposed that at the origin of every protein family there was a generalist that could perform some or all of the functions of its descendants, which eventually became specialists as they diverged (Jensen, 1976; Khersonsky et al., 2006): for example, through subfunctionalization. The fitness landscape of the hypothetical neutral network of PON1 (and the adjacent nets of the promiscuous functions) (Amitai et al., 2007) is shown in Figure 5B and C. In each neutral network there is a region with maximum fitness. Between

FUNCTIONAL PROMISCUITY OF ENZYMES

119

(A)

(B)

Figure 5 Hypothetical sequence neighborhood of the enzyme PON1. (A) Putative neutral network of PON1 with adjacent networks associated with promiscuous functions. The symbols represent different protein sequences connected by amino acid substitutions (lines). Symbols of different shapes belong to different neutral networks. Symbols with circles around them represent sequences on maxima in the fitness landscape. Large dashed circles delimit the neutral networks of the promiscuous functions. The native function of PON1 is that of a lipo-lactonase (circles). Promiscuous functions are thiolactonase (hexagons), aryl-esterase (squares), phosphotriesterase (triangles), and drug resistance (stars). (B) Fitness landscape of the same neutral network. The plane represents protein sequence space and the vertical axis is the fitness of an individual expressing the corresponding protein. Over the centre of each neutral network lies a fitness maximum. Paths between the neutral networks correspond to ridges in the fitness landscape, which connect the maxima of neighboring networks. (C) Frontal view of the fitness landscape, showing the maxima in profile. A hypothetical diagram under a highlighted part of the fitness landscape shows how catalytic activities of a promiscuous function (thio-lactonase) and the native function decrease and increase along the connecting path in sequence space [see (Aharoni et al. 2005) for experimental data]. Along this path, overall fitness does not decrease much if both native and promiscuous functions can be maintained by the same enzyme. For a complete transition from one network to another, however, a gene duplication event might be necessary (see Figure 6). [(A) Adapted from Amitai et al., 2007.] (See insert for color representation of the figure.)

120

EVOLUTION AFTER AND BEFORE GENE DUPLICATION? (C)

Figure 5 (Continued )

those local fitness maxima, the fitness of the corresponding protein sequences (i.e., points in sequence space) does not drop completely, but forms a fitness “ridge” that allows a path of nearly neutral mutations between two neutral networks. An important question remaining is whether or not structural and functional promiscuity generally evolve simultaneously (James and Tawfik, 2003; Tokuriki and Tawfik, 2009). Whereas evolution of promiscuous functions can be explained without major conformational changes, the opposite case seems very unlikely, because a new structure is useless without function. The transition between neutral networks would then be correlated not only with different activities (as described for PON1 above) but also with different structures. A promiscuous protein could then be considered to be located at the edge of the neutral network of its native structure, bordering onto the neutral network of another structure (Figure 4B). Each structure would then correspond to a different activity: for example, the binding of different substrates. An intriguing example of such a link between promiscuous structures and functions has been found in antibodies. As shown in vitro, the same antibody, named SPE7, could bind a small peptide as well as a large protein surface. Each substrate was bound via a different backbone conformation of the antibody (James et al., 2003). Other, very recent examples of functionally promiscuous proteins include ubiquitin (Lange et al., 2008) and cytochrome P450 (Muralidhara et al., 2008), for both of which distinct crystal structures were resolved corresponding to the conformations necessary for binding different ligands.

6 GENE DUPLICATIONS AND PHENOTYPIC TRANSITIONS AT THE POPULATION LEVEL It is important to emphasize that gene duplications occur within single organisms. Whether or not such a duplication will become a permanent part of the genetic repertoire of one species (or even lead to the formation of a new species) also depends on the dynamics occurring at the population level. It is likely that a gene duplication becomes lost immediately due to its lethal or detrimental effects, or simply by chance.

GENE DUPLICATIONS AND PHENOTYPIC TRANSITIONS AT THE POPULATION LEVEL

121

Population size has a considerable effect on the fate of gene duplicates as estimated by computer simulations (Lynch and Force, 2000). In large populations, pseudogenization (also called nonfunctionalization) as well as the determination of gene duplicate fate (by subfunctionalization or neofunctionalization) take longer. This should provide an advantage for organisms that occur in smaller populations (e.g., multicellular eukaryotes). Since positive mutations are fixed more rapidly in the population, those organisms can profit from beneficial gene duplications more quickly than can organisms that occur in large populations (such as protists or bacteria) (Kimura, 1983). Also, the molecular events leading to gene duplications are usually more frequent in organisms with larger genomes, which also tend to occur in populations of smaller size. In eukaryotes there are more repetitive genomic regions suitable for recombination errors, and transposable elements are also more abundant than in prokaryotes. Furthermore, horizontal gene transfer is common in bacteria but very rare in eukaryotes. Therefore, bacteria have very different options when it comes to the acquisition of new proteins and do not rely on the mechanisms of gene duplication and divergence in the same way that eukaryotes do (Lerat et al., 2005). There are also population genetic consequences for the phenotypic transitions via latent protein functions, as described above. If the latent function of a protein provides a fitness advantage and its gene is duplicated, a very small number of amino acid changes will be sufficient to increase the efficacy of the latent function in one of the duplicates. The mechanism by which this happens might be either sub- or neofunctionalization. The result is two paralogs, one with the original native function and one with the former latent function as its native function. After a gene duplication, the duplicates have been found to accumulate (possibly adaptive) mutations at an elevated rate (Zhang et al., 2003; Johnston et al., 2007; Scannell and Wolfe, 2008). If adaptive mutations, which can be measured as the ratio of synonymous over nonsynonymous mutations, occur before and simultaneously with gene duplication (e.g., via the optimization of a latent protein function), there should be traces of adaptive evolution. Indeed, this could be measured for a number of proteins in chordates, for which higher rates of adaptive mutations could be detected in those branches of the gene tree that lead toward (i.e., precede) a duplication event (Johnston et al., 2007). Another possibility for preduplication adaptation is that the divergence between native and latent functions occurs on different alleles first. In diploid organisms, alleles provide slightly different versions of the same gene (at the same locus on homologous chromosomes), much like the two copies that arise through gene duplication. The major difference is that gene duplicates always occur within the same individual in a population, whereas alleles may also occur in varying combinations. But if an allele carries a mutation that promotes the latent activity and if this activity is advantageous to the organism, the frequency of the allele will increase in the population. If a gene duplication then occurs in an individual carrying both alleles (i.e., one with low latent activity and one with elevated latent activity), so that the allele from one homologous chromosome gets copied onto the other homologous chromosome, both copies would be inherited together. Allelic divergence before gene duplication should be expected to occur under high selection pressures for functional or expressional divergence of a protein. Figure 6 shows a fitness landscape of a protein under selection for divergence. Such a selection pressure applies if the function of a protein needs to be optimized for performance in

122

EVOLUTION AFTER AND BEFORE GENE DUPLICATION?

Fitness

ce uen Seq spa

ce

ce

ce uen

spa

q

Se

Figure 6 Mechanism for functional divergence of a protein. Two neutral networks (wide circles) associated with the protein functions and structures A and B are shown. Sequence a1 is the fittest within its network, because it is most efficient at performing its native function. However, there would be a selective advantage in performing the function of the adjacent neutral network B as well, which is not yet available. Sequence a1 therefore adapts toward sequence a2 (solid arrow), which still folds into structure A, but also into structure B, albeit with low stability. Sequence a2 therefore has a tolerably high fitness, even though it is less efficient at function A. A gene duplication would be advantageous at this point, because not only would it instantly increase the rates of functions A and B (as there is now twice the amount of the gene product), but it would also allow for further functional divergence (dashed arrows). One gene copy could complete the transition toward sequence b, which has a high efficiency for function B, and the other copy could return to the also highly efficient sequence a1 . This process has been described as EAC (escape from adaptive conflict) subfunctionalization (Conant and Wolfe, 2008). The end result, however, is that of neofunctionalization.

different tissues or under different conditions (as, for example, caused by a changing environment). If those conditions do not differ much (i.e., if the pressure for divergence is low), the same protein will be optimized for performance under both conditions, probably with compromised efficiency. The more the two conditions differ, the more beneficial a gene duplication at the protein’s gene locus will be, enabling diversification of the two copies. Finally, if the selection pressure for divergence is high, a gene duplication might take too long, so that adaptation cannot take place when it would be advantageous. In this case, different alleles might adapt to the different conditions instead, so that some individuals (those that carry different alleles, i.e., that are heterozygous) will obtain a higher level of fitness. These population genetic dynamics for diploid organisms have so far been observed in computer simulations (Proulx and Phillips, 2006).

GENE DUPLICATIONS AND PHENOTYPIC TRANSITIONS AT THE POPULATION LEVEL

123

Another population genetic effect on phenotypic transitions has been proposed for fast-replicating organisms such as bacteria and viruses, which occur in vast numbers. In such enormous populations one would expect to find a high tolerance toward mutations, since the sheer number of individuals will compensate for any deleterious effects of mutations within single individuals. So if the genotypic diversity of such populations could be plotted as the distance from the wild-type genotype, something resembling a “cloud” (rather than a “dot”) of genotypes would emerge. This has been referred to as a quasispecies, a term used originally for viruses (Eigen, 1993, 1996; Domingo, 1998; Tolou et al., 2002). It has been proposed that such quasispecies could be spread out over a rather flat region in a fitness landscape and still outcompete populations located on a high but narrow fitness peak. This hypothesis was at first proven only for digital organisms (Wilke et al., 2001), but then further evidence was found in plant viroids (Codo˜ner et al., 2006) and viruses (van Nimwegen, 2006). The explanation for this could be that “flat” populations are better suited for adaptation to changing conditions than populations focusing on only one genotype with a very high level of fitness. Longterm evolvability thus outcompetes short-term adaptedness, leading to the “survival of the flattest” (Wilke et al., 2001). Whether or not this hypothesis also applies to organisms such as bacteria still needs to be shown, but it integrates well with the notion of protein evolution proceeding along neutral paths of single amino acid changes. Each variant protein provides a potential starting point for the transition toward a different neighboring protein phenotype. Viruses such as HIV seem to benefit greatly from the quasispecies effect, because it allows them to evade the host’s immune response very rapidly (Eigen, 1993; Ribeiro et al., 1998). The replication machinery of viruses is inherently very error-prone (Preston, 1996), thus leading to the high mutation rates required to form a quasispecies. In higher eukaryotes there are also ways of increasing the phenotypic diversity under selection. The molecular chaperone Hsp90, for example, is capable of suppressing morphological diversity in Drosophila melanogaster under normal circumstances. Under environmental stress or mutations, however, Hsp90’s function is compromised and the genetic diversity accumulated (which was suppressed before) becomes apparent in the phenotype. Proteins with such properties can therefore be used as “switches” between conservation and adaptation, depending on the prevalent selection pressures (Rutherford and Lindquist, 1998; Sangster et al., 2004). In addition to the mechanism of genetic mutations presented so far, the concept of phenotypic mutations might explain, at least for some cases, how alternative protein structures can be “tested” for fitness advantages without changing the predominant phenotype of that protein (Whitehead et al., 2008) (Figure 7). To be effective, some phenotypic traits require two mutations: for example, disulfide bridges. Having only one of these mutations will not increase an organism’s fitness. Phenotypic mutations happen every time the transcription and translation machineries make a mistake that goes “unnoticed” and results in an amino acid change. In an individual carrying only one mutation in the genome, a small percentage of the proteins might carry the second mutation by mistake. If the fitness advantage of those proteins is high enough, the genotype will spread in the population and the second mutation is more likely to occur on the genotypic level. Whitehead et al. (2008) named this hypothetical mechanism the look-ahead effect of phenotypic mutations.

124

EVOLUTION AFTER AND BEFORE GENE DUPLICATION?

Figure 7 Adaptation toward a high-fitness double mutant via phenotypic mutations. This scheme illustrates how phenotypic mutations might aid evolution toward a mutant protein carrying two codependent point mutations (A and B). Both mutations are required for an increase in fitness, so it would take a long time for the double mutant to arise by neutral drift. One of the two mutations is, however, likely to arise by chance, and individuals carrying it might produce the second mutation at a low rate via phenotypic mutations (see the text). The fitness benefit resulting from this might be sufficient to drive the single mutant A to fixation. Finally, if A has become fixed, the population will eventually acquire B by drift, resulting in a high-fitness phenotype that will quickly spread through the population.

7

EVOLUTION OF RIBOZYME STRUCTURES

Proteins are not the only catalytic molecules in the cell. Ribozymes, RNA molecules that have enzymatic activity and are not translated into protein, have been found to carry out important functions, mostly regulatory in nature. Large portions of eukaryotic genomes, previously termed junk DNA, have recently been found to contain noncoding RNAs, ribozymes, and riboswitches, which are important in posttranscriptional regulation (Serganov and Patel, 2007). Those noncoding RNAs can exert their function by recognizing and binding other nucleic acid sequences (DNA or RNA) and by forming enzyme-like three-dimensional conformations with catalytically active sites. Ribozymes thus combine functions of DNA and proteins, a property that has led to the hypothesis of an ancient RNA world (Gesteland et al., 2006) where RNA molecules had both information-storing and catalytic functions. Later then, according to this hypothesis, DNA took over the role of information storage because its double helix is much more stable than that of RNA, and proteins took over the role of catalytic agents because their alphabet of 20 amino acids (compared with the four nucleotides found in RNA)

CONCLUSIONS

125

allowed for more diverse chemical activity. However, RNA remained at key positions in such basic cellular activities as transcription and translation. Because RNA molecules have the property of folding into distinct catalytically active conformations and also exhibit mutational robustness (Wagner and Stadler, 1999; Borenstein and Ruppin, 2006; Wagner, 2008), it is reasonable to assume that they, too, evolve via latent or promiscuous activities, from one phenotype to another. Computer simulations suggest that RNA structures form sparse but extended neutral networks that are likely to be in close spatial proximity. This means that several entirely different structures can be just one mutation apart. The neutral networks of proteins seem to be different in that they are more compact and more isolated from each other. [Protein neutral networks have been likened to plums in plum pudding, as opposed to RNA neutral networks, which are more like a bowl of spaghetti (Goldstein, 2008).] Also, according to current knowledge, a completely different fold is required for a new ribozyme to evolve (Curtis and Bartel, 2005), whereas in proteins the same overall conformation can be used to catalyze very different reactions (since mostly amino acids in the active site change) (Holmquist, 2000). In vivo examples of ribozyme promiscuity are still rare [e.g., the Tetrahymena group I ribozyme (Forconi and Herschlag, 2005)]. Most insights on ribozyme functional promiscuity come from in vitro experiments. It is possible to produce artificial RNA sequences that lie exactly between two neighboring neutral networks and that fold into the structures of both nets at equal frequencies (Schultes and Bartel, 2000), as illustrated for proteins in Figure 4. Neutral paths (i.e., via mutations that do not change fitness) that connect sequences uniquely folding into one or the other conformation can be reconstructed, showing that a gradual phenotypic transition is at least possible similar to the way proposed for proteins. It has also been shown that under directed laboratory evolution, ribozymes can evolve towards new functions with few mutations (Curtis and Bartel, 2005). Therefore, it is possible that just like proteins, ribozymes can recruit promiscuous activities but need more mutations to do so. Unfortunately, it seems more difficult to elucidate functional promiscuity in ribozymes, because compared with proteins, more mutations are needed to change from one phenotype (i.e., conformation) to another. Therefore, even though two ribozymes might share a common evolutionary origin, such relationships might not be recognized.

8 CONCLUSIONS In this contribution we have collected evidence pointing toward a mechanism of gene duplicate retention that extends the current view on sub- and neofunctionalization. Most approaches so far have focused on the processes following a gene duplication. Adaptive evolution, however, does not seem to be limited to postduplication processes, but can act on protein sequences at any time. In fast-replicating microbes, the strategy is to generate a high level of genetic diversity through random mutations, regardless of the fitness effects for the individual. For organisms with lower replication rates, however, this strategy does not work, as the survival of the individual is more important. Therefore, a mechanism that conserves a functional phenotype at all times, while still allowing for innovation, can be beneficial.

126

EVOLUTION AFTER AND BEFORE GENE DUPLICATION?

We suggest that the conformational variability of protein structures plays an important role in protein evolution. Proteins seem to have evolved to a state where different structural phenotypes are mutationally closely related. At the same time, some enzymes exhibit functional promiscuity, which can be altered with just a few mutations. Taken together, these two observations can be used to formulate a mechanism of phenotypic transitions of proteins by point mutations. Those transitions can occur gradually via functionally promiscuous intermediates. The most important aspect of the phenotypic transition is the maintenance of the native protein function while increasing another promiscuous function. In the terminology of neutral networks, this allows a protein under selection to be positioned near the neutral network of a different beneficial phenotype. This is the potential starting point for subfunctionalization, because two subfunctions are available for selection to work on. At this point, a gene duplication could be very advantageous because it would allow one copy to complete the phenotypic transition while letting the other copy return to its original native state. Therefore, the end result of this process is more consistent with the original model of neofunctionalization, because one gene copy retains its original function while the other copy adopts a new one. This applies to single-protein domains. Since phenotypic transitions are likely to occur domain independently in multidomain proteins, subfunctionalization might also be the result of neofunctionalization occurring in different “directions” for individual domains. This means that on the same gene copy (or paralog) one domain might keep its original function while another loses it, and vice versa in the other copy. The loss of a function in one domain can be accompanied by the simultaneous gain of a new function, but not necessarily. If complementary domains are conserved between the two copies, “pure” subfunctionalization might be the immediate fate, potentially followed by neofunctionalization later. In the absence of gene duplication and under a high selection pressure, allelic divergence might be another preduplication step toward a new advantageous phenotype. A high selection pressure is necessary in order to increase the frequency of subfunctionalized alleles in the population, so that a gene duplication is more likely to unite the two subfunctions on one chromosome. With the current advances in genomics it should soon be possible to sample entire populations under selection pressure to detect allelic divergence of protein subfunctions systematically. Until then, computer simulations yield the most promising insights into these processes (Proulx and Phillips, 2006). Also on the level of structural protein evolution, computer simulations with simple lattice models (Bornberg-Bauer and Chan, 1999) have proven to be well suited in predicting phenomena that could be shown in the lab afterward (Bloom et al., 2006). Fortunately, lattice models have helped greatly to elucidate the properties of protein neutral networks (Chan and Bornberg-Bauer, 2002; Xia and Levitt, 2004; Goldstein, 2008), so that these are no longer “terra incognita” (Meier and Ozbek, 2007). In general, simulations have the potential to provide insights into evolutionary processes and can be used to formulate research hypotheses in the lab. However, more evidence is still needed for the link between structural and functional promiscuity, and the possibility of phenotypic transitions via promiscuous intermediates. As of now, only one or the other of those properties has been demonstrated in real proteins. Although the possibility of such transitions seems likely, as demonstrated by the joint computational and experimental efforts by (Wroe et al., 2007) and (Amitai et al., 2007), there is still a need for a “smoking gun example” (DePristo, 2007).

REFERENCES

127

REFERENCES Aharoni A, Gaidukov L, Khersonsky O, Gould SM, Roodveldt C, et al. 2005. The “evolvability” of promiscuous protein functions. Nat Genet 37(1):73–76. Ahn S, Tanksley SD. 1993. Comparative linkage maps of the rice and maize genomes. Proc Natl Acad Sci USA 90(17):7980–7984. Alber T. 1989. Mutational effects on protein stability. Annu Rev Biochem 58:765–798. Alexander PA, He Y, Chen Y, Orban J, Bryan PN. 2007. The design and characterization of two proteins with 88% sequence identity but different structure and function. Proc Natl Acad Sci USA 104(29):11963–11968. Alva V, Ammelburg M, S¨oding J, Lupas AN. 2007. On the origin of the histone fold. BMC Struct Biol 7: 17. Amitai G, Gupta RD, Tawfik DS. 2007. Latent evolutionary potentials under the neutral mutational drift of an enzyme. HFSP J 1(1):67–78. Amoutzias GD, Robertson DL, Oliver SG, Bornberg-Bauer E. 2004. Convergent evolution of gene networks by single-gene duplications in higher eukaryotes. EMBO Rep 5(3):274–279. Barton NH, Charlesworth B. 1998. Why sex and recombination? Science 281(5385):1986–1990. Bergthorsson U, Andersson DI, Roth JR. 2007. Ohno’s dilemma: evolution of new genes under continuous selection. Proc Natl Acad Sci USA 104(43):17004–17009. Bj¨orklund AK, Ekman D, Light S, Frey-Sk¨ott J, Elofsson A. 2005. Domain rearrangements in protein evolution. J Mol Biol 353(4):911–923. Blomme T, Vandepoele K, Bodt SD, Simillion C, Maere S, et al. 2006. The gain and loss of genes during 600 millionyears of vertebrate evolution. Genome Biol 7(5): R43. Bloom JD, Labthavikul ST, Otey CR, Arnold FH. 2006. Protein stability promotes evolvability. Proc Natl Acad Sci USA 103(15):5869–5874. Boehr DD, McElheny D, Dyson HJ, Wright PE. 2006. The dynamic energy landscape of dihydrofolate reductase catalysis. Science 313(5793):1638–1642. Borenstein E, Ruppin E. 2006. Direct evolution of genetic robustness in microRNA. Proc Natl Acad Sci USA 103(17):6593–6598. Bornberg-Bauer E. 1997. How are model protein structures distributed in sequence space? Biophys J 73(5):2393–2403. Bornberg-Bauer E, Chan HS. 1999. Modeling evolutionary landscapes: mutational stability, topology, and superfunnels in sequence space. Proc Natl Acad Sci USA 96(19):10689–10694. Bowers JE, Chapman BA, Rong J, Paterson AH. 2003. Unravelling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events. Nature 422(6930):433–438. Brunet FG, Crollius HR, Paris M, Aury JM, Gibert P, et al. 2006. Gene loss and evolutionary rates following whole-genome duplication in teleost fishes. Mol Biol Evol 23(9):1808–1816. Chain FJJ, Evans BJ. 2006. Multiple mechanisms promote the retained expression of gene duplicates in the tetraploid frog Xenopus laevis. PLoS Genet 2(4): e56. Chan HS, Bornberg-Bauer E. 2002. Perspectives on protein evolution from simple exact models. Appl Bioinf 1(3):121–144. Choi IG, Kim SH. 2006. Evolution of protein structural classes and protein sequence families. Proc Natl Acad Sci USA 103(38):14056–14061. Codo˜ner FM, Dar´os JA, Sol´e RV, Elena SF. 2006. The fittest versus the flattest: experimental confirmation of the quasispecies effect with subviral pathogens. PLoS Pathog 2(12): e136.

128

EVOLUTION AFTER AND BEFORE GENE DUPLICATION?

Collins CH, Leadbetter JR, Arnold FH. 2006. Dual selection enhances the signaling specificity of a variant of the quorum-sensing transcriptional activator LuxR. Nat Biotechnol 24(6):708–712. Conant GC, Wolfe KH. 2008. Turning a hobby into a job: how duplicated genes find new functions. Nat Rev Genet 9(12):938–950. Cordes MH, Burton RE, Walsh NP, McKnight CJ, Sauer RT. 2000. An evolutionary bridge to a new protein fold. Nat Struct Biol 7(12):1129–1132. Cui Y, Wong WH, Bornberg-Bauer E, Chan HS. 2002. Recombinatoric exploration of novel folded structures: a heteropolymer-based model of protein evolutionary landscapes. Proc Natl Acad Sci USA 99(2):809–814. Curtis EA, Bartel DP. 2005. New catalytic structures from an existing ribozyme. Nat Struct Mol Biol 12(11):994–1000. Dalal S, Balasubramanian S, Regan L. 1997. Protein alchemy: changing beta-sheet into alphahelix. Nat Struct Biol 4(7):548–552. DePristo MA. 2007. The subtle benefits of being promiscuous: adaptive evolution potentiated by enzyme promiscuity. HFSP J Comput Graph Stat 1(2):94–98. DePristo MA, Weinreich DM, Hartl DL. 2005. Missense meanderings in sequence space: a biophysical view of protein evolution. Nat Rev Genet 6(9):678–687. DesMarais DL, Rausher MD. 2008. Escape from adaptive conflict after duplication in an anthocyanin pathway gene. Nature 454(7205):762–765. Domingo E. 1998. Quasispecies and the implications for virus persistence and escape. Clin Diagn Virol 10(2–3):97–101. Drummond DA, Bloom JD, Adami C, Wilke CO, Arnold FH. 2005. Why highly expressed proteins evolve slowly. Proc Natl Acad Sci USA 102(40):14338–14343. Eigen M. 1993. Viral quasispecies. Sci Am 269(1):42–49. Eigen M. 1996. On the nature of virus quasispecies. Trends Microbiol 4(6):216–218. Ferrada E, Wagner A. 2008. Protein robustness promotes evolutionary innovations on large evolutionary time-scales. Proc Biol Sci 275(1643):1595–1602. Force A, Lynch M, Pickett FB, Amores A, Yan YL, et al. 1999. Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151(4):1531–1545. Forconi M, Herschlag D. 2005. Promiscuous catalysis by the tetrahymena group I ribozyme. J Am Chem Soc 127(17):6160–6161. Garcia-Diaz M, Kunkel TA. 2006. Mechanism of a genetic glissando: structural biology of indel mutations. Trends Biochem Sci 31(4):206–214. Gesteland RF, Cech TR, Atkins JF. 2006. The RNA World , 3rd ed. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory. Goldstein RA. 2008. The structure of protein evolution and the evolution of protein structure. Curr Opin Struct Biol 18(2):170–177. He X, Zhang J. 2005. Rapid subfunctionalization accompanied by prolonged and substantial neofunctionalization in duplicate gene evolution. Genetics 169(2):1157–1164. Henzler-Wildman KA, Thai V, Lei M, Ott M, Wolf-Watz M, et al. 2007. Intrinsic motions along an enzymatic reaction trajectory. Nature 450(7171):838–844. Hittinger CT, Carroll SB. 2007. Gene duplication and the adaptive evolution of a classic genetic switch. Nature 449(7163):677–681. Holmquist M. 2000. Alpha/beta-hydrolase fold enzymes: structures, functions and mechanisms. Curr Protein Pept Sci 1(2):209–235. Hughes AL. 1994. The evolution of functionally novel proteins after gene duplication. Proc Biol Sci 256(1346):119–124.

REFERENCES

129

Hughes MK, Hughes AL. 1993. Evolution of duplicate genes in a tetraploid animal, Xenopus laevis. Mol Biol Evol 10(6):1360–1369. Hughes T. 2007. Computational analysis of the evolutionary dynamics of proteins on a genomic scale. Ph.D. dissertation, University of Bergen. Hughes T, Liberles DA. 2007. The pattern of evolution of smaller-scale gene duplicates in mammalian genomes is more consistent with neo- than subfunctionalisation. J Mol Evol 65(5):574–588. James LC, Tawfik DS. 2003. Conformational diversity and protein evolution–a 60-year-old hypothesis revisited. Trends Biochem Sci 28(7):361–368. James LC, Roversi P, Tawfik DS. 2003. Antibody multispecificity mediated by conformational diversity. Science 299(5611):1362–1367. Jensen RA. 1976. Enzyme recruitment in evolution of new function. Annu Rev Microbiol 30:409–425. Jiang H, Blouin C. 2007. Insertions and the emergence of novel protein structure: a structurebased phylogenetic study of insertions. BMC Bioinf 8: 444. Johnston CR, O’Dushlaine C, Fitzpatrick DA, Edwards RJ, Shields DC. 2007. Evaluation of whether accelerated protein evolution in chordates has occurred before, after, or simultaneously with gene duplication. Mol Biol Evol 24(1):315–323. Jones DT, Moody CM, Uppenbrink J, Viles JH, Doyle PM, et al. 1996. Towards meeting the Paracelsus challenge: the design, synthesis, and characterization of paracelsin-43, an alpha-helical protein with over 50% sequence identity to an all-beta protein. Proteins 24(4):502–513. Khersonsky O, Roodveldt C, Tawfik DS. 2006. Enzyme promiscuity: evolutionary and mechanistic aspects. Curr Opin Chem Biol 10(5):498–508. Kimura M. 1983. The Neutral Theory of Molecular Evolution. Cambridge, UK: Cambridge University Press. Kouyos RD, Silander OK, Bonhoeffer S. 2007. Epistasis between deleterious mutations and the evolution of recombination. Trends Ecol Evol 22(6):308–315. Lange OF, Lakomek NA, Far`es C, Schr¨oder GF, Walter KFA, et al. 2008. Recognition dynamics up to microseconds revealed from an RDC-derived ubiquitin ensemble in solution. Science 320(5882):1471–1475. Lerat E, Daubin V, Ochman H, Moran NA. 2005. Evolutionary origins of genomic repertoires in bacteria. PLoS Biol 3(5): e130. Levy Y, Cho SS, Shen T, Onuchic JN, Wolynes PG. 2005. Symmetry and frustration in protein energy landscapes: a near degeneracy resolves the Rop dimer-folding mystery. Proc Natl Acad Sci USA 102(7):2373–2378. Lupas AN, Ponting CP, Russell RB. 2001. On the evolution of protein folds: Are similar motifs in different protein folds the result of convergence, insertion, or relics of an ancient peptide world? J Struct Biol 134(2–3):191–203. Lynch M, Conery JS. 2000. The evolutionary fate and consequences of duplicate genes. Science 290(5494):1151–1155. Lynch M, Force A. 2000. The probability of duplicate gene preservation by subfunctionalization. Genetics 154(1):459–473. Maisnier-Patin S, Berg OG, Liljas L, Andersson DI. 2002. Compensatory adaptation to the deleterious effect of antibiotic resistance in Salmonella typhimurium. Mol Microbiol 46(2):355–366. Matthews BW. 1995. Studies on protein stability with T4 lysozyme. Adv Protein Chem 46:249–278.

130

EVOLUTION AFTER AND BEFORE GENE DUPLICATION?

McLoughlin SY, Copley SD. 2008. A compromise required by gene sharing enables survival: implications for evolution of new enzyme activities. Proc Natl Acad Sci USA 105(36):13497–13502. Meier S, Ozbek S. 2007. A biological cosmos of parallel universes: Does protein structural plasticity facilitate evolution? Bioessays 29(11):1095–1104. Meier S, Jensen PR, David CN, Chapman J, Holstein TW, et al. 2007. Continuous molecular evolution of protein-domain structures by single amino acid changes. Curr Biol 17(2):173–178. Muralidhara BK, Sun L, Negi S, Halpert JR. 2008. Thermodynamic fidelity of the mammalian cytochrome P450 2B4 active site in binding substrates and inhibitors. J Mol Biol 377(1):232–245. Murzin AG, Brenner SE, Hubbard T, Chothia C. 1995. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247(4):536–540. O’Brien PJ, Herschlag D. 1999. Catalytic promiscuity and the evolution of new enzymatic activities. Chem Biol 6(4): R91–R105. Ohno S. 1970. Evolution by Gene Duplication. New York: Springer-Verlag. Pakula AA, Sauer RT. 1989. Genetic analysis of protein stability and function. Annu Rev Genet 23:289–310. P´al C, Papp B, Lercher MJ. 2006. An integrated view of protein evolution. Nat Rev Genet 7(5):337–348. Papp B, P´al C, Hurst LD. 2003. Dosage sensitivity and the evolution of gene families in yeast. Nature 424(6945):194–197. Piatigorsky J, Wistow GJ. 1989. Enzyme/crystallins: gene sharing as an evolutionary strategy. Cell 57(2):197–199. Preston BD. 1996. Error-prone retrotransposition: rime of the ancient mutators. Proc Natl Acad Sci USA 93(15):7427–7431. Proulx SR, Phillips PC. 2006. Allelic divergence precedes and promotes gene duplication. Int J Org Evol 60(5):881–892. Rastogi S, Liberles DA. 2005. Subfunctionalization of duplicated genes as a transition state to neofunctionalization. BMC Evol Biol 5(1): 28. Ribeiro RM, Bonhoeffer S, Nowak MA. 1998. The frequency of resistant mutant virus before antiviral therapy. AIDS 12(5):461–465. Riechmann L, Winter G. 2000. Novel folded protein domains generated by combinatorial shuffling of polypeptide segments. Proc Natl Acad Sci USA 97(18):10068–10073. Riechmann L, Winter G. 2006. Early protein evolution: building domains from ligand-binding polypeptide segments. J Mol Biol 363(2):460–468. Rose GD, Creamer TP. 1994. Protein folding: predicting predicting. Proteins 19(1):1–3. Rothman SC, Kirsch JF. 2003. How does an enzyme evolved in vitro compare to naturally occurring homologs possessing the targeted function? Tyrosine aminotransferase from aspartate aminotransferase. J Mol Biol 327(3):593–608. Rutherford SL, Lindquist S. 1998. Hsp90 as a capacitor for morphological evolution. Nature 396(6709):336–342. Sangster TA, Lindquist S, Queitsch C. 2004. Under cover: causes, effects and implications of Hsp90-mediated genetic capacitance. Bioessays 26(4):348–362. Scannell DR, Wolfe KH. 2008. A burst of protein sequence evolution and a prolonged period of asymmetric evolution follow gene duplication in yeast. Genome Res 18(1):137–147. Schultes EA, Bartel DP. 2000. One sequence, two ribozymes: implications for the emergence of new ribozyme folds. Science 289(5478):448–452.

REFERENCES

131

Schuster P, Fontana W, Stadler PF, Hofacker IL. 1994. From sequences to shapes and back: a case study in RNA secondary structures. Proc Biol Sci 255(1344):279–284. Serganov A, Patel DJ. 2007. Ribozymes, riboswitches and beyond: regulation of gene expression without proteins. Nat Rev Genet 8(10):776–790. S¨oding J, Lupas AN. 2003. More than the sum of their parts: on the evolution of proteins from peptides. Bioessays 25(9):837–846. Tokuriki N, Tawfik DS. 2009. Protein dynamism and evolvability. Science 324(5924):203–207. Tokuriki N, Stricher F, Schymkowitz J, Serrano L, Tawfik DS. 2007. The stability effects of protein mutations appear to be universally distributed. J Mol Biol 369(5):1318–1332. Tolou H, Nicoli J, Chastel C. 2002. Viral evolution and emerging viral infections: what future for the viruses? A theoretical evaluation based on informational spaces and quasispecies. Virus Genes 24(3):267–274. Tompa P, Tusn´ady GE, Cserzo M, Simon I. 2001. Prion protein: evolution caught en route. Proc Natl Acad Sci USA 98(8):4431–4436. Van de Peer Y, Taylor JS, Meyer A. 2003. Are all fishes ancient polyploids? J Struct Funct Genomi 3(1–4):65–73. van Nimwegen E. 2006. Epidemiology: influenza escapes immunity along neutral networks. Science 314(5807):1884–1886. Varadarajan N, Gam J, Olsen MJ, Georgiou G, Iverson BL. 2005. Engineering of protease variants exhibiting high catalytic activity and exquisite substrate selectivity. Proc Natl Acad Sci USA 102(19):6855–6860. Vendruscolo M, Dobson CM. 2006. Structural biology: dynamic visions of enzymatic reactions. Science 313(5793):1586–1587. Veron AS, Kaufmann K, Bornberg-Bauer E. 2007. Evidence of interaction network evolution by whole-genome duplications: a case study in MADS-box proteins. Mol Biol Evol 24(3):670–678. Viksna J, Gilbert D. 2007. Assessment of the probabilities for evolutionary structural changes in protein folds. Bioinformatics 23(7):832–841. Vogel C, Bashton M, Kerrison ND, Chothia C, Teichmann SA. 2004. Structure, function and evolution of multidomain proteins. Curr Opin Struct Biol 14(2):208–216. Wagner A. 2008. Robustness and evolvability: a paradox resolved. Proc R Soc Lond B 275(1630):91–100. Wagner A, Stadler PF. 1999. Viral RNA and evolved mutational robustness. J Exp Zool 285(2):119–127. Weiner J, Beaussart F, Bornberg-Bauer E. 2006. Domain deletions and substitutions in the modular protein evolution. FEBS J 273(9):2037–2047. Whitehead DJ, Wilke CO, Vernazobres D, Bornberg-Bauer E. 2008. The look-ahead effect of phenotypic mutations. Biol Direct 3(18). Wilke CO, Wang JL, Ofria C, Lenski RE, Adami C. 2001. Evolution of digital organisms at high mutation rates leads to survival of the flattest. Nature 412(6844):331–333. Wilson KP, Malcolm BA, Matthews BW. 1992. Structural and thermodynamic analysis of compensating mutations within the core of chicken egg white lysozyme. J Biol Chem 267(15):10842–10849. Wroe R, Chan HS, Bornberg-Bauer E. 2007. A structural model of latent evolutionary potentials underlying neutral networks in proteins. HFSP J 1(1):79–87. Xia Y, Levitt M. 2004. Simulating protein evolution in sequence and structure space. Curr Opin Struct Biol 14(2):202–207. Zhang P, Gu Z, Li WH. 2003. Different evolutionary patterns between young duplicate genes in the human genome. Genome Biol 4(9): R56.

wwwwwww

7

Protein Products of Tandem Gene Duplication: A Structural View WILLIAM R. TAYLOR and MICHAEL I. SADOWSKI Division of Mathematical Biology, MRC National Institute for Medical Research, London, UK

1 INTRODUCTION Since the early proposal of Ohno (1970), it is now clear from many studies in the fields of protein structure analysis and comparative genomics that one of the major mechanisms by which proteins evolve is through gene duplication and modification of the resulting duplicated gene products. Many proteins contain domains that have clearly arisen through duplication and subsequent fusions [for a review, see an article by Bajaj and Blundell (1984)]. Duplication events often result in spatially and sequentially distinct domains which can then be regulated separately, such as when a protein expressed on the X-chromosome in humans is required in spermatozoa and must be copied to an autosomal chromosome to ensure its presence [discussed by Patthy (2008)]. In the less common circumstance that the duplication event overwrites a stop codon, a fusion between two duplicates of the same gene can be created. Depending on the subsequent fate of the two copies, this can have little or no effect on the structure or result in such large changes that the relationship between the novel protein and its ancestor can be obscured completely. The chapter begins with a brief review of the genetic mechanisms that create tandem duplicates of this type and control their genomic fate. We then describe the effects that these events can have on protein structures and review the examples of each type of event that have so far been observed. The underlying issue of detecting symmetries in protein sequences and structures is also considered. Finally, we synthesize a general evolutionary model from these experimental findings and discuss how it can be tested.

2 GENETIC MECHANISMS Tandem duplications are most commonly the result of homologous recombination [see Shapiro et al. (1977) and following papers]. Other mechanisms that can result in gene duplication, such as lateral transfer by a retrotransposon, have no bias for the duplicated copy to remain adjacent and are not considered here. Short repeats generated Evolution After Gene Duplication, Edited by Katharina Dittmar and David Liberles Copyright © 2010 Wiley-Blackwell

133

134

PROTEIN PRODUCTS OF TANDEM GENE DUPLICATION: A STRUCTURAL VIEW

by slippage and loop-out events are also not considered. Nor do we consider somatic duplications, since it is only duplication events in the germline that are passed down the generations. During meiosis, the original father and mother copies of the diploid genome are aligned side by side with their centroids at the equator of the spindle apparatus before being pulled apart toward the poles. During separation strands can become reconnected, either by chance breakage or by active exchange. The result is the four-way Holliday junction, which under pressure from the separating chromosomes (and aided by proteins) can migrate away from the centromeres, either reaching the end or becoming resolved back into separate strands again. The result is an exchange of father/mother DNA between the two separating genome copies called recombination (Figure 1). Recombination cannot rely simply on the rough physical positioning of the chromosomes to provide a match between strand exchange sites, and more active mechanisms have been proposed to ensure better registration between the strands. This might involve nicks cut at matching sites by a specific restriction enzyme, allowing the two free ends to cross-hybridize (with subsequent ligation). Alternatively, a nick in one strand would leave a single free end that could invade an adjacent strand without the need to have synchronized nick sites. This mechanism could equally include two free ends from

Figure 1 Homologous recombination: normal crossover events.

GENETIC MECHANISMS

135

the same strand caused by a double-stranded DNA break (as might be prone to occur under the stress of chromosome separation).∗ All these crossover mechanisms rely on strands locating their complementary match on the opposite chromosome. Given that there will always be some uncertainty in this process, there is the chance that the strands cannot find their correct locations. Errors in this processes are more likely to occur if there are multiple homologous regions in the vicinity, increasing the chance of incorrect strand hybridization. Down the generations, this can lead to repeated duplications, as earlier-duplicated segments can engender further duplication. Otherwise, with respect to gene structure, strand breakage and recombination are random occurrences,† and it is a matter of chance whether an intact gene is copied, or multiple genes, along with their control elements and intervening sequences, are copied. The result of incorrect matching depends on how the Holliday junction resolves itself (Figure 2). In one direction there is simply a reciprocal exchange of genetic material, whereas in the other, one chromosome becomes longer while the other becomes correspondingly shorter. In the longer duplex, a duplication has occurred at two positions (mother3father3 and MOTHER5FATHER5), with a symmetric loss of FATHER5 and mother3 in the shorter product. Since duplication events are initiated by incorrect strand hybridization, the regions exchanged will not be a perfect match to each other, and DNA mismatch repair mechanisms will attempt to correct any resulting base mismatches in a process known as gene conversion. In the example above, this would entail the conversion of one of the strands in the FATHER4/mother5 pairing to match the other. It is possible that the changes resulting from gene conversion might involve the loss or gain of a stop codon (or an intron splice site), leading to a dramatic change in gene structure (Figure 3). In this example, if the father3father4 stretch corresponds to a gene with a STOP codon (at 4), conversion of this to a non-STOP codon from the MOTHER5 strand would result in a read-through into the father5 segment or beyond until another STOP codon was hit by chance, either in some intervening “junk” DNA or after a read-through of a following gene. In this situation, the two original adjacent genes would become fused as a single gene product. A number of these possibilities involve the incorporation of intergenic regions of DNA into a new coding region, where they appear as linkers between previous gene regions. The lengths of these segments might well be quite large and in the new coding region would constitute a new “random” protein sequence. As suggested previously, these novel proteins might be protected sufficiently from selection pressure by their flanking functional domains to allow some structure and function to be acquired. ∗ Double-stranded breaks in DNA where the broken ends are close in sequence can be repaired directly by enzymes that do not make use of hybridization (called nonhomologous end joining), but if there is a length mismatch between the broken ends, as might result from strand separation between two distant single-stranded breaks, the repair mechanisms rely on the hybridization of the two complementary singlestranded tails. Typically, the 3 broken ends are trimmed back until two matching single-stranded segments are exposed, resulting in loss of genetic material. However, during replication, when there is an intact DNA copy closeby, the loose ends from the strand with a staggered break can locate their complementary sequences in the newly synthesized copy (strand invasion), resulting in an arrangement that is similar to the crossover events that occur in meiosis. Indeed, the final stage in this process is a pair of Holliday junctions that can resolve either with or without crossover. † Recombination hot-spots are observed, but except for the specialized system in the immunoglobulin region, these do not have any relation to protein structure.

136

PROTEIN PRODUCTS OF TANDEM GENE DUPLICATION: A STRUCTURAL VIEW

Figure 2

Misaligned crossover events, potentially causing duplications.

They might then eventually be cut loose as a distinct protein by some future gene rearrangement. The incorporation of intergenic regions into a new open reading frame will also result in a loss of reading frame between the two genes, assuming that matching intron splice sites are not copied. Given that there is no mechanism to synchronize these frames, loss of coherence would be expected in two out of three events. As suggested above, translation of the downstream gene would produce a “random” amino acid sequence until a STOP codon was encountered by chance. Such a novel gene product might survive by chance on the back of its neighbor, but it is not impossible to imagine that the random reading frame could acquire some sructure and perhaps function. Given the nature of the genetic code, a shift in reading frame does not generate a completely random sequence. This simplified survey of genetic mechanisms has shown that there are ways to create fully or partially duplicated genes (or adjacent gene fusions) that lie in tandem. Despite extensive research into the fate of duplicated genes at the population and evolutionary levels, the importance of the various contributing factors that are involved in the initial generation of tandem duplicated copies remains poorly quantified. The relative contributions of errors in double-stranded break repair by homologous strand invasion relative to simple reannealing, both of which can occur in the diploid germ cells, will affect the balance between gene expansion and contraction, and both may provide different contributions in different species. In eukaryotic species, the balance

DUPLICATED PROTEINS

137

Figure 3 Overwritten stop codons: leading to fused transcripts.

of these relative to homologous recombination is largely unquantified, and it may be difficult to separate the events, as double-stranded breakage can act to promote recombination. In prokaryotes, the balance of processes will be different again as the promotion of recombination through haploid generation will be absent, with other specialized mechanisms coming into play. Other complications include the influence of introns (Whamond and Thornton, 2006) either with a passive role in facilitating duplication events (Street et al., 2006) or with a possibly more active ancient past (deRoos, 2005). It must also be remembered that the genes we observe are those that have become fixed in a population, implicating an entire variety of new factors, such as population size (Lynch et al., 2001; Lynch, 2002). This may explain why duplication is more common in eukaryotes than in prokaryotes, which typically have much larger populations and much shorter generation times.

3 DUPLICATED PROTEINS Following a tandem duplication there are several structural possibilities, depicted in Figure 4. If a domain that does not form a homodimer is fully duplicated, the two copies may remain as independent folding units with no structural association (“beads

138

PROTEIN PRODUCTS OF TANDEM GENE DUPLICATION: A STRUCTURAL VIEW

(A)

(B)

(C)

Figure 4 Structural options for tandem duplicates: (A) beads on a string, illustrated with the structure of titin, (2r15); (B) pseudodimer, illustrated with the archaeal histone structure, (1f 1e); (C) domain swap, illustrated with the structure of cyanovirin, (1l5b); (D) inseparable domains, illustrated with the structure of aspartate protease (1e81); (E) entangled domain, illustrated with the structure of myoglobin (101 m). In each case the first domain is shown in blue, the second in red. (Figures created in RasMol.) (See insert for color representation of the figure.)

DUPLICATED PROTEINS

(D)

(E)

Figure 4 (Continued )

139

140

PROTEIN PRODUCTS OF TANDEM GENE DUPLICATION: A STRUCTURAL VIEW

on a string”). Duplication of a single domain that does homodimerize can form a fused homodimer (pseudodimeric domains) as a single unit or in pairs. If the functional residues in the protein become distributed between the two subunits, neither copy alone would be functional and the situation becomes one of inseparable domains. The evolutionary origin of this is particularly obscure if the functional contributions of the two domain copies become asymmetric and the original single-chain ancestor becomes extinct. Additional embellishments to the resulting fold by deletions or insertions and frozen domain swaps or deletions to particular parts lead to the generation of new domain structures, which now contain only a relic of the original duplicate as part of the core, and the evolutionary path to the modern fold is all but undetectable (entangled domains). Circular permutations can also develop, obscuring the traces of evolution still further, and this can in rare cases lead to the formation of topological knots. We deal next with each of these possibilities. 3.1 Beads on a String The simplest but least interesting structural option for duplicated domains is to remain as isolated folding units. These might derive selective advantage by all the methods available to isolated gene copies, such as increased binding or enzymatic turnover, but have the additional option of producing a bivalent molecule. This confers the novel function of being able to cross-link, and although such a function could be contained in a dimer (from a single gene copy), the incorporation of a covalent bond into the link provides additional strength without the complexity of introducing explicit cross-links (such as disulfide bonds or shared metal-ion coordination). Extensive tandem repeats are commonly found in proteins that perform a structural function, such as providing extensions for receptors to allow intercellular contact (e.g., icams), or most dramatically, in the 1000 (mixed) repeats in titin that are required to bridge the sarcomere length. Such multiple repetition merges with smaller repeat lengths into more classic fibrous class of protein, such as spectrin or ankyryns associated with the cytoskeleton to the tripeptide repeat of collagen. 3.2 Pseudodimeric Domains If the original protein undergoing duplication forms a functional homodimer, then given a linker of sufficient length to achieve the same orientation of the two domains, the two domains in the fusion protein will interact in the same way, giving rise to proteins known as pseudodimers. We can estimate the frequency of these events by looking at a nonredundant set of proteins containing homologous duplicate domains of known structure. From such data we find that roughly 70% of proteins containing exactly two domains are between pairs very unlikely to be homologous (not sharing a SCOP class), with about 30% of fusions containing repeated domains (with the same fold, superfamily, or family designation). Although there are several ways to try to ensure that the sample is less biased than the PDB as a whole or the SCOP classification, three different methods produce similar results: using SCOP as a whole, using a 25% nonredundant set of chains, and using a 40% nonredundant domain set. However, in this as in most other cases, we

DUPLICATED PROTEINS

141

cannot distinguish between fixed tandem duplicates and more complicated evolutionary events, but tandem duplication is the most parsimonious explanation. Why do such fusions become the dominant form in some cases? The principal reason must relate to the molecular function of the protein. Binding sites and the active sites of enzymes are typically associated with a depression (cleft or hole) on the surface of a protein. This allows functional side chains to be brought into contact with the substrate from a variety of angles, enabling the development of increased binding specificity and catalytic options. A natural source of a binding cleft is in the region between protein molecules, either between two subunits or between two domains of a larger single protein. A homodimer has the disadvantage that a site formed between two monomers will have constraints imposed by symmetry. As most substrates will be asymmetric, this can create problems, as a residue change to improve binding to one part of the substrate will create a symmetric change in the other copy that may be disruptive. This effect can be avoided by moving the site away from the molecular twofold, but symmetry then implies that this will create a second site, which may not always be desirable. Subtle effects may come into play, as, for example, in the triose phosphate isomerase dimer, in which only one site is active. The symmetry can be broken in this situation by the binding of the substrate itself with communication through to the other site. An alternative is to make use of a heterodimer to avoid the constraints of symmetry, and it is not obvious that two covalently linked domains should be any different from an equally diverged pair of proteins that constitute a hetrodimer. The constraint to produce stoichiometric equivalents might provide some advantage to the linked domains, as they are literally tied to be in the correct 1 : 1 ratio, but two copies under one operon would also give good control over the hetrodimer composition. Perhaps the only difference is that if the domains or dimers are not greatly diverged, the hetrodimer will be prone to create two less productive homodimer variants. The complexity of adding constraints to the sequence to avoid homodimer formation could hand the advantage to the covalent domain linkage. The relative energetics of the two systems also plays a part, as it is often required that the two sides of an active site cleft should be relatively free to move in a hingelike manner to facilitate entry of substrate and release of product. To make a heterodimer interface specific (over homodimer alternatives) requires additional interactions that would mitigate against flexibility. By contrast, the strong flexible covalent domain linkage is well suited to allow relative motion. In addition, an entropic component must be considered. The formation of a dimer (homo or hetero) corresponds to a great decrease in the potential number of states (entropy) of the system. The two domains in the duplicated gene are already confined and do not have such a large degree of freedom to lose on adopting their functional form. Protein structures that function as homodimers would be the most likely candidates for gene duplication into a fused protein, as they have already evolved complementary interacting surfaces. The dimers most susceptible to duplication and fusion would be those in which the two ends to be joined (the N-terminus of one subunit with the C-terminus of its symmetric half) lie close together. Without this, some unwinding of the chain at each terminus would be necessary, or an additional linking segment would be needed. Both would give rise to new interactions, with the probability of these being unfavorable. A direct implication of this is that the remaining free ends (now the termini of the fused-gene product) must, because of the twofold symmetry, lie close

142

PROTEIN PRODUCTS OF TANDEM GENE DUPLICATION: A STRUCTURAL VIEW

together. Interestingly, this would explain the frequent proximity of termini in protein domains (Thornton and Sibanda, 1983), a phenomenon that is largely unexplained by other effects. 3.3 Inseparable Domains Some dimerlike domain pairs are so closely linked that it is difficult to say whether they should be treated as a single domain. This becomes a particularly difficult case to call when the domains have never been observed in isolation, and under some definitions this is part of the definition of being considered a domain. Well-known examples of this type include the β-barrel folds of the serine and aspartyl proteases. Higher-order repeats have also been observed in repeat proteins such as the β-propellors. Proteases The aspartyl protease family has a fold consisting of two remotely related domains with considerable differences in loop lengths and subdomain packing (Figure 5). Each domain, however, contributes an aspartic acid to the active site, which in high-resolution structures can be seen to have an almost exact twofold (180) relationship (Blundell et al., 1990). In addition, the same twofold axis corresponds closely to the symmetric relationship of the two domains, suggesting a precursor molecule that functioned as a dimer (Tang et al., 1978). For many years the double-domain form was the only known structure, but with the sequencing of the HIV genome, a possible aspartic protease active site was identified (Toh et al., 1985) and shown to be consistent with an intact half-domain (Pearl and Taylor, 1987). It was predicted that this would form a dimeric enzyme in the virus, as was later shown to be so by x-ray crystallography (Wlodawer et al., 1989). Given the uncertainties associated with viral origins and evolution, it is difficult to argue that the monomer

Figure 5 Aspartyl protease 1e81. The two halves are colored red and blue to distinguish them, the two active-site aspartic acids are shown in a lighter color. (See insert for color representation of the figure.)

DUPLICATED PROTEINS

143

was the ancestral form of the protein that duplicated in the distant past. It might equally well be argued that pressure of viral genome size induced a shift in the reverse direction to create a monomeric protein. A similar situation is found in the other large protease family, the serine proteases. In this family, two six-stranded barrels pack together to create an active site for the protein, but dispite being very widespread, no dimeric equivalent has yet been seen. This does not mean that one will never be found, but as the number of genomes increases it becomes increasingly less likely. A reduction in a monomeric form seems possible in the aspartyl proteases because of the high symmetry in the active site between the two aspartates, whereas the serine proteases have an asymmetric catalytic triad, consisting of histidine, aspartate, and serine residues, which would not easily allow an equivalent reduction. β-Propellers β-Propeller structures are stacked arrays of β-sheets in which the edges of the sheets form a hub from which the sheets radiate (Figure 6). Because of the twist of the β-sheet, this gives the appearance of a ship’s propeller (Murzin, 1992). Over the years, proteins with different numbers of sheets have been found and there are now structures displaying all sheet numbers from four to eight. Each sheet in the propeller is self-contained and there is a clear sequence motif, implying that the structure arose through tandem duplication of a single sheet. Depending on the family, the repeats are variously referred to as the WD40 motif (after conserved tryptophan and aspartate residues in the motif of 40 residues) or the Kelch repeat. A recent, thorough analysis of the known members of the family has concluded that these proteins have a very active recent evolutionary history in which an entire protein has evolved from duplication of a single repeat (Chaudhuri et al., 2008). The β-sheet blade of the propellors has never been observed in isolation; attempts to produce single propellors artificially have found that some regions on either side of the blade are also necessary (Yadid and Tawfik, 2007). This may help to explain the mechanism by which the propellors are closed, although it is also possible that the sequence signal for stability of an independent blade has disappeared, since in the context of a multibladed protein there would be no selection pressure to maintain it. RNA Polymerase Unlike all viral RNA polymerases, the structure of the RNAi polymerase (Salgardo et al, 2006) is distantly homologous to the DNA polymerases, suggesting that their common structure may have been the ancestor polymerase before the shift from RNA to DNA as the prime genetic material [see also Jones (2006)]. The catalytic core in this polymerase is a pair of double ψ β-barrel domains, which by their duplication must also have had an even earlier single-domain dimeric form. This would suggest a common ancestor perhaps over 3.5 billion years ago at the earliest boundary of cellular life. The core domains of the RNA-dependent RNA polymerase provides another example of two core-conserved β-barrel domains that have never been observed apart. 3.4 Beta/Alpha Class The β/α class of proteins contains more folds than the all-β class and all-α class together, and not suprisingly, tandem duplications are very common.

144

PROTEIN PRODUCTS OF TANDEM GENE DUPLICATION: A STRUCTURAL VIEW

(A)

(B)

Figure 6 Beta propellor structures with (A) four, (B) five, (C) six, and (D) seven blades: structures 1hxn, 1tl2, 1f8d, and 2bbk, respectively. (See insert for color representation of the figure.)

DUPLICATED PROTEINS

(C)

(D)

Figure 6 (Continued )

145

146

PROTEIN PRODUCTS OF TANDEM GENE DUPLICATION: A STRUCTURAL VIEW

Rossmann Fold The Rossmann fold (Rao and Rossmann, 1973), which is found extensively throughout dinucleotide-binding proteins, consists of two subdomains arranged about a pseudo-twofold axis, often corresponding to the center of the binding cleft. It has been argued that this fold is an example of an ancient duplication event bringing about the fusion of two mononucleotide-binding domains. Each individual domain consists of three β/α units in tandem, and although this is a common enough motif in other proteins, it is not seen as an isolated domain.∗ TIM Barrel One of the most ubiquitous β/α folds is the eightfold β/α barrel, typified by the enzyme triosephosphate isomerase (TIM) (Banner et al., 1975). The high symmetry of the eightfold barrel has been linked to an ancient duplication event, and there is evidence that there is a preferred relic twofold associated with the two barrel halves. Unlike the all-β barrel folds discussed above, which would retain their individual barrel topologies if split in two, half a TIM barrel would appear to be less structurally stable. However, there are some candidates for a half barrel, but it is difficult to assess whether these really have any evolutinary connection to the true barrels or are just alternative simple structural solutions that have been arrived at independently. This problem will be returned to below.

4

ENTANGLED DOMAINS

4.1 Domain-Swapped Dimers In true dimeric interactions involving proteins with multiple domains, the packing between the domains within each chain forms identical contacts (as they must, being dimeric). If the linker between domains is sufficiently long, this allows domains to be swapped between monomers. Because of the equivalence of the two sets of interactions, the only determining factor in whether this rearrangement occurs is the length of the connecting loop and any “spurious” interactions it might make in facilitating the exchange (Bennet et al., 1995). The balance is so subtle that even different crystal forms of the same protein can be domain-swapped. In some systems, such as the cyanovirin, the balance can be controlled by external conditions, producing either swapped or unswapped dimers (Barrientos et al., 2002). After tandem duplication, if a dimeric protein had previously been able to form a domain swap, this can become “frozen” into the new protein structure as diversification of the copies shifts the previously fine balance away from equilibrium. This has been postulated to account for the fold of a number of proteins that appear to contain domain-swapped relics. Histone Fold The histone fold consists of three α-helixes: two short and one long. Functional histones bind DNA as octamers, which ultimately consist of heterodimers with the basic histone fold. Since the hydrophobic patches required for assembly of both the heterodimeric and tetrameric forms are exposed, there would be no means to prevent a much larger superassembly from forming. Thus, some copies of these genes ∗ The original Rossmann fold was half a dinucleotide-binding domain. It is used here to refer to the double fold, which constitutes an intact domain.

ENTANGLED DOMAINS

147

do not have these parts and are used for chain termination. In most organisms these are found as separate copies, but in some archeal organisms, such as the methanomicrobia, only a single histone gene is found that is a pseudodimeric fusion of a dimerizationcompetent and a dimerization-incompetent gene (Sandaman and Reeve, 2006). The identification of a helix–strand–helix motif that is common to the histones and other ancient proteins with related functions is suggestive that the original core protein consisted only of this short motif, which then underwent a series of duplications combined with domain swapping. The problem with this and similar analyses is that it is impossible to verify these relationships using sequence data, as the original events must have occurred long ago. Even structural similarity, which has the capacity to probe more deeply back in time, cannot be considered significant either when the core motif is simply a pair of common secondary structures. Nevertheless, it is useful to see that, in principle, these basic folds can be related by a simple mechanism, irrespective of whether this reflects their true history. Globin Fold The globin fold might also be explained as a domain-swapped relic. The α-helixes that constitute the globin fold can be labeled A to G, but one pair of these (CD) is small and poorly conserved and can be neglected, leaving six major helixes, of which the EF pair forms the main binding cleft for the heme group, and each contributes one of the coordinating histidines. The separation between the E and F helixes required to fit the heme makes it unlikely that this pair was ever an independent protein with any structure in the absence of a heme. The existence of a protoglobin based around the EF pair is supported circumstantially by the observation that these helixes constitute a separate exon and that the EF pair may correspond to a common heme-binding ancestor of both the globins and cytochromes (Craik et al., 1981). Addition of sequence to the amino and carboxy ends of the EF core could convert the fold to a Greek-key four-helical bundle and, through further additions, into the modern fold. If such accretion occurred, it would predict that the Greek-key fold should remain as a core in the modern fold, but there is no nucleation center that would give rise to this motif. An alternative explanation of how the fold might have evolved can be based on the rough twofold symmetry of the globin fold. The globin fold can be viewed as a short segment of a double helix of α-helixes, which is more easily seen in greatly simplified representations. This equates the BE hairpin with the GH hairpin and places the A and F helixes in equivalent positions across the end of each hairpin (Figure 7). A superposition of the myoglobin halves based on this correspondence superposes ˚ RMSD. With a limited number of helixes, this 60 residues in each half within 4.5-A correspondance might well be due to chance; however, additional support can be found in folding studies, which have predicted the early formation of the BE and GH hairpins (Ptitsyn and Rashin, 1975; Bashford et al., 1988). This model implies that the protoglobin was equivalent to half the modern fold but with the difference that a single heme would be held between a dimer, with coordination just to the B helix. The location of the A and F helixes relative to this core would suggest that their positions were swapped across the dimer interface, possibly being acquired as later stabilizing additions to the core. Tandem duplication of this dimer would produce the modern fold except that the heme coordination would need to shift from the H to the F helix. As predicted by Go (1978), a third intron was found in the gene of a leghemoglobin (Jensen et al., 1981) that splits the E and F helices. It

148

PROTEIN PRODUCTS OF TANDEM GENE DUPLICATION: A STRUCTURAL VIEW (a)

(e) E

E

B

F

b e

(b)

(f) E

E

B

B

A a

F

b

G

e

(c)

(d) E

E

B

B A

A F

Figure 7

G H

F

G H

Proposed duplicated structural relics in myoglobin.

could be argued that this exon junction is the oldest and has been lost in the other globins, but more recent observations indicate that intron gain is more prevalent than was originally believed, making this argument considerably weaker without the support of a quantitative evolutionary model. Proteosome Fold The αββα-layer protein 1ryp1 from the proteasome contains an internal structural duplication. This symmetry runs through three of the four layers of secondary structure, and although it is not clear to the eye, it was identified as a twofold repeat using the SAPit program. The sequence identity over the repeats is less than 10%, which would not be seen by any sequence-based method. A nontrivial set of relationships has been analyzed for a small family of proteins with the αββα-architecture called the DOM-fold (Cheng and Grishin, 2005). The structure of the molybdenum cofactor binding protein (1jroB) in this family can be explained by a series of two duplication events with domain swapping from a core αβ-domain which has the same architecture as the αβ-layers in the other members of the family. At the sequence level, these αβ-subdomains can be aligned with two other members of the family, giving a good correspondence of secondary structure elements and hydrophobic positions but no highly conserved residues. Similar highly divergent relationships can also be seen in smaller all-β folds (Theobald and Wuttke, 2006).

ENTANGLED DOMAINS

149

4.2 Membrane Proteins Extensive duplication and subsequent modifications have resulted in many membrane proteins being functional heteromers of subunits which are only minor modifications of one another (the most obvious examples being the very large and ubiquitous families of voltage- and ligand-gated ion channels). In some cases the subunits are found in separate genes and in others proteins containing several subunits within a single transcript are observed (Yu and Catterall, 2003). These may be the consequence either of tandem duplication events or fusions of paralogous proteins after some divergence has occurred. The recent accumulation of high-resolution three-dimensional structures across several membrane protein families has led to several interesting observations of symmetry, which build on the observations described above. Structural Observations of Symmetry in Membrane Proteins Since high-resolution structural data have become available for several families of helical membrane proteins, there have been many observations of twofold pseudosymmetry in the orientations of regions of the tertiary structure. The first such observation was made in the aquaporin family of water-transporting pores. These are part of the large Major Intrinsic Protein (MIP) family, which generally have six transmembrane (TM) segments. It was recognized early that a conserved motif (GAXNPAX[ST][AG]) occurs twice in the sequences of several family members, suggesting a duplication in the sequence (Wistow et al., 1991). Subsequently, with larger numbers of sequences available, it was found that a second motif, known as the AEF-box , could also be suggested to have been duplicated, although in the second repeat the motif has degenerated to a conserved glutamate (Zardoya and Villalba, 2001). A more recent analysis has broadened the scope of this motif to the closely related glycerol transporter family (Zardoya, 2005). Interpretation of the meaning of these repeated motifs is not entirely straightforward, however, since they are directly involved in the function of the channel and may possibly be the result of convergence (Zardoya and Villalba, 2001). Regardless of whether the sequence evidence was indeed interpreted correctly, the publication of high-resolution structures for aquaporin (Fu et al., 2000; Murata et al., 2000) showed strong evidence that the receptor could be separated into two 3-TM halves, which were rotated copies of one another and validated the symmetrical location of the functional residues observed from sequence analyses. This strongly implies that an ancient duplication event occurred in the evolution of this “superfamily” of membrane proteins. The structures of ClC chloride channels from Salmonella enterica typhimurium and Escherichia coli also show patterns consistent with duplication and structural embellishment: these channels use 16-TM regions to span the membrane with a twofold pseudosymmetry apparent between two subcomponents of 7-TM regions, the other two presumably having arisen subsequently (Dutzler et al., 2002). In a publication describing the structure the authors observe that some weak sequence similarities exist between the putative duplicated regions, but in the absence of structural evidence it would be impossible to distinguish these from chance similarity. Similar reports have emerged at almost the same rate as new membrane protein structures. The list presently includes the antiporter from E. coli (Hunte et al., 2005), the BtuCD vitamin B12 transporter from E. coli (Locher et al., 2002), the bacterial homolog of the human neurotransmitter uptake proteins (Yamashita et al., 2005), the

150

PROTEIN PRODUCTS OF TANDEM GENE DUPLICATION: A STRUCTURAL VIEW

Sec61/SecY-facilitated diffusion channel for protein translocation (Van den Berg et al., 2004), and members of the hydroxycarboxylate family of secondary transporters. In the latter case there is also substantial sequence-based evidence for the existence of a duplication event in the ancestor of the 12-TM family members (Lolkema et al., 2005; Sobczak and Lolkema, 2005) and structural information by direct assessment of membrane topology (Saaf et al., 2001). Dual-Topology Membrane Proteins Clearly, the observations above suggest that duplications or fusions have occurred commonly in the evolution of membrane proteins. However, a major difficulty with this is that in many cases the repeated segment contains an odd number of transmembrane helices arranged in an antiparallel fashion for a twofold pseudosymmetrical relationship. If a transcript containing this repeat has evolved by duplication, this implies an ancestor that was indifferent to its membrane orientation. The observation by von Heijne (1992) that positively charged amino acids are found in loops on the inside of the cell would predict that this could occur only if the ancestral protein had an equal distribution of such amino acids on either side. A recent survey of the membrane topologies of the E. coli membrane proteome (Daley et al., 2005) using a dual-reporter assay to determine the location of the termini found a candidate for this, a protein known as EmrE. This is a four-helix protein that is responsible for the efflux of a variety of toxins in E. coli and many other bacteria. As such, it is extremely interesting as a potential drug target and has been the subject of a considerable amount of biochemical experimentation. The results of the global study indicated that EmrE could adopt either of two topologies, with the N-terminus inside or outside the cell. On the basis of this result and an earlier study (Saaf et al., 1999), the authors proposed that EmrE forms a functional homodimer in an antiparallel organization, with the two subunits in opposite membrane orientations (Figure 8). Although this was a controversial suggestion, later structural studies were consistent with this model (Tate et al., 2001; Ma and Chang, 2004; Pornillos et al., 2005) and the authors followed this up with a second assay on EmrE aimed specifically at testing the proposal that it could form a functional heterodimer, which was found to be positive (Rapp et al., 2007a). However, the controversy has continued since it has been argued that their assay may disrupt formation of the native transmembrane topology (Schuldiner, 2007a) and that earlier biochemical results contradict these findings. The retraction of the x-ray structures that same year (Chang et al., 2006) added some weight to these contentions (Schuldiner, 2007b), although the authors argued that the finding of duplicate copies with opposite topologies and concomitant charge distributions in the same family presented a strong case for this mechanism (Rapp et al., 2007b). Subsequent biochemical studies have been reported that both support (Nara et al., 2007; Lehner et al., 2008) and refute (Steiner-Mordoch et al., 2008; McHaourab et al., 2008) the proposition that EmrE functions as an antiparallel dimer, leaving the question without definitive resolution for the present. Publication of the corrected x-ray structures has also been found to favor the antiparallel organization (Chen et al., 2007), but it remains difficult to reconcile the conflicting nature of the evidence in this specific case. Another, indirect line of argument advanced by the authors is that Bacillus subtilis encodes two homologs of EmrE, ebrA and ebrB, on the same operon. These are 4-TM proteins with charge distributions consistent with opposite topologies. They

ENTANGLED DOMAINS

I

II

III

N

I

IV

II

III

IV

C

N

C

151

(A)

N

I

N

II

III

IV

C

I

II

III

IV

C (B)

Figure 8

(A) Parallel and (B) antiparallel dimer configurations of EmrE.

therefore propose that this is a case where a duplication has occurred, with the copies being coexpressed to form the functional transporter. Since EmrE has four helixes, it would require an extra helix to adopt an antiparallel membrane orientation, which suggests that separate copies in the same operon would be a more likely route. Another E. coli protein identified by Rapp et al. belonging to the PFam DUF606 family has been found to have several homologs which are fusions of two copies of the 5-TM domain in either order, each copy having adopted a charge bias opposite to that of the other (Lolkema et al., 2008). Whatever the status of EmrE in particular, there is now a large body of evidence that may support the general evolutionary mechanism proposed by Rapp et al. (2007a) (Figure 9). As described above, it has frequently been observed that the two parts of the transmembrane structure of a helical membrane protein are related by twofold pseudosymmetry. Although in general the evolutionary mechanisms that have actually occurred are at present invisible, having taken place very early, this strongly suggests that the proteins are either fusions of separate genes or tandem duplicates. Generality of Dual-Topology Fusions So far the majority of evidence for ancestral fusions by “flip-flopping” membrane proteins has been found in proteins with

152

PROTEIN PRODUCTS OF TANDEM GENE DUPLICATION: A STRUCTURAL VIEW +

+

I

C

II

III

+ N Topologically indifferent ancestor

(A)

I

+

II

III

N + + C Topologically indifferent duplicate + + N C

I

II

+

I

IV V VI

III

IV V VI

(B)

+

II

III

IV V VI

N + + C Duplicates with defined topology N

C

I

II

+

III

IV V VI

(C)

+

Figure 9 Evolution of topological preferences. A possible chain of events leading to pseudosymmetric membrane proteins is shown. An ancestor with unbiased positive charge distribution (A) duplicates and can adopt either of two topologies (B). Selective loss of positive charges from one set of loops can then lead to a choice of one or the other topology (C).

transporter functions. Whether or not these events have happened in other large membrane protein families, such as those with receptor or enzymatic functions, is not apparent. An argument for ancient duplication in the formation of the very common seven-transmembrane architecture, which would account for similarities between bacteriorhodopsin and eukaryotic G-protein-coupled receptors, was advanced many years ago (Taylor and Agarwal, 1993); the publication of the rhodopsin structure has been suggested to offer some support for this (Palczewski et al., 2000). However, whether this represents duplication or exon shuffling remains unclear. Although the fusion of a repeated polytopic motif with no strong topology preference may yet prove to underlie the evolution of many or most families of helical membrane proteins, it is possible that it has only been used where there is a clear functional advantage for doing so. A recent article raises this possibility: The serotonin transporter SERT is proposed to have a mechanism in which it alternately exposes its 5-HT binding site to either side of the membrane, thereby enabling controlled cross-membrane transport. On the basis of the crystal structure for the bacterial homolog, LeuT, the authors argued that orientation of the symmetric parts of the structure can be used to deduce the residues that would be exposed to the cytoplasm following the conformational change, which is not observed in the structure. Cysteine mutagenesis followed by labeling to determine accessible residues on the cytoplasmic face of the transporter supports this prediction (Forrest et al., 2008). If this observation stands and similar observations are made for other transporters, it may prove that the large number of such symmetries observed could be selected for functional reasons. Alternatively, it is possible to argue that this mechanism is quite general, but symmetry has been maintained only in cases where it is functionally useful, although how such a hypothesis would be tested is difficult to imagine.

ENTANGLED DOMAINS

153

The absence of observations of potential fusions for some families may also be of interest. Given the growing evidence that the extremely large and diverse family of G-protein-coupled receptors are functional dimers, it is curious to note that no GPCR fusion proteins have been observed. It is not possible to preclude their existence, but it is possible to explain this observation and predict that it will remain the case for two reasons. First, the argument applied by Rapp et al. also applies to these proteins, albeit in reverse: GPCRs dimerize in a parallel arrangement, but since they have an odd number of transmembrane segments, to maintain their topological relationships they would require an extra membrane-spanning region. This does not rule out their existence (see above); however, it makes finding them more difficult. A second possible objection is that it may be necessary (or convenient) to maintain separate copies to allow for a large number of possible dimerization events to occur, rather than constraining interaction partners by fusion. Nonetheless, so far, this has not been studied explicitly. The existence of duplicate repeats in membrane proteins has reached the status of general acceptance given the number of published structures, which cover many apparently unrelated lineages. This accords well with the observations already discussed for globular proteins, since it would be difficult to justify the existence of separate mechanisms in the two cases; although embedding a protein structure into a membrane introduces additional constraints, it remains a protein nonetheless and is subject to the same general rules of evolutionary change. To what extent the dual-topology evolutionary trajectory will be supported by future observations is, of course, uncertain, but at present it holds a great deal of promise as an explanation for how modern membrane protein folds came into existence. 4.3 Knotted Folds KARI Family The class II ketol-acid reductioisomerase (KARI) has a 250-residue αβ nucleotide–binding domain to its amino terminus with a large all-α domain following. It was found that in the plant protein acetohydroxy acid isomeroreductase (1yve and homologs 1yrl and 1qmg), this C-terminal domain contains a figure-of-eight knot (Taylor, 2000). The knot is the most deeply embedded known, with an amino-terminal domain on one side and 70 residues trailing on its carboxy terminus. It was proposed that the knotted domain can best be explained by a duplication followed by a helix swap (Taylor, 2000) (with deletion of the second αβ domain). Unlike the dimers discussed above, where it was suggested that short N-C connections would be preferred, the duplicated halves are connected by a long loop through which the C-terminus must pass before folding is complete. A dimeric precursor of the knotted domain can be found in the class I KARI structure (Ahn et al., 2003) (PDB code: 1np3), in which the terminal segments of the monomers ˚ suggesting that a simple in the dimer make a closest approach of just over 10 A, duplication (which requires these ends to join) would be quite viable. The subsequent deletion of one of the nucleotide-binding domains would then leave a larger gap of ˚ but this can be closed easily by remodeling a few of the residues on each over 20 A terminus at either end of the gap. The resulting (knotted) dimeric fusion has an RMSD ˚ over 260 residues. with a true knotted domain of 5 A Other duplication, swapping, and deletion (DSD) events have led to a number of related folds, including glycerol-3-phosphate dehydrogenase, 6-phosphogluconate dehydrogenase (PGDH), and similar oxidoreductases (Andreeva and Murzin, 2006),

154

PROTEIN PRODUCTS OF TANDEM GENE DUPLICATION: A STRUCTURAL VIEW

none of which are knotted. These enzymes, collectively referred to as the PGDH-like oxidoreductases, all contain a conserved nucleotide-binding domain, without which the relationships among the all-α catalytic domains would be very difficult to deconvolute. The all-α domain of PGDH and the corresponding knotted domain in the KARI class II structure both contain a clear internal duplication, but in the knotted domain, the core helixes are swapped across the pseudo-twofold. The relationship between these domains is not just a simple exchange of two helix positions, and there is no single rearrangement that would transform one fold to the other. However, an indirect link can be traced through the all-α dimererization domains of two further dehydrogenases, UDP-glucose dehydrogenase (1dliA, UDPGDh) and GDP-manose dehydrogenase (1 mv8A, GPDMDh), which are related by a swapped pair of helixes (Andreeva and Murzin, 2006). When a dimeric fusion is constructed from the GPDMDh structure, there is a plausible superposition over the core. This comprises the long central helixes at the start of each duplication and two following helixes, but the link between these symmetric halves has no correspondence. 4.4 Cyclic Permutation The combinations of duplications, deletions, and domain swaps discussed above can make it very difficult to discern the sequence of evolutionary events and obscure relationships between protein families. Another duplication-based mechanism whereby cryptic relationships of this type can evolve is through the cyclic permutation (CP) of a sequence over the structure. Such a shift is difficult to rationalize by any single evolutionary mechanism, but if the fold is duplicated and partially deleted from each terminus, the remaining core appears as a CP. This process appears to be more probable if it is accompanied by domain swapping. Consider a compact domain formed by two subdomains, A and B. After duplication, a subdomain swap could create the compact domains, and deletion of the termini (the domain) would leave the domain, which would appear as a CP in the sequence. The extent of rotation through the sequence would depend on the relative sizes of A and B. Globular Proteins Using a sequence alignment algorithm designed to detect cyclic permutation, Weiner and colleagues (Weiner et al., 2005) describe some novel examples and make the distinction between proper CPs as described above and incomplete CPs, where the deletion has been made at only one of the termini. The permutations examined in this way all occur at the level of intact domains in multidomain proteins, and it is always possible that other mechanisms of gene rearrangement (such as exon shuffling) might produce the same change. When the cyclic permutations occur within a domain, sequence-based methods are often not sensitive enough to detect them. This level of permutation has been investigated using a novel structure comparison algorithm (FASE), with several new examples being reported (Vesterstrom and Taylor, 2006). These include not only new examples of the familiar permutation of strands in a β-barrel architecture but also examples in βα proteins. One has a shift through the topology by one strand position in the βsheet (32145 to 21345), while the other has the more dramatic shift of strand positions 12534 to 45312. Studies have shown that many possible intermediate steps created by cyclic permutations can be functional, allowing for the possiblity that this is a general mechanism by which proteins can evolve different structures (Peisajovich et al., 2006).

INTRINSIC PROTEIN SYMMETRIES

155

Membrane Proteins The question of cyclic permutations in the evolution of membrane proteins has received surprisingly little attention. Since these are well known to exist in globular proteins and it is possible for antiparallel copies to fuse, it seems that there is nothing to prevent these from forming should a partial deletion occur following a duplication event. One piece of supporting evidence is also the demonstration that at least for one superfamily (the GPCRs) it is possible to cut the chain at a certain point and coexpress the two fragments to create a functional receptor (Schoneberg et al., 1995; Ridge et al., 1996), which is also the case for globular domains shown to permute circularly (Carey et al., 2007). Since the topology of the permuted variant with respect to the membrane would be liable to change in many cases as a consequence of changing the charge distribution in the loop regions, it is possible that the range of duplication–deletion pairs which are potentially functional as circularly permuted membrane proteins is narrower than for globular proteins. On this basis, units of two membrane-spanning domains with partial loops would be more likely as a basic unit than single helixes; additionally, where the N-terminal region has accepted other domains that function at a particular location (cytoplasmic or extracellular), it is likely to be less probable. However, apparently, some proteins (such as the M1 muscarinic receptor from Drosophila melanogaster) are able to accept extremely large substitutions between TM regions, so even this apparently unlikely possibility cannot be ruled out entirely. One limitation may well be that the termini of the protein need to be close, which would be impossible for proteins with odd numbers of TM regions. Circular permutants of the β-barrel transmembrane protein OmpX (in which the termini are close) have been used in bacterial display experiments (Rice et al., 2006), demonstrating the possibility in the case of β-barrel membrane proteins. Whether helical membrane proteins will prove to be different seems an interesting unresolved question.

5 INTRINSIC PROTEIN SYMMETRIES Most of the cases discussed earlier are considered to be the consequence of gene duplication events on the basis that their structures and/or sequences are symmetrical. Although there is frequently other evidence that supports this, it is not clear how strongly the observation of symmetry in protein structures indicates an earlier duplication event. An internal symmetry that appears to have arisen by duplication may be due to intrinsic physical constraints on protein folds as a consequence of preferences for chirality and compactness. It is therefore worth considering the results of analyses of symmetry in protein structures from a theoretical standpoint. 5.1 β/α Proteins The clear chiral preference in connections between secondary structure units (the connection βαβ is almost never left-handed) can provide a strong bias toward symmetric structures. These structural constraints imply that the different βα folds we see cannot be random, as there will always be the possibility of finding some symmetric arrangements of secondary structures by chance. This is particularly likely when both the

156

PROTEIN PRODUCTS OF TANDEM GENE DUPLICATION: A STRUCTURAL VIEW

degree of symmetry (two-, three-, fourfold) is not specified beforehand and there is scope to neglect arbitrary “disordered” parts of the structure. 5.2 All-β Proteins Symmetries can be found in the all-β structure class. Typically, these are seen in structures consisting of a β-sheet (or sheets) with a closed connection forming a barrel structure. If the barrel were opened up (as in a Mercator projection of the world), the whole can be depicted in two dimensions. In this representation, some of the chiral symmetries resemble the decorative motif commonly used in classical Greece and was accordingly named the Greek key (Richardson, 1977). The extension of this spiral has been called a jelly roll and consists of eight strands in a closed barrel, with two connections across the top and two below. It has been suggested that the Greek-key motif (and the jelly roll) might have arisen from the symmetric folding of an elongated hairpin β-structure in the form of a double helix. 5.3 All-α Proteins Folding symmetries are also found in the α/α class, but their relationship to the local chiral preferences of the substructures is less clear. Much of the apparent symmetry within this class probably results simply from the more limited packing arrangements available with fewer secondary structures. A bundle of four or five helixes will have some regularity almost no matter how they are packed. Symmetry becomes more apparent in the all-α superhelixes and barrels, which have a simple solenoid fold. In contrast to the superhelixes, which have clear sequence repeats and function primarily as structural proteins, the barrels are all enzymes and do not have a repeating motif. 5.4 Fourier Analysis of Structural Symmetry The cases above have generally been established on the basis of multiple sources of evidence for duplication, including structural symmetry, sequence repeats, conserved positioning of functional residues, and the maintenance of unusual structural features. Since structural symmetry has in many cases led to the observation of previously undetected sequence repeats and therefore identified plausible new evolutionary relationships involving duplications and fusions, it is interesting to search for structural repeats on a large scale. The SAP program for structural comparison (Taylor and Orengo, 1989; Taylor, 1999) provides the option to align a structure to itself and find self-similarities indicative of symmetry. Using the technique of Fourier analysis, the periodicity of “ridges” that such self-similarities create in the comparison matrix can be identified and used to define repeat boundaries (Taylor et al., 2002). Calibrating this method using real and artificial repetitive proteins and searching a subset of the PDB found a significant fraction (17%) that were highly repetitive, dominated by the β-propellor and TIMbarrel folds. Once obvious sequence repeats are removed from this list, the remainder are almost exclusively globular β/α class proteins. This is a surprising result, as there is no obvious structural reason why these should be more likely to generate such repetitive folds.

DUPLICATE AND DESTROY

157

One possible explanation of the predominant symmetries of the globular β/α proteins might be based on the relative sizes and degrees of structural freedom that are available to the various supersecondary structure types. All-β proteins have a geometric regularity imposed by the plane of the β-sheet but are otherwise relatively topologically unconstrained, thus giving rise to few symmetries by chance. The all-α protein structures lack the spatial register imposed by a hydrogen-bonded sheet and so will naturally be less symmetric in their packing. However, as the α-helix is a relatively large structure, smaller proteins (with fewer than six helixes) will stand a good chance of having a symmetric arrangement. The β/α unit combines symmetry-inducing attributes of the previous types, having the spatial register of the β-sheet, while being relatively large, so there will not be too many unsymmetric arrangements in a protein of typical size. Alternatively, the reason for this bias might be a result of the evolutionary history of these folds (Phillips et al., 1978; Lupas et al., 2001). The most obviously repeating structures are of relatively recent origin (within the last 500 million years) and so retain their sequence signal, whereas those in the βα class tend to be ancient metabolic enzymes often common to all known life. This suggests that their structural symmetry may be a relic of duplications in the far distant past, far enough back in time that no trace of detectable sequence similarity remains. Such ideas are difficult, if not impossible, to prove (Phillips et al., 1978). However speculative, they nonetheless provide one of the few glimpses into the distant origins of protein structures.

6 DUPLICATE AND DESTROY The survey above indicates that tandem duplication, optionally followed by partial deletion, is a key mechanism for the generation of structural novelty. Mechanisms such as circular permutation in particular are key to generating new topologies in a nearly neutral fashion. Another mechanism which is not relevant here but is likely to be of high importance to the question of fold change in structures generally, is the existence of “chameleon sequences,” which equilibriate between two very different conformations and may therefore form evolutionary bridges between apparently unrelated folds (reviewed in Taylor, 2007). Mutations occur at different rates in different proteins (Luz and Vingron, 2006) and at different locations within a protein sequence; the same is true for insertions and deletions. These events can lead to the gradual accretion (Pan and Bardwell, 2006) or embellishment of substructure around a conserved core (Reeves et al., 2006). One major unresolved question in the study of protein evolution is the extent to which these mechanisms have operated in generating the structural diversity that we observe at present. Since “jumps” involving chameleon sequences are clearly possible, it is not simple to determine this (not least because we cannot at present predict when and where they can happen). However, we can suggest two reasons why gene duplications and deletions may be a better explanation. First, the mechanisms that involve gradual mutations, accretions around structural cores, and chameleon sequences are likely to operate very slowly. Mutational studies of proteins have only rarely observed radical changes to occur, which suggests that in a given protein at most only a handful of proteins can exist. Additionally, such large-scale structural changes as can be introduced by chameleon sequences are more

158

PROTEIN PRODUCTS OF TANDEM GENE DUPLICATION: A STRUCTURAL VIEW

likely to have negative than positive functional consequences. On this basis, duplication and deletion, which can generate structural transitions much more quickly, are more probable. Second, it is not yet clear whether protein structure space is sufficiently well connected for enough accessible paths to exist between folds to permit such mechanisms to make the journey between them (see Whitehead et al., 2008 and Chapter 6). The concept that proteins evolve primarily by the duplicate-and-destroy method (in which agglomerations of proteins are built up and then unnecessary components gradually removed) has the advantage that it is less risky: If two units that fold independently are joined together, we expect the initial result to be two joined, independently folding units. If an independently folding unit experiences an insertion, deletion, or point mutation, a disruption is far more likely. The only risk in the former mechanism is that posed by an increased “dosage” of the duplicated domain. The “periodic table” representation of protein fold space (Taylor, 2002) provides one way to represent and visualize structural transitions. In this representation a duplication event would be represented as a leap forward, with deletion events corresponding to steps back to smaller structures, not unlike radioactive decay in the more familiar table of elements. It would be possible to tune such a model to recapitulate known evolutionary events; this would provide a tool to evaluate the probability of an evolutionary relationship between two proteins, not unlike the work of Shaknovich and colleagues at the residue level using simple lattice models (Dokholyan et al., 2003; Zeldovich et al., 2006). We have seen how some relationships between proteins can be traced back through a series of duplications and domain rearrangements to very basic elements of protein structure, often incorporating only a pair of a few secondary structure elements. Although difficult to prove, it is tempting to speculate that these basic cores once corresponded to the earliest functional units and that all known protein folds can be derived through the mechanisms of duplication and deletion that have been described above. Acknowledgment The work was supported by the Medical Research Council (UK). REFERENCES Ahn HJ, Eom SJ, Yoon HJ, Lee BI, Cho H, Suh SW. 2003. Crystal structure of class I acetohydroxy acid isomerase from Pseudomonas aeruginosa. J Mol Biol 328:505–515. Andreeva A, Murzin AG. 2006. Evolution of protein fold in the presence of functional constraints. Curr Opin Struct Biol 16:399–408. Bajaj M, Blundell T. 1984. Evolution and the tertiary structure of proteins. Annu Rev Biophys Bioeng 13:453–492. Banner DW, Bloomer AC, Petsko GA, Phillips DC, Pogson CI, Wilson IA. 1975. Structure of ˚ resoluchicken muscle triose phosphate isomerase determined crystallographically at 2.5 A tion. Nature 255:609–614. Barrientos LG, Louis JM, Botos I, Mori T, Han Z, O’Keefe BR, Boyd MR, Wlodawer A, Gronenborn AM. 2002. The domain-swapped dimer of cyanovirin-n is in a metastable folded state. Structure 10:673–686.

REFERENCES

159

Bashford D, Cohen FE, Karplus M, Kuntz ID, Weaver DL. 1988. Diffusion–collision model for the folding kinetics of myoglobin. Protein Struct Funct Genet 4:211–227. Bennet MJ, Schlunegger MP, and Eisenberg D. 1995. 3D domain swapping: a mechanism for oligomer assembly. Protein Sci 4:2455–2468. Blundell TL, Jenkins JA, Sewell BT, Pearl LH, Cooper JB, Tickle IJ, Veerapandian B, Wood SP. ˚ resolution 1990. X-ray analyses of aspartic proteinases: the 3-dimensional structure at 2.1 A of endothiapepsin. J Mol Biol 211:919–941. Bofkin L, Goldman N. 2007. Variation in evolutionary processes at different codon positions. Mol Biol Evol 24:513–521. Carey J, Lindman S, Bauer M, Linse S. 2007. Protein reconstitution and three-dimensional domain swapping: benefits and constraints of covalency. Protein Sci 16:2317–2333. Chang G, Roth CB, Reyes CL, Pornillos O, Chen Y, Chen AP. 2006. Retraction of Pornillos et al., Science 310(5756) 1950-1953; retraction of Reyes and Chang, Science 308(5724) 1028-1031; retraction of Chang and Roth, Science 293(5536) 1793-1800. Science 314:1875. Chaudhuri I, Soding J, Lupas AN. 2008. Evolution of the beta-propeller fold. Protein Struct Funct Genet 71:795–803. Chen YJ, Pornillos O, Lieu S, Ma C, Chen AP, Chang G. 2007. X-ray structure of emre supports dual topology model. Proc Natl Acad Sci USA 104:18999–19004. Cheng H, Grishin NV. 2005. DOM-fold: a structure with crossing loops found in DmpA ornithine acetyltransferase and molybdenum cofactor-binding protein domain. Protein Sci 14:1902–1910. Craik CS, Buchman SR, Beychok S. 1981. O binding properties of the product of the central exon of beta globin gene. Nature 291:87–90. Daley DO, Rapp M, Granseth E, Melen K, Drew D, von Heijne G. 2005. Global topology analysis of the Escherichia coli inner membrane proteome. Science 308:1321–1323. deRoos ADG. 2005. Origins of introns based on the definition of exon modules and their conserved interfaces. Bioinformatics 21:2–9. Dokholyan NV, Deeds EJ, Shakhnovich EI. 2003. Protein evolution within a structural space. Biophys J 85:2962–2972. Dutzler R, Campbell EB, Cadene M, Chait BT, MacKinnon R. 2002. X-ray structure of ˚ reveals the molecular basis of anion selectivity. Nature a clc chloride channel at 3.0 A 415:287–294. Forrest LR, Zhang Y-W, Jacobs MT, Gesmonde J, Xie L, Honig BH, Rudnick G. 2008. Mechanism for alternating access in neurotransmitter transporters. Proc Natl Acad Sci USA 105:10338–10343. Fu D, Libson A, Miercke LJ, Weitzman C, Nollert P, Krucinski J, Stroud RM. 2000. Structure of a glycerol-conducting channel and the basis for its selectivity. Science 290:481–486. Go M. 1978. Correlation of DNA exonic regions with protein structural units in haemoglobin. Nature 291:90–92. Hunte C, Screpanti E, Venturi M, Rimon A, Padan E, Michel H. 2005. Stricutre of a na + /h+ antiporter and insights into mechanism of action and regulation by pH. Nature 435:1197–1202. Jensen EO, Paludan K, Hyldig-Nielsen JJ, Jorgensen P, Marcker KA. 1981. The structure of a chromosomal leghaemoglobin gene from soybean. Nature 291:677–679. Jones R. 2006. RNA silencing sheds light on the RNA world. PloS Biol 4:1. Lecomte JTJ, Vuletich DA, Lesk AM. 2005. Structural divergence and distant relationships in proteins: evolution of the globins. Curr Opin Struct Biol 15:290–301.

160

PROTEIN PRODUCTS OF TANDEM GENE DUPLICATION: A STRUCTURAL VIEW

Lehner I, Basting D, Meyer B, Haase W, Manolikas T, Kaiser C, Karas M, Glaubitz C. 2008. The key residue for substrate transport (glu(14)) in the emre dimer is asymmetric. J Biol Chem 283:3281–3288. Locher KP, Lee AT, Rees DC. 2002. The E. coli btucd structure: a framework for abc transporter architecture and mechanism. Science 496:1091–1098. Lolkema JS, Sobczak I, Slotboom D-J. 2005. Secondary transporters of the 2hct family contain two homologus domains with inverted membrane topology and trans re-entrant loops. FEBS Lett 272:2334–2344. Lolkema JS, Dobrowolski A, Slotboom D. 2008. Evolution of antiparallel two-domain membrane proteins: tracing multiple gene duplications events in the duf606 family. J Mol Biol 378:596–606. Lupas AN, Ponting CP, Russell RB. 2001. On the evolution of protein folds: Are similar motifs in different protein folds the result of convergence insertion or relics of an ancient peptide world? J Struct Biol 134:191–203. Luz H, Vingron M. 2006. Family specific rates of protein evolution. Bioinformatics 22:1106–1171. Lynch M. 2002. Gene duplication and evolution. Science 297:945–947. Lynch M, O’Hely M, Walsh B, Force A. 2001. The probability of preservation of a newly arisen gene duplication. Genetics 159:1789–1804. Ma C, Chang G. 2004. Structure of the multidrug resistance efflux transporter emre from Escherichia coli . Proc Natl Acad Sci USA 101:2852–2857. McHaourab BS, Mishra S, Koteiche HA, Amadi SH. 2008. Role of sequence bias in the topology of the multidrug transporter EMRE. Biochemistry 47:7980–7982. Murata K, Mitsuoka K, Hirai T, Walz T, Agre P, Heymann JB, Engel A, Fujiyoshi Y. 2000. Structural determinants water permeation through aquaporin-1. Nature 407:599–605. Murzin AG. 1992. Structural principles for the propeller assembly of β-sheets: the preference for seven-fold symmetry. Protein Struct Funct Genet 14:191–201. Nara T, Kouyama T, Kurata Y, Kikukawa T, Miyauchi S, Kamo N. 2007. Anti-parallel membrane topology of a homo-dimeric multidrug transporter, emre. J Biol Chem 142:621–625. Ohno S. 1970. Evolution by Gene Duplication. New York: Springer-Verlag. Palczewski K, Kumasaka T, Hori T, Behnke CA, Motoshima H, Fox BA, LeTrong I, Teller DC, Okada T, Stenkamp RE, et al. 2000. Crystal structure of rhodopsin: a G-protein coupled receptor. Science 289:739–745. Pan JL, Bardwell JCA. 2006. The origami of thioredoxin-like folds. Protein Sci 15:2217–2227. Patthy L. 2008. Protein Evolution, 2nd ed. Oxford: Blackwell. Pearl LH, Taylor WR. 1987. A structural model for the retroviral proteases. Nature 329:351–354. Peisajovich SG, Rockah L, Tawfik DS. 2006. Evolution of new protein topologies through multistep gene rearrangements. Nat Genet 38:168–174. Phillips DC, Sternberg MJE, Thornton JM, Wilson IA. 1978. An analysis of the structure of triose phosphate isomerase and its comparison with lactate dehydrogenase. J Mol Biol 119:329–351. Pornillos O, Chen Y, Chen AP, Chang G. 2005. X-ray structure of the emre multidrug transporter in complex with a substrate. Science 310:1950–1953. Ptitsyn OB, Rashin AA. 1975. A model of myoglobin self-organisation. Biophys Chem 3:1–20. Rao ST, Rossmann MG. 1973. Comparison of super-secondary structures in proteins. J Mol Biol 76:241–256.

REFERENCES

161

Rapp M, Seppala S, Granseth E, von Heijne G. 2007a. Emulating membrane protein evolution by rational design. Science 315:1282–1284. Rapp M, Seppala S, Granseth E, von Heijne G. 2007b. Reply to Schuldiner 2007. Science 317:748–751. Reeves GA, Dallman TJ, Redfern OC, Akpor A, Orengo CA. 2006. Structural diversity of domain superfamilies in the CATH database. J Mol Biol 360:725–741. Rice JJ, Schohn A, Bessette PH, Boulware KT, Daugherty PS. 2006. Bacterial display using circularly permuted outer membrane protein ompx yields high affinity peptide ligands. Protein Sci 15:825–836. Richardson JS. 1977. β-Sheet topology and the relatedness of proteins. Nature 268:495–500. Ridge KD, Lee SSJ, Abdulaev NG. 1996. Examining rhodopsin folding and assembly through expression of polypeptide fragments. J Biol Chem 271:7860–7867. Saaf A, Baars L, von Heijne G. 2001. The internal repeats in the Na+ /Ca2+ exchangerrelated Escherichia coli protein yrbg have opposite membrane topologies. J Biol Chem 276:18905–18907. Saaf A, Johansson M, Wallin E, von Heijne G. 1999. Divergent evolution of membrane protein topology: the Escherichia coli RnfA and RnfE homologues. Proc Natl Acad Sci USA 96:8540–8544. Salgado PS, Koivunen MRL, Makeyev EV, Bamford DH, Stuart DI, Grimes JM. 2006. The structure of an RNAi polymerase links RNA silencing and transcription. PLoS Biol 4:2274–2281. Sandaman K, Reeve JN. 2006. Archaeal histones and the origin of the histone fold. Curr Opin Microbiol 9:520–525. Schoneberg T, Liu J, Wess J. 1995. Plasma-membrane localization and functional rescue of truncated forms of a G-protein coupled receptor J Biol Chem 270:18000–18006. Schuldiner S. 2007a. Controversy over emre structure. Science 317:748–751. Schuldiner S. 2007b. When biochemistry meets structural biology: the cautionary tale of emre. TIBS 32:252–258. Shapiro JA, Adhya SL, Bukhari AI. 1977. Introduction: New pathways in the evolution of chromosome structure. In Bukhari AI, Shapiro JA, Adhya SL (eds.), DNA Insertion Elements, Plasmids and Episomes. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press, pp. 3–11. Sobczak I, Lolkema JS. 2005. The 2-hydroxycarboxylate transporter (2hct) family: physiology structure and mechanism. Microbiol Mol Biol Rev 69:665–695. Steiner-Mordoch S, Soskiine M, Solomon D, Rotem D, Gold A, Yechieli M, Adam Y, Schuldiner S. 2008. Parallel topology of genetically fused emre homodimers. EMBO J 27:17–26. Street TO, Rose GD, Barrick D. 2006. The role of introns in repeat protein gene formation. J Mol Biol 360:258–266. Tang J, James MNG, Hsu IN, Jenkins JA, Blundell TL. 1978. Structural evidence for gene duplication in the evolution of the acid proteases. Nature 271:619–621. Tate CG, Kunji ERS, Lebendiker M, Schuldiner S. 2001. The projection structure of emre, ˚ resolution. EMBO J a proton-linked multidrug transporter from Escherichia coli , at 7 A 20:77–81. Taylor WR. 1999. Protein structure alignment using iterated double dynamic programming. Protein Sci 8:654–665. Taylor WR. 2000. A deeply knotted protein and how it might fold. Nature 406:916–919. Taylor WR. 2002. A periodic table for protein structure. Nature 416:657–660. Taylor WR. 2007. Evolutionary transitions in protein fold space. Curr Opin Struct Biol 17:354–361.

162

PROTEIN PRODUCTS OF TANDEM GENE DUPLICATION: A STRUCTURAL VIEW

Taylor EW, Agarwal A. 1993. Sequence homology between bacteriorhodopsin and G-protein coupled receptors: exon shuffling or evolution by duplication? FEBS Lett 325:161–166. Taylor WR, Orengo CA. 1989. Protein structure alignment. J Mol Biol 208:1–22. Taylor WR, Heringa J, Baud F, Flores TP. 2002. A Fourier analysis of symmetry in protein structure. Protein Eng 15:79–89. Theobald DL, Wuttke DS. 2006. Divergent evolution within protein superfolds inferred from profile-based phylogenetics. J Mol Biol 354:722–737. Thornton JM, Sibanda BL. 1983. Amino and carboxy-terminal regions in globular proteins. J Mol Biol 167:443–460. Toh H, Ono M, Saigo K, Miyata T. 1985. Retroviral protease-like sequence in the yeast transposon ty1. Nature 315:691. Van den Berg B, Clemons WM, Collinson I, Modis Y, Hartmann E, Harrison SC, Rapoport TA. 2004. X-ray structure of a protein-conducting channel. Nature 427:36–44. Vesterstrom J, Taylor WR. 2006. Flexible secondary structure based protein structure comparison applied to the detection of circular permutation. J Comput Biol 13:43–62. von Heijne G. 1992. Membrane-protein structure prediction: hydrophobicity analysis and the positive-inside rule. J Mol Biol 225:487–494. Weiner J, Thomas G, Bornberg-Bauer E. 2005. Rapid motif-based prediction of circular permutations in multi-domain proteins. Bioinformatics 21:932–937. Whamond GS, Thornton JM. 2006. An analysis of intron positions in relation to nucleotides amino acids and protein secondary structure. J Mol Biol 359:238–247. Whitehead DJ, Wilke CO, Vernazobres D, Bornberg-Bauer E. 2008. The look-ahead effect of phenotypic mutations. Biol Direct 3:18. Wistow GJ, Pisano MM, Chepelinsky AB. 1991. Tandem sequence repeats in transmembrane channel proteins. TIBS 16:170–171. Wlodawer A, Miller M, Jaskolski M, Sathyanarayana BK, Baldwin E, Weber IT, Selk LM, Clawson L, Schneider J, Kent SBH. 1989. Conserved folding in retroviral proteases: crystal structure of a synthetic HIV-1 protease. Science 245:616–621. Yadid I, Tawfik DS. 2007. Reconstruction of functional β-propeller lectins via homo-oligomeric assembly of shorter fragments. J Mol Biol 365:10–17. Yamashita A, Singh SK, Kawate T, Jin Y, Gouaux E. 2005. Crystal structure of a bacterial homologue of Na+ /Ca–dependent neurotransmitter transporters. Nature 437:215–223. Yu FH, Catterall WA. 2003. Overview of the voltage-gated sodium channel family. Genome Biol 4:207. Zardoya R. 2005. Phylogeny and evolution of the major intrinsic protein family. Biol Cell 97:397–414. Zardoya R, Villalba S. 2001. A phylogenetic framework for the aquaporin family in eukaryotes. J Mol Evol 52:391–404. Zeldovich KB, Berezovsky IN, Shakhnovich EI. 2006. Physical origins of protein superfamilies. J Mol Biol 357:1335–1343.

8

Statistical Methods for Detecting Functional Divergence of Gene Families XUN GU Department of Genetics, Development and Cell Biology, Center for Bioinformatics and Biological Statistics, Iowa State University, Ames, Iowa

1 INTRODUCTION Many organisms, from yeast to human, have undergone genomewide or local chromosome duplication events during their evolution (Ohno, 1970; Lundin, 1993; Holland et al., 1994; Spring, 1997; Wolfe and Shields, 1997). After gene duplication, one gene copy maintains the original function, whereas the other copy is free to accumulate amino acid changes toward functional divergence (Li, 1983). As a result, many genes are represented as several paralogs in the genome with related but distinct functions. Since gene family proliferation is thought to have provided the raw materials for functional innovations, it is desirable, from sequence analysis, to identify amino acid sites that are responsible for the functional diversity. This approach has great potential for functional genomics because it is cost-effective, and these predictions can be tested further by experimentation. Since most amino acid changes are not related to functional divergence but represent neutral evolution, it is crucial to develop appropriate statistical methods to distinguish between these two possibilities. Indeed, when sequences of a gene family are available, the identification of functionally important residues can be approached computationally (e.g., Casari et al., 1995; Lichtarge et al., 1996; Livingstone and Barton, 1996; Gu, 1999, 2001; Landgraf et al., 1999). In particular, Gu (1999, 2001) has developed a novel probabilistic model, based on the underlying principle that functional divergence after gene duplication is correlated strongly with the change of evolutionary rate. This correlation is a complement to a fundamental rule in molecular evolution: Functional importance is correlated strongly with evolutionary conservation (Kimura, 1983). A site-specific profile based on the posterior probability was then developed to predict critical residues for functional divergences between two gene clusters. Many authors (e.g., Wang and Gu, 2001; Gu et al., 2002; Mathews, 2005) have applied this newly developed method successfully to the study of functional diversity in gene families. For example, Wang and Gu (2001) studied Evolution After Gene Duplication, Edited by Katharina Dittmar and David Liberles Copyright © 2010 Wiley-Blackwell

163

164

STATISTICAL METHODS FOR DETECTING FUNCTIONAL DIVERGENCE

the caspase gene family and found that our predictions are supported by experimental data. In this chapter we review the statistical basis for testing functional divergence after gene duplication and predicting the amino acid residues that are responsible for these divergences.

2

TWO-STATE MODEL FOR FUNCTIONAL DIVERGENCE

Consider a multiple alignment of a gene family with two sets of homologous genes, 1 and 2 (Figure 1). Although various terminologies were used previously (e.g., Casari et al., 1995; Lichtarge et al., 1996; Livingstone and Barton, 1996; Gu, 1999; Landgraf et al., 1999), amino acid patterns can be classified tentatively as follows (Gu, 2001). Type 0 represents amino acid patterns that are universally conserved through the entire gene family, implying that these residues are important for the common function shared by all member genes. Type I represents amino acid patterns that are highly conserved in gene 1 but highly variable in gene 2, or vice versa, implying that these residues have experienced altered functional constraints. Type II represents amino acid patterns that are highly conserved in both genes but whose biochemical properties are very divergent (e.g., charge positive vs. negative), implying that these residues may be responsible for functional specification. Finally, amino acid patterns at many residues are not so clear-cut that they have to be regarded as unclassified (type U). After gene duplication, functional divergence between duplicates is likely to occur in the early stage. There are two basic types of functional divergence. The first type results in site-specific altered functional constraints (i.e., different evolutionary rate) between duplicate genes. We named it type I functional divergence, as it typically generates type I amino acid patterns. The second type results in no altered functional constraints but radical change in amino acid property between duplicates (e.g., charge,

x1 x5 x2

x0

Cluster 1

x3 x6 x4

y1 y5 y2

y0

Cluster 2

y3 y6 y4

Figure 1 Two gene clusters after gene duplication.

TESTING TYPE I FUNCTIONAL DIVERGENCE AFTER GENE DUPLICATION

165

hydrophobicity). We named it type II functional divergence, as it typically generates type II amino acid patterns. For two gene clusters generated by the gene duplication, the two-state model assumes that in each cluster, one site has two possible states, S0 (functional divergence-unrelated, or functional constraint) and S1 (functional divergence). When a site is under S0 , the evolutionary rate at this site is virtually the same between two clusters (i.e., λ1 = λ2 ). In contrast, under state S1 we have statistical independence between λ1 and λ2 (Gu, 1999). The assumption of rate independence for functional divergence means that knowing the evolutionary rate at such sites in one cluster contains no information for predicting the intensity of functional constraint in the other cluster. 3 TESTING TYPE I FUNCTIONAL DIVERGENCE AFTER GENE DUPLICATION Gu (1999, 2001) developed statistical approaches to estimating type I functional divergence, which have been implemented in the software DIVERGE (Gu and Vander Velden, 2002). The principal difference between these two models is that the method of Gu (2001) is based on the Markov chain model, whereas that of Gu (1999) is based on the Poisson model. Figure 2 outlines the pipeline of statistical analysis. 3.1 Markov Chain Model Under the Markov chain model, the likelihood for sequence evolution can be derived as follows (Felseinstein, 1981; Kishino et al., 1990). First, the transition probability matrix for a given time period t can be computed as P = exp(λRt), where the rate matrix R represents the pattern of amino acid substitutions, which can be determined empirically by, for example, the Dayhoff model (Dayhoff et al., 1978). The evolutionary rate (λ)

Input: aligned amino acid sequences of two clusters (A, B) and the phylogeny

Probabilistic model of a site in each cluster: f(XA|λA), f(XB|λB), where XA and XB are amino acid configurations in A and B. In the fast algorithm of Gu (1999), XA and XB are simplified to the expected number of substitutions (Gu and Zhang 1997) so that f(XA|λA) and f(XB|λB) are Poisson processes. See Gu (2001) for a formal likelihood treatment.

Site-specific profile Posterior analysis: P(S1|X) = θf(X|S1)/ f(X)

The joint probability of X = (XA, XB) f(X) = (1 – θ) f(X|S0) + θf(X|S1) and the likelihood over all sites (k) is L = Πk f(Xk)

Figure 2

Conditional joint probability: assume rates λA and λB varies among sites according to a gamma distribution. Under S0, λΑ = λΒ = λ, so f(X|S0) = E[f(XA|λ) f(XB|λ)] Under S1, λA, λB independent, so f(X|S1) = E[f(XA|λA)] E[f(XB|λB)] where X = (XA, XA) and E for expectation.

Chart for statistical analysis of functional divergence.

166

STATISTICAL METHODS FOR DETECTING FUNCTIONAL DIVERGENCE

may vary among sites because of different functional constraints. Usually, λ is treated as a random variable, which follows a gamma distribution; that is, φ(λ) =

βα α−1 −βλ λ e (α)

(1)

(Uzzel and Corbin, 1971). The shape parameter, α, describes the strength of rate variation among sites (i.e., a small value of α means a strong rate heterogeneity among sites, and α = 8 means no rate variation among sites), whereas β is a scale constant (Gu et al., 1995). Consider the phylogenetic tree in Figure 1. Let X = (x1 , x2 , x3 , x4 ) and Y = (y1 , y2 , y3 , y4 ) be the amino acid patterns observed for a site with clusters 1 and 2, respectively. For the (unrooted) subtree for cluster 1 or 2, the conditional probability of observing X or Y at a site can be written as follows: f (X|λ) =

20 20

bx5 Px5 x1 Px5 x2 Px5 x6 Px6 x3 Px6 x4

x3 =1 x6 =1

f (Y |λ) =

20 20

(2) by5 Py5 y1 Py5 y2 Py5 y6 Py6 y3 Py6 y4

y5 =1 y6 =1

where Pij = Pij (vij ) is the transition probability from node i to node j , vij is the branch length between them, and bi is the frequency of amino acid i. By integrating out the random variable λ, the probability of observing X or Y at a site is given by ∞ p(X) = f (X|λ)φ(λ) dλ 0 (3) ∞ p(Y ) = f (Y |λ)φ(λ) dλ 0

respectively. Let P (S1 ) = θI be the probability of a site being in state S1 (functional divergence) and P (S0 ) = 1 − θI be the probability of a site being in state S0 (functional constraint). We call θI the coeficient of type I functional divergence between clusters 1 and 2 (Gu, 1999). Let X and Y be the amino acid patterns of a site in clusters 1 and 2, respectively. Since evolutionary rates (λ1 and λ2 ) at an S1 site (i.e., a site under S1 ) are statistically independent between two clusters, whereas they are completely correlated (λ1 = λ2 , without loss of generality) at an S0 site, the joint probability of subtrees conditional on S0 or S1 is given by ∞ f (X|λ)f (Y |λ)φ(λ) dλ = E[f (X|λ)f (X|λ)] f (X, Y |S0 ) = (4) 0 f (X, Y |S1 ) = p(X)p(Y ) = E[f (X|λ1 )] × E[f (Y |λ2 )]

TESTING TYPE I FUNCTIONAL DIVERGENCE AFTER GENE DUPLICATION

167

where f (X|λ1 ) or f (Y |λ2 ) is the likelihood of each unrooted subtree, respectively [e.g., it is given by Eq. (8.2) for the phylogeny in Figure 1], and E means taking expectation. From the two-state model, one can easily show that the joint probability of two subtrees can be written as p(X, Y ) = (1 − θI )f (X, Y |S0 ) + θI f (X, Y |S1 )

(5)

Then, under the assumption of site independence, the likelihood function over all sites (gaps excluded) is given by L(x|data) = p(X(k) , Y (k) ) (6) k

where k is the number of sites and x is the set of unknown parameters. 3.2 Poisson Model Gu (1999) developed a Poisson-based model to estimate the coefficient of functional divergence, which is computationally efficient. At a given site, the number of amino acid changes (Xi , i = 1, 2 for gene clusters 1 and 2, respectively) follows a Poisson distribution; that is, the probability that Xi = k is given by pi (k) =

(λi Ti )k −λi Ti e k!

i = 1, 2

(7)

where T1 and T2 are the total evolutionary times of clusters 1 and 2, respectively. The joint distribution of the number of changes, P (X1 , X2 ), can be derived as follows. For any S1 site, the evolutionary rate is statistically independent between two clusters, whereas it is completely correlated at an S0 site. Thus, the probability of X1 = i in cluster 1 and X2 = j in cluster 2 under state S0 or S1 is given by P (X1 = i, X2 = j |F1 ) = Q1 (i)Q2 (j ) P (X1 = i, X2 = j |F0 ) = K12 (i, j )

(8)

The analytical forms of Q1 , Q2 , and K12 were derived by Gu (1999). Then the joint distribution can be expressed as P (X1 , X2 ) = (1 − θI )K12 + θI Q1 Q2

(9)

To estimate θI we need to know the number of changes at each site for each gene cluster (i.e., X1 and X2 ). Since X1 and X2 cannot be observed directly from the sequence data, a conventional solution is to use the number of minimum-required changes (m) as an approximation, which can be inferred by the parsimony under a known phylogenetic tree (Fitch, 1971). However, m is a biased estimate for the true number of changes because it does not consider the possibility of multiple hits. This

168

STATISTICAL METHODS FOR DETECTING FUNCTIONAL DIVERGENCE

problem has been solved by using a combination of ancestral sequence inference and maximum likelihood estimation (Gu and Zhang, 1997). Extensive computer simulation has shown that the estimate of mean of expected number of changes, as well as that of variance, is asymptotically unbiased and robust against the accuracy of ancestral amino acid inference.

4 PREDICTING CRITICAL RESIDUES FOR TYPE I FUNCTIONAL DIVERGENCE It is of great interest to predict (statistically) which sites are likely to be responsible for functional differences. Indeed, these sites can be tested further by experimentation using molecular, biochemical, or transgenic approaches. We have developed site-specific profiles for this purpose, which can be obtained using posterior analysis. 4.1 Markov Chain Model For the simple two-cluster case, there are only two states: S0 and S1 . We wish to know the probability of S1 for a given site when the amino acid configuration (X, Y ) is observed [i.e., P (S1j X, Y ]. The prior probability of S1 is P (S1 ) = θI . According to the Bayesian law, we have P (S1 |X, Y ) =

θI f (X, Y |S1 ) p(X, Y )

(10)

where f (X, Yj S1 ) and p(X, Y ) are given by Eqs. 4 and 5, respectively. 4.2 Poisson Model In the case of strong statistical evidence supporting the functional divergence after gene duplication (i.e., θI > 0), it is of great interest to predict which sites are likely to be responsible for these (type I) functional differences. Indeed, these sites can be tested further using molecular, biochemical, or transgenic approaches. Remember that in the two-state model, each site has two possible states, S0 (functional constraint) and S1 (functional divergence), with the (prior) probabilities P (S1 ) = θI and P (S0 ) = 1 − θI , respectively. To provide a statistical basis for predicting which state is more likely at a given site, we need to compute the (posterior) probability of state F1 at this site with X1 (and X2 ) changes in cluster 1 (and 2), P (S1 |X1 , X2 ). Obviously, P (S0 |X1 , X2 ) = 1 − P (S1 |X1 , X2 ). According to the Bayesian law, one can show that P (S1 |X1 , X2 ) =

θI Q1 Q2 (1 − θI )K12 + θI Q1 Q2

(11)

We may use this formula to identify those amino acid sites that may be responsible for the functional divergence given a cutoff value. In practice, the

IMPLEMENTATION AND CASE STUDY

169

choice of a cutoff value is somewhat arbitrary, from P (S1 |X1 , X2 ) > 0.5 (Rij > 1) to P (S1 |X1 , X2 ) > 0.95 (or Rij > 20). As will be seen below, it depends on how much information we can obtain.

5 IMPLEMENTATION AND CASE STUDY These methods have beeen implemented in the software Diverge, which is available at www.xgu.gdcb.iastate.edu. Diverge is a GUI-based, user-friendly software package to provide an integrated analytical tool for functional prediction of protein sequence data, which can be run under both the Windows and LiNUX operating systems (Figure 3). Using Diverge, Wang and Gu (2001) analyzed the caspase gene family to explore the structural–functional basis for site-specific rate shifts (type I functional divergence) of protein sequences between major caspase subfamilies. The key component in the apoptotic machinery (or programmed cell death) is a cascade of cysteine aspartyl proteases (caspases). To date, 14 members of the caspase gene family have been identified in mammals, which can be classified into two major subfamilies, CED-3 (including

Figure 3

Interface of the software Diverge.

170

STATISTICAL METHODS FOR DETECTING FUNCTIONAL DIVERGENCE HUMAN 3-alpha HUMAN 3-beta 99 99 RAT 3-alpha 92 RAT 3-beta 99 MOUSE HAMSTER 97 CHICKEN 99 FROG HUMAN MOUSE 99 77 RAT 68 94 HAMSTER HUMAN 99 RAT B 99 MOUSE 96 99 CHICKEN 81 DROSOPHILA ARMY WORM 99 C HUMAN 99 MOUSE HUMAN 10a 99 HUMAN 10b 99 99 HUMAN 10d C. ELEGANS CED-3 HUMAN 9 HUMAN 99 RAT 97 MOUSE 99 CHICKEN HUMAN 99 MOUSE HUMAN 97 HORSE 99 RAT MOUSE 99 99 HUMAN 76 99 HUMAN 93 HUMAN MOUSE 99 67 MOUSE FROG ICE-A FROG ICE-B 99 99

A

CASP-3

E-Casp CASP-7

CASP-6 CED-3

CASP-8 CASP-10 CASP-9

I-Casp

CASP-2

CASP-14

CASP-1 CASP-4 CASP-5 CASP-13 CASP-11 CASP-12

ICE

0.05

Figure 4

Phylogenetic tree of the caspase gene family.

caspase-2, -3, -6, -7, -8, -9, -10, and -14) and ICE (including caspase-1, -4, -5, -11, -12, and -13). CED-3-type caspases are essential for most apoptotic pathways, while the major function of the ICEtype caspases is to mediate immune response. Based on the inferred tree of caspases (Figure 4), Wang and Gu (2001) found that type I functional divergence is statistically significant between two major subfamilies, CED-3 and ICE (θI = 0.29). The posterior profile (Figure 5) predicts crucial amino acid residues that are responsible for functional divergence between them. It has been shown that 4 of 21 amino acid residues predicted (for type I functional divergence between CED-3 and ICE) have been verified by experimental or structural evidence.

REFERENCES

171

1 P(S1|X)

0.8 0.6 0.4 0.2 191

181

171

161

151

141

131

121

111

91

101

81

71

61

51

41

31

21

0

11

0 Alignment position (A) Site

CED-3

Sequence conservation An invariant Trp (W) 161

86/88

Highly variable

Structural features

Form a narrow pocket with an No extra loop; a shallow depression found extra loop; form a H-bond

Substrate specificity

Network with a group o amino Hydrophobic side chains acids; Hydrophilic side chains

Structural features

No surface loop

Sequence conservation Highly variable 131

ICE

Structural features

Not a cleavage site

Lie in an exta surface loop Highly conserved Cleavage site for proenzyme processing

(B)

Figure 5 (A) Site-specific profile for predicting critical amino acid residues responsible for functional divergence between CED-3 and the ICE subfamilies, measured by the posterior probability of being functionally divergence-related at each site [P (S1 |X)]. The arrows point to four amino acid residues at which functional divergence between two subfamilies has been verified by experimentation. (B) Four predicted sites that have been verified by experimentation.

REFERENCES Casari G, Sander C, Valencia A. 1995. A method to predict functional residues in proteins. Struct Biol 2:171–178. Dayhoff MO, Schwartz RM, Orcutt BC. 1978. A model of evolutionary change in proteins. In Dayhoff MO (ed.), Atlas of Protein Sequence Structure, Vol. 5, Suppl. 3. Washington, DC: National Biomedical Research Foundation, pp. 342–352. Felsenstein J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17:368–376. Fitch WM. 1971. Toward defining the course of evolution: minimum change for a specific tree topology. Syst Zool 20:406–416. Gu X. 1999. Statistical methods for testing functional divergence after gene duplication. Mol Biol Evol 16:1664–1674. Gu X. 2001. Maximum likelihood approach for gene family evolution under functional divergence. Mol Biol Evol 18:453–464. Gu X, Vander Velden K. 2002. DIVERGE: Phylogeny-based analysis for functional–structural divergence of a protein family. Bioinformatics 18:500–501.

172

STATISTICAL METHODS FOR DETECTING FUNCTIONAL DIVERGENCE

Gu X, Zhang J. 1997. A simple method for estimating the parameter of substitution rate variation among sites. Mol Biol Evol 14:1106–1113. Gu X, Fu YX, Li WH. 1995. Maximum likelihood estimation of the heterogeneity of substitution rate among nucleotide sites. Mol Biol Evol 12:546–557. Gu J, Wang Y, Gu X. 2002. Pattern of functional divergence in JAK tyrosine protein kinase family. J Mol Evol 54:725–733. Holland PWH, Garcia-Fern´andez J, Williams NA, Sidow A. 1994. Gene duplication and the origins of vertebrate development. Development 1994 Suppl. pp. 125–133. Kimura M. 1983. The Neutral Theory of Molecular Evolution. Cambridge, UK: Cambridge University Press. Kishino H, Miyata T, Hasegawa. 1990. Maximum likelihood inference of protein phylogeny and the origin of chloroplasts. J Mol Evol 31:151–160. Landgraf R, Fischer D, Eisenberg D. 1999. Analysis of heregulin symmetry by weighted evolutionary tracing. Protein Eng 12:943–951. Li WH. 1983. Evolution of duplicated genes. In Nei M, Koehn RK (eds.), Evolution of Genes and Proteins. Sunderland, MA: Sinauer Associates. Lichtarge O, Bourne HR, Cohen FE. 1996. An evolutionary trace method defines binding surfaces common to protein families. J Mol Biol 257:342–358. Livingstone CD, Barton G. 1996. Identification of functional residues and secondary structure from protein sequence alignment. Methods Enzymol 266:497–512. Lundin LG. 1993. Evolution of the vertebrate genome as reflected in paralogous chromosomal regions in man and the house mouse. Genomics 16:1–19. Mathews S. 2005. Analytical methods for studying the evolution of paralogs using duplicate gene datasets. Methods Enzymol 395:724–745. Ohno S. 1970. Evolution by Gene Duplication. New York: Springer-Verlag. Spring J. 1997. Vertebrate evolution by interspecific hybridisation: Are we polyploid? FEBS Lett 400:2–8. Uzzel T, Corbin KW. 1971. Fitting discrete probability distribution to evolutionary events. Science 172:1089–1096. Wang Y, Gu X. 2001. Functional divergence in the caspase gene family and altered functional constraints: statistical analysis and prediction. Genetics 158:1311–1320. Wolfe KH, Shields DC. 1997. Molecular evidence for an ancient duplication of the entire yeast genome. Nature 387:708–713.

9

Mapping Gene Gains and Losses Among Metazoan Full Genomes Using an Integrated Phylogenetic Framework ATHANASIA C. TZIKA Laboratory of Natural and Artificial Evolution, Department of Genetics and Evolution, Sciences III, Geneva, Switzerland; Evolutionary Biology and Ecology, Universit´e Libre de Bruxelles, Brussels, Belgium

¨ HELAERS RAPHAEL Department of Biology, Facult´es Universitaires Notre-Dame de la Paix, Namur, Belgium

MICHEL C. MILINKOVITCH Laboratory of Natural and Artificial Evolution, Department of Genetics and Evolution, Sciences III, Geneva, Switzerland

1 INTRODUCTION Although a rough increase in maximum phenotypic complexity across the entire range of evolutionary time is indisputable, this general trend is not distributed homogeneously throughout the tree of life. Multiple lineages, such as myzostomes, flatworms, and tunicates, even exhibit simplified body plans probably derived rather than ancestral. Conversely, many branches in the tree of life at diverse phylogenetic scales are characterized by an accelerated acquisition of new and complex physiological and morphological characters [e.g., Aburomia et al. (2003), but see Donoghue and Purnell (2005)], of which some had a major impact on the ability of these lineages to diversify and thrive. The temptation to correlate phenotypic complexity with genomic complexity is both obvious and unsubstantiated. Although the absolute amount of DNA in a haploid cell is poorly correlated with organismal complexity (Gregory, 2002), notable and gradual increases in gene number (resulting from the retention of duplicated genes) and more abrupt increases in the abundance of spliceosomal introns and mobile genetic elements through evolutionary time have been suggested (Lynch and Conery, 2003). More generally, it is possible that the emergence of new genes [through one or a combination of processes involving exon shuffling, gene duplication, mobile elements, lateral Evolution After Gene Duplication, Edited by Katharina Dittmar and David Liberles Copyright © 2010 Wiley-Blackwell

173

174

PHYLOGENY-BASED MAPPING OF GENE GAINS AND LOSSES

gene transfer, gene fusion/fission, and de novo origination; see Long et al. (2003) for a review] is involved in the development of phenotypic novelties in organismal evolution. Furthermore, the single or multiple round(s) of full-genome duplication [see Van de Peer (2004) for a review] in several major lineages, such as yeast, vertebrates, and plants, might explain major evolutionary leaps and adaptive radiations [Ohno, 1970; Aburomia et al., 2003; Van de Peer, 2004; but see Donoghue and Purnell (2005)]. A recent major competing hypothesis is that phenotypic transitions are explained by shifts in the precise spatial and temporal expression patterns of genes [see, e.g., Carroll (2001, 2005) and Carroll et al. (2004)] rather than by changes in their protein-coding regions: Gains or losses of cis-regulatory noncoding modules (CRMs) would cause shifts in the regulation of discrete tissue-specific and developmental-stage-specific expression of genes while avoiding deleterious pleiotropic effects of protein sequence modification. The structural mutation (mutation within the coding region) and regulatory mutation (mutations outside the coding region) models (Hoekstra and Coyne, 2007) are, however, by no means incompatible. For example, compartmentation of specialized gene functions can be brought about by duplication of the protein-coding sequence with its regulatory module(s) followed by subfunctionalization (Lynch and Conery, 2000; Lynch and Force, 2000); that is, the two gene copies specialize to perform complementary functions, for example, through protein sequence changes and/or evolution of the respective sets of CRMs (Force et al., 1999; Greer et al., 2000). Note that the increased probability of survival of subfunctionalized duplicates (Lynch and Conery, 2000) provides an extended time period during which neofunctionalization [i.e., one copy acquiring a new function whereas the other retains the ancestral function (Ohno, 1970)] can occur through coding sequence modifications (He and Zhang, 2005; Rastogi and Liberles, 2005). Furthermore, several studies corroborate the importance of lineage-specific positive selection (hence, potential neofunctionalization) for the retention of duplicates (Hurles, 2004; Kondrashov and Kondrashov, 2006; Shiu et al., 2006). Hoekstra and Coyne (2007) recently provided an extensive and articulated discussion suggesting that embracing the recent cis-regulatory paradigm of adaptive evolution as the single dominating mechanism for explaining the emergence of adaptations is theoretically unsubstantiated and is not supported experimentally. Rather, they suggest that “changes in both the structure and regulation of genes have been important in adaptation, that their relative importance will not be known for a considerable time, and that the role of structural mutations in morphological evolution—and other adaptive change—is unlikely to be trivial.” The increasing number of fully sequenced genomes and large-scale expression studies, accompanied by a constantly growing number of software and databases for better integration and exploitation of this wealth of data, should help investigate correlations between genome and phenotype evolution. However, whole-genome comparisons among eukaryotic species have proven more problematic than among prokaryotes, not only due to extensive gene duplication events and the multidomain structure of most proteins, but also because of the low-coverage sequencing of several genomes (Milinkovitch et al., 2010a). Furthermore, the broad field of comparative genomics currently suffers from two major biases. First, a striking taxonomic bias in the choice of model species and genome sequencing projects is noteworthy (Milinkovitch and Tzika, 2007); for example, only 3% of full-genome sequencing projects use the localization of the corresponding species in the tree of life as a primary

INTRODUCTION

175

motivation (Liolios et al., 2006). As a result, a database such as Ensembl (Hubbard et al., 2007), which generates and maintains automatic annotation of selected eukaryotic genomes (www.ensembl.org), includes 21 mammalian and five teleost fish genomes, but only one bird and no reptile (v45). Current proposals for full-genome sequencing (www.genome.gov/10002154) correct the problem only very partially. Second, many of the methods and databases available for identifying duplication events and assessing orthology relationships of genetic elements among genomes avoid the heavy computational cost of phylogenetic trees inference and the difficulties associated with their interpretation, even though phylogeny-based orthology/paralogy identification is widely accepted as the most valid approach (Li et al., 2003; Alexeyenko et al., 2006). Recently, the problem has, however, been largely recognized and partially addressed by the comparative genomics community. For example, Ensembl (Hubbard et al., 2007) and the Human Phylome (Huerta-Cepas et al., 2007) are automated pipelines in which orthologs and paralogs are identified through the estimation of gene family phylogenetic trees. Furthermore, the recently developed MANTiS relational database (www.mantisdb.org) (Tzika et al., 2008) integrates phylogeny-based orthology/paralogy assignments with functional and expression data, allowing users to explore phylogeny-driven (focusing on any set of branches), gene-driven (focusing on any set of genes), function/process-driven, and expression-driven questions (Milinkovitch et al., 2010b). Application systems that integrate into an explicit evolutionary framework the mapping of gene gains and losses with functional and expression data should help in investigating whether the gene duplication phenomenon is generally relevant to adaptive evolution (i.e., beyond the well-known examples of diversification in globins, olfactory receptors, opsins, and transcription factors) and might even provide a means of investigating the causal relationship between genome evolution and an increase in phenotypic complexity. Furthermore, even if adaptations involve gene duplication, structural mutations, and regulatory mutations at drastically different relative frequencies (in the phylogenetic tree as a whole), many evolutionists will still be interested in identifying the genetic basis of adaptive traits at specific lineages of interest. Here we compare the efficiency of MANTiS against those of InParanoid (O’Brien et al., 2005), MultiParanoid (Alexeyenko et al., 2006), OrthoMCL (Li et al., 2003), and RoundUp (DeLuca et al., 2006) for the localization of gene gains and losses and duplication events within the metazoan phylogeny. First, InParanoid is a program that identifies putative ortholog clusters seeded by a reciprocally best-matching ortholog pair, around which in-paralogs are gathered and out-paralogs are excluded on the basis of their similar pairwise scores resulting from NCBI BLAST (Remm et al., 2001; O’Brien et al., 2005). InParanoid clusters generated from different pairs of genomes can then be merged using MultiParanoid. InParanoid was one of the first programs to refine best reciprocal hits for ortholog clustering. Second, OrthoMCL is an algorithm that groups putative ortholog protein sequences by (1) distinguishing between putative in-paralog and ortholog pairs through comparisons of reciprocal best hits within and between genomes, (2) correcting for differences in evolutionary distances between pairs of sequences, and (3) using the Markov clustering algorithm to split megaclusters. OrthoMCL was the first database to allow detection of genes present in a set of genomes and absent from another set. Third, the RoundUp database detects putative orthologs using the reciprocal smallest distance algorithm (RSD) based on global sequence alignment and maximum likelihood estimation of evolutionary distances. RoundUp incorporates the greatest number of sequenced genomes. Note that contrary

176

PHYLOGENY-BASED MAPPING OF GENE GAINS AND LOSSES

to the three resources mentioned above, MANTiS (1) incorporates the mapping of gains and losses of genes as well as of duplication events into an explicit phylogenetic framework, and (2) allows the user to perform elaborate queries combining parameters pertaining to gene identity, phylogeny, function, and expression.

2

DATA MINING

Data mining and the construction of the relational database were accomplished using the MANTiS (v1.0.15) pipeline (Tzika et al., 2008), available at www.mantisdb.org. MANTiS performs automated downloads from Ensembl (www.ensembl.org), extracts information relevant to protein families trees from the Compara database (Vilella et al. 2009), and defines characters for the generation of a full data set that includes orthologous gene presence/absence information for all species selected. Note that orthology is not assigned on the basis of simple best-reciprocal BLAST hits (BRH). Indeed, the existence of a BRH does not guarantee that orthology is inferred correctly in all cases (Theissen, 2002), because it ignores gene loss and differential rates of evolution. Here, orthology/paralogy are assigned in Ensembl (Hubbard et al. 2007, 2008) through the use of a pipeline (www.ensembl.org/info/data/compara/homology_method.html) that includes (1) the identification of gene families (using gene-relation graphs based on BRH), (2) tree inference after multiple protein sequence alignment within each gene family, and (3) identification of duplication and speciation events through gene tree vs. species tree reconciliation. MANTiS builts two data sets: the with duplications data set, combining all characters (de novo gains and duplication events), and the families only data set, which excludes characters corresponding to duplication events (i.e., we merge the characters within each protein tree). After each duplication event, the “ancestral” (vs. “derived”) character is identified as the child subtree with the smallest mean distance between the duplication node and all leaf nodes (Tzika et al., 2008). As functional and expression data are associated with a single specific Ensembl gene but a MANTiS character can correspond to a set of several Ensembl orthologous genes, all the relevant orthologs are assigned to a single MANTiS character corresponding to an Ensembl gene (and associated functional data) from the species with the largest amount of functional and expression data available (called priority species). All nonpriority species genes associated with a given character are considered as synonyms of the corresponding MANTiS character except when functional information is available via the Panther database (e.g., for Mus musculus, Rattus norvegicus, and Drosophila melanogaster genes). See an article by Tzika et al. (2008) for details on the character assignment method. Orthology assignment problems are expected to decrease as genome assembly and annotation improve. Note, however, that the annotation quality of a given genome does not depend solely on genome sequence coverage but also on its phylogenetic proximity with model species for which experimental data assisting in genome annotation (e.g., EST and SAGE data) are available. For example, the high-quality annotation of the human genome is more easily exploited for annotation of the macaque genome than for annotation of the opossum genome.

COMPARISONS WITH OTHER DATABASES

177

3 CHARACTER MAPPING Gains and losses of orthologs are mapped by MANTiS v1.0.15 (www.mantisdb.org) on the “true” species tree [i.e., the topology best supported (Halanych, 2004; Springer et al., 2004; Bashir et al., 2005)]. MANTiS maps characters as follows: (1) the character presence/absence matrix for all species (built in the character-mining phase; see above) is used for computing a distance matrix following a modified Jukes–Cantor model; (2) the distance matrix is used to compute the branch lengths of the true species topology using the least-squares approach under minimum evolution; (3) the gain of a character is assigned to the corresponding internal or tip branch of the true species tree; and (4) a recursive maximum likelihood approach is used to identify, for each character, the exact most likely combination of branch(es) on which gene loss(es) is (are) assigned. Once gains and losses have been mapped, MANTiS builds the genome content of each internal node. See the article by Tzika et al. (2008) for details on the character mapping method and genome content view of MANTiS.

4 COMPARISONS WITH OTHER DATABASES FOR THE LOCALIZATION OF GAINS AND LOSSES We compared the character mapping generated by MANTiS against similar information extracted from InParanoid (O’Brien et al., 2005), MultiParanoid (Alexeyenko et al., 2006), OrthoMCL (Li et al., 2003), and RoundUp (Deluca et al., 2006). We focused on genes present in human, mouse, and rat because these are the only three mammalian species present in all the databases mentioned above, and their genomes are well annotated. Mapping of specific genes was retrieved using the queries system available within MANTiS. Indeed, MANTiS allows building elaborate queries [performed on one or several “statement(s)” executed following priorities and logical operators in a user-friendly interface] concerning gene identity, mapping, and function parameters (biological processes, molecular functions, and gene expression) [see the articles by Tzika et al. (2008) and Milinkovitch et al. (2010b) for details]. The data sets originating from other databases were extracted as follows. First, MultiParanoid was used to merge the SQL tables of orthologs generated by InParanoid (v5.0) for the pairwise species comparisons Homo–Mus, Homo–Rattus, and Mus–Rattus. We retained only the clusters with a confidence value of 1 within, and no discrepancy among, the three comparisons of species pairs. All proteins of the three species were converted to MANTiS characters both for the “with duplications” and “families only” data sets. Second, using the “phyletic pattern form” view of OrthoMCL (v1), we extracted ortholog groups present in H. sapiens, M. musculus, and R. norvegicus and absent from all other species in the database. All cluster representatives were converted to MANTiS characters for the “with duplications” and “families only” data sets. Third, transitively closed phylogenetic profiles were retrieved from the RoundUp (July 2007) orthology database for H. sapiens, M. musculus, and R. norvegicus, using their most stringent conditions (BLAST e-value <1 × 10−5 and divergence threshold = 0.2). The results were exported and GI accession numbers were converted to Entrez IDs using the Gene ID Conversion tool in the DAVID database (Dennis et al., 2003). Entrez IDs

178

PHYLOGENY-BASED MAPPING OF GENE GAINS AND LOSSES

were then converted to MANTiS characters. Finally, the manually curated data set of Alexeyenko et al. (2006) was retrieved from http://multiparanoid.dgb.ki.se/stats.html and the Ensembl genes’ IDs were obtained using NCBI, SwissProt, and Ensembl . The three data sets (i.e., human, mouse, and rat genes identified as sets of orthologs by OrthoMCL, RoundUp, or InParanoid/MultiParanoid) were compared to the genes identified by MANTiS as appearing in any branch between the most recent common ancestor of mammals and the most recent common ancestor of human, mouse, and rat (and not lost in any of these three species). Given that RoundUp and InParanoid do not allow us to exclude the sets of orthologs also present in nonmammalian species, the number of clusters generated by these two databases greatly exceed those inferred by OrthoMCL and MANTiS. Among these false positives, we identified that about half (stacks with light gray or large dots in Figure 1A) correspond to genes gained in lineages older than the mammalian common ancestor, whereas the other half (stack with horizontal lines in Figure 1A) are genes gained within mammals but lost in one or two of the three species of interest. The remaining few false positives correspond to genes specifically gained in the human or the Glires (mouse + rat) lineage (black and dark gray stacks, respectively, in Figure 1A). Although OrthoMCL exhibits a very small number of false positives, it also exhibits the largest number of false negatives (dashed horizontal lines, Figure 1A): It fails to find 79% of the valid (human + mouse + rat)specific genes. On the other hand, although InParanoid exhibits the largest number of false positives, it also shows the smallest number (19%) of false negatives, whereas RoundUp yields intermediate values. Figure 1B shows the results of similar analyses when duplications are disregarded and only de novo characters (origin of new gene families) are considered. In this case, the number of false positives remains highest for InParanoid/MultiParanoid, smallest for OrthoMCL, and intermediate for RoundUp. In all cases these false positives are mostly old gene families gained prior to the origin of the mammalian ancestor (light gray stack), while mammalian families lost in one of the three focal species (stack with horizontal lines) become minor (Figure 1B). One could argue that the many characters we qualified as false positives in InParanoid/MultiParanoid, and RoundUp analyses could be false negatives in MANTiS. This hypothesis is very unlikely because the false positive nature of many RoundUp and InParanoid characters was confirmed by manual inspection (results not shown) as well as by their absence in the set of genes assigned as (human + mouse + rat)-specific by OrthoMCL. Similarly, one could argue that the many characters we qualified as false negatives in OrthoMCL and RoundUp analyses could be false positives in MANTiS, but this hypothesis is contradicted by the InParanoid analyses. These results demonstrate the interest of combining orthology/paralogy inference based on phylogenetic trees and performing gene mapping in a phylogenetic framework.

5

REANALYSIS OF PREVIOUSLY INVESTIGATED GENE FAMILIES

Below, using published examples pertaining to (1) gains of chicken-specific genes, and (2) gains of genes involved in neural crest development, we demonstrate that a relational database such as MANTiS that integrates gene mapping and functional/expression data into a phylogenetic framework makes it possible, easily and accurately, to perform comparative genomic analyses that were very tedious in the past.

REANALYSIS OF PREVIOUSLY INVESTIGATED GENE FAMILIES

179

20000 Gained before mammals Gained before mammals & lost Mammalianspecific but lost

15000

10000

Gained in Primates Gained in Glires Fits with MANTIS

5000

0

OrthoMCL

RoundUp

InParanoid

MANTiS

(A) 9000

6750

4500

2250

0

OrthoMCL

RoundUp

InParanoid

MANTiS

(B)

Figure 1 Comparison of the orthologs clustering and character mapping pipeline of OrthoMCL, RoundUp, InParanoid/MultiParanoid, and MANTiS for gains between the most recent common ancestor (MRCA) of mammals and the MRCA of mouse, rat, and human: (A) including duplications; (B) gene families only. The stacks with large dots or horizontal lines refer to genes lost in one or two of the three focal species. Arrows indicates genes gained before the MRCA of mammals. Horizontal dashed lines indicate the relative number of false negative gains. See the text for details. (Color representation of the figure may be viewed at www.lanevol.org/LANE/publications.html.)

Using the query system of MANTiS, we verified the mapping of genes recognized by the International Chicken Genome Sequencing Consortium (ICGSC, 2004) as (1) strictly chicken specific, (2) present in Homo but absent from chicken, and (3) present in chicken but absent from Eutheria. All our analyses confirm the results reported by ICGSC (2004) and indicate that (1) feather, scale, and claw keratin genes along with ovocleidin-116 and avidins were gained in the Gallus lineage, (2) κ-casein, enamelin, statherin, and histatin do not have homologs in the chicken genome, and (3) chicken

180

PHYLOGENY-BASED MAPPING OF GENE GAINS AND LOSSES

vitellogenin and indigoidine synthase A protein have no counterparts in any mammalian genome (whereas CPD-photolyase is found in the marsupial Monodelphis domestica). To determine the origin of neural crest specification genes, Martinez-Morales et al. (2007) phylogenetically classified 646 neural crest genes (compiled using Mammalian Phenotype Ontology terms) into seven temporal emergence categories (Prokaryota, Eukaryota, Metazoa, Deuterostomia, Chordata, Vertebrata, and Mammalia). Their results suggest that the evolution of the neural crest is associated to recruitment of ancestral regulatory genes and to the emergence of signaling peptides. To test MartinezMorales et al.’s results, we input the list of 640 neural crest genes (six of the 646 genes used originally by the authors are no longer in the Ensembl database) in MANTiS to map their origin in the phylogeny of eukaryotes. MANTiS confirms Martinez-Moralez et al.’s mapping for 37.0% (237/640) of the genes, of which most appeared before or at the metazoan branch (Figure 2). On the other hand, MANTiS infers a shallower (i.e., more recent) origin for 60.6% of the genes, whereas it infers a deeper (i.e., more ancient) origin for only 2.3% of the genes (see the legend of Figure 2 for details). A

Martinez-Moralez (M-M) mapping

MANTIS mapping

Same level as in M-M 37.0% 60.6% From younger node in M-M From older node in M-M 2.3%

400 5% D/C

100% P/E 52% Met

48% P/E

300

5% Vert 23% Met

31% Met

72% P/E

64% P/E

200

100

0 Prokaryotes/ Eukaryotes (P/E)

Metazoa (Met)

Deuterostomes/ Chordates (D/C)

Vertebrates (Vert)

Tetrapodes/ Mammals

Figure 2 Distribution of temporal emergence of 646 neural crest specification genes according to MANTiS (wide black/gray/white columns) is strongly shifted in comparison to the mapping performed by Martinez-Morales et al. (2007) (narrow patterned columns). For genes shifted to younger branches by MANTiS (gray stacks), the node of origin cited by Martinez-Morales et al. (2007) is indicated. For example, 169 of the 190 genes gained at the vertebrate branch have been mapped to an older node cited by Martinez-Morales et al. (2007), and among these 169 genes, 64%, 31%, and 5% had been mapped on the prokaryote/eukaryote branch (large dots), the metazoan branch (horizontal lines), and the deuterostome/chordate branch (small dots), respectively. MANTiS confirms Martinez-Morales et al.’s mapping for 37.0% of the genes. (Color reprecentation of the figure may be viewed at www.lanevol.org/LANE/publications.html.)

CONCLUSIONS

181

closer look at these differences of first appearance indicates that many of the prokaryota/eukaryota novelties cited by Martinez-Morales et al. (2007) might be false positives, probably due to the use of simple BLAST searches with a relaxed threshold (versus the more elaborate tree-based orthology/paralogy assignment of Ensembl/MANTiS). For example, the enamelin and dentin sialophosphoprotein genes (ENSMUSG00000029286 and ENSMUSG00000053268) identified by MANTiS as arising in Theria and Eutheria, respectively, are assigned a prokaryotic origin by Martinez-Morales et al. (2007) because these two genes generate low-stringency BLAST hits in bacterial genomes. Finally, we compared results obtained by MANTiS to those provided in the manually curated “clusters of orthologs” from Homo, Drosophila, and Caenorhabditis available at http://multiparanoid.cgb.ki.se/stats.html. Conversion to Ensembl gene IDs was possible for 200 of the 221 clusters. Using the “families only” data set, MANTiS fully validated 67.5% (135/200) of these clusters. The remaining manually curated clusters were recognized by MANTiS as agglomerates of different gene families (i.e., they contain genes that are not considered homologous). Note that Alexeyenko et al. (2006) had already recognized that a similar proportion of manually curated clusters were not recognized as orthologous groups by InParanoid/MultiParanoid either. Although it is difficult to determine whether the homology criteria used in the manually curated database and by Martinez-Morales et al. (2007) are too permissive or that of the Ensembl/MANTiS database is too strict when grouping gene family members, the latter approach might be preferable because it is conservative.

6 CONCLUSIONS Morphological novelties abound in the history of animal evolution, but increases in complexity and acquisition of novelties are not distributed homogeneously across the phylogenetic tree of life. Although morphological evolution might have been partly driven by the evolution of cis-regulatory modules (Carroll et al., 2004; Carroll, 2005), there is little doubt that both gene duplications and adaptive structural mutations in protein-coding genes have also contributed substantially to the evolution of forms and physiologies [see, e.g., references in works of Li (1997) and Hoekstra and Coyne (2007)]. Hence, we think that one of the biggest challenges of comparative genomics lies in the identification, and mapping on a robust phylogeny, of changes in genome content that had significant functional implications. Such an endeavor may become possible by the integration of genome content and functional data into an explicit phylogenetic framework (Tzika et al., 2008) and should complement (1) analyses of evolutionary conservation [e.g., the characterization of ultraconserved nongenic sequences (Dermitzakis et al., 2003; Bejerano et al., 2004)] and (2) identification of protein-coding genes experiencing accelerated sequence evolution (e.g., Clark et al., 2003). The systematic phylogenetic mapping of gene gains and losses and associated functional data should also prove complementary to the screening of gene expression in target structures at specific stages of their development. Indeed, the latter approach requires prior identification of structures and genes of interest such that it has so far remained mostly restricted to morphological (vs. physiological, metabolic, etc.) characters and to genes likely involved in the development of these structures. Furthermore, these methods of observing spatiotemporal patterns of gene expression do not prove a causal relationship between gene expression and phenotype (Hoekstra and Coyne,

182

PHYLOGENY-BASED MAPPING OF GENE GAINS AND LOSSES

2007). The comparative genomic approach, on the other hand, will require highly accurate genome sequence information and their exhaustive annotation. Still, despite imperfect annotation of the genomes available, our analyses demonstrate the possibility to broaden the “evo-devo” perspective efficiently to comparative genomics [see also Milinkovitch et al. (2010b)]. Although we mostly focused here on portions of the metazoan phylogeny, the availability of additional genomes in the future (and improvement of their annotation) will allow expansion and fine-tuning of such analyses. Indeed, the availability of additional genomes from a wider taxonomic breadth will greatly improve the mapping of gains and losses of genetic elements, although the low coverage of some genomes might prove to generate more problems than solutions (Milinkovitch et al., 2010b). Further analyses, such as (1) the identification of convergent events, (2) the characterization of possible nonphylogenetic covariation among characters (e.g., whether a loss i is more likely to be convergent if a loss j has occurred as well), and (3) systematic functional analyses of branch-specific changes, should provide an improved understanding of the importance of gene gains and losses in the evolution of organisms. Acknowledgments This work was supported by grants from the University of Geneva (Switzerland), the Swiss National Science Foundation (FNSNF, grant 31003A_125060), the Soci´et´e Acad´emique de Gen`eve (Switzerland), the Georges and Antoine Claraz Foundation (Switzerland), the Ernst and Lucie Schmidheiny Foundation (Switzerland), and the National Fund for Scientific Research Belgium (FNRS). A.T. is a postdoctoral fellow at the FNRS.

REFERENCES Aburomia R, Khaner O, Sidow A. 2003. Functional evolution in the ancestral lineage of vertebrates or when genomic complexity was wagging its morphological tail. J Struct Funct Genom 3:45–52. Alexeyenko A, Tamas I, Liu G, Sonnhammer EL. 2006. Automatic clustering of orthologs and inparalogs shared by multiple proteomes. Bioinformatics 22: e9–e15. Bashir A, Ye C, Price AL, Bafna V. 2005. Orthologous repeats and mammalian phylogenetic inference. Genome Res 15:998–1006. Bejerano G, Pheasant M, Makunin I, Stephen S, Kent WJ, Mattick JS, Haussler D. 2004. Ultraconserved elements in the human genome. Science 304:1321–1325. Carroll SB. 2001. Chance and necessity: the evolution of morphological complexity and diversity. Nature 409:1102–1109. Carroll SB. 2005. Evolution at two levels: on genes and form. PLoS Biol 3:1159–1166. Carroll SB, Grenier JK, Weatherbee SD. 2004. From DNA to Diversity: Molecular Genetics and the Evolution of Animal Design. Hoboken, NJ: Wiley-Blackwell. Clark AG, Glanowski S, Nielsen R, Thomas PD, Kejariwal A, Todd MA, Tanenbaum DM, Civello D, Lu F, Murphy B, et al. 2003. Inferring nonneutral evolution from human–chimp–mouse orthologous gene trios. Science 302:1960–1963. DeLuca TF, Wu IH, Pu J, Monaghan T, Peshkin L, Singh S, Wall DP. 2006. Roundup: a multigenome repository of orthologs and evolutionary distances. Bioinformatics 22:2044–2046.

REFERENCES

183

Dennis G Jr, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA. 2003. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol 4:P3. Dermitzakis ET, Reymond A, Scamuffa N, Ucla C, Kirkness E, Rossier C, Antonarakis SE. 2003. Evolutionary discrimination of mammalian conserved non-genic sequences (CNGs). Science 302:1033–1035. Donoghue PC, Purnell MA. 2005. Genome duplication, extinction and vertebrate evolution. Trends Ecol Evol 20:312–319. Force A, Lynch M, Pickett FB, Amores A, Yan YL, Postlethwait J. 1999. Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151:1531–1545. Greer JM, Puetz J, Thomas KR, Capecchi MR. 2000. Maintenance of functional equivalence during paralogous Hox gene evolution. Nature 403:661–665. Gregory TR. 2002. Genome size and developmental complexity. Genetica 115:131–146. Halanych KM. 2004. The new view of animal phylogeny. Annu Rev Ecol Evol Syst 35:229–256. He X, Zhang J. 2005. Rapid subfunctionalization accompanied by prolonged and substantial neofunctionalization in duplicate gene evolution. Genetics 169:1157–1164. Hoekstra HE, Coyne JA. 2007. The locus of evolution: evo devo and the genetics of adaptation. Evol Int J Org Evol 61:995–1016. Hubbard TJ, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T, et al. 2007. Ensembl 2007. Nucleic Acids Res 35:D610–D617. Hubbard TJ, Aken BL, Ayling S, Ballester B, Beal K, Bragin E, Brent S, Chen Y, Clapham P, Clarke L, et al. 2008. Ensembl 2009. Nucleic Acids Res 37:D690–697. Huerta-Cepas J, Dopazo H, Dopazo J, Gabaldon T. 2007. The human phylome. Genome Biol 8:R109. Hurles M. 2004. Gene duplication: the genomic trade in spare parts. PLoS Biol 2:e206. ICGSC. 2004. Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 432:695–716. Kondrashov FA, Kondrashov AS. 2006. Role of selection in fixation of gene duplications. J Theor Biol 239:141–151. Li W-H. 1997. Molecular Evolution. Sunderland, MA: Sinauer Associates. Li L, Stoeckert CJ Jr, Roos DS. 2003. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13:2178–2189. Liolios K, Tavernarakis N, Hugenholtz P, Kyrpides NC. 2006. The Genomes On Line Database (GOLD) v.2: a monitor of genome projects worldwide. Nucleic Acids Res 34:D332–D334. Long M, Betran E, Thornton K, Wang W. 2003. The origin of new genes: glimpses from the young and old. Nat Rev Genet 4:865. Lynch M, Conery JS. 2000. The evolutionary fate and consequences of duplicate genes. Science 290:1151–1155. Lynch M, Conery JS. 2003. The origins of genome complexity. Science 302:1401–1404. Lynch M, Force A. 2000. The probability of duplicate gene preservation by subfunctionalization. Genetics 154:459–473. Martinez-Morales JR, Henrich T, Ramialison M, Wittbrodt J, Martinez-Morales JR. 2007. New genes in the evolution of the neural crest differentiation program. Genome Biol 8:R36. Milinkovitch MC, Tzika A. 2007. Escaping the mouse trap: the selection of new evo-devo model species. J Exp Zool B 308B:337–346. Milinkovitch MC, Helaers R, Depiereux E, Tzika AC, Gabaldon T. 2010a. 2× genomes—depth does matter Genome Biol 11 (2):R16.

184

PHYLOGENY-BASED MAPPING OF GENE GAINS AND LOSSES

Milinkovitch MC, Helaers R, Tzika AC. 2010b. Historical constraints on vertebrate genome evolution. Genome Biol Evol 2010:13–18. O’Brien KP, Remm M, Sonnhammer EL. 2005. Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res 33:D476–D480. Ohno S. 1970. Evolution by Gene Duplication. New York: Springler-Verlag. Rastogi S, Liberles DA. 2005. Subfunctionalization of duplicated genes as a transition state to neofunctionalization. BMC Evol Biol 5:28. Remm M, Storm CE, Sonnhammer EL. 2001. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J Mol Biol 314:1041–1052. Shiu SH, Byrnes JK, Pan R, Zhang P, Li WH. 2006. Role of positive selection in the retention of duplicate genes in mammalian genomes. Proc Natl Acad Sci USA 103:2232–2236. Springer MS, Stanhope MJ, Madsen O, de Jong WW. 2004. Molecules consolidate the placental mammal tree. Trends Ecol Evol 19:430–438. Theissen G. 2002. Secret life of genes. Nature 415:741–741. Tzika A, Helaers R, Van de Peer Y, Milinkovitch MC. 2008. MANTiS: a phylogenetic framework for multi-species genome comparisons. Bioinformatics 24:151–157. Van de Peer Y. 2004. Computational approaches to unveiling ancient genome duplications. Nat Rev Genet 5:752–763. Vilella AJ, Severin J, Ureta-Vidal A, Durbin R, Heng L, Birney E. 2009. EnsemblCompara GeneTrees: analysis of complete, duplication aware phylogenetic trees in vertebrates. Genome Res 19:327–335.

10

Reconciling Phylogenetic Trees OLIVER EULENSTEIN Department of Computer Science, Iowa State University, Ames, Iowa

SNEHALATA HUZURBAZAR Department of Statistics, University of Wyoming, Laramie, Wyoming

DAVID A. LIBERLES Department of Molecular Biology, University of Wyoming, Laramie, Wyoming

1 INTRODUCTION Assembling the tree of life is one of the grand challenges in computational biology. A wide majority of phylogenetic analyses for the tree of life combine genomic sequences, from presumably orthologous loci, or loci whose homology is the result of speciation, into gene trees. These analyses largely have to neglect vast amounts of indispensable genetic sequence information in which complex evolutionary events such as gene duplication and loss, lateral gene transfer, or deep coalescence (lineage sorting) infer gene trees that are inconsistent with the actual species tree (Wolf et al., 2002). For example, gene duplication and loss are known to have played a pivotal role in the evolution of nearly all life on Earth. Analyses of genomic data from numerous plants, such as grasses (Vandepoele et al., 2003; Guyot and Keller, 2004; Paterson et al., 2004; Schlueter et al., 2004; Wang et al., 2005; Yu et al., 2005), Arabidopsis or other Brassicaceae (Vision et al., 2000; Simillion et al., 2002; Blanc et al., 2003; Bowers et al., 2003; Schlueter et al., 2004; Cannon et al., 2006; Schranz and Mitchell-Olds, 2006), poplar (Sterck et al., 2005), cotton (Blanc and Wolfe, 2004; Rong et al., 2004), and Physcomitrella (Rensing et al., 2007), among others, have revealed evidence of ancient gene duplications and losses. However, in many cases such phylogenetic information can be utilized by gene tree reconciliation (Goodman et al., 1979; Page, 1994; Mirkin et al., 1995; Eulenstein, 1998; Bonizzoni et al., 2005), which is an approach to resolving topological inconsistencies between a gene tree and a trusted species tree by invoking gene duplications and losses. Resulting gene duplications and losses provide a rich source of information on which a wide variety of evolutionary applications is based. These applications include the ortholog/paralog annotation of genes, rooting and adjusting gene trees (Chen et al., 2000; G´orecki and Tiuryn, 2004; Berglund-Sonnhammer et al., 2006), resolving apparent or soft polytomies in gene and species trees (Berglund-Sonnhammer et al., 2006; Chang and Eulenstein, 2006; Vernot et al., 2008), inferring gene trees Evolution After Gene Duplication, Edited by Katharina Dittmar and David Liberles Copyright © 2010 Wiley-Blackwell

185

186

RECONCILING PHYLOGENETIC TREES

(Arvestad et al., 2003, 2004; Behzadi and Vingron, 2006a; Durand et al., 2006), locating episodes of gene duplications in phylogenetic history (Guig´o et al., 1996; Fellows et al., 1998; Page and Cotton, 2002; Bansal and Eulenstein, 2008b; Burleigh et al., 2008, 2009), reconstructing domain decompositions (Behzadi and Vingron, 2006b), inferring species supertrees based on gene duplications (Page, 1998; Ma et al., 2000; Bansal et al., 2007; Bansal and Eulenstein, 2008a, 2008c), gene duplications and losses (Guig´o et al., 1996; Page, 1998; Ma et al., 2000; Fellows et al., 2003), or deep coalescence (Zhang, 2000). Gene tree reconciliation has mostly been based on the principle of parsimony, which is the focus of this chapter. This principle involves the identification of a reconciliation that minimizes the number of evolutionary events. The events to be minimized include, in various implementations, duplication events, the sum of duplication and loss events, and duplication events followed by a minimization of loss events in those resolutions that are equally parsimonious with regard to the number of duplication events. The advantage in including loss events is that they are usually almost as numerous as duplication events (given that loss is a common fate for duplicate genes) and can increase the signal. However, loss events can be problematic, in that it is impossible to differentiate between gene loss and missing data. The future development of methods that add pseudo-counts to loss data based on the expectation of missing data given the ratio of genes in a data set to genome size may improve the use of loss data in gene tree reconciliation. Earliest algorithms for gene tree reconciliation required fully binary gene and species trees and have polynomial running times (Zhang, 1997; Eulenstein, 1998; Ma et al., 2000; Zmasek and Eddy, 2001). However, due to the uncertainty in species relationships, to lineage sorting, and to other biological factors, species and/or gene trees may not always be best represented by binary relationships at nodes. Polynomial time algorithms (Chen et al., 2000; Chang and Eulenstein, 2006; Vernot et al., 2008) and heuristics have subsequently been developed and released that can accommodate nonbinary species as well as gene trees in the reconciliation process (Berglund-Sonnhammer et al., 2006). However, as has been learned in the field of phylogenetic reconstruction, parsimony suffers from several known flaws. Parsimony treats all branches equally, whereas the underlying process generating duplication and loss events is time dependent. Therefore, as branch length increases, the probability of multiple events per branch increases and thereby the probability that parsimony will undercount the number of events that took place. Similarly, the most parsimonious reconciliation may place an event on a short branch, whereas it is more likely to have occurred with a less parsimonious solution involving an event on an older, longer branch with a corresponding loss event on another downstream longer branch. For these reasons, there is an increasing interest in the development of model-based maximum likelihood and Bayesian methods. An additional question to consider as we turn our attention temporarily to likelihood methods is the distribution of likelihoods of different reconciliations that are more or less different from the most parsimonious reconciliation. It is not well understood what the landscape looks like (how rugged it is) and if the most parsimonious reconciliation is often close in solution space or identical to the most likely reconciliation(s) or if it is so distant as to be of little value. This characterization awaits exploration as the field moves forward.

GENE DUPLICATION MODEL

187

A major pitfall with the use of model-based methods is that typically intrinsically difficult computationally problems have to be solved. To get around this, Hahn et al. (2005) developed a model-based approximation of gene tree reconciliation that does not consider gene trees. Instead, this approach models changes in gene family size over species tree lineages. How well this approximation performs is still an open question. With model-based approaches lagging behind parsimony-based approaches, we first direct our attention to parsimony methods for gene tree reconciliation. After providing necessary introductory definitions and notation, we describe parsimony-based gene duplication model and variants of it. Later we survey a clustering of gene duplications to predict episodes of gene duplications that were duplicated at once. Supertree problems that are based on gene duplication and loss, and heuristic approaches and properties for these problems, are also discussed. We conclude by surveying model-based reconciliation approaches.

2 BASIC DEFINITIONS AND NOTATION A tree T is a connected graph with no cycles, consisting of a node set V (T ) and an edge set E(T ). We denote by I (T ) the internal nodes in V (T ). T is rooted if it has exactly one distinguished node called the root, which we denote by Ro(T ). Let T be a rooted tree. We define ≤T to be the partial order on V (T ), where x ≤T y if y is a node on the path between Ro(T ) and x. If x ≤T y, we call x a descendant of y and y an ancestor of x. We also define x
188

RECONCILING PHYLOGENETIC TREES

tree (e.g., the trees G and S in Figure 1), which are both rooted, by invoking gene duplications. To resolve these topological differences, all gene trees, here referred to as inferred gene trees, are considered that could have evolved by invoking gene duplication and speciation events along the topology of the given species tree. Formally, D is an inferred gene tree of species tree S, if there exists a function f : V (D) → V (S), such that (1) every leaf in D maps to a leaf in S, and (2) every internal node u in D is either a gene duplication node or a speciation node. The node u is a duplication node (or d-node) if f (ChD (u)) = {f (u)}. The node u is a speciation node (or s-node) if the function f on its restricted domain ChD (u) is a bijection into ChS (f (u)). Further, f (u) is referred to as the species that contains u. As an example, Figure 1 depicts the inferred gene tree R2 , which evolved only by speciation events along the topology of the species tree S. Thus, R2 consists only of s-nodes and exhibits the same topology as the species tree S. An example of a gene tree that is inferred by gene duplication from the species tree S is the tree R1 , which is shown in Figure 1. The root X of species tree S contains the gene d that is duplicated into the copies d1 , d2 , and d3 . Each of the copies evolves only through speciation along the topology of the species tree S. The inferred gene tree consists of exactly one d-node contained in species X, and s-nodes otherwise. Any number of further gene duplications can be invoked in the same way. Observe that there always exists an inferred gene tree, say D, which is evolutionarily compatible with the given gene tree G. In this case the tree G can be embedded in the tree D. The embedding is a function Emb: V (G) → V (D) that satisfies the following two properties. First, the function Emb preserves the pairwise least common ancestor relations given in G. Second, for every leaf l ∈ Le(G), the species from which it was sampled and the species of Emb(l) are identical. Inferred gene trees that allow such an embedding of the gene tree are called explanation trees. Explanation trees are evolutionary compatible with the gene tree, and therefore their d-nodes explain topological inconsistencies between the gene tree with respect to the species tree. An example is depicted in Figure 1, where the solid edges of the explanation tree R2

G

d

X

r

X

X

S

R1

R2

d1 d2 d3

Y

Y y

Y a

b Gene tree

c A

B

y3

C

A

Species tree A

a3

a1

B

b3

b2

C

a

B

C b

c

c3

Figure 1 The gene tree G and its comparable species tree S are inconsistent. If the polytomy r ∈ V (G) is true, the embedding of the gene tree G into the reconciled tree R1 explains the inconsistency. The embedding of G into R1 is highlighted by solid edges. In particular, the polytomy r := Ro(G) that is embedded into the d-node d ∈ V (R1 ) truly duplicates into the copies d1 , d2 , and d3 in species X. If the polytomy r ∈ V (G) is apparent, the embedding of the gene tree G into the reconciled tree R2 explains the inconsistency. In this case the apparent polytomy is replaced by the topology of the species tree S and does not invoke any d-nodes in the reconciled tree R2 .

GENE DUPLICATION MODEL

189

show its embedded gene tree G. Further, maximum subtrees in an explanation tree that have no embedding from the gene tree can be caused by an extinction of the genes represented by the root nodes of these subtrees. Therefore, these root nodes are referred to as losses (Page, 1994). For example, in Figure 1 the node y3 in the explanation tree R1 is a loss. Note that a loss can also be caused by genes that do exist but are not accounted for in the gene tree. However, there is always an infinite set of explanation trees, and parsimony criteria are used to select biologically meaningful explanation trees, which are called reconciled trees. Of particular interest are node reconciled trees, which are explanation trees with the minimum overall number of nodes. Node reconciled trees are unique (Eulenstein, 1998; Bonizzoni et al., 2005), which is a desirable property in practice. However, node reconciled trees are not unique for certain generalizations of the GD model (Chang and Eulenstein, 2006). Another type of reconciled trees are dup-loss reconciled trees, which minimize the number of d-nodes and losses. In the case of binary trees the definitions for node reconciled trees and dup-loss reconciled trees are equivalent (G´orecki and Tiuryn, 2004; Chang and Eulenstein, 2006). In the following we describe conditions that are necessary for gene tree reconciliation (2) an efficient approach to locating the gene duplications, and (3) reconciliation measurements. 3.1 Comparability To reconcile a gene tree with a species tree, the trees have to be comparable; that is, every leaf of the gene tree has to be mapped to the leaf in the species tree representing the species from which it was sampled. A mapping that establishes comparability between a gene tree and a species tree is called leaf mapping. Note that in theory a leaf mapping always exists, although it might not be meaningful in practice. Definition 3.1 (Leaf Mapping) A leaf mapping LG,S : Le(G) → Le(S) specifies for each gene the species from which it was sampled . Definition 3.2 (Comparability) Given a gene tree G and a species tree S, we say that G is comparable to S if there exists a leaf mapping LG,S . A collection of gene trees G is comparable to S if each gene tree in G is comparable to S .

3.2 Gene Duplication A duplicated gene or d-node is a gene in an inferred gene tree that is duplicated in a species into several copies (e.g., gene d is duplicated into the copies d1 , d2 , and d3 in species X in Figure 1). It is known that every d-node in the node reconciled tree has an embedding from a gene of the gene tree (Eulenstein, 1998). Therefore, every d-node in the node reconciled tree can be observed in the gene tree. For convenience we identify d-nodes in the node reconciled tree with their embedded nodes from the gene tree. Furthermore, the d-nodes in the gene tree can be described by a mapping from the gene tree to the species tree, independent of the node reconciled tree. Definition 3.3 (Mapping) The extension MG,S : V (G) → V (S) of LG,S is the mapping defined by lcaG,S (g) = lcaS (LG,S (Le(Gg ))).

190

RECONCILING PHYLOGENETIC TREES

Theorem 3.1 (Eulenstein, 1998) A node v ∈ V (G) is a d-node if and only if at least one of the following properties is satisfied . 1. lcaG,S (v) ∈ lcaG,S (ChG (v)) 2. There are distinct u, v ∈ ChG (v), lcaG,S (v).

where

lcaS ({lcaG,S (u), lcaG,S (v)}) <S

Definition 3.4 (Duplication Nodes) We define the set of all d-nodes with respect to a gene tree G and a comparable species tree S to be dup(G, S) = {g ∈ V (G)| g is a d-node}. Mappings are linear-time computable on a PRAM (Zhang, 1997) through a reduction from the least common ancestor problem (Bender and Farach-Colton, 2000). Thus, with Theorem 3.1 it follows that dup(G,S ) can be computed in linear time. Observe that given dup(G,S ), the node reconciled tree can be inferred in linear time in the size of the reconciled tree, which can be at most quadratic in the input size. 3.3 Reconciliation Cost A variety of approaches is provided in the literature that describe a reconciliation cost from a gene tree G to a comparable species tree S. A simplest definition of such a cost might be the duplication cost d(G, S) := |dup(G, S)|. An extension of this cost is the duplication-loss cost, which counts, in addition to the duplication cost the number of losses, which is described by Page (1994). Note that the duplication cost and the duplication-loss cost are asymmetric measurements. This reflects that reconciliation costs are determined between two different types of trees: namely, a gene tree and a species tree. Guig´o et al. (1996) introduced the mutation cost for binary gene and species trees, where the leaf mapping is a one-to-one mapping. Definition 3.5 (Guig´o et al., 1996) Let G be a gene tree that is comparable to the species tree S such that both trees are rooted and binary and LG,S is a one-to-one leaf mapping. Further, let p(a,b) be the number of nodes on the path between lcaG,S (a) and lcaG,S (b), for a ≤G b. Consider g ∈ I (G), ChG (g) = {e, f }, and the mappings g := lcaG,S (g), e := lcaG,S (e), and f := lcaG,S (f ). Then the loss of a node g is defined to be ⎧ 0, if g = e and g = f ⎪ ⎪ ⎪ ⎨p(g, e) + 1, if e <s g ∧ f = g l(g) := ⎪ p(g, f ) + 1, if f <s g and e = g ⎪ ⎪ ⎩ p(g, e) + p(g, f ), otherwise The mutation measure is defined as m(G, S) := d(G, S) +

g∈I (G)

and we call

g∈I (G)

l(g) the loss cost.

l(g)

APPARENT POLYTOMIES

191

A generalized version of the mutation measure, where the leaf mapping is not required to be a one-to-one mapping, is linear-time computable (Ma et al., 2000). Furthermore, the dupl-loss cost under the “setting” of the mutation measure is equivalent to the mutation cost (Eulenstein, 1998). Another equivalent formulation of the mutation cost has been given by Eulenstein and Vingron (1998) and Zhang (1997).

4 APPARENT POLYTOMIES The GD model has been widely accepted but is limited to true polytomies. In graphtheoretic terms a polytomy is a node with more then two children. Biologically, a polytomy can be either true or apparent (Maddison, 1989; Slowinski, 2001). A polytomy is true if its children diverged from it at the same time. A polytomy is apparent if the subtree consisting of the polytomy and its children could not be fully resolved in evolutionary history. Since true polytomies are rare evolutionary events (Slowinski, 2001), the literature typically describes the basic GD model only for binary gene and species trees. Note that property 2 in Theorem 3.1 becomes obsolete in the case of binary trees. However, in practice phylogenetic trees frequently have weakly supported or completely unresolved evolutionary relationships, which may be represented most accurately by apparent polytomies. In this section we survey tree reconciliation problems for rooted trees with apparent polytomies. The apparent tree reconciliation problem is: Given a gene tree G and a comparable species tree S, find under all ordered pairs of complete binary refinements of G and S, one with the minimum reconciliation cost. Constrained variants of this problem are the apparent gene tree reconciliation problem and the apparent species tree reconciliation problem, where the given species tree and the given gene tree are binary, respectively. Chang and Eulenstein (2006) addressed the apparent gene tree reconciliation problem. They modified the original GD model by relaxing the embedding function Emb: V (G) → V (D): that is, preserving only the tree order rather then the least common ancestor relations. Figure 1 depicts an example for an embedding and a relaxed embedding. An embedding that preserves the least common ancestor relations is shown by solid lines in the node reconciled tree R1 . A relaxed embedding that preserves the tree order and not the least common ancestor relations is shown by solid lines in the node reconciled tree R2 . The relaxed embedding function modifies the definition of node and dup-loss reconciled trees. In contrast to the original GD model, node and dup-loss reconciled trees are not necessarily unique, and their definitions are not equivalent (Chang and Eulenstein, 2006). However, node and dup-loss reconciled trees can be computed in polynomial time (Chang and Eulenstein, 2006) under the modified GD model. Vernot et al. (2008) addressed a variant of the apparent gene tree reconciliation problem. Their algorithm runs in O(|V (G)|(h + d)) time, where h is the height of S and d is maximum out-degree of a node in V (S). An implementation of the algorithm can be found in the program package NOTUNG (Chen et al., 2000). Berglund-Sonnhammer et al. (2006) introduced a variant of the general apparent tree reconciliation problem. The authors’ approach to address this problem includes solving an NP-complete subproblem. However, their approach features several refinements for the application in practice, allows to input unrooted trees, and is implemented in the software package Softparsmap (Berglund-Sonnhammer et al., 2006). Note that to our

192

RECONCILING PHYLOGENETIC TREES

knowledge, the computational hardness of the general apparent reconciliation problem remains open.

5

UNROOTED TREES

In this section we consider only binary trees. The unrooted tree reconciliation problem is: Given an unrooted gene tree G and a comparable species tree S, find under all ordered pairs of a rooted version of G and a rooted version of S one with the minimum reconciliation cost. Constrained variants of this problem are the unrooted gene tree reconciliation problem and the unrooted species tree reconciliation problem, where the root of the given species tree and the root of the given gene tree are fixed, respectively. In practice, only the unrooted gene tree reconciliation problem may be of interest, since species trees can be rooted using outgroup rooting or midpoint rooting approaches. However, these approaches may be problematic in order to root gene trees if there is a history of gene duplication and losses. Furthermore, several standard phylogenetic reconstruction methods infer unrooted gene trees. The unrooted gene tree reconciliation problem was first addressed by Chen et al. (2000), who gave a linear time algorithm for this problem. Later, G´orecki and Tiuryn (2007a) presented a more refined algorithm and showed that choosing any optimal rooting results in correct biological scenarios. Berglund-Sonnhammer et al. (2006) follow the dynamic programming approach from Chen et al. (2000), but return gene tree rootings that have the minimum number of losses under all rootings with the minimum number of gene duplications. Implementations for the unrooted gene tree reconciliation problem can be found in the software packages NOTUNG (Chen et al., 2000), URec (G´orecki, 2006; G´orecki and Tiuryn, 2007b), and Softparsmap (Berglund-Sonnhammer et al., 2006).

6

EPISODES OF GENE DUPLICATIONS

Understanding the evolution of gene families and genomes is a key problem in evolutionary biology. Gene families and genomes have been shaped by ancient events of genome duplication that have taken place during the evolution of species (Stebbins, 1950; Grant, 1981). The location and timing of these genome duplications can provide invaluable information in understanding how gene families and genomes have evolved. Sadly, the detection of genome duplications is much complicated by gene loss and gene rearrangements, which frequently follow after genome duplication events (Simillion et al., 2002; Lynch and Conery, 2003). However, single gene duplications have left many traces in genomes, and they can represent a larger duplication event on the genome level if combined to an episode of gene duplications. Episode-based gene duplication problems seek, given a collection of gene trees and a comparable species tree, to identify and locate episodes of gene duplications on the species tree. It is assumed that the gene duplications for this problem result from gene tree reconciliation. Note that gene tree reconciliation can identify gene duplications, but it does not necessarily map them accurately on the species tree. Specific episode-based gene duplication problems are determined by (1) specifying locations where gene duplications can be placed on the species tree, and (2) the objective that identifies a best placement of the gene duplications.

EPISODES OF GENE DUPLICATIONS

193

Guig´o et al. (1996) introduced the first episode-based gene duplication problem, which is referred to as the multiple gene duplication problem. The authors addressed this problem with a heuristic approach that was then refined and restated in more formal terms by Page and Cotton (2002). In essence, the heuristic approach from Page and Cotton for the multiple gene duplication problem aimed to solve a somewhat similar problem, called the episode clustering problem. In 2008 the multiple gene duplication problem was restated by Bansal and Eulenstein (2008b) and called the minimum episode problem. An alternative version of the multiple gene duplication problem was introduced by Fellows et al. (1998). In the following, we describe the episode clustering problem and the minimum episode problem, after we have introduced their solution space. Finally, we survey the multiple gene duplication problem from Fellows et al. (1998). For the remainder of this section we assume that all trees are binary and rooted. 6.1 Solution Space Both, the episode clustering problem and the minimum episode problem have the same allowed placements of gene duplications onto the species tree. The placements allowed for a gene are nodes on a subpath of a path that runs between the root and a leaf in the species tree. We refer to the subpaths as tree intervals, which are associated with genes by tree interval mapping. Definition 6.1 (Tree Interval Mapping) Let G be a gene tree that is comparable with a species tree S. We define in the following tree interval mapping for G and S. ⎧ if g = Ro(G) ⎨ [lcaG,S (g), Ro(S)], if lcaG,S (g) = lcaG,S (PaG (g)) [lcaG,S (g), lcaG,S (g)], intG,S (g) := ⎩ [lcaG,S (g), lcaG,S (PaG (g))], otherwise

An example for a tree interval mapping with respect to a gene tree G and a comparable species tree S is depicted in Figure 2. The gene duplications in dup(G,S) are represented by the three bold nodes in G, which are associated with their tree interval mapping intG,S . For example, the interval [5,3] represents the path 5,4,3 in the species trees S. Let g denote the node corresponding to the interval [5,3]. Species 5 is the 1

G

S

2

[2, 1] [2, 2]

3 4 5

[5, 3]

6 a

b

c

d

e

f

a

b

c

e

f

d

g

Figure 2 A gene tree G and a comparable species tree S are depicted. The bold nodes in G are gene duplications that are associated with their tree intervals. Each tree interval represents the locations of the corresponding gene duplication allowed in the species tree S.

194

RECONCILING PHYLOGENETIC TREES

most recent species that could have contained g, and the parent of species 3 (i.e., 2) is the most recent species that could have contained the parent of g. Thus, the solution space for both problems can be described through all possible mappings of gene duplications onto their tree intervals, which we call mapping scenarios. Definition 6.2 (Mapping Scenario) Let G be a gene tree that is comparable with a species tree S. We say that a function MG,S : V (G) → V (S) is a mapping scenario if the following hold : • •

MG,S (g) ∈ intG,S (g), if g ∈ dup(G, S). MG,S (g) = lcaG,S (g), otherwise.

Let G be a collection of gene trees that is comparable with a species tree S and V (G) := ∪G∈G V(G). We say that a function MG,S : V (G) → V (S) is a mapping scenario if MG,S for every G ∈ G is a mapping scenario.

6.2 Episode Clustering Problem The episode clustering problem seeks a mapping scenario that maps the gene duplications onto the minimum number of species. Equivalently, the episode clustering problem can be defined based on minimum tree interval covers. Problem 6.1 (Episode Clustering) Instance: A gene tree collection G and a comparable species tree S . Find: A minimum cover of ∪G∈ G,g∈ dup(G,S) {intG,S (g)} in the order ≤S . Page and Cotton (2002) approached the episode clustering problem by first reducing it to the intrinsically difficult set-cover problem (Garey and Johnson, 1979) and then using a heuristic with the aim to solve the reduced problem. The first exact polynomial time algorithm was given by Burleigh et al. (2008) and later improved by Luo et al. (2009). 6.3 Minimum Episode Problem The minimum episode problem relies on scoring mapping scenarios based on the evolutionary relations of gene duplications that map to the same species. The subgraph induced by these gene duplications in G forms a forest. Definition 6.3 (Forest) Let M G,S be a mapping scenario and s ∈ V (S). We call the −1 subgraph of G that is induced by MG,S (s) the forest of s under MG,S , and denote it by F (MG,S , s). Each forest located at a node in the species tree is scored by the minimum number of gene duplication episodes necessary to create the forest, which is the height of the

EPISODES OF GENE DUPLICATIONS

195

forest. The score of a mapping is the overall sum of the forest heights for each species in the species tree. The minimum episode problem seeks a mapping scenario with minimum score. Definition 6.4 (Episodes) Let M G,S be a mapping scenario and s ∈ V (S). We define (F (MG,S , s)) to be the height of the forest F (MG,S , s) and (F (MG,S , s)) = s∈V (s) (F (MG,S , s)). Problem 6.2 (Minimum Episode) Instance: A gene tree collection G and a comparable species tree S . ∗ , where (F (M ∗ )) = min{(F (M Find: A mapping scenario MG,S G,S )) | MG,S is G,S a mapping scenario}. The first polynomial time algorithm for the minimum episode problem was given by Bansal and Eulenstein (2008b). Later, Luo et al. (2009) described a linear time algorithm for this problem.

6.4 Multiple Gene Duplication Problem II Fellows et al. (1998) introduced multiple gene duplication problem II ∗ in 1998. In this problem the locations for gene duplication placements are much less restrictive then in the episode clustering or minimum episode problem. Furthermore, the event status for a gene can change from speciation event to a gene duplication event. The cost of a placement is based on multiple gene duplication events that each duplicate at most one gene of every given gene tree at once. In the following we describe in detail the allowed locations of gene duplications, the event status change for a gene from speciation to duplication, and the term multiple gene duplication. After this we are prepared to formulate multiple gene duplication problem II, and survey complexity results for this problem. Let G be a gene tree that is comparable with a species tree S. The location and the event status of the genes in G is described by the two functions locG,S : V (G) → V (S) and eventG,S : V (G) → {dup, spec}, respectively. The initial setting for these functions are locG,S := lcaG,S , and eventG,S (g) :=

dup, spec,

if g ∈ dup(G, S) otherwise

All other settings for these functions result from repeatedly applying a “move-up rule” that moves the locations of genes toward the root with respect to the tree order of G. The move-up rule can only be applied to genes u ∈ V (G) \ {Ro(G)}, where eventG,S (g) = dup. Let u be a node where the rule can be applied and v be the parent of u. The move-up rule modifies the functions locG,S and eventG,S as follows: ∗ This problem is a restricted version of multiple gene duplication problem [introduced by Fellows et al. (1998), which is a supertree problem that is based on minimizing multiple gene duplication events].

196

RECONCILING PHYLOGENETIC TREES

1. Modifying locG,S : locG,S (u) := locG,S (v). 2. Modifying eventG,S : If eventG,S (v) = spec, then eventG,S (v) := dup. The tuple (locG,S , eventG,S ) is called valid if the functions in the set can result from applying the move-up rule zero or more times to the initial setting. Now we introduce the term multiple gene duplication. Therefore, let G be a collection of rooted and binary gene trees that is comparable with a species tree S, and (locG,S , eventG,S ) be a valid tuple for each G ∈ G. Consider a node u ∈ V (S), and let D(u) be the set of genes g ∈ V (G) such that locG,S (g) = u and eventG,S (g) = dup for all G ∈ G. The set D can be partitioned into maximal sets, such that each set contains at most one gene from each gene tree in G. Each set of this partition is thought to have resulted from one larger duplication event and is called a multiple gene duplication. Problem 6.3 (Multiple Gene Duplication Problem II) Instance: A gene tree collection G consisting of rooted and binary trees, a comparable species tree S, and an integer c. Question: Exist valid tuples (locG,S , eventG,S ) for every G ∈ G such that the overall number of multiple gene duplication events is at most c? Fellows et al. (1998) show that the multiple gene duplication problem II is NPcomplete and W[1]-hard. The authors show further that several restricted variants of this decision problem are fixed-parameter tractable.

7

SUPERTREES BASED ON TREE RECONCILIATION

Supertrees assemble a collection of phylogenetic trees with not necessarily identical taxa (Gordon, 1986; Sanderson et al., 1998; Bininda-Emonds, 2002, 2004) into one tree, and make evolutionary statements about the joint taxa of the input trees. Thus, supertrees provide a way to synthesize many small trees into comprehensive phylogenies representing large sections of the tree of life. Supertree studies have assembled the first complete family-level phylogeny of flowering plants (Davies et al., 2004) and the first nearly complete phylogeny of extant mammals (Bininda-Emonds et al., 2007). However, there is an elementary debate in biological classification about how a supertree should assemble input trees to represent their common evolutionary information optimally. The wide variety of available supertree methods reflects the difficulty to establish criteria for assembling phylogenetic trees into supertrees for optimal support of the needs of evolutionary biologists. Most supertree problems require that their input trees are species trees, which are typically derived directly from their corresponding gene trees. As discussed earlier, evolutionary events often cause gene trees to be inconsistent with an accurate species tree. Below we first describe supertree problems for rooted binary trees that address this inconsistency based on gene tree reconciliation. Given that the supertree problems described are intrinsically hard, we survey heuristics that aim to solve them. For the remainder of the section we assume that all trees are rooted and binary.

SUPERTREES BASED ON TREE RECONCILIATION

197

7.1 Supertree Problems A supertree problem based on tree reconciliation is to find, for a given collection of gene trees G, a comparable species tree S ∗ with the minimum reconciliation cost. The reconciliation cost from G to a comparable species tree S is the sum of the reconciliation costs for each ordered pair in G × {S}. Here we consider the reconciliation cost to be either the duplication cost or the dup-loss cost, and refer to the corresponding supertree problems as the duplication problem and the dup-loss problem, respectively. The decision versions of both of these problems and some of their characterizations are NP-complete (Ma et al., 2000; Fellows et al., 2003). Furthermore, the duplication problem is W[2]hard when parameterized by the number of gene duplications and is hard to approximate to better than a logarithmic factor (Bansal and Shamir, submitted for publication). 7.2 Hill-Climbing Heuristics A major objective of supertree analyses, and in particular of analyses based on tree reconciliation, is to infer large supertrees. At the same time, computing large supertrees based on the duplication problem and the dup-loss problem is intrinsically difficult, as discussed earlier. However, efficient and effective hill-climbing heuristics that search the solution space for these problems have been adopted successfully. The solution space for a given collection of gene trees is the set of species trees that are comparable with the given tree collection. Every tree in the solution space can be labeled with its reconciliation score based on the given tree collection. The heuristics’ objective is to search for a tree in the solution space with the minimum reconciliation score using a road map, which guarantees that such a tree can be reached. In graph-theoretic terms the search space and its road map correspond to a connected, undirected, and node weighted graph G := (V , E, γ), called tree search graph. In this graph the node set V represents the solution space, the edges in E establish the road map of the search space, and γ: V → N assigns the reconciliation score to each node. In many tree search graphs, an edge between two distinct nodes in the graph G is drawn if the corresponding trees can be transformed into each other by one tree edit operation. A heuristic search in the tree search graph G is initialized by a given tree T ∈ V and follows a path of steepest descent with respect to the node weights until a local minimum is reached. This path is found by repeatedly solving instances of a local search problem, that is, to find in the neighborhood of a given node a node with the minimum weight. Typically, the neighborhood of a node is defined as the set of its adjacent nodes. However, other neighborhoods are noted in the literature [e.g., neighborhoods formed by the nodes with a distance of at most k to the given node (Bansal and Eulenstein, 2008a)]. Search paths may include several thousand nodes, and for each node on the path an instance of the local search problem has to be solved. Consequently, the time complexity of the local search problem determines in parts the runtime of the heuristic. The time efficiency of the local search problem correlates to the size and structure of the problem’s neighborhood. We first survey typical neighborhoods and then time complexities of local search problems that correspond to these neighborhoods. Neighborhoods Neighborhoods that have been applied successfully in supertree heuristics are based on a special type of tree edit operation called a subtree transfer

198

RECONCILING PHYLOGENETIC TREES

operation. Using a subtree transfer operation, a tree is edited by pruning and reconnecting a chosen subtree of it. We describe three standard types of subtree transfer operations for rooted trees (Bordewich and Semple, 2004) in the following. Let T be a tree, e = (u, v) an edge in E(T ), and Cu and Cv be the components in T − e where u ∈ Cu and v ∈ Cv . Following Baroni et al. (2005), we define T p := T + f to be the planted tree of T , where f :={x, Ro(T )} and x ∈ / V (T ), is called the root edge of T p . 1. The tree bisection and reconnection (TBR) operation. Let T be the tree obtained from tree T p by the following operations: (1) delete edge e; (2) subdivide an edge in each of the components Cu and Cv by a node, and add an edge between those nodes; (3) suppress any degree 2 node; and (4) remove the root edge. If a component of T − e consists of a single node, the added edge is attached to this node. The resulting tree T is said to be obtained from T by a TBR operation. Figure 3 depicts a general form of the TBR operation. 2. The subtree pruning and regrafting (SPR) operation. Let T be the tree obtained from tree T by a TBR operation, where in step (2) of this operation the subdivided edge in component Cv is incident to node v. The resulting tree T is said to be obtained from T by an SPR operation. Figure 4 depicts a general form of the SPR operation. 3. The nearest-neighbor interchange (NNI) operation. Let T be obtained from tree T by an SPR operation, where in step (2) of the corresponding TBR operation the subdivided edge in component Cu is on a path of length 2 from u to a leaf. The resulting tree T is said to be obtained from T by an NNI operation. Figure 5 depicts a general form of the NNI operation.

x

Tp

x

T′p

g′ r

r g u v

Cv

TBR

e

Cu

s

t Cv

f Cu

Figure 3 TBR operation. Consider a tree T that is changed into the tree T by applying a TBR operation. First, the planted tree T p , shown in the figure, is constructed by adding the root edge {x, r} to T . Then the planted tree T p is changed into the tree T p shown in the figure by deleting edge e, and rejoining the resulting components Cu and Cv by adding edge f . To add edge f = {s, t}, an edge (drawn in bold) is selected in each of the two components and divided by the additional nodes, s and t. A special case of rejoining the components is to join the component Cv above the root r by adding edge g. Finally, the root edge {x, g } or {x, r} is removed from T p to produce the (binary and rooted) tree T .

SUPERTREES BASED ON TREE RECONCILIATION x

Tp

199

x

T′p

g′ r

r g u SPR

e

v

s f

l

k

k Cu

Cv

t

l Cu

Cv

Figure 4 SPR operation. The SPR operation is a special case of the TBR operation. In contrast to TBR operation, the edge in component Cv that is used to rejoin the components has to be one of the edges, {v, k} or {v, l}.

x

Tp

x

T′p

r

r

u v

e t

s f

Cv

Cu

Cv

Cu

Figure 5 NNI operation. The NNI operation is a particular SPR operation where the edge in component Cu that is used for rejoining the components has to be on a path (drawn in bold) of at most length 2 from node u to a leaf.

Each of the operations defined above that transforms a tree T into a tree T can be reversed by an operation of the same type. Consequently, the edges in the tree search graphs based on the operations NNI, SPR, and TBR can be undirected. Furthermore, tree search graphs based on NNI operations are connected, which follows from Robinson’s result (Robinson, 1971) that any two unrooted binary trees over the same leaf set can be transformed into each other by a sequence of NNI operations for unrooted trees. Observe that every NNI operation is a SPR operation and any SPR operation is a TBR operation. Therefore, every tree search graph based on SPR and TBR operations is connected. In summary, tree search graphs based on the edit operations NNI, SPR, and TBR are well defined. Local Search Problems The type of edit operation considered determines the number of trees that are in the neighborhood of a given tree. Let T be a species tree over n leaves. The number of trees in the neighborhoods of T for the edit operations NNI, SPR, and TBR are asymptotically bound by O(n), O(n2 ), and O(n3 ), respectively.

200

RECONCILING PHYLOGENETIC TREES

From these bounds and the result that reconciliation costs based on duplication and dup-loss can be computed in linear time, the following time bounds for the local search problems follow directly. The local search problem for the duplication cost and the dup-loss cost based on the operations NNI, SPR, and TBR can be computed in time O(n2 ), O(n3 ), and O(n4 ), respectively. Heuristics based on these results are implemented in the program GeneTree (Page, 1998). These results were improved for the duplication cost by exploiting the similarity of trees in neighborhoods. Bansal and Eulenstein (2008a) provided a nearly linear time algorithm for the local search problem based on NNI operations. Further, the running time for the local search problems based on the operations SPR and TBR were improved by factors of n and n2 / log n, respectively (Bansal et al., 2007; Bansal and Eulenstein, 2008c). These improvements allowed, for the first time, truly large-scale phylogenetic studies based on the GD model. Heuristics with respect to the local search problem for the duplication loss cost based on the SPR operation are implemented in the program DupTree (Wehe et al., 2008). Heuristics implemented in this program also allow us to input unrooted gene trees.

8

MODEL-BASED APPROACHES

Lagergren and co-workers have pioneered the development of full model-based approaches for gene tree reconciliation (Arvestad et al., 2003–2009; Akerborg et al., 2009). They use a birth–death process over the species tree, where the probability of duplication is dependent on species branch lengths reported in the literature, as is the probability of loss, modeled using an exponential distribution as a constant-rate process. Ongoing efforts include models to enable differentiation between gene duplication and lateral transfer, as lateral transfer will be fit as a duplication event coupled to a large number of loss events in the absence of an appropriate model. The base model is described here. In general, probabilistic gene evolution models provide a framework for the evolution of gene trees inside species trees. In such models, duplications and losses of genes modeled using birth–death processes draw on the stochastic processes literature [reviewed by Novozhilov et al. (2006)]. Given that a gene tree is modeled within a species tree, the probabilities of reconciliations of gene trees with a species tree can also be obtained. Arvestad et al. (2003–2009) provide such modeling within a Bayesian framework, which allows for computation of posterior distributions for various probabilities, including those of reconciled trees. As an introduction to such modeling, consider the simple example from Arvestad et al. (2009) where the species tree consists of a single lineage resulting in a genes following duplication events. The probability of reconciliation of a gene tree with a children with the species tree is given by P {S(G)|D(a)}P {D(a)}: namely, the product of the conditional probability that a generated tree is isomorphic to the tree G, given that the gene tree has a children, multiplied by the probability of a gene tree evolving to have a children. The second part of this, P {D(a)}, is modeled via a linear birth–death process, which has a long history in stochastic processes. The birth–death process is parameterized via two rate parameters, λ for the births or duplications and μ for the deaths or losses. Then the process is governed by two probabilities: separate but related expressions for P0 (t), the probability of having zero genes at time t, and Pn (t), the

CONCLUSIONS

201

probability of having n genes at time t. Given expressions for these probabilities, P0 (t) =

μ(1 − e−(λ−μ)t ) λ − μe−(λ−μ)t

and Pn (t) =

λ−μ λ − μe−(λ−μ)t

˙ I P {D(a)} is given by qa (t) = Pn (t)[1 − P0 (t)]P0 (t)a−1 (0,∞) (a) + [1 − Pn (t)]I[θ] (a) where I (·) is an indicator function. Similarly, the probability of a generated tree being isomorphic to a tree given that there are a children is obtained via a result from Harding (see Harding, 1971; Arvestad et al., 2004, 2009). Breaking up of probabilities into products of conditional and marginal probabilities allows for computation via Markov chain Monte Carlo (MCMC). The more general model is formulated by considering the simple model above for its subtrees. The probability of reconciliation of the more general gene tree with a general species tree is computed by the accumulated reconciliation probabilities of the subtrees. The approach described above is a general framework that is extendable to modification to account for various aspects of molecular genetics. Modeling the birth and death processes as dependent on reported species branch lengths originating either from the fossil record or from molecular dating involving a collection of external genes begs the question of how gene- and chromosome region–specific the duplication process is. If this rate varies across genes and genomes, one alternative is to calibrate the clock with dS from the gene of interest (sums of gene tree branch lengths) rather than species-specific values. Another aspect of the model involves the use of the exponential distribution and a constant rate of loss. Comparative genomic modeling suggests that the exponential distribution is a reasonable approximation for loss events that occur between 0.02 and 0.15 dS unit (branches up to this length) in the absence of positive selection or selection for dosage compensation, but that fixation of complementary loss of function events (subfunctionalization) or of neofunctionalization requires a Weibull or related distribution with a declining loss rate with time (Hughes and Liberles, 2007). The rate of decay of loss appears to be different according to different models. Of course, as phylogenetics has moved from parsimony, to simple models (Felsenstein, 2004), to gamma-distributed rates (Yang, 1994), to covarion models (Fitch and Markowitz, 1970), to other more sophisticated models (Rodrigue et al., 2006; Stern and Pupko, 2006), there are added levels of biological complexity that may (or may not) be important to incorporate into models. If population genetics parameters such as mutation rate or effective population size change across a phylogeny, it may be important to consider the effects that these changes have on resulting patterns of duplication and loss (Lynch and Conery, 2003). Treating whole-genome duplication differently from stochastic smaller-scale duplication, both in duplication patterns within and across gene families and in loss rates, may be important (Hughes and Liberles, 2008). Finally, while parameterization in a reconciliation may be gene family–specific, extension to lineagespecific modeling may give increased realism in characterizing historical events.

9 CONCLUSIONS With increasing efforts to assemble the tree of life it has become apparent that inconsistencies between gene trees and a trusted species tree are widespread. To support

202

RECONCILING PHYLOGENETIC TREES

resolving these inconsistencies, research for the GD model is gaining momentum. The GD model has been refined and extended in many ways, which we have surveyed in parts. Furthermore, research efforts to assemble the tree of life are assisted by a variety of methods, which are based on gene tree reconciliation. Additionally, reconciliation is also gaining ground for use in various problems in gene family analysis. However, inherent problems with the GD model and applications that are based on it remain unresolved. The computational hardness of tree reconciliation for gene and species trees with soft polytomies is still open (Vernot et al., 2008). Extensions of the GD model that simultaneously consider gene duplications together with other evolutionary events have been developed (Hallett and Lagergren, 2001; Hallett et al., 2004; Berglund-Sonnhammer et al., 2006; G´orecki, 2006; Vernot et al., 2008). On the other hand, extensions that consider all major evolutionary events at the same time remain to be developed. Future work on extending the GD model might also allow us to include sequence information directly rather then first inferring gene trees from the sequences. In general, supertree problems based on tree reconciliation can be defined using extensions of the GD model. Several of the desirable properties for such supertree problems in the consensus setting are known to be satisfied. Yet, fast hill-climbing heuristics are only available for supertree problems that are based solely on the duplication cost. Further, since gene duplication events can be a part of larger multiple-gene duplication episodes, it might be more helpful to formulate supertree problems by adopting episode-based gene duplication optimality criteria (Fellows et al., 1998), although episode-based gene duplication optimality criteria are still in their infancy and have to be better adopted to more general applications in practice (Burleigh et al., 2008). Finally, model-based approaches have lagged behind parsimony-based approaches but appear to be poised for rapid growth in development and adoption to address various biological problems. Acknowledgments We thank Pawel G´orecki, Anke Konrad, Harris T. Lin, and Wen-Chieh Chang for careful reading of the manuscript. This work was supported by the National Science Foundation (awards 0830012, 334832, and 743374). REFERENCES Akerborg O, Sennblad B, Arvestad L, Lagergren J. 2009. Simultaneous bayesian gene tree reconstruction and reconciliation analysis. Proc Natl Acad Sci USA 106:5714–5719. Arvestad L, Berglund AC, Lagergren J, Sennblad B. 2003. Bayesian gene/species tree reconciliation and orthology analysis using MCMC. Bioinformatics 19(Suppl. 1):i7–i15. Arvestad L, Berglund AC, Lagergren J, Sennblad B. 2004. Gene tree reconstruction and orthology analysis based on an integrated model for duplications and sequence evolution. In Bourne PE, Gusfield D. (eds.), Proceedings of the Eighth Annual International Conference on Computational Molecular Biology, San Diego, CA, Mar. 27–31, 2004, pp. 326–335. Arvestad L, Lagergren J, Sennblad B. 2009. The gene evolution model and computing its associated probabilities. J ACM 56:7. Bansal MS, Eulenstein O. 2008a. The gene-duplication problem: near-linear time algorithms for NNI based local searches. In Mandoiu II, Sunderraman R, Zelikovsky A (eds.), ISBRA, Vol. 4983 of Lecture Notes in Computer Science. New York: Springer-Verlag, pp. 14–25.

REFERENCES

203

Bansal MS, Eulenstein O. 2008b. The multiple gene duplication problem revisited. Bioinformatics 24i132–i1328. Bansal MS, Eulenstein O. 2008c. An (n2 / log n) speed-up of TBR heuristics for the geneduplication problem. IEEE/ACM Trans Comput Biol Bioinf 5:514–524. Bansal MS, Burleigh JG, Eulenstein O, Wehe A. 2007. Heuristics for the gene-duplication problem: a (n) speed-up for the local search. In Speed TP, Huang H (eds.), RECOMB , Vol. 4453 of Lecture Notes in Computer Science. New York: Springer-Verlag, pp. 238–252. Bansal MS, Shamir R. Submitted for publication. A note on the fixed parameter tractability of the gene-duplication problem. IEEE/ACM Trans Comput Biol Bioninf. Baroni M, Gr¨unewald S, Moulton V, Semple C. 2005. Bounding the number of hybridisation events for a consistent evolutionary history. J Math Biol 51:171–182. Behzadi B, Vingron M. 2006a. An improved algorithm for the macro-evolutionary phylogeny problem. In Lewenstein M, Valiente G (eds.), CPM , Vol. 4009 of Lecture Notes in Computer Science. New York: Springer-Verlag, pp. 177–187. Behzadi B, Vingron M. 2006b. Reconstructing domain compositions of ancestral multi-domain proteins. In Bourque G, El-Mabrouk N (eds.), Comparative Genomics Vol. 4205 of Lecture Notes in Computer Science. New York: Springer-Verlag, pp. 1–10. Bender MA, Farach-Colton M. 2000. The LCA problem revisited. In Gonnet GH, Panario D, Viola A (eds.), LATIN , Vol. 1776 of Lecture Notes in Computer Science. New York: Springer-Verlag, pp. 88–94. Berglund-Sonnhammer AC, Steffansson P, Betts MJ, Liberles DA. 2006. Optimal gene trees from sequences and species trees using a soft interpretation of parsimony. J Mol Evol 63:240–250. Bininda-Emonds ORP. 2004. The evolution of supertrees. Trends Ecol Evol 19:315–322. Bininda-Emonds ORP, Gittleman JL, Steel MA. 2002. The (super)tree of life: procedures, problems, and prospects. Annu Rev Ecol Syst 33:265–289. Bininda-Emonds ORP, Cardillo M, Jones KE, MacPhee RDE, Beck RMD, Grenyer R, Price SA, Vos RA, Gittleman JL, Purvis A. 2007. The delayed rise of present-day mammals. Nature 446:507–512. Blanc G, Wolfe KH. 2004. Widespread paleopolyploidy in model plant species inferred from age distributions of duplicate genes. Plant Cell 16:1667–1678. Blanc G, Hokamp K, Wolfe KH. 2003. A recent polyploidy superimposed on older large-scale duplications in the Arabidopsis genome. Genome Res 13:137–144. Bonizzoni P, Della Vedova G, Dondi R. 2005. Reconciling a gene tree to a species tree under the duplication cost model. Theor Comput Sci 347:36–53. Bordewich M, Semple C. 2004. On the computational complexity of the rooted subtree prune and regraft distance. Ann Combinatorics 8409—423. Bowers JE, Chapman BA, Rong J, Paterson AH. 2003. Unravelling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events. Nature 422:433–438. Burleigh JG, Bansal MS, Wehe A, Eulenstein O. 2008. Locating multiple gene duplications through reconciled trees. In Vingron M, Wong L (eds.), RECOMB , Vol. 4955 of Lecture Notes in Computer Science. New York: Springer-Verlag, pp. 273–284. Burleigh J, Bansal M, Wehe A, Eulenstein O. 2009. Locating large-scale gene duplication events through reconciled trees: implications for identifying ancient polyploidy events in plants. J Comput Biol 16:1071–1083. Cannon SB, Sterck L, Rombauts S, Sato S, Cheung F, Gouzy J, Wang X, Mudge J, Vasdewani J, Schiex T, et al. 2006. Legume genome evolution viewed through the Medicago truncatula and Lotus japonicus genomes. Proc Nat Acad Sci USA 103:14959–14964.

204

RECONCILING PHYLOGENETIC TREES

Chang WC, Eulenstein O. 2006. Reconciling gene trees with apparent polytomies. In Chen DZ, Lee DT (eds.), COCOON , Vol. 4112 of Lecture Notes in Computer Science. New York: Springer-Verlag, pp. 235–244. Chen K, Durand D, Farach-Colton M. 2000. NOTUNG: a program for dating gene duplications and optimizing gene family trees. J Comput Biol 7:429–447. Davies TJ, Barraclough TG, Chase MW, Soltis PS, Soltis DE, Savolainen V. 2004. Darwin’s abominable mystery: insights from a supertree of the angiosperms. Proc Natl Acad Sci USA 101:1904–1909. Durand D, Halld´orsson BV, Vernot B. 2006. A hybrid micro-macroevolutionary approach to gene tree reconstruction. J Comput Biol 13:320–335. Eulenstein O. 1998. Vorhersage von Genduplikationen und deren Entwicklung in der Evolution. Ph.D. dissertation, Rheinische Friedrich-Wilhelms-Universit¨at Bonn. Eulenstein O, Vingron M. 1998. On the equivalence of two tree mapping measures. Discrete Appl Math Comb Oper Res Comput Sci 88:103–128. Fellows MR, Hallett MT, Stege U. 1998. On the multiple gene duplication problem. In Chwa KY, Ibarra OH (eds.), ISAAC Vol. 1533 of Lecture Notes in Computer Science. New York: Springer-Verlag, pp. 347–356. Fellows MR, Hallett MT, Stege U. 2003. Analogs and duals of the MAST problem for sequences and trees. J Algorithms 49:192–216. Felsenstein J. 2004. Inferring Phylogenies. Sunderland, MA: Sinauer Associates. www.loc.gov/ catdir/toc/ecip043/2003008942.html. Fitch WM, Markowitz E. 1970. An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution. Biochem Genet 4:579–593. Garey MR, Johnson DS. 1979. Computers and Intractability: A Guide to the Theory of NPCompleteness. San Francisco: W.H. Freeman. Goodman M, Czelusniak J, Moore GW, Romero-Herrera AE, Matsuda G. 1979. Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences. Syst Zool 28:132–163. Gordon AD. 1986. Consensus supertrees: the synthesis of rooted trees containing overlapping sets of labelled leaves. J Classification 3:335–348. G´orecki P. 2006. Detection of horizontal gene transfer. Ph.D. dissertation, Warsaw University. G´orecki P, Tiuryn J. 2004. On the structure of reconciliations. In Lagergren J (ed.), Comparative Genomics, Vol. 3388 of Lecture Notes in Computer Science. New York: Springer-Verlag, pp. 42–54. G´orecki P, Tiuryn J. 2007a. Inferring phylogeny from whole genomes. Bioinformatics 23:e116–e122. 10.1093/bioinformatics/btl296. G´orecki P, Tiuryn J. 2007b. Urec: a system for unrooted reconciliation. Bioinformatics 23:511–512. Grant V. 1981. Plant Speciation. New York: Columbia University Press. Guig´o R, Muchnik IB, Smith TF. 1996. Reconstruction of ancient molecular phylogeny. Mol Phylogenet Evol 6:189–213. Guyot R, Keller B. 2004. Ancestral genome duplication in rice. Genome 47:610–614. Hahn MW, De Bie T, Stajich JE, Nguyen C, Cristianini N. 2005. Estimating the tempo and mode of gene family evolution from comparative genomic data. Genome Res 15:1153–1160. Hallett MT, Lagergren J. 2001. Efficient algorithms for lateral gene transfer problems. In RECOMB , Vol. of Lecture Notes in Computer Science. New York: Springer-Verlag, pp. 149–156.

REFERENCES

205

Hallett MT, Lagergren J, Tofigh A. 2004. Simultaneous identification of duplications and lateral transfers. In Bourne PE, Gusfield D. (eds.), Proceedings of the Eighth Annual International Conference on Computational Molecular Biology, San Diego, CA, Mar. 27–31, 2004, pp. 347–356. Harding EF. 1971. The probabilities of rooted tree-shapes generated by random bifurcation. Adv Appl Prob 3:44–77. Hughes T, Liberles DA. 2007. The pattern of evolution of smaller-scale gene duplicates in mammalian genomes is more consistent with neo- than subfunctionalisation. J Mol Evol 65:574–588. Hughes T, Liberles D. 2008. A characterisation of the effects of speciation and whole genome duplication on the distribution of gene family sizes and its application to detecting large scale duplication. J Mol Evol 67:343–357. Luo CW, Chen MC, Chen YC, Yang RW, Liu HF, Chao KM. 2009. Linear-time algorithms for the multiple gene duplication problems. IEEE/ACM Trans Comput Biol Bioinf 15:1545–5963. Lynch M, Conery JS. 2003. The evolutionary demography of duplicate genes. J Struct Funct Genom 3:35–44. Ma B, Li M, Zhang L. 2000. From gene trees to species trees. SIAM J Comput 30:729–752. Maddison WP. 1989. Reconstructing character evolution on polytomous cladograms. Cladistics Int J Willi Hennig Soc 5:365–377. Mirkin B, Muchnik IB, Smith TF. 1995. A biologically consistent model for comparing molecular phylogenies. J Comput Biol 2:493–507. Novozhilov AS, Karev GP, Koonin EV. 2006. Biological applications of the theory of birthand-death processes. Brief Bioinf 7:70–85. Page RDM. 1994. Maps between trees and cladistic analysis of historical associations among genes, organisms, and areas. Syst Biol 43:58–77. Page RDM. 1998. GeneTree: comparing gene and species phylogenies using reconciled trees. Bioinformatics 14:819–820. Page RDM, Cotton JA. 2002. Vertebrate phylogenomics: reconciled trees and gene duplications. Pacific Symposium on Biocomputing, pp. 536–547. Paterson AH, Bowers JE, Chapman BA. 2004. Ancient polyploidization predating divergence of the cereals, and its consequences for comparative genomics. Proc Natl Acad Sci USA 101:9903–9908. Rensing SA, Ick J, Fawcett JA, Lang D, Zimmer A, Van de Peer Y, Reski R. 2007. An ancient genome duplication contributed to the abundance of metabolic genes in the moss physcomitrella patens. BMC Evol Biol 7130. Robinson DF. 1971 Comparison of labeled trees with valency three. J Comb Theory 11:105–119. Rodrigue N, Philippe H, Lartillot N. 2006. Assessing site-interdependent phylogenetic models of sequence evolution. Mol Biol Evol 23:1762–1775. Rong J, Abbey C, Bowers JE, Brubaker CL, Chang C, Chee PW, Delmonte TA, Ding X, Garza JJ, Marler BS, et al. 2004. A 3347-locus genetic recombination map of sequence-tagged sites reveals features of genome organization, transmission and evolution of cotton (gossypium). Genetics 166:389–417. Sanderson MJ, Purvis A, Henze C. 1998. Phylogenetic supertrees: assembling the trees of life. Trends Ecol Evol 13:105–109. Schlueter JA, Dixon P, Granger C, Grant D, Clark L, Doyle JJ, Shoemaker RC. 2004. Mining EST databases to resolve evolutionary events in major crop species. Genome 47:868–876.

206

RECONCILING PHYLOGENETIC TREES

Schranz ME, Mitchell-Olds T. 2006. Independent ancient polyploidy events in the sister families Brassicaceae and Cleomaceae. Plant Cell 18:1152–1165. Simillion C, Vandepoele K, Van Montagu MCE, Zabeau M, Van de Peer Y. 2002. The hidden duplication past of Arabidopsis thaliana. Proc Natl Acad Sci USA 99:13627–13632. Slowinski JB. 2001. Molecular polytomies. Mol Phylogenet Evol 19:114–120. Stebbins GL. 1950. Variation and Evolution in Plants. New York: Columbia University Press. Sterck L, Rombauts S, Jansson S, Sterky F, Rouze P, Van de Peer Y. 2005. EST data suggest that poplar is an ancient polyploid. New Phytol 167:165–170. Stern A, and Pupko T. 2006. An evolutionary space–time model with varying among-site dependencies. Mol Biol Evol 23:392–400. Vandepoele K, Simillion C, Van de Peer Y. 2003. Evidence that rice and other cereals are ancient aneuploids. Plant Cell 15:2192–2202. Vernot B, Stolzer M, Goldman A, Durand D. 2008. Reconciliation with non-binary species trees. J Comput Biol 15:981–1006. Vision TJ, Brown DG, Tanksley SD. 2000. The origins of genomic duplications in arabidopsis. Science 290:2114–2117. Wang X, Shi X, Hao B, Ge S, Luo J. 2005. Duplication and DNA segmental loss in the rice genome: implications for diploidization. New Phytol 165:937–946. Wehe A, Bansal MS, Burleigh JG, Eulenstein O. 2008. DupTree: a program for large-scale phylogenetic analyses using gene tree parsimony. Bioinformatics 24:1540–154. Wolf YI, Rogozin IB, Grishin NV, Koonin EV. 2002. Genome trees and the tree of life. Trends Genetics 18:472–479. Yang Z. 1994. Maximum likelihood phylogenetic estimation from dna sequences with variable rates over sites: approximate methods. J Mol Evol 39:306–314. Yu J, Wang J, Lin W, Li S, Li H, Zhou J, Ni P, Dong W, Hu S, Zeng C, et al. 2005. The genomes of Oryza sativa: a history of duplications. PLoS Biol 3e38. Zhang L. 1997. On a Mirkin–Muchnik–Smith conjecture for comparing molecular phylogenies. J Comput Biol 4:177–187. Zhang L. 2000. Inferring a species tree from gene trees under the deep coalescence cost. RECOMB 2000 poster. Zmasek CM, Eddy SR. 2001. A simple algorithm to infer gene duplication and speciation events on a gene tree. Bioinformatics 17:821–828.

11

On the Energy and Material Cost of Gene Duplication ANDREAS WAGNER Department of Biochemistry, University of Zurich, Zurich, Switzerland; The Santa Fe Institute, Santa Fe, New Mexico; Swiss Institute of Bioinformatics, Lausanne, Switzerland

1 INTRODUCTION A gene duplication first occurs in a single individual of an evolving population. The duplicate may then increase in frequency or again become extinct. Genetic drift and natural selection may be responsible for either fate. If natural selection is involved, one must distinguish two principal contributors to this fate: duplication benefits and duplication costs. Gene duplication has long- and short-term evolutionary benefits. Among the longterm benefits is the ability to facilitate evolutionary innovation through the evolution of new molecular activities in one of the gene copies, a notion first popularized by Ohno (1970). However, such long-term benefits may be irrelevant for the immediate fate of a gene duplicate after it first arises. Shorter-term benefits include advantages of increased gene dosage and thus increased gene expression. Such advantages may exist both for gene products that are in extremely high demand in a cell, and for genes that are expressed at very low levels when in single copy. In the latter case, noisy gene expression is at the root of the benefit. Noisy gene expression is ubiqituous, but especially prevalent for lowly expressed genes (Bar-Even et al., 2006). For such genes, the amount of gene product in a cell can show dramatic fluctuations, and for long periods of time the cell may contain little or none of the product. If the product is important to the life cycle of a cell, it is advantageous to alleviate these fluctuations via an increase in the average expression level (Cook et al., 1998). Gene duplication is one avenue to such an increase. Another short-term benefit arises in cases where a gene’s duplicate is not equal in sequence and function to the original. If the new function is beneficial to the cell, its carrier may rise in frequency through natural selection. Both anecdotal evidence (Long and Langley, 1993) and systematic work on genome-scale data (Katju and Lynch, 2003; Vinckenbosch et al., 2006) show that new genes can indeed originate in this way. The second factor influencing a gene duplication’s fate through natural selection is the cost of a duplication. A duplication will generally result in an increase in a Evolution After Gene Duplication, Edited by Katharina Dittmar and David Liberles Copyright © 2010 Wiley-Blackwell

207

208

ON THE ENERGY AND MATERIAL COST OF GENE DUPLICATION

cell’s genome size. This may result in an increased amount of time needed for DNA replication (and cell division), as well as in additional energy and material needs for DNA replication. As a result, cells with only a single copy of any one gene might be able to divide slightly faster. This cost component, however, is likely to play only a minor role. The generally small increase in genomic DNA associated with a single-gene duplication might cause a small replication delay in prokaryotes with a single replication origin, but not so in eukaryotes, where DNA replication is initiated simultaneously at thousands of replication origins in the genome. For example, the genome of Xenopus laevis is approximately 1000 times larger than that of that in Escherichia coli . Nonetheless, it can replicate in some 30 minutes, not much longer than the minimum cell division time of E. coli (Alberts, 2002). In addition, the energy and material cost of synthesizing the added DNA is negligible compared to that of gene expression. For example, dividing yeast (Saccharomyces cerevisiae) cells can double their biomass every 90 minutes. Fifty percent of this biomass consists of protein and RNA, but only 0.4% consists of DNA (Forster et al., 2003). Two other cost components are likely to be more important than a gene duplication’s influence on genome size. Both stem from the increase in expression caused by a duplication. While cells may compensate for changes in gene dosage by adjusting expression levels (Kafri and Pilpel, 2004)—for example, through negative feedback regulation of the duplicated gene, or via limited availability of transcription factors—such mechanisms may not be prevalent (Wong and Roth, 2005; He and Zhang, 2006). In the absence of such mechanisms, one would expect an approximate doubling of a gene’s expression level after duplication, if a regulatory region is duplicated in its entirety along with the coding region. Increased gene expression may interfere with cellular life in a variety of ways. For example, the newly expressed gene product may bind to proteins that are then no longer available for other, necessary protein interactions. This is one of several ways in which increased gene expression may be toxic to a cell. Second, gene expression requires both energy (in the form of ATP) and materials (nucleotides and amino acids) which incur a cost on a cell’s energy budget or material budget, whenever this budget is limited. It is very difficult to disentangle the relative contributions of expression toxicity and its energy or material cost, partly because toxicity has many faces. I will discuss recent evidence from the yeast S. cerevisiae that gene expression cost alone—even disregarding potentially toxic effects of increased expression—can affect the fate of most duplicates, at least in organisms with large population sizes. Before that, however, I need to ask how small an expression cost can be visible to natural selection.

2

COSTS VISIBLE TO NATURAL SELECTION

The fitness cost of any mutation, including gene duplications, is typically expressed in terms of a selection coefficient s, a fitness reduction relative to the wild type that its carrier suffers. In a diploid organism, the magnitude of s below which genetic drift influences a mutation’s fate more strongly than natural selection is s < 4N1 e (Kimura, 1983). Here Ne is the effective size of a population, which can be estimated from the nucleotide diversity at synonymous sites. Existing nucleotide diversity data show that for yeast, the critical s below which drift is stronger than selection is smaller than 5 × 10−7 (Wagner, 2005, 2007; Bragg and Wagner 2007, 2009). This means that

ENERGY COST OF GENE EXPRESSION IN THE YEAST

209

minute effects of mutations, many orders of magnitude smaller than could be detected in the laboratory, can affect the fate of a mutation.

3 ENERGY COST OF GENE EXPRESSION IN THE YEAST Saccharomyces cerevisiae A dividing cell needs a certain amount of energy per division cycle, much of it invested in building cell biomass. It is reasonable to assume that the production of such energy is one of the limiting factors in cell proliferation. If so, then increasing the expression of any one gene leaves less of this energy for growing the remaining biomass, which would delay cell proliferation by an amount corresponding to the fraction of energy diverted to the gene’s expression. Thus, the fractional energy cost of expressing any one gene is an indicator of the fitness effect s that a duplication of this gene has, in situations where cell growth rate is proportional to fitness. I note that gene expression itself is responsible for a substantial fraction of biomass production. As mentioned above, in yeast, RNA and protein comprise fully half of a cell’s biomass (Forster et al., 2003). The energy cost of gene expression has many components. First, nucleotide precursors need to be synthesized, which carries a cost in terms of both material and energy. Second, these nucleotide precursors need to be strung together in transcription to make messenger RNA. Third, amino acids need to be synthesized. Fourth, these amino acids need to be polymerized in translation. In addition, one needs to take into account different rates of protein and RNA turnover. Both kinds of molecules are constantly synthesized and degraded at molecule-specific rates that vary over several orders of magnitude (Wang et al., 2002; Belle et al., 2006). The absolute steady-state concentration of an RNA and protein molecule is thus not very informative about its expression cost. A molecule might experience fast synthesis and high decay rates, or slow synthesis and low decay rates, both of which might yield the same steady state, but at very different cost to a cell. In sum, to estimate the expression cost of genes, we need information about precursor synthesis costs, synthesis rates, and half-life. Currently, such information is available on a genome scale for only one organism, the yeast S. cerevisiae. By integrating a vast amount of genome-scale information on mRNA and protein levels, mRNA and protein half-lives, nucleotide composition of genes, and nucleotide and amino acid synthesis costs, one can determine what fraction of a cell’s gene expression energy cost goes into the expression of any one gene. The result is a distribution of selection coefficients associated with doubling the expression for each of thousands of yeast genes (Figure 1) (Wagner, 2005, 2007). Strikingly, all yeast genes for which expression information is available have expression costs vastly greater than the critical s discussed above. This holds regardless of whether the cells grow under fermentative or respiratory conditions. This means that for yeast genes expressed at any level, duplication would generally carry a cost visible to selection. To be sure, this assertion relies on some assumptions, among them that the energy cost of producing RNA and protein biomass is not vastly different from that of the cell’s remaining biomass (among it, many lipids and sugars). However, even if all selection coefficients in Figure 1 were overestimated tenfold, duplication of most yeast genes would still be subject to costs visible to selection.

210

ON THE ENERGY AND MATERIAL COST OF GENE DUPLICATION 100 90

70 60 50 40

Neutral Zone

Number of genes

80

30 20 10 0 −7.5

−6.5

−5.5

−4.5

−3.5

−2.5

−1.5

log10(s)

Figure 1 Distribution of the fractional energy cost s of doubling gene expression for the yeast S. cerevisiae. The gray zone indicates a region where the cost is too small to be visible to natural selection, based on effective population size estimates of yeast. (After Wagner, 2007.)

4 MATERIAL COST OF GENE EXPRESSION IN THE YEAST SACCHAROMYCES CEREVISIAE Some elements are major components of the biomass produced in gene expression. Specifically, RNA contains carbon, nitrogen, and phosphorus. Protein contains carbon, nitrogen, and sulfur. These elements can severely restrict the growth of organisms when their availability is limited. Such limitation can also foster fierce competition. In an environment where any one element is limiting, an increase in expression of any one gene will divert elemental nutrients to the gene product and may thus reduce the rate of cell proliferation. Because the chemical compositions of amino acids and nucleotides are known, and because we have complete genome sequence information, we can determine the amount of any one element invested into a single RNA or protein molecule. In combination with the known biomass composition of yeast, and with available information on mRNA and protein expression levels and half-lives, we can thus determine, for each element and gene, the material cost of doubling gene expression. This cost can be expressed as a fraction s of a cell’s estimated total material budget. By relating s to a critical selection coefficient, as outlined above, one can determine whether a given cost increase is visible to natural selection (Bragg and Wagner, 2007, 2009). With this approach, one finds that for more than 97% of yeast genes and for the elements carbon, nitrogen, and sulfur, the cost of doubling expression is a factor of 10 greater than the critical selection coefficient. The effect of phosphorus limitation is less dramatic, being visible for only 94% of duplicated genes. These numbers change if any one element is not strongly but weakly limiting. For example, if a fractional increase in expression cost by x causes a reduction in fitness not by x but merely by x/4, a doubling of expression would be visible to selection only for more than 90% of genes. In sum, for any element that is growth limiting, gene duplication causes significant material costs for the vast majority of genes, similar to what I discussed earlier for energy costs.

THE LAC OPERON AS AN EXPERIMENTAL SYSTEM TO STUDY EXPRESSION COSTS

211

Energy cost and material cost of a gene’s expression are highly positively correlated (Bragg and Wagner, 2007, 2009). Genes with a high energy cost of expression also tend to have a high material cost. It is easy to see why. A substantial part of both costs comes from the rate of synthesis for mRNA and protein molecules, which enter the calculation of both energy and material in identical ways. An additional contribution to this correlation comes from the fact that chemically complex amino acids, containing more atoms of a given type, tend to consume more energy in biosynthesis than simpler amino acids. (The cost differences among different nucleotides are much smaller than those among different amino acids and are thus less important.)

5 THE LAC OPERON AS AN EXPERIMENTAL SYSTEM TO STUDY EXPRESSION COSTS I now highlight some recent experimental work on the lac operon that sheds light on the cost of expression for very highly expressed genes. The lac operon is one of the best-studied regulatory systems inside cells (Alberts, 2002). Its three gene products are a β-galactosidase (product of the lacZ gene), a permease (lacY ), and a transacetylase (lacA). The first two of these products are necessary to metabolize the sugar lactose. The expression of the lac operon is highly regulated and turned on only if lactose is available in the cell’s environment. In such environments, the operon is expressed at very high levels. The advantage of this system is that its regulation can be manipulated either through mutations or through artificial inducers. One such inducer is isopropylβ-D-thiogalactoside (IPTG). IPTG induces the lac operon, but the cell does not gain any benefit from this induction, because unlike lactose, IPTG cannot feed into energy metabolism. A recent study (Dekel and Alon, 2005) took advantage of this property to measure the cost of expressing the lac operon at various levels of induction. It concluded that full induction of the lac operon with IPTG leads to a reduction in the cell division rate of 4.5%. Although this type of approach cannot strictly exclude the possibility that the cost of expression reflects toxicity of the gene products, this seems unlikely in the case of the lac operon. The reason is that the high expression state is not just induced in the laboratory under unphysiological conditions with an artificial inducer, but it is also vital under physiological conditions in lactose-containing environments. Another study took advantage of mutations that render lacZ expression constitutive (Stoebel et al., 2008). It is estimated that lac operon expression in lactose-free environments leads to a 10% reduction in growth rate. Most of this cost comes from expressing β-galactosidase (Stoebel et al., 2008). Tagging the β-galactosidase product with a peptide that decreases its half-life and thus recycles its amino acids reduces this cost dramatically. This suggests that the bulk of the cost for expressing this protein does not come from the biosynthesis of the proteins and its amino acids. Aside from the possibility that the cost of transcription is of major importance, it is also conceivable that the extremely high lac expression sequesters RNA polymerases or ribosomes, rendering them unavailable for expressing other genes at appropriate levels. Experimental approaches like these are powerful, because they can demonstrate the effects of gene expression on cell growth directly. However, they can detect the expression costs of only the most highly expressed genes, because experiments are able to resolve selection coefficients only to a lower limit of approximately 10−3 .

212

ON THE ENERGY AND MATERIAL COST OF GENE DUPLICATION

In organisms with large effective population size, much smaller selection coefficients are still visible to selection. Importantly, most genes have small selection coefficients associated with a doubling of gene expression. In yeast, doubling the expression of most genes would lead to expression costs much smaller than 10−3 . The example just discussed also shows that for the enormous changes in expression that occur in the lac operon, factors independent of material or energy cost, such as the sequestering of polymerases or ribosomes, may come into play. These factors may play a smaller role for more lowly expressed genes and for smaller expression changes, such as those observed in a gene’s duplication.

6

EVOLUTIONARY COST SIGNATURES

Where experiments cannot reach, patterns of evolutionary change may inform us about the impact of duplication costs. A genome-scale analysis of gene duplicates in yeast shows that genes with high carbon and nitrogen expression cost have fewer surviving duplicates (Bragg and Wagner, 2007). In such an analysis, it is important to correct for gene expression levels, because genes with high expression may also evolve a nucleotide composition with low elemental or energy cost (Akashi and Gojobori, 2002; Fauchon et al., 2002; Elser et al., 2006; Heizer et al., 2006). However, the association persists when differences in expression levels are taken into account (Bragg and Wagner, 2007). In addition to this example pertaining to gene duplications, a number of studies have demonstrated that energetic and material costs of expression shape the composition of proteins. For example, Akashi and Gojobori (2002) showed that in E. coli highly expressed proteins show increased abundance of energetically cheap amino acids. In addition, proteins needed to assimilate carbon tend to contain fewer carbon-costly amino acids than other proteins (Baudouin-Cornu et al., 2001). A similar pattern holds for proteins involved in sulfur assimilation (Baudouin-Cornu et al., 2001). These patterns probably reflect an evolutionary adaptation which ensures that nutrient assimilation can remain active if a nutrient becomes scarce. As discussed earlier, expression cost is only one of multiple factors affecting the fate of duplicate genes. That it can leave genomic signatures at all is thus astounding. It suggests that expression cost has a strong influence on molecular evolution. Benefits of duplication, however, can also leave genomic signatures. For example, highly active metabolic enzymes (i.e., metabolic enzymes with high metabolic flux) tend to be encoded by a greater number of duplicate genes than are less active enzymes (Papp et al., 2004; Vitkup et al., 2006). This pattern probably reflects the advantage of increased gene dosage for such enzymes, an advantage that may override their large expression cost. The types of signatures gene duplication leaves in a genome reflect whether a duplicate’s fate is dominated by either benefit or cost.

7

CONCLUSIONS

In microbial organisms, the doubling of expression associated with many gene duplications carries significant energetic and material costs. Such duplications thus do not go to fixation neutrally. Because most genomes contain large numbers of duplicate genes, one can infer that gene duplication often confers adaptive advantages that outweigh

REFERENCES

213

these costs. To investigate the nature of these advantages is one part of a promising research program that will yield insight into the evolutionary forces shaping genomes. Another part is the investigation of expression costs in higher, multicellular organisms. Because of their smaller effective population sizes, selection is a weaker evolutionary force in these organisms. It is currently unclear whether the observations discussed here apply to higher organisms. Acknowledgments I thank the Swiss National Foundation for support through SNF grant 315200-116814 and through the YeastX program from SystemsX.ch.

REFERENCES Akashi H, Gojobori T. 2002. Metabolic efficiency and amino acid composition in the proteomes of Escherichia coli and Bacillus subtilis. Proc Natl Acad Sci USA 99:3695–3700. Alberts B. 2002. Molecular Biology of the Cell . New York: Garland Science. Bar-Even A, Paulsson J, et al. 2006. Noise in protein expression scales with natural protein abundance. Nat Genet 38:636–643. Baudouin-Cornu P, Surdin-Kerjan Y, et al. 2001. Molecular evolution of protein atomic composition. Science 293:297–300. Belle A, Tanay A, et al. 2006. Quantification of protein half-lives in the budding yeast proteome. Proc Natl Acad Sci USA 103:13004–13009. Bragg J, Wagner A. 2007. Protein carbon content evolves in response to carbon availability and may influence the fate of duplicate genes. Proc R Soc Lond Ser B 274:1063–1070. Bragg J, Wagner A. 2009. Protein material costs: single atoms can make an evolutionary difference. Trends Genet 25:5–8. Cook DL, Gerber LN, et al. 1998. Modeling stochastic gene expression: implications for haploinsufficiency. Proc Natl Acad Sci USA 95:15641–15646. Dekel E, Alon U. 2005. Optimality and evolutionary tuning of the expression level of a protein. Nature 436:588–592. Elser JJ, Fagan WF, et al. 2006. Signatures of ecological resource availability in the animal and plant proteomes. Mol Biol Evol 23:1946–1951. Fauchon M, Lagniel G, et al. 2002. Sulfur sparing in the yeast proteome in response to sulfur demand. Mol Cell 9:713–723. Forster J, Famili I, et al. 2003. Genome-scale reconstruction of the Saccharomyces cerevisiae metabolic network. Genome Res 13:244–253. He XL, Zhang JZ. 2006. Transcriptional reprogramming and backup between duplicate genes: Is it a genomewide phenomenon? Genetics 172:1363–1367. Heizer EM, Raiford DW, et al. 2006. Amino acid cost and codon-usage biases in 6 prokaryotic genomes: a whole-genome analysis. Mol Biol Evol 23:1670–1680. Kafri RB-E, Pilpel Y. 2004. Transcription control reprogramming in genetic backup circuits. Nat Genet 37:295–299. Katju V, Lynch M. 2003. The structure and early evolution of recently arisen gene duplicates in the Caenorhabditis elegans genome. Genetics 165:1793–1803. Kimura M. 1983. The Neutral Theory of Molecular Evolution. Cambridge; UK: Cambridge University Press.

214

ON THE ENERGY AND MATERIAL COST OF GENE DUPLICATION

Long MY, Langley CH. 1993. Natural-selection and the origin of jingwei, a chimeric processed functional gene in Drosophila. Science 260:91–95. Ohno S. 1970. Evolution by Gene Duplication. New York: Springer-Verlag. Papp B, P´al C, et al. 2004. Metabolic network analysis of the causes and evolution of enzyme dispensability in yeast. Nature 429:661–664. Stoebel D, Dean A, et al. 2008. The cost of expression of Escherichia coli lac operon proteins is in the process, not the products. Genetics 178:1653–1660. Vinckenbosch N, Dupanloup I, et al. 2006. Evolutionary fate of retroposed gene copies in the human genome. Proc Natl Acad Sci USA 103:3220–3225. Vitkup D, Kharchenko P, et al. 2006. Influence of metabolic network structure and function on enzyme evolution. Genome Biol 7:R39. Wagner A. 2005. Energy constraints on the evolution of gene expression. Mol Biol Evol 22:1365–1374. Wagner A. 2007. Energy costs constrain the evolution of gene expression. J Exp Zool B 308B:322–324. Wang YL, Liu CL, et al. 2002. Precision and functional specificity in mRNA decay. Proc Natl Acad Sci USA 99:5860–5865. Wong SL, Roth FP. 2005. Transcriptional compensation for gene loss plays a minor role in maintaining genetic robustness in Saccharomyces cerevisiae. Genetics 171:829–833.

12

Fate of a Duplicate in a Network Context ORKUN S. SOYER Engineering, Mathematics and Physical Sciences, University of Exeter, Exeter, UK

1 INTRODUCTION Biology is a result of evolution. Genetic and epigenetic events occurring at the DNA level manifest themselves as phenotypic variation, upon which selection operates. This process leads, in neutral or adaptive fashion, to small functional changes or innovations. Accumulation of these at the molecular level eventually results in changes at higher phenotypic levels, possibly leading to speciation. One well-characterized genetic event in this regard is gene duplication. The origin of many genes can be linked to an ancient duplication event in the genomes analyzed (Zhang, 2003) [see also Brenner et al. (1995) and Vogel and Chothia (2006)]. At first sight this observation might not sound surprising. One might think that organisms frequently need to adapt to new conditions and duplications provide a very good way for doing so; an already functioning gene is generated which only needs to be tweaked with subsequent mutations to achieve a new, desired function. However, for us to observe a duplicate in a given genome millions of years after its birth, it had to spread and fix in the population after arising in one (or a few) individual(s) and had to remain there without losing its function. From the point of evolutionary and population dynamics, this is such an unlikely event that retention of duplicates became one of the most important biological questions still to be answered fully. The question can be separated into two parts: (1) How does a duplicate increase in frequency right after its birth? and (2) how could a duplicate be maintained in the population over a long time? The answer to the first question relates closely to the fitness effects of a duplicate. If the duplication event does not harm the reproductive success of an individual, the duplicate might spread in the population through genetic drift. On the other hand, a duplicate would be selected against if it infers a fitness cost above a critical selective coefficient s = 4/Ne , where Ne refers to the effective population size. It is estimated that energy costs associated with a single duplicate would easily make its fitness effects surpass s if one assumes energy constraints to be an important part of fitness (Wagner, 2005b). However, such energy costs associated with a duplicate could be compensated by other (positive) effects so to result in overall neutrality or a fitness benefit (Moore and Purugganan, 2003). One intuitive explanation for the latter would be dosage effects. Evolution After Gene Duplication, Edited by Katharina Dittmar and David Liberles Copyright © 2010 Wiley-Blackwell

215

216

FATE OF A DUPLICATE IN A NETWORK CONTEXT

A high gene dosage would be beneficial if the original gene is weakly expressed and hence was subject to stochastic fluctuations (Cook et al., 1998) or if its protein product was in high demand (Papp et al., 2003b). On the other hand, dosage effects could also add to the deleterious effects of a duplicate. It has been suggested, for example, that duplication of a gene, whose product functions as part of a complex, would disrupt complex formation and be deleterious (Papp et al., 2003a). There is experimental evidence that this is indeed the case, with insufficient complex formation being the probable explanation for the deleterious effects (Deutschbauer et al., 2005; Sopko et al., 2006). Interestingly, following whole-genome duplication, this effect might act as a selective force to retain preferentially—at least initially—those genes that are involved in complexes (Aury et al., 2006; Hakes et al., 2007). Despite the possible mechanistic effects that would deem a single gene duplicate “death” right at its birth, let’s assume that some duplicates can spread in the population and get fixed as an exact copy of their origin. At this point we face the second question that we set above: What would be the long-term fate of such a duplicate–original gene pair? We can envision three possible fates: (1) the original gene maintains the original function, whereas the duplicate becomes a pseudogene as a result of deleterious mutations (nonfunctionalization); (2) the original gene maintains the original function, whereas the duplicate acquires a new function as a result of beneficial mutations (neofuntionalization); and (3) the original and duplicate genes lose some of their function, so that both of them would be required to achieve the original function (subfunctionalization). For duplicate retention, neofunctionalization or subfunctionalization must arise and fix in the population before all duplicates are nonfunctionalized. A substantial body of theoretical work that uses population genetic models shows that neofunctionalization alone cannot explain duplicate retention unless population size is unrealistically large (Watterson, 1983) or if there is substantial selective advantage for the new function (Walsh, 1995). On the other hand, subfunctionalization emerges as the most likely fate for a duplicate–original gene pair under decreasing population size and increasing relative rate of loss-of-subfunction mutations (Force et al., 1999; Lynch and Force, 2000). This result becomes intuitive when one considers the assumption behind subfunctionalization; it results from nonlethal, nearly neutral mutations that reduce functional efficiency or result in loss of subfunction(s). Since these types of mutations are believed to be more frequent than beneficial mutations (i.e., neofunctionalization), it is plausible to assume that subfunctionalization would make it easier to escape nonfunctionalization. This view also explains why subfunctionalization should dominate in small (or realistic sized) populations. In such populations, subfunctionalized individuals can fix quickly, before they could be nonfunctionalized. The larger the population size, the longer the fixation time and hence the risk of becoming nonfunctionalized. This relation between population size and subfunctionalization leads to a possible explanation for observed complexity (defined as genome size) in higher organisms, which have increasingly smaller effective population sizes (Lynch and Conery, 2003). Despite the elegance and intuition behind the population genetic models, their predictions have mixed support from empirical studies. In fungi, genomic analyses suggest that a larger fraction of retained duplicates are through subfunctionalization than through neofunctionalization (Wapinski et al., 2007), while the opposite trend emerges from the analysis of small-scale duplications in mammals (Hughes and Liberles, 2007). Furthermore, it is indicated that subfunctionalization alone cannot explain all

GENES AS PART OF A LARGER SYSTEM

217

of the duplicate retention observed in yeast (Papp et al., 2003b; Evangelisti and Wagner, 2004). This is supported by other studies which show that evolutionary trajectories to retention are not easily separable as subfunctionalization and neofunctionalization, but rather, there is a more complicated mix of the two processes in action (He and Zhang, 2005). The distinction between subfunctionalization and neofunctionalization is even more difficult to make in case of regulatory divergence, which is found to underlie the fate of many duplicates (Gu et al., 2005; Duarte et al., 2006; Tirosh and Barkai, 2007).

2 GENES AS PART OF A LARGER SYSTEM It is possible that some of the discrepancy between empirical evidence and theoretical predictions stems from the fact that population genetic models consider a duplicate–original gene pair in isolation. In reality, almost every gene product functions in the context of a larger network, such as a metabolic, regulatory, or signaling network, and it is usually the output of this larger system that is related to fitness (see Figures 1 and 2). This fact has two important consequences in the fate of duplicate (A)

R

R

1

1

E

R

1

1’

E

R

1’

1

1’

E

E

R

R

Time (B)

R

R

1

1

2

1’

2 E

1

1’

2 E

1

1’

2 E

E

Figure 1 Hypothetical evolutionary routes following a duplication event in a signaling network. Shown from left to right are representations of a given signaling network over evolutionary time. Each node represents a protein that can activate (indicated by a solid arrow) or deactivate others. Nodes labeled R and E denote a receptor (which receives a signal, i.e., input) and an effector (which mediates a physiological response, i.e., output). Activation and deactivation of proteins in the network can take different forms, including phosphorylation, methylation, and structural interaction. For example, the intermediary protein 1 can be thought of as being phosphorylated by the receptor and transferring this phosphor to an effector protein, as seen, for example, in two-component signaling systems or kinase cascades. Duplication of the intermediary protein initially results in another route of phosphor flux to the effector. In scenario A, mutations lead to a phospho-relay where only one of the duplicates is phosphorylated by the receptor and only one of them phosphorylates the effector (dashed arrows represent intermediary interactions attained by mutations). In scenario B, recruitment of a new protein results in lengthening of the cascade. In both scenarios it is expected that the qualitative nature of the response (i.e., the time course of effector concentration in the presence of a signal) would remain unchanged over evolution.

218

FATE OF A DUPLICATE IN A NETWORK CONTEXT

R

... E

Network dynamics D = f([P], k, c, ...)

Replication with mutations

Time course R

... E

Selection: Network fitness w = f([P], t, ...)

...

Figure 2 Generic approach used in modeling the evolution and dynamics of biological networks. A population of networks is evolved via replication with mutations and under a biologically plausible selection pressure. The latter can be determined according to network dynamics, which is derived from a function of network element concentrations, the kinetic rates governing their interaction, and an initial state. Shown are representations of signaling networks, where each node represents a protein that can activate or deactivate others, and two nodes define the points of signal input (R for receptor) and network output (E for effector). Mutations can be modeled as changes in the set of interactions defining the network (i.e., network topology); shown is an interaction deletion.

genes. First, although mutations happen at the single-gene level, they will have end effects at the network level, changing both system dynamics and stability. Population genetic models fail to capture such effects and make ad hoc assumptions about the effects of mutations at the single-gene level. In reality, a mutation that would be considered deleterious at the protein level might be neutral at the network level (or vice versa). Second, and more important, mutations happen independently at all genes that are part of a larger system. As a result, interactions can be lost and gained, and functions can be changed not only due to mutations on a given gene but also due to mutations on other genes with which it interacts. Further, an interaction lost between two genes can be compensated by another interaction forming among other genes somewhere else in the network (Figure 1). This larger network picture reveals the inherent plasticity in the system, which has direct consequences for the fate of single duplicates. In general, one could argue that being part of a larger network would have two opposing effects on a gene duplicate. On the one hand, it would be less likely that duplication would be a neutral event, and hence the initial increase in duplicate frequency would be more difficult. On the other hand, there could be more chances for the duplicate to diverge and get to be maintained, as this would happen not only through mutations on itself but also on others in the network. The latter idea is supported by empirical analyses of gene regulatory networks, which indicate that consecutive neutral mutations can change the network structure substantially and possibly lead to a new role and retention of a

STUDYING DYNAMICS AND EVOLUTION AT THE NETWORK LEVEL

219

duplicate (Teichmann and Babu, 2004; Tsong et al., 2006). Such plasticity of the gene regulatory network might be due to the ease of generating new regulatory biding sites with few point mutations (Stone and Wray, 2001). It is also important to note that the network view puts a question mark on the concepts of sub- and neofunctionalization. Although it makes sense to use such definitions from a standpoint of single genes, their meaning becomes increasingly hazy when considering network-level changes. For example, duplicate genes in a network can diverge in their interaction set, and while not maintaining the set of interactions of the ancestral gene (so that they would not be considered subfunctionalized), they might still function in a complementary manner, due to changes elsewhere in the network structure (Figure 1). In other words, a true characterization of a duplicate’s fate as subfunctionalized or neofunctionalized would require analysis of its function in the context of that of the entire system. There have already been successful attempts to analyze genomic data using such a network view and complex scenarios for the fate of duplicates (Teichmann and Babu, 2004). Here, I concentrate on theoretical approaches that aim to capture the evolutionary trajectories leading to such scenarios and predict the fate of a duplicate in a network context.

3 STUDYING DYNAMICS AND EVOLUTION AT THE NETWORK LEVEL Studying the fate of a duplicate gene in the context of a larger system of which it is part requires modeling the dynamics and evolution of such systems. Usually, the system considered is that behind a single gene: a set of interacting genes as found in metabolic, regulatory, or signaling networks. The general approach to model evolution at this level is to use the network as the unit under selection. In other words, these models consider not the fitness contribution of a single gene but, rather, that of an entire network. As such, they require a model for network dynamics and a formalism relating this to the fitness of the organism. Further, they need to capture the effects of mutational events at the network level. Figure 2 depicts this generic approach. The dynamics of each single interaction in the network is captured by a mathematical function. This function is then extended to include the entire set of the interactions in the network, resulting in a full dynamic network model that describes the time course of each entity in the network (i.e., activated proteins, expressed genes, etc.). This time course describes the overall behavior of the network, which can be assumed to relate to the overall fitness of an organism. For example, in the case of a regulatory network model, the viability of the organism would be coupled to its ability to reach a certain steady-state expression level for the genes it controls. Based on such arguments, it would then be possible to formulate a function relating network dynamics to fitness. This coupling between fitness and network dynamics allows in silico evolution; a homogeneous or heterogeneous initial population can be generated from a selected (or random) founder individual (containing a given network model), which is then allowed to “evolve” under the effect of mutations and a specific selective pressure. The mutations can be incorporated in a variety of ways, but usually they are modeled as changes in the interaction set (i.e., network structure) or interaction parameters. When modeled in this way, these changes are thought to correspond to end effects of DNA-level mutations. This generic approach to model network evolution and dynamics

220

FATE OF A DUPLICATE IN A NETWORK CONTEXT

has been utilized successfully in recent years to address a variety of questions, ranging from evolution of certain traits (Wagner, 1996, 2005a; Stern, 1999; Bergman and Siegal, 2003; Pfeiffer et al., 2005; Azevedo et al., 2006; Soyer and Bonhoeffer, 2006; Goldstein and Soyer, 2008) to the analysis of specific network dynamics (Bray and Lay, 1994; Deckard and Sauro, 2004; Francois and Hakim, 2004; Paladugu et al., 2006; Soyer et al., 2006; Francois and Siggia, 2008). Here we consider its application to study questions relating to the fate of duplicate genes in a network context.

4 RECONSIDERING FITNESS EFFECTS OF DUPLICATIONS IN THE CONTEXT OF NETWORKS Besides the mechanistic effects of a duplication that are discussed in Section 1 the duplication of any gene that is part of a larger system of interacting genes (or proteins) would lead to a perturbation in the dynamics of that system. Such a perturbation can even lead to the system becoming unstable. The first study of the effects of gene duplication in a network context has been on gene regulatory networks. It attempted to understand the relation between the number of interactions in which a gene participates and the fitness effects of its duplication (Wagner, 1994). To construct a tractable gene network model, it is assumed that expression of each gene g is given by a sigmoidal curve that is a function of the concentration of other genes in the network and their effects on g. Use of the sigmoidal function allows labeling genes as “on” or “off,” based on their expression level. The overall dynamics of the network (i.e., the expression state of the genes) depends on the set of interactions between the genes, which can be listed as an interaction matrix W . The steady-state gene activity (i.e., expression) pattern of the network constitutes the trait that is linked to fitness. The analysis starts by creating an ensemble of networks that are capable of reaching a stable expression pattern. Then a given number of genes in the network are duplicated, and the ability of the resulting network to reach the same expression state as before the duplication is assessed. Note that such a comparison is qualitative by its nature; a network is either capable of reaching the preduplication expression state or it is not. The results of this analysis showed that effects of duplicating a single gene or the entire regulatory network have the fewest fitness effects, while duplication of an intermediate number of genes in the network have the greatest effects. In other words, the effects of duplications on the network were defined by a unimodal function of r, the percentage of genes in a network that was duplicated. Repeating the analysis with network ensembles of different connectivity c (which defines the mean interactions per gene or the “density” of W ), it was shown that the effect of gene duplication decreases with decreasing c. These results suggest that duplication events would have weak fitness effects in real networks, especially if we assume that real gene regulatory networks are sparsely connected (i.e., each gene product controls expression of one or few genes). However, they provide neither quantitative measures for such effects nor any comparison between them and the effects of other mutational events. It might be possible, for example, that gene duplications are inherently less deleterious than other mutational events. If so, they could be accepted more frequently even though they happen rarely, effectively increasing the number of duplication events over all accepted mutations. This possibility has been tested in signaling networks using a variant of the generic approach outlined above (Soyer and Bonhoeffer, 2006).

RECONSIDERING FITNESS EFFECTS OF DUPLICATIONS

221

In that study, the dynamics of the signaling network is captured using a specific model that describes a network as a collection of interacting proteins, each of which can be in an active or inactive state. Each protein, when active, can influence the state of other proteins in the network with which it interacts. Two arbitrary proteins are taken as receptor and effector, allowing the network dynamics to be quantified as changes in the active effector concentration, in response to incoming signals received at the receptor. Biologically, this model corresponds to a signaling cascade, such as those made of phosphatases and kinases. Several biological mechanisms, such as methylation, phosphorylation, and direct protein contact, could correspond to the activation and deactivation processes in the model. In summary, this network model captures the basic biochemistry of biological signaling networks and allows quantifying their function using the dynamic response to incoming signals. The evolutionary model assumes that the signaling network described is under selective pressure to produce a specific (e.g., a transient) response to an incoming signal. A homogeneous population of a single “founder” network or a random population is generated and allowed to “evolve” under such a selective pressure. The founder or random networks selected were chosen so that they have the minimal structure to achieve the required function. In other words, the evolutionary simulations are assumed to start at the dawn of an “ancestral” minimal network that is capable of achieving the response dynamics required. Mutations are modeled as random events occurring with a certain probability and leading to formation and deletion of interactions, as well as changes in their strength, and deletion and duplication of proteins. Among all these plausible mutational events, it is found that duplicates are usually the most tolerated. In other words, network function (and hence fitness) is least disrupted by duplication events. Consequently, a neutral growth in network size is observed despite the higher probability for mutations leading to the loss of proteins and interactions. Interestingly, the fitness effects of each mutational event are found to be coupled to network size and structure. Early in evolution, when network size is minimal, mutations leading to deletion of proteins or interactions are highly costly. On the other hand, duplications and mutations resulting in new interaction formation have low fitness effects during the same period. This imbalance leads to an increased retention of duplicates and an early growth in network size. As evolution proceeds and networks grow, the fitness effects of different mutational events reach a delicate balance, and network growth stops. In other words, duplications and deletions are accumulating at comparable rates in the population so that, on average, no duplicate can be retained. These results are not only qualitatively robust under biologically plausible parameter ranges, but they also hold under a variety of selection criteria. Networks were evolved for a variety of functions, ranging from the simplest (i.e., the ability to respond to an incoming signal) to the more complex (i.e., ability to give a switchlike response). Under each selection criterion, the above-described early network growth is observed, and final network sizes were always larger than the minimal requirement for functionality. Interestingly, it is found that the equilibrium size that networks evolve toward is dependent on the selective pressure they experience. Further quantitative evidence for the low fitness effects of duplication events at the network level comes from subsequent analyses with a similar model (Soyer, 2007). Addressing the evolution of modularity in signaling networks, this study uses an entirely different selection criterion that is quantitative. Such a quantitative fitness function allows more exact analysis of the fitness effects of mutational events that

222

FATE OF A DUPLICATE IN A NETWORK CONTEXT

500

Protein recruitment

0

0

200

N Mulations 200 500

Interaction loss

0.0

-0.2

-0.5

0.0

-0.5

Protein loss

0

0

200

N Mulations 200 500

500

Interaction formation

-0.2

0.0

-0.2

-0.5

0.0

-0.5

Protein duplication

0

0

200

N Mulations 200 500

500

Coefficient change

-0.2

0.0

-0.2

-0.5

Fitness effect

0.0

-0.2

-0.5

Fitness effect

Figure 3 Distribution of fitness effects for each mutation type. Fitness in this case is defined as the ability of the signaling network to produce independent outputs to two incoming signals (Soyer, 2007). The fitness effects of each mutation type are averaged over the entire population. Data are collected and averaged over seven independent evolutionary simulations. Each panel shows the distribution for a different mutational mechanism indicated on the top of the panel.

change network structure and dynamics. The early network growth and significantly low fitness effects for duplications early in evolution are observed in this study as well. Further, the distribution of fitness effects over many evolutionary simulations (data reproduced in Figure 3) clearly show that duplications have the least negative fitness effects on network dynamics, followed by changes in kinetic parameters. The most disruptive events at the network level are mutations leading to the loss of proteins. The dependency of these effects on network structure and size is shown in Figure 4. Although these studies provide only a first glimpse of the complicated relations between selective pressure, network topology, and the effect of mutations, they suggest that although duplications happen rarely, they have a high chance of accumulating in the population, due to their low fitness effects on network dynamics. This fulfills a prerequisite for subsequent divergence and retention; namely, it provides the initial establishment of mutants with duplicates in the population. So far, these theoretical studies do not tackle the relation between the functional role of a gene in the network

FATE OF A GENE DUPLICATE IN A NETWORK CONTEXT

223

Fitness effect

–0.1

–0.2

–0.6 0

200

400

600

800

1000

800

1000

Generation

Pathway size

12 8 4 0 0

200

400 600 Generation

Figure 4 Fitness effects of different mutation types and pathway size over generations. Fitness in this case is defined as the ability of the signaling network to produce independent outputs to two incoming signals (Soyer, 2007). The fitness effects of each mutation type are averaged over the entire population. Data are collected and averaged over seven independent evolutionary simulations. Different colors indicate different mutation types. Using the notation of Figure 3, we have black for “duplicate protein,” red for “change coefficient,” blue for “protein loss,” green for “create interaction,” cyan for “protein recruitment,” and yellow for “interaction loss.” (See insert for color representation of the figure.)

and the fitness effect of its duplication. That such relations exist is already suggested by an analysis of duplicates in metabolic networks (Diaz-Mejia et al., 2007).

5 FATE OF A GENE DUPLICATE IN A NETWORK CONTEXT While it is well recognized that most genes do not operate in isolation, attempts to study the fate of duplicate genes in a network context were made only very recently and only in regulatory networks. In a regulatory network, duplication and subsequent diversification of transcription factors and their target genes lead to changes in the topology of the network. Such change could be driven by adaptation, leading to improvements in the dynamics or qualitative features of gene regulation, or alternatively, it could be driven by nonadaptive processes (Tsong et al., 2006; Lynch, 2007). Although discriminating between these different driving forces is not always possible, large-scale empirical analysis of transcription factors and their target genes reveals that topology of regulatory networks is highly dynamic. Analysis of gene regulatory networks in Escherichia coli and Saccharomyces cerevisiae (Evangelisti and Wagner, 2004; Teichmann and Babu, 2004) has shown that most pairs of duplicate transcription factors maintain at least one common interaction (i.e., target gene), while most pairs of duplicate target

224

FATE OF A DUPLICATE IN A NETWORK CONTEXT

genes diverge to be regulated by different transcription factors. These findings suggest that both neo- and subfunctionalization of duplicates play a key role in the evolution of regulatory networks. A more detailed look at these processes using protein–protein interaction data from S. cerevisiae concluded that a complicated mix of neo- and subfunctionalization is responsible for the topology observed for this network (He and Zhang, 2005). To better understand how evolution alters the topology of regulatory networks and to elucidate the role of neo- and subfunctionalization in this process, MacCarthy and Bergman used a Boolean gene regulatory model to simulate network evolution and trace the fate of duplicated genes (MacCarthy and Bergman, 2007). In particular, they analyzed how duplicate genes in a regulatory network diversify in their interaction set following genome duplication (i.e., when the entire network exist in duplicate). In their model the network is assumed to be under stabilizing selection: The fitness of each organism is inversely proportional to the deviation between its gene expression dynamics from that of the initial network (before duplication). The biological justification behind this assumption is clear; the “founder” had the “perfect” gene expression pattern for the environmental conditions, and any deviations from it should be costly in terms of fitness. The model incorporates mutational events only in terms of their end effects on network structure. In particular, mutations are modeled as random events occurring with a given probability and causing loss or generation of regulatory interactions among genes. Results from these evolutionary simulations suggest that neofunctionalization is more important than subfunctionalization for the retention of duplicate genes. The frequency of the former increases monotonically as evolution proceeds, while that of the latter peaks right after the duplication event and then decreases significantly. These results hold even under different assumptions regarding the frequency of mutational events leading to interaction formation and deletion. In particular, subfunctionalization is found to decrease in time even when deletion of interactions in the network was assumed to be much higher than generation of new interactions. This would correspond to increasing the relevant rate of loss-of-subfunction mutations in the population genetic models described above and would be predicted to lead to increased subfunctionalization (Force et al., 1999). Although this study provides a direct way to study the fate of duplicate genes in the network, it is not entirely clear how robust its results are with respect to certain modeling choices. In particular, the analysis uses a myopic description of neofunctionalization and labels every duplicate gene with an interaction set different from the original as neofunctionalized. Hence, duplicate genes that lose interactions (i.e., degenerate) would be labeled as neofunctionalized in this model, whereas they would have been considered as subfunctionalized under classic population genetic models. Another, possibly more severe issue with this model is that it does not consider lethal (i.e., unstabilizing) mutations. In other words, the only possibility for genes to become nonfunctional in the model is by losing all their interactions one by one. This decreases the probability of nonfunctionalization as a fate for duplicate genes, artificially increasing that of suband neofunctionalization. These modeling considerations also demonstrate the difficulty of applying the concepts of neo- and subfunctionalization, which were developed for single genes, in a more realistic network context. For example, in the context of signaling networks, it could easily be imagined that a new protein is incorporated in a cascade to act between

TOWARD A COMPLETE MODEL OF GENE DUPLICATION

225

duplicates (Figure 1). In such a scenario, both duplicates could be considered as neofunctionalized from a single-gene perspective, but subfunctionalized from a response dynamics perspective. Hence, it is may not be surprising that resolving the prevalence of neo- and subfunctionalization has been difficult from the analysis of network data (He and Zhang, 2005).

6 DUPLICATE RETENTION AND ROBUSTNESS The few network-level studies performed so far suggest that accumulation and eventual retention of duplicates might be closely linked to the concept of robustness. Borrowed from engineering, the term robustness refers to the ability of a system to maintain its function in the face of perturbations. As mutations provide a steady source of perturbation for any biological system, robustness becomes a highly relevant concept in biology (Wagner, 2005c). This relevance also applies for questions relating to duplicate retention. It is possible, for example, to interpret the aforementioned results from signaling networks in light of robustness. As minimal systems are highly nonrobust to deletion of proteins and interactions, mutations with such effects are highly costly (i.e., lethal). This allows duplications to accumulate and lead to network growth. As networks grow, their robustness possibly improves and allows them to tolerate all types of mutations more or less equally. In fact, a need for increased robustness has been implicated as a cause for observed gene redundancy (i.e., retention of duplicate genes) (de Visser et al., 2003). Theoretical works show that gene redundancy can evolve under high mutation rates and large populations (Nowak et al., 1997; Wagner, 2000). Although these studies suggest that it is unlikely that robustness against mutational load can be a strong selective force for duplicate gene retention under realistic conditions, considering environmental (Harrison et al., 2007) and other ecological factors (Salath´e and Soyer, 2008) might result in a different conclusion.

7 TOWARD A COMPLETE MODEL OF GENE DUPLICATION The early population genetic models provided important insight into the process of duplicate gene retention. The elegance and simplicity of these models provide a clear advantage for both mathematical and computational analyses and allow drawing clear conclusions. However, such simplicity comes with a cost: loss of system-level interactions and dynamics. To overcome this limitation and achieve a more complete understanding of gene duplication, we need toy models at system level. Although such system-level models, at least in a simple form, have been available for several years now, their application to the study of gene duplication has happened only very recently. The conclusions of these studies provide the first glimpse into the complex nature of the duplication process and hint that models that are even more detailed will be required to capture it fully. The existing network models are still far from being perfect and have possible drawbacks to address gene retention. For example, the aforementioned gene regulatory network model does not consider interaction strength and accounts only crudely for dosage effects. While resolving these issues, the model for signaling networks considers only a simplified version of protein

226

FATE OF A DUPLICATE IN A NETWORK CONTEXT

interaction. A possibly more problematic weakness of all the existing models (including population genetic models) is the ad hoc nature of the treatment of mutations. Considering only the end effect of DNA-level mutations, the network models define ad hoc rates for events that are thought to change network structure. Although it might be possible to infer rates for these events from the data, this may not be possible in the foreseeable future except perhaps for protein interaction networks (Berg et al., 2004). On the other hand, population genetic models consider DNA-level mutations directly, but they also use ad hoc assumptions regarding the effects of mutations on gene function. Such assumptions can more easily be based on estimates from real data, but the single-gene view of these models does not allow carrying the effects of mutations at the gene level to the network level. Obviously, any modeling approach is bound to make simplifying assumptions, and the robustness of results against such assumptions and key parameter choices can always be controlled. Still, it would be of highest priority to develop models where the effects of mutations are increasingly an emergent property of the model rather than defined a priori. Achieving this will require inclusion of sequence, folding, and structure spaces in gene retention models and combining this level of information with dynamics at the network level. Recently, there have been attempts to include structural information at the protein level in models of gene retention (Rastogi and Liberles, 2005). Similarly, network models that incorporate sequence space have been developed to address questions other than gene retention (Watson et al., 2004; Quayle and Bullock, 2006). Eventually, such models will have to be combined with models of network dynamics and put in a coherent evolutionary context. It will be such hybrid models that will allow us to achieve a fuller understanding of gene duplication and its effect on the evolution of complex biological systems. REFERENCES Aury JM, Jaillon O, Duret L, Noel B, Jubin C, Porcel BM, S´egurens B, Daubin V, Anthouard V, Aiach N, et al. 2006. Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraurelia. Nature 444:171–178. Azevedo RB, Lohaus R, Srinivasan S, Dang KK, Burch CL. 2006. Sexual reproduction selects for robustness and negative epistasis in artificial gene networks. Nature 440:87–90. Berg J, Lassig M, Wagner A. 2004. Structure and evolution of protein interaction networks: a statistical model for link dynamics and gene duplications. BMC Evol Biol 4:51. Bergman A, Siegal ML. 2003. Evolutionary capacitance as a general feature of complex gene networks. Nature 424:549–552. Bray D, Lay S. 1994. Computer simulated evolution of a network of cell-signaling molecules. Biophys J 66:972–977. Brenner SE, Hubbard T, Murzin A, Chothia C. 1995. Gene duplications in H. influenzae. Nature 378:140. Cook DL, Gerber AN, Tapscott SJ. 1998. Modeling stochastic gene expression: implications for haploinsufficiency. Proc Natl Acad Sci USA 95:15641–15646. Deckard A, Sauro HM. 2004. Preliminary studies on the in silico evolution of biochemical networks. ChemBioChem 5:1423–1431. Deutschbauer AM, Jaramillo DF, Proctor M, Kumm J, Hillenmeyer ME, Davis RW, Nislow C, Giaever G. 2005. Mechanisms of haploinsufficiency revealed by genome-wide profiling in yeast. Genetics 169:1915–1925.

REFERENCES

227

de Visser JA, Hermisson J, Wagner GP, Ancel Meyers L, Bagheri-Chaichian H, Blanchard JL, Chao L, Cheverud JM, Elena SF, Fontana W, et al. 2003. Perspective: Evolution and detection of genetic robustness. Evol Int J Org Evol 57:1959–1972. Diaz-Mejia JJ, Perez-Rueda E, Segovia L. 2007. A network perspective on the evolution of metabolism by gene duplication. Genome Biol 8:R26. Duarte JM, Cui L, Wall PK, Zhang Q, Zhang X, Leebens-Mack J, Ma H, Altman N, dePamphilis CW. 2006. Expression pattern shifts following duplication indicative of subfunctionalization and neofunctionalization in regulatory genes of Arabidopsis. Mol Biol Evol 23:469–478. Evangelisti AM, Wagner A. 2004. Molecular evolution in the yeast transcriptional regulation network. J Exp Zoolog B 302:392–411. Force A, Lynch M, Pickett FB, Amores A, Yan YL, Postlethwait J. 1999. Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151:1531–1545. Francois P, Hakim V. 2004. Design of genetic networks with specified functions by evolution in silico. Proc Natl Acad Sci USA 101:580–585. Francois P, Siggia ED. 2008. A case study of evolutionary computation of biochemical adaptation. Phys Biol 5:26009. Goldstein RA, Soyer OS. 2008. Evolution of taxis responses in virtual bacteria: non-adaptive dynamics. PLoS Comput Biol 4:e1000084. Gu X, Zhang Z, Huang W. 2005. Rapid evolution of expression and regulatory divergences after yeast gene duplication. Proc Natl Acad Sci USA 102:707–712. Hakes L, Pinney JW, Lovell SC, Oliver SG, Robertson DL. 2007. All duplicates are not equal: the difference between small-scale and genome duplication. Genome Biol 8:R209. Harrison R, Papp B, Pal C, Oliver SG, Delneri D. 2007. Plasticity of genetic interactions in metabolic networks of yeast. Proc Natl Acad Sci USA 104:2307–2312. He X, Zhang J. 2005. Rapid subfunctionalization accompanied by prolonged and substantial neofunctionalization in duplicate gene evolution. Genetics 169:1157–1164. Hughes T, Liberles DA. 2007. The pattern of evolution of smaller-scale gene duplicates in mammalian genomes is more consistent with neo- than subfunctionalisation. J Mol Evol 65:574–588. Lynch M. 2007. The evolution of genetic networks by non-adaptive processes. Nat Rev Genet 8:803–813. Lynch M, Conery JS. 2003. The origins of genome complexity. Science 302:1401–1404. Lynch M, Force A. 2000. The probability of duplicate gene preservation by subfunctionalization. Genetics 154:459–473. MacCarthy T, Bergman A. 2007. The limits of subfunctionalization. BMC Evol Biol 7:213. Moore RC, Purugganan MD. 2003. The early stages of duplicate gene evolution. Proc Natl Acad Sci USA 100:15682–15687. Nowak MA, Boerlijst MC, Cooke J, Smith JM. 1997. Evolution of genetic redundancy. Nature 388:167–171. Paladugu SR, Chickarmane V, Deckard A, Frumkin JP, McCormack M, Sauro HM. 2006. In silico evolution of functional modules in biochemical networks. Syst Biol (Stevenage) 153:223–235. Papp B, P´al C, Hurst LD. 2003a. Dosage sensitivity and the evolution of gene families in yeast. Nature 424:194–197. Papp B, P´al C, Hurst LD. 2003b. Evolution of cis-regulatory elements in duplicated genes of yeast. Trends Genet 19:417–422. Pfeiffer T, Soyer OS, Bonhoeffer S. 2005. The evolution of connectivity in metabolic networks. PLoS Biol 3:e228.

228

FATE OF A DUPLICATE IN A NETWORK CONTEXT

Quayle AP, Bullock S. 2006. Modelling the evolution of genetic regulatory networks. J Theor Biol 238:737–753. Rastogi S, Liberles DA. 2005. Subfunctionalization of duplicated genes as a transition state to neofunctionalization. BMC Evol Biol 5:28. Salath´e M, Soyer OS. 2008. Parasites lead to evolution of robustness against gene loss in host signaling networks. Mol Syst Biol 4:202. Sopko R, Huang D, Preston N, Chua G, Papp B, Kafadar K, Snyder M, Oliver SG, Cyert M, Hughes TR, et al. 2006. Mapping pathways and phenotypes by systematic gene overexpression. Mol Cell 21:319–330. Soyer OS. 2007. Emergence and maintenance of functional modules in signaling pathways. BMC Evol Biol 7:205. Soyer OS, Bonhoeffer S. 2006. Evolution of complexity in signaling pathways. Proc Natl Acad Sci USA 103:16337–16342. Soyer OS, Pfeiffer T, Bonhoeffer S. 2006. Simulating the evolution of signal transduction pathways. J Theor Biol 241:223–232. Stern MD. 1999. Emergence of homeostasis and “noise imprinting” in an evolution model. Proc Natl Acad Sci USA 96:10746–10751. Stone JR, Wray GA. 2001. Rapid evolution of cis-regulatory sequences via local point mutations. Mol Biol Evol 18:1764–1770. Teichmann SA, Babu MM. 2004. Gene regulatory network growth by duplication. Nat Genet 36:492–496. Tirosh I, Barkai N. 2007. Comparative analysis indicates regulatory neofunctionalization of yeast duplicates. Genome Biol 8:R50. Tsong AE, Tuch BB, Li H, Johnson AD. 2006. Evolution of alternative transcriptional circuits with identical logic. Nature 443:415–420. Vogel C, Chothia C. 2006. Protein family expansions and biological complexity. PLoS Comput Biol 2:e48. Wagner A. 1994. Evolution of gene networks by gene duplications: a mathematical model and its implications on genome organization. Proc Natl Acad Sci USA 91:4387–4391. Wagner A. 1996. Does evolutionary plasticity evolve? Evolution 50:1008–1023. Wagner A. 2000. The role of population size, pleiotropy and fitness effects of mutations in the evolution of overlapping gene functions. Genetics 154:1389–1401. Wagner A. 2005a. Circuit topology and the evolution of robustness in two-gene circadian oscillators. Proc Natl Acad Sci USA 102:11775–11780. Wagner A. 2005b. Energy constraints on the evolution of gene expression. Mol Biol Evol 22:1365–1374. Wagner A. 2005c. Robustness and Evolvability in Living Systems. Princeton Studies in Complexity. Princeton, NJ: Princeton University Press. Walsh JB. 1995. How often do duplicated genes evolve new functions? Genetics 139:421–428. Wapinski I, Pfeffer A, Friedman N, Regev A. 2007. Natural history and evolutionary principles of gene duplication in fungi. Nature 449:54–61. Watson J, Geard N, Wiles J. 2004. Towards more biological mutation operators in gene regulation studies. Biosystems 76:239–248. Watterson GA. 1983. On the time for gene silencing at duplicate loci. Genetics 105:745–766. Zhang J. 2003. Evolution by gene duplication: an update. Trends Ecol Evol 18:292–298.

13

Evolutionary and Functional Aspects of Genetic Redundancy RAN KAFRI Department of Systems Biology, Harvard Medical School, Boston, Massachusetts

TZACHI PILPEL Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, Israel

1 INTRODUCTION Genetic networks demonstrate a remarkable capacity to carry out precise regulatory programs in the face of habitat variations, the stochasticity of the internal cellular environment, and genetic variability. Possibly not unrelated, genomes contain tremendous numbers of redundancies, typically associated with duplicated genes (paralogs). Specific examples include redundant components of signal transduction, developmental regulators, and isozymes. Contrary to evolutionary predictions, recent evidence demonstrates that many of these functional overlaps are evolutionarily stable and far more prevalent than expected previously. More intriguingly, recent advances in the systematic identification and characterization of redundant duplicates now suggest that redundant duplicates are preferentially associated with key functions in cellular regulation. Collectively, these new findings challenge the view that redundancies are simply leftovers of ancient duplications and suggest them as an additional component of the sophisticated machinery of cellular regulation. 2 GENETIC REDUNDANCY: A WORKING DEFINITION The general and accepted framework for an understanding of the evolution of duplicated genes is the paradigm of duplication and divergence. In short, randomly occurring gene duplication events generate functionally redundant copies of an ancestral gene. The functional overlap of these newly generated duplicates allows for rapid evolutionary change, consequently leading to a rapid loss of the functional overlap. Thus, at the end of this process, while genes may still have similarity in a DNA or amino acid sequence, they no longer perform the same functions. These conceptual paradigms have been outlined by many excellent reviews and supported by unquestionable evidence (Wagner, 2001; Kondrashov and Koonin, 2004; Cusack and Wolfe, 2007; Semon Evolution After Gene Duplication, Edited by Katharina Dittmar and David Liberles Copyright © 2010 Wiley-Blackwell

229

230

EVOLUTIONARY AND FUNCTIONAL ASPECTS OF GENETIC REDUNDANCY

and Wolfe, 2007). Nevertheless, whereas this general paradigm outlines the fate of the majority of duplicated genes, a minority of duplicated pairs seem to have escaped this fate (Kafri et al., 2006, 2008; DeLuna et al., 2008; Musso et al., 2008). For this minority, the functional overlap or redundancy that was generated by the duplication event seems to be conserved evolutionarily. In this review we take the road “less traveled” and discuss not the majority of duplicates that have undergone complete subfunctionalization, neofunctionalization, or loss (Lynch and Conery, 2000) but, rather, those that have retained their functional overlap. We ask what may have been the selective advantage that allowed these functional overlaps to escape the claws of evolutionary selection. To discuss the conservation of functional redundancy of duplicated genes, the concept of redundancy must be better defined. In fact, one may correctly criticize that true functional redundancies do not actually exist in genomes but, rather, represent merely partial overlaps in gene functions. Agreeably, knockout phenotypes have been described for many genes reported to have redundant partners. For these reasons it has previously been suggested to define redundancy as a measure of correlated, rather than degenerate, gene functions (Tautz, 1992). Here and elsewhere (Kafri et al., 2008) we propose considering genes redundant if they fulfill two criteria simultaneously: 1. Both genes must perform the same molecular function. For pleiotropic genes, a functional overlap is required such that at least one of the molecular functions of these genes is shared. 2. Knockout of either of the two genes must produce a phenotype that is far smaller than that predicted by the function they perform. 3

DISPENSABILITY OF DUPLICATED GENES

The central and recurring observation associating duplicate genes with redundant functions is that the knockout of a single gene from a duplicate pair or family often results with a smaller effect on phenotype than would have been expected from its biological function. Examples illustrating this phenomenon have been collected systematically in saccharomyces cerevisiae (DeLuna et al., 2008; Kafri et al., 2008; Musso et al., 2008) and are further told by numerous anecdotal examples, some of which are discussed in this review. MyoD and Myf-5 are master developmental regulators that autonomously specify the myogenic fate of somites in skeletal muscle development. Redundancy in the functions of these regulators was established by the fact that null mutations in either of these two myogenic regulatory factors resulted with apparently normal skeletal muscle (Rudnicki et al., 1992, 1993). More so, it has been shown that either MyoD or Myf5 can separately induce muscle development (Rudnicki et al., 1993). In strong contrast, mice lacking both MyoD and Myf-5 lack skeletal muscle altogether and die soon after birth (Rudnicki et al., 1993) Thus, the general theme that has emerged is that duplicate genes are somewhat more dispensable than singletons and that this dispensability is owed to their overlapping functions, which act to compensate for mutations. Although this notion has accumulated slowly from numerous anecdotal examples, it was not demonstrated systematically until 2003 (Gu, 2003; Gu et al., 2003). In the latter work the authors have obtained data

DISPENSABILITY OF DUPLICATED GENES

231

from systematic measurements of the growth phenotypes of 1147 single-gene deletions in the yeast S. cerevisiae. By comparing the effect of gene deletion for duplicates and for singletons, the authors demonstrated the expected trend that the proportion of genes that are essential for cell viability is significantly greater among singletons than among duplicates. A similar study later reconfirmed these results for the worm C. elegans (Conant and Wagner, 2004). Additionally important, these studies have allowed us, for the first time, to calculate a quantitative assessment of the proportion of redundant gene duplicates, resulting with a lower bound estimate of 25% (Gu et al., 2003) in S. cerevisiae and 7% in Caenorhabditis elegans. Later attempts to better quantify the proportion of redundant duplicates have resulted in a variety of estimates, ranging from as few as 9% redundant duplicates (Lin et al., 2006) to 55% (DeLuna et al., 2008). The rationale was that the difference between the proportion of dispensable duplicates to the proportion of dispensable singletons is the outcome of redundancies and is associated quantitatively with its prevalence. As a word of caution, it should be obvious that the term gene dispensability should not be interpreted literally, as it obviously reflects a conceptual simplification. Genes are not dispensable. Had they been so, they would not have survived evolutionary pressures. Results such as those described by Gu et al. (2003) demonstrate that under the experimental conditions employed in the artificial laboratory setting, mutations in duplicates are typically associated with a less severe consequence on organism phenotype. Such results were taken to suggest that functional overlaps of duplicates reduce the phenotypic cost of mutations by means of compensation. A more recent computational study that simulated a metabolic network has shown that with the addition of growth conditions and challenges, the list of apparently dispensable genes shortens, presumably since more functionalities become essential (Papp et al., 2004). In this review we adopt the term dispensability and ask the reader to keep this simplification in mind. An immediate consequence of genetic redundancy is that it decouples the essentiality of a gene’s function from the essentiality of the gene itself . For example, the pair of duplicate genes GSL1 and GSL5 have been reported to have “essential yet redundant roles for plant and pollen development” (Enns et al., 2005). In other words, whereas single-gene knockout experiments seem to suggest that the gene GSL1 is somewhat dispensable, double knockouts of GSL1 together with its duplicate GSL5 demonstrate that the function encoded by this pair is essential. Another example illustrating this point is the pair of genes Vav1 and Vav3 , which were reported to have “critical but redundant roles in mediating platelet activation” (Pearce et al., 2004). These genes, while performing functions that are “critical” or “essential” for platelet activation, may themselves appear dispensable, due to functional overlap. More generally, vital biochemical functions may be performed redundantly by several genes, each of which, separately, appears dispensable. Here, we wish to focus on an interesting and relatively unexplored question: Considering the tremendous number of redundant duplicates observed in all organisms studied, could genetic networks have evolved means to specifically utilize functional overlaps? A positive answer might contribute to explaining the extended conservation of some redundancies in biology (Tischler et al., 2006).

232

EVOLUTIONARY AND FUNCTIONAL ASPECTS OF GENETIC REDUNDANCY

4 DISPENSABILITY OF DUPLICATES: REDUNDANCY OR UNIMPORTANT FUNCTIONS? From systematic experiments of gene deletion it has been noted that most genes (e.g., 80% in S. cerevisiae) appear dispensable regardless of whether they do or do not have a duplicate partner. As mentioned above, the source for this apparent dispensability is due at least partially to a dispensability of gene functions under the experimental test conditions (Gu et al., 2003; Papp et al., 2004). For example, in yeast, genes responsible for sporulation would not affect fitness under optimal growth conditions. It is thus a challenge to distinguish between duplicates whose dispensability is associated intrinsically with their duplicated state and those whose dispensability is coincidental . Specifically, we consider a gene coincidentally dispensable if its ancestral singleton, prior to the duplication event, was also dispensable. In other words, the organism can live without this gene not because its function is compensated for but simply because its function is not needed (under the experimental test conditions). Previously it was conjectured that the relative proportions of duplicates that are dispensable due to unimportant functions to those that are dispensable due to compensations are indicative of the significance of genetic redundancy (Wong and Roth, 2005). Regarding this, one extreme point of view has suggested that most if not all of the dispensability of duplicated genes is owed to nonessential functions (He and Zhang, 2005a, b). Specifically, it was argued that genes associated with unimportant functions more frequently duplicate (or that duplicates of these genes are more frequently fixed in population). Although the latter possibility is intellectually attractive, it has become somewhat implausible in light of recent evidence (Davis and Petrov, 2004; Jordan et al., 2004; Kafri et al., 2005, 2006, 2008; Tischler et al., 2006; DeLuna et al., 2008). For example, it has been demonstrated that the proportion of dispensable duplicates becomes far more prevalent among hubs in the protein interaction network (Kafri et al., 2008). Such hubs, in turn, have been shown to be associated with functions that are typically highly essential for cell viability (Jeong et al., 2001). Although theoretically, it is possible that protein network hubs perform essential functions only if they are singletons, such a possibility is highly unlikely and not supported by evidence. Furthermore, several recent studies have demonstrated that (1) redundant interactions are far more prevalent than previously expected (Kafri et al., 2006; DeLuna et al., 2008), (2) are often associated with important functions (DeLuna et al., 2008; Kafri et al., 2008), and (3) are controlled by feedback regulation that seems to exploit their functional overlap (Kafri et al., 2005, 2006). An additional important line of evidence suggesting the functional importance of “dispensable” duplicates stems from the unexpectedly long evolutionary period of conservation, as explained in the following section.

5 EVOLUTION OF REDUNDANT DUPLICATES: CONTRASTING THEORY AND OBSERVATIONS From an evolutionary perspective, redundancies are thought to buffer phenotypes from genomic variations by reducing the phenotypic cost of mutations and, consequently, increasing an organism’s evolvability (Gerhart and Kirschner, 1997; Kirschner and Gerhart, 1998). But on the other hand, this very fact renders these redundancies instable

EVOLUTION OF REDUNDANT DUPLICATES

233

on evolutionary time scales (Ohno, 1970; Nowak et al., 1997; Wolfe and Shields, 1997; Lynch and Conery, 2000; Brookfield, 2003; Conant and Wagner, 2003; Gu et al., 2003; Makova and Li, 2003). Specifically, if a gene’s function can be compensated for perfectly by a redundant partner, mutations in that gene would have no consequence on the phenotype of that individual. Such mutations could not, therefore, be selected against and would tend to accumulate, leading either to loss of function of one of the duplicates (nonfunctionalization), or in the case of pleiotropy, the functions of the ancestral gene, prior to duplication, would be partitioned between the two duplicates (subfunctionalization) (Lynch and Conery, 2000). A third and rarer possibility is that one or both of the duplicate partners will acquire novel functions not present in the ancestral gene (neofunctionalization). In Molecular Biology of the Cell (Alberts, 2002) this notion has been captured by the section title “Genetic Redundancy Is a Problem for Geneticists, But It Creates Opportunities for Evolving Organisms.” Thus, redundancy generated by gene duplication is predicted to be short-lived on evolutionary time scales. Consistent with this, evidence suggests that for the majority of gene duplicates, functional redundancy generated by the duplication event is indeed lost shortly after the duplication event (Wolfe and Shields, 1997; Lynch and Conery, 2000; Kellis et al., 2004). Together with that, recent observations suggest that this evolutionary process, eliminating functional overlaps, is not the exclusive fate of all redundant duplicates. In fact, recent evidence suggests that for a significant proportion of duplicates, redundancy is actually stably maintained throughout evolution (ref). For example, in C. elegans, 14 duplicate gene pairs were found to have conserved redundant functions for over 80 million years of evolution (Tischler et al., 2006)! Similarly, in S. cerevisiea, redundant functions were detected systematically for over 50% of the duplicates originating from a whole-genome duplication even dating back 100 million years (DeLuna et al., 2008). Although such redundant interactions exist for only a minority of all duplicate gene pairs, the evolutionary conservation of these functional overlaps may be suggestive of their importance to organism fitness. Anecdotally illustrating such conservation of redundant duplicates is the pair of O-acyltransferase isozymes, redundantly catalyzing the conjugation of sterols to fatty acids. The functional significance and evolutionary conservation of these isozymes is demonstrated by the fact that redundancy of this enzyme pair has been conserved all the way from yeast (Are1 and Are2) to mammals (ACAT1 and ACAT2) (Yang et al., 1996; Yu et al., 1996; Cases et al., 1998). From numerous examples such as the above, it has been concluded that although retention of redundancy is much less frequent than its loss, its widespread existence is nontrivial and cannot (Nowak et al., 1997; Kafri et al., 2006) be dismissed as leftovers of recent duplication events. One interesting recent observation regarding the conservation of redundant duplicates was obtained from an examination of the distribution of redundant duplicates in the yeast protein interaction network (Kafri et al., 2008). Findings from this analysis demonstrated that the proportion of redundant interactions increases for duplicates involved in higher numbers of protein–protein interactions. This suggests that redundant interactions are maintained selectively for duplicates whose biological functions are mediated by numerous physically interacting protein partners. Examining the age dependency of these “protected” hubs indicated three stages in a continuum of the evolution of duplicates following the duplication event. In the first of these evolutionary stages, shortly after the duplication event, duplicates are both tightly coregulated and highly dispensable. It is thus conceivable that the capacity of these duplicates to

234

EVOLUTIONARY AND FUNCTIONAL ASPECTS OF GENETIC REDUNDANCY

compensate mutations stems from the fact that they have not yet accumulated significant evolutionary change to either their regulatory control or functions. In the second phase, duplicates were both less tightly coregulated and less dispensable. This trend has been widely expected and is the simple outcome of evolutionary divergence (Lynch and Conery, 2000; Gu et al., 2002, 2004; Wagner, 2002; Brookfield, 2003; Conant and Wagner, 2003; Makova and Li, 2003). Surprisingly, it is only in the third evolutionary phase, reflecting the most ancient duplicates, that we notice the overrepresentation of redundant interactions among duplicates that are highly involved in protein–protein interactions (Kafri et al., 2008). This finding may suggest that compensations of protein network hubs by their duplicates is not a simple epiphenomenon of gene duplication but, rather, represents a functionality that has evolved through purifying selection. To place these findings in biological context, it is interesting to ask for the identity and function of these protein network hubs that have been awarded with redundant partners. These redundant duplicates largely constitute a variety of posttranscriptional regulators, such as kinases (e.g., Mrk1 and Rim11, which are homologs of the mammalian Gks-3 involved in Wnt pathway regulation), phosphotases (e.g., Ppz2 and Ppz1), and ubiquitin ligases (e.g., Bul1 and Bul2). Fundamentally, this is not surprising, since redundancies in such gene families have long been observed, ranging from redundancy of cyclins (Berthet and Kaldis, 2007) to redundancies of cytokines (Taniguchi, 1995).

6

EXPLAINING THE CONSERVATION OF REDUNDANT DUPLICATES

Contrary to early thought, there is now well-established theoretical argumentation explaining that evolutionary instability is not the inevitable fate of genetic redundancy (Gerhart and Kirschner, 1997; Nowak et al., 1997; Kirschner and Gerhart, 1998, 2005; Krakauer and Nowak, 1999; Kafri et al., 2005, 2006, 2008; Landry et al., 2006). In fact, multiple routes allowing the conservation of redundancy have been suggested (Nowak et al., 1997; Kafri et al., 2006). Before describing these theories, one introductory remark is noteworthy. Typically, when evolutionary notions are translated into formal mathematical treatment, a simplification is employed whereby genes are assumed to be independent. Specifically, gene interactions, and in particular regulatory interactions, are often not taken into account. In simple terms, assume that we have a cost function, s, evaluating the “importance” of a particular gene (the fitness cost associated with its knockout). Most mathematical treatments would simplistically assume that mutations in a given gene, g1 , have no affect on the fitness cost of mutations in a different gene, g2 . In the context of duplicate genes, this would mean that mutations in a given duplicate would not affect the function of its redundant partner. Although such simplifications have proven useful in predicting many evolutionary trends, outliers or exceptions should not be unexpected. Thus, when considering the evolution of redundant duplicates, two different cases must be distinguished. The first of these describes pairs of duplicates where redundant partners are independent and are not bound by various forms of cross-regulatory interactions (Figure 1A). The second possibility accounts for duplicates that, by means of regulatory interactions, are not independent (Figure 1B and C). In the latter cases, mutations of a given gene would affect the function of its duplicate. For these pairs of duplicated gene pairs it is conceivable that redundancy may be favored over the nonredundant state. In other words, the redundant state offers an evolutionarily selectable advantage. We discuss each of

EXPLAINING THE CONSERVATION OF REDUNDANT DUPLICATES

(A)

Duplication

Duplication

Duplication

Evolution

Evolution

Evolution

(B)

235

(C)

Figure 1 Regulatory dependencies generated by duplication events. A duplication of genes that are under negative (A) or positive (B) feedback regulation may result spontaneously in a regulatory circuit in which duplicates respond in level or activity to mutation of their duplicate partners. During the course of evolution, a portion of these interactions can be lost, producing a variety of possible circuitries.

these possibilities separately and show how the conservation of redundant duplicates can be explained by both scenarios. 6.1 Conservation of Noninteracting Redundant Duplicates In 1997, Nowak and colleagues were the first to challenge the view that redundancy cannot be stable evolutionarily (Nowak et al., 1997). In this work the authors relied on a simple mathematical formalism describing a population of animals in which some essential function is performed redundantly by genes at either of two loci, A and B. They further considered mutations at rates μa and μb generating, from A and B, the nonfunctional alleles a and b. Following this formalism, they described three possible scenarios in which both redundant alleles can coexist stably in the population. In the first scenario, one of the two alleles, say A, functions at a slightly higher efficiency level but is also exposed to higher mutation rates. In such cases, the authors show that the functionally less efficient allele, A, can coexist stably with allele B by selection for the Ab genotype. The second scenario describes duplicates that are redundant only with respect to a certain function, while genes are maintained by selection because of another independent function. The latter scenario may be perceived as a form of partial subfunctionalization. In the third scenario, the authors consider situations where a certain gene fails to perform its function correctly at significant frequencies (although defects are not heritable). In these cases it is evolutionarily advantageous if this function were to be compensated for by a redundant duplicate. 6.2 Evolution of Interacting Redundant Duplicates: A Systems Biology Perspective The birth of systems biology marks a shift in interests and focus from the functions of individual genes to functions that emerge from the regulatory circuitries and networks

236

EVOLUTIONARY AND FUNCTIONAL ASPECTS OF GENETIC REDUNDANCY

connecting them (Kirschner, 2005). Redundant duplicates are often governed by regulatory interactions whereby one duplicate is under the regulatory influence of its redundant partner (Kafri et al., 2005, 2006). In these cases, mutations in a single gene may affect the fitness contribution of its duplicate in a variety of ways. Such regulatory interactions between duplicate genes are not unexpected and can result as a direct and immediate consequence of the duplication event (Figure 1). Furthermore, it is plausible to assume that circuitries may evolve that exploit functional overlap to obtain evolutionarily advantageous functionalities (Kafri et al., 2006). One such functionality was hypothesized to be exploitation of cross-regulated redundant duplicates to downplay stochastic noise in protein levels (Kafri et al., 2006). It is thus likely that at least in some pathways, redundancies have been selected for based on some evolutionary advantage that they confer on the wild-type organism (Kafri et al., 2006). In particular, we propose that there are regulatory designs that exploit redundancy to achieve functionalities such as control of noise in gene expression or extreme flexibility in gene regulation. In such cases, compensation for gene loss could merely be a side product of design principles utilizing functional redundancy. Such circuitries that have evolved to utilize redundant functions have been termed responsive backup circuits (Kafri et al., 2005, 2006; Semon and Wolfe, 2007). 6.3 Responsive Backup Circuits Clues for regulatory designs controlling redundancy were first obtained in a recent study that explored the dispensability of gene duplicates with various degrees of coregulation (Kafri et al., 2005). The preliminary assumption of this study was that for one duplicate copy to compensate against the loss of its partner, both duplicates must not only perform the same function but do so at the same place and time. In other words, coregulation of duplicates was perceived as a prerequisite for functional compensation (Gu et al., 2003; Kafri et al., 2005). In reality, however, paralogs expressed similarly were almost never found to back each other up, as evident from their high essentiality (Figure 2A, D–F, H). In fact, tightly coregulated gene duplicates were found more essential for viability even compared to singleton genes (Figure 2A and D). Inference based on gene dispensability (see Section 11) suggested that functional redundancy and compensation are restricted almost exclusively to duplicates that in the wild type are regulated differentially (Kafri et al., 2005). These results were later corroborated by several independent sources of evidence, including careful measurements of epistemic interactions (DeLuna et al., 2008) between duplicates (Figure 2F). Further insight has been provided by the observation that some differentially regulated duplicates maintain the ability to become coregulated under certain environmental conditions (Musso et al., 2008; Quezada et al., 2008; Sanchez-Perez et al., 2008). Such conditional coregulation of these genes within the transcriptional network was shown to be very strongly negatively correlated with the severity of the knockout phenotypes of these genes. Thus, the paradigm that has emerged is that genes that are functionally redundant are often not controlled independently, but, rather, are involved in feedback regulation that often results with one duplicate being unregulated in response to mutational inactivation of its partner. As a side note, Figure 2D illustrates a trend reporting two distinct and separate phenomena. On the one hand, as pointed out, the proportions of genes essential for viability are significantly low (compared to singleton genes) among duplicates that

EXPLAINING THE CONSERVATION OF REDUNDANT DUPLICATES

800

400

0 Single copy enzymes

Co-regulated isozymes

Evolutionary loss of redundancy

1 Proportion of redundant duplicates

Number of ‘dispensable genes’

Differential regulation of isozymes

0.9

0.8

Differentially regulated isozymes

Recent duplications

(A)

Differential regulation of redundant duplicates 1 Proportion of ‘dispensable’genes

1 Proportion of ‘dispensable’ enzymes

Ancient duplications

(B)

Dispensability of isozymes

0.8

0.6

0.9 0.8 0.7 0.6

Isozymes

Single copy enzymes

Singleton genes

CoDifferentially regulated regulated duplicates duplicates

(C)

(D)

Conditional co-regulation of redundant duplicates

0.4 Proportion of redundant duplicates

1 Proportion of ‘dispensable’genes

237

0.9 0.8 0.7 0.6

Differential regulation of redundant duplicates

0.3 0.2 0.1 0

Singleton genes

Duplicates low CCR

(E)

Duplicates high CCR

Differentially regulated

Co-regulated

(F)

Figure 2 Prevalence and evolutionary loss of redundancies: (A) number of enzymes in S. cerevisiae that appear in single copy vs. the number of enzymes having co-regulated or differentially regulated redundant partners (isozymes); (B) proportion of redundant duplicates originating from ancient duplications (Ks > 1) vs. duplicates originating from recent duplications; (C) proportion of dispensable single-copy enzymes vs. the proportion of dispensable isozymes (enzymes with redundant duplicates). Redundancies inferred by increased dispensability: (D, E) proportion of dispensable genes among singletons vs. coregulated and differentially regulated duplicates. Coregulation is quantified by expression similarity (D, F, H) or conditional coregulation (E, G, I). Redundancies inferred by direct epitasis measurements: (F, G) proportion of coregulated vs. differentially regulated duplicate pairs having epistatic interaction. Redundancies inferred by curating literature: (H, I) proportion of coregulated or differentially regulated duplicate pairs that are functionally redundant as inferred from literature curation.

EVOLUTIONARY AND FUNCTIONAL ASPECTS OF GENETIC REDUNDANCY

Proportion of redundant duplicates

0.4

Differential regulation of redundant duplicates

Conditional co-regulation of redundant duplicates

0.2 Proportion of redundant duplicates

238

0.3 0.2 0.1 0

0.1

0 Duplicates low CCR

Duplicates high CCR

Differentially regulated

Co-regulated

(G)

(H) Conditional co-regulation of redundant duplicates

Proportion of redundant duplicates

0.2

0.1

0 Duplicates low CCR

Duplicates high CCR

(I)

Figure 2 (Continued )

are differentially regulated. This was interpreted by assuming compensation by crossregulated duplicates (Kafri 2005, 2006). In sharp contrast, duplicate genes that are consistently coregulated are far more essential than singletons. To explain this, Zhang has insightfully pointed out (He and Zhang, 2006) that the increased essentiality of coregulated duplicates coincides with their increased involvement in protein–protein interactions. It can, further, be generalized that such tight coregulation is indicative of the fact that these duplicates are required simultaneously for a given functionality and, consequently, cannot substitute for each other’s absence. Thus, while the increased proportions of essential genes among coregulated duplicates is the result of these two separate phenomena, it is specifically the dispensability of differentially regulated duplicates that implies redundancy. Two lines of evidence could indicate a function’s direct benefit from existing redundancy: the evolutionary conservation of the functional overlap and a nontrivial regulatory design that utilizes it. Many well-known examples meet both criteria, one of which is that of the 1,3-β-glucane synthase catalytic subunit in yeast, which is encoded by two alternative, functionally redundant, and synthetically lethal genes, FKS1 and FKS2 (Zhao et al., 1998). The evolutionarily selectable advantage of this redundancy can be inferred from the fact that both isozymes are found as duplicates in 12 sequenced yeast species, except for Yarrowia lipolytica (Leon et al., 2002). Furthermore, in S. cerevisiae, these two genes obey a particular regulation whereby FKS2 transcriptionally responds to the intactness of FKS1 and is up-regulated upon FKS1 mutational

CONDITIONAL COREGULATION AND THE MAINTENANCE OF METABOLIC FLUXES

239

inactivation (Garcia-Rodriguez et al., 2000). Numerous other examples describing such responsive backup circuits exist and cover a wide variety of organisms, ranging from bacteria to mammals (Katri etal., 2006). In fact, the observed prevalence of this particular regulatory design for control of genetic redundancy may be indicative of specific selectable functions that it performs.

7 CONDITIONAL COREGULATION AND THE MAINTENANCE OF METABOLIC FLUXES Within the metabolic network, fluxes are governed and regulated by the concentration of active enzymes catalyzing the various reactions. Although the detailed contribution of functional redundancy to this regulation is not fully established, such a contribution is widely anticipated given the large number of isozymes and other redundancies that exist within these networks. As Figure 2C shows, individual isozymes are less essential and produce fewer deleterious effects upon deletion than do enzymes existing as single copies. An interesting twist to this account comes from the fact that isozymes, although redundant and consequently dispensable, obey different regulatory programs and are transcribed at different times in response to environmental pressures (Gasch et al., 2000; Ihmels et al., 2004; Kafri et al., 2005). A recent finding supplying the compromise between these two seemingly opposing observations shows that many differentially regulated genes can be induced for coexpression, given that particular environmental stress stimuli were applied (Kafri et al., 2005). More so, gene pairs that maintain this capacity for conditional coexpression were shown to be the most likely candidates for compensating against deletion mutations (Kafri et al., 2005). This conditional coregulation [referred to as PCoR by (Kafri et al., (2005)] may provide essential clues for the function of these redundancies in the regulation of metabolic fluxes. The model that emerges is that whereas many isozymes are specialized for different environmental regimes (Sanchez-Perez et al., 2008), alarm signals induced by particular stress stimuli may call for their synergistic coexpression. Here, responsive backup circuits provide functional specialization together with extreme flexibility in gene control that could be activated when sufficient stress has been applied. For example, in yeast, glucose serves as a regulatory input for alternating between aerobic and anaerobic growth. Its presence is detected by two separate and independent signaling pathways, one probing intracellular glucose concentrations and the other probing extracellular concentrations (Ozcan, 2002). This differential sensing enables some genes to be separately regulated by either intracellular or extracellular glucose. One consequence of this shows the effect in a responsive backup circuit composed of Hxt1 and Htx2 (Figure 3). Here feedback is made possible by having Hxt2 controlled by two opposing signals. One is its induction by extracellular glucose and the second is its repression by intracellular glucose (Ozcan, 2002). The consequence of this is that while high glucose concentrations results in repression of Hxt2 expression, its induction could be triggered by low environmental sugar, or alternatively, by mutations in genes responsible for glucose influx (Ozcan, 2002). Similar examples include the isocitrate dehydrogenases idp2 and Idh, where the glucose repression of idp2 is reversed in the ihd mutant (McCammon and McAlister-Henn, 2003) and for the pair Acs1 and Acs2, where Acs2’s expression is induced in the Acs1 mutant (Van den Berg et al., 1996). In all these cases, the common denominator is that one of the two

240

EVOLUTIONARY AND FUNCTIONAL ASPECTS OF GENETIC REDUNDANCY Extra-cellular Glucose

Snf3/Rgt2

Hxt1

Hxt2

Intra-cellular Glucose

Figure 3 Hxt1 /Hxt2 responsive backup circuit. Extracellular glucose concentration is sensed by two membrane receptors on the outer yeast membrane, Rgt2 and Snf3 . These, once activated by glucose, initiate a signal cascade that induces the transcription of the Hxt gene family of hexose transporters encoding membrane channels for glucose intake. The flux of incoming glucose generates an increase in intracellular glucose concentration, which, in turn, represses the transcription of Hxt2 .

duplicates is under repression in the wild type, and that that repression is relieved upon its partner’s mutation.

8

REDUNDANCIES OF DEVELOPMENTAL REGULATORS

The extent to which genomic functional redundancies have influenced the way that we think about biology can be appreciated simply by inspecting the vast number of times the word redundancy is used in the biomedical literature (convince yourself by examining PubMed). Particularly interesting is the abundance with which it is addressed in studies of developmental biology. In fact, it is here that concepts such as genetic buffering and canalization (Waddington, 1942) were first suggested. Furthermore, the robustness of developmental phenotypes such as body morphologies and patterning has been demonstrated repeatedly (Gerhart and Kirschner, 1997). So the question is: Are these redundancies simply leftovers of ancient duplications, or are they an additional component to the sophisticated machinery of cellular regulation? One may argue critically that many of the redundancies reported do not actually represent functionally equivalent genes but, rather, reflect only partial functional overlap. In fact, knockout phenotypes have been described for a number of developmental genes that have presumably redundant partners (Qiu et al., 1995, 1997; Enns et al., 2005). For these reasons it has been suggested that redundancy be defined as a measure of correlated, rather than degenerate, gene functions (Tautz, 1992). Although this may

REDUNDANCIES OF DEVELOPMENTAL REGULATORS

241

suggest that these redundancies have not evolved for the sake of buffering mutations, it has, in our opinion, little relevance to the question of whether or not they serve a functional role. The interesting question, then, is: Can such a functional role for a duplicated state be inferred from the way the two genes are regulated? For most cases of developmental redundancies, redundant partners are either temporally or spatially distinct in their expression patterns (Kafri et al., 2006); however, some level of expression overlap is usually observed. Cross-regulation of redundancies has been tested in only a relatively small number of cases, yet from these a few persuasive recurring themes have emerged. One of the better known cases of cross-regulated developmental regulators is that of the four master regulators of vertebrate skeletal muscle development mentioned above: MyoD, Myf-5, myogenin, and MRF4 , collectively known as the MRF gene family (Sabourin and Rudnicki, 2000). These four basic helix–loop–helix transcription factors specify and execute the process through which naive mesoderm cells differentiate to form distinct skeletal muscles [for a review, see Olson and Klein (1994)] and are activated sequentially during myogenesis. The myogenic pathway consists of two separate phases. In the first phase, MyoD and Myf-5 specify the myogenic progenitors in the somites into myoblasts, cells that are committed to becoming muscle fibers. In the second phase, myoblasts develop into myofibers, a process initiated by myogenin and MRF4 (Zhang et al., 1995). Sequence similarity between the myogenic transcription factors suggests that these have evolved through multiple gene duplication events early in the evolution of vertebrates, approximately with the appearance of fish (Krause et al., 1990; Michelson et al., 1990; Venuti et al., 1991; Holland et al., 1992; Atchley et al., 1994). Interestingly, despite their long evolutionary separation, these regulators have largely conserved their functional redundancy. In fact, experiments on mice in which MyoD was completely inactivated resulted in viable and fertile mice that exhibited phenotypically normal skeletal muscles (Rudnicki et al., 1992). In strong contrast, mice lacking both MyoD and Myf-5 lack skeletal muscle altogether and die soon after birth (Rudnicki et al., 1993). From the perspective of this review, myogenesis is a particularly interesting process, as it harbors two responsive backup circuits. Specifically, while MyoD and myf-5 were recently shown to be expressed in separate cell lineages (Haldar et al., 2008), mutations in MyoD elicit an up-regulation of its redundant isoform Myf-5 by increasing proliferation of the Myf-5 cell lineage. Thus, in this case extracellular signals regulate a responsive circuitry that effectively buffers against MyoD mutations. In other words, MyoD and Myf-5 form a responsive circuit that is controlled by extracellular regulation. In the case of MRF4 and myogenin, responsive circuitry is formed by the induction of myogenin in response to mutations in MRF4 (Zhang et al., 1995). Importantly, such responsive circuits are not unique to the myogenic pathway and are reported continuously for numerous developmental and signaling pathways. An additional interesting feature of the MRF responsive backup circuits is by what we term dosage-dependent linear response. By this we account for the observation that the up-regulatory response induced by a heterozygote mutation is approximately half that of the homozygote mutation. In particular, for MyoD and Myf-5 , mutations in one of the two MyoD alleles results in an 1.8-fold up-regulatory response of Myf-5 , while disruption of both alleles results in a 3.5-fold response (Rudnicki et al., 1992). This type of linearity may hold clues as to both the function and regulation of these genetic circuits. One attractive possibility that may be suggested by this linearity is that the

242

EVOLUTIONARY AND FUNCTIONAL ASPECTS OF GENETIC REDUNDANCY Signal

MyoD (inducer)

Negative feedback

Myf5 (responder)

Response

Figure 4 Redundancy of two duplicate genes (duplicate α and duplicate β) utilized by a responsive circuit. Although duplicates α and β perform the same molecular function, they differ in how they are regulated. Specifically, duplicate β is repressed by α (in a direct or indirect manner) such that its protein level responds reciprocally to changes in the levels of its partner. In such cases, an inductive signal up-regulating α will be result in down-regulation of its partner, β. Since both duplicates perform the same molecular function, output is defined as the sum of levels of the two proteins. In this design, fluctuations of α may be balanced by a reciprocal response from β generating a nonfluctuating output.

process carried out by these redundant regulators benefits from the constancy of the sum of their protein concentrations. In other words, while the concentration of MyoD may fluctuate, due, for example, to noise in gene expression or false induction, the sum of MyoD + Myf -5 may have evolved to remain constant (Figure 4). According to this model, the transcriptional compensation by Myf5 seen upon mutation in MyoD may represent a by-product of an ability of the system to respond to temporary fluctuations in the system. An additional example illustrating dosage-dependent linear response constitutes the Pax1 and Pax9 regulators of sclerotome development. Here functional redundancy has been established at the phenotypic level from mutant mouse experiments showing that Pax1 can fully rescue Pax9 mutants, and conversely, Pax9 can offset the Pax1 -null phenotype to a substantial degree (Peters et al., 1999). In line with what seems to be the general case for numerous examples of developmental redundancies, Pax1 and Pax9 have partially overlapping expression domains during early development, particularly in the sclerotomes (Peters et al., 1999) although this overlap decreases in the later

REDUNDANCIES AND REGULATION

243

stages of development. The responsive circuitry of these regulators was established using the lacZ /gal system to show an up-regulation and spatial expansion of Pax9 expression in the sclerotomes of the Pax1 mutant (Peters et al., 1999). Thus, Pax9 expression in the Pax1 mutants was observed in cells that in the wild type exhibit only Pax1 expression. Dosage dependency was observed by comparing phenotypes of combinations of the wild-type, heterozygous, and homozygous mutants of Pax1 and Pax9 (Peters et al., 1999). It is worth noting that functional redundancy was also suggested for other members of the Pax gene family, in particular for the two pairs Pax2 /Pax5 (Schwarz et al., 1997) and Pax3 /Pax7 (Mansouri and Gruss, 1998). All nine family members of the Pax transcription factors play roles in the genetic control of mammalian organogenesis [for a review, see Dahl et al. (1997)]. Other examples of responsive circuits of redundant developmental regulators include the closely related homeobox gene pair Gsh1 and Gsh2 , for which mutational inactivation of Gsh1 resulted in a pronounced expansion of Gsh2 expression in the cerebral cortex and olfactory bulb of mice with an apparently normal phenotype (Toresson and Campbell, 2001). For the vertebrate distal-less-related regulators dlx3 and dlx7, morpholino-induced inactivation of dlx3 resulted in a strong induction of dlx7 mRNA expression in zebrafish embryos (Solomon and Fritz, 2002). In the two functionally overlapping E3 ligases Smurf-1 and Smurf-2 (Yamashita et al., 2005), knockouts of Smurf-1 were shown to result in an up-regulatory response of Smurf-2 (Kavsak et al., 2000; Lin et al., 2000; Zhang et al., 2001). The midkine and pleiotrophin cytokines for which functional redundancy was observed and a strong up-regulatory response of pleiotrophin was shown to result from the double knockout of the midkine gene (Herradon et al., 2005).

9 REDUNDANCIES AND REGULATION The abundance of redundancies occurring in genes related to developmental processes and their functional role as master regulators may be taken to suggest their utilization in either the flexibility or robustness of regulatory control. Although for most examples, regulatory interactions correlating redundant developmental genes were not tested for, some have been identified specifically as displaying negative cross-regulatory inhibitions. One such example existing in E. coli is that of the pair stpA/HN-S , which regulate genome-scale transcriptional response to DNA damage (Dorman, 2004). This pair of regulators displays an additional complexity where its regulation is induced by pairwise associations to either form homodimers composed of either of the pair members or heterodimers containing both (Dorman, 2004). Nevertheless, mutational inactivation of HN-S induced an up-regulatory response of its partner, stpA, with only a marginal effect on phenotype (Zhang et al., 1996; Free and Dorman, 1997). A more recent example indicates that the multidrug resistance phenomenon in S. cerevisiae is also regulated by an RBC encoding for up-regulation of the transcription factor YRR1 in response to the deletion of its partner, YRM1 (Lucau-Danila et al., 2003; Onda et al., 2004). 9.1 Recurring Regulatory Patterns Two architectures of cross-regulated redundancies may exist. According to the first, inactivation of each of the redundant genes from a given pair would result in the

244

EVOLUTIONARY AND FUNCTIONAL ASPECTS OF GENETIC REDUNDANCY

Gene A

Gene B (A)

Gene A

Gene B (B)

Gene A (inducer)

Gene B (responder) (C)

Figure 5 Bidirectional (A,B) and unidirectional (C) responsive backup circuits.

induction or repression of the other (Figure 5A and B), and according to the second, only one of the pair members is responsive (Figure 5C). We have therefore suggested the terminology bidirectional responsive backup circuits and unidirectional responsive backup circuits. This distinction is important, as from the current literature review, all but two of the examples seem to fall into the unidirectional category. In light of this, for unidirectional responsive backup circuits, we suggest further a distinction between the responsive gene and the controller gene. The pattern above is but one of several asymmetries and regulatory patterns that recur systematically throughout the literature. An additional example is the classification of redundant pair members into a ubiquitously expressed gene partner and a sporadically expressed partner. One of the most profound and insightful of these recurring regulatory themes is that although both genes are capable of some functional compensation, disruption of the responder produces a significantly less deleterious phenotype than does disruption of the controller (Kafri et al., 2006). An insightful example illustrating this entails the pair of genes Fks1 and Fks2 , redundantly encoding the catalytic subunit of the yeast 1,3-β-glucan synthease (Douglas et al., 1994; Inoue et al., 1995). This enzyme is responsible for the generation of cross-links within the 1,3-β-glucan matrix, comprising the major structural component of the yeast cell wall. As such, this process requires very tight regulation of cell-wall degradation and cell cycle to enable budding and isotropic cell growth. This fact is also suggested from the numerous associations of FKS1 within the genetic interaction network, illustrating its linkage to processes such as cell cycle control, environmental stress responses, and mating (Lesage et al., 2004). The surprising aspect of this story is that despite the high connectivity of FKS1 , FKS2 is only sparsely connected (Lesage et al., 2004). Also, whereas deletion of FKS1 induces an up-regulatory response of FKS2 with mild phenotypic effects, deletion of FKS2 induces no regulatory response of FKS1 but also no detectable effect on the phenotype (Douglas et al., 1994). This result may seem counterintuitive, as it is FKS2 that is up-regulated to rescue against deletion of FKS1 and not vice versa, yet FKS2 is the more dispensable gene within this pair (see Table 2). A simple potential interpretation may suggest that while the controller is the key player performing some essential biological role, the responder is merely a less efficient substitute. Yet, accepting the notion that redundancy could not have evolved for the sake of buffering mutations, this interpretation is still severely lacking. A different and more biologically reasonable hypothesis accounting for these asymmetries is that one of the functions of the responder is to buffer dosage fluctuations of

REDUNDANCIES AND REGULATION

245

the controller. This buffering capacity requires a functional overlap that also manifests itself in compensations against the rarer event of gene loss. Other models accounting for this are discussed further in this book, but our main point of argument is that this complex regulation of functionally redundant, yet evolutionarily conserved genes strongly indicates utilization of redundancy. 9.2 Regulatory Designs What regulatory design could account for a gene sensing and responding to its redundant partner’s intactness? From the most general perspective there are three possible regulatory schemes that could answer this question. Scheme A entails a direct negative regulation of a gene by its functionally redundant partner (Figure 6). Scheme B utilizes the substrate abundance as a proxy for its partner’s activity. In other words, overaccumulation of substrate, potentially caused by reduced or abolished efficiency of one of the responsive backup circuit pair members, signals for overproduction of the second member. Scheme C employs end product inhibition. Assuming that an end-product may inhibit both redundant partners, the lack of function of one of the partners would result in an absence of the product and hence relief of repression from the second partner. Conceptually, schemes B and C are symmetric. One instance of a responsive backup circuit that relies on direct regulatory interaction between redundant partners without involving either substrate or end-product regulation constitutes the two vertebrate distal-less-related regulators, dlx3 and dlx7 (Figure 5B). These paralogous transcriptional regulators are both expressed in embryonic development and are involved in the development of auditory and olfactory placodes (Ekker et al., 1992; Akimenko et al., 1994; Ellies et al., 1997). By injecting anti-dlx3 and anti-dlx7 morpholino oligonucleotides (MO) in zebrafish it was showed that whereas the simultaneous inhibition of both genes (dlx3 + 7-MO) resulted in embryos having (A)

(B)

Scheme A

Scheme C

Scheme B

Dlx3

Dlx7

Downstream genes

Figure 6 (A) Three possibilities for feedback in responsive backup circuits. For one duplicate gene to sense and respond to its partners’ intactness feedback mechanisms must be at play. In this diagram duplicates are represented as circles that lie embedded within a reaction pathway illustrated by the consecutive arrows. Lines A, B, and C represent three feedback possibilities: simple negative regulation (scheme A), substrate induction (scheme B), and end-product regulation (scheme C). (B) Regulatory wiring for the two distal-less developmental regulators Dlx3 and Dlx7 as deduced from morpholino antisense translation inhibitions.

246

EVOLUTIONARY AND FUNCTIONAL ASPECTS OF GENETIC REDUNDANCY

severe defects in the auditory and olfactory placodes, dlx7 loss-of-function embryos appeared phenotypically normal and dlx3 MO embryos exhibited only smaller auditory placodes and inner ear structures than those of normal (or wild-type) embryos (Solomon and Fritz, 2002). Increase in dlx7 mRNA was observed in the dlx3 MO embryos (Solomon and Fritz, 2002). The regulatory relationships between dlx3 and dlx7 were tested by measuring the mRNA content of the various MO-treated embryos (Solomon and Fritz, 2002) and are summarized in a network diagram (Figure 5B) featuring both cross- and autoregulation. Thus, the lesson is that for this case, redundancy is embedded within a more complex interaction network that includes a unidirectional responsive circuit in which the controller (dlx3 ) also represses its own transcription, while the responder (dlx7 ) is a positive autoregulator. Another interesting example for which the regulatory pathway leading to induction was well characterized constitutes the unidirectional RBC of Fks1 and Fks2 in yeast (discussed above in a different context). Here the responder (Fks2 ), in addition to being activated in the fks1 mutant, is also induced by heat shock, cell-wall damage, pheromone, and Ca2+ (Garcia-Rodriguez et al., 2000). The intricate design of this circuitry is realized by the fact that there are two different, alternative signaling pathways that operate synergistically to control Fks2 expression. While response to Fks1 deletion is activated through a calcineuerin/Ca2+ -dependent pathway, response to heat shock and cell-wall damage in induced by both the former and the Roh1 -dependent cell integrity pathway (Weiss et al., 1998). Even more interesting, it was found that these two pathways induce different and complementary dynamics of the Fks2 response (Weiss et al., 1998). Specifically, whereas the calcineuerin-dependent pathway induces a rapid but transient response, the Roh1 -dependent pathway induces a delayed response that is sustained for longer time scales. An end-product-activated feedback mechanism is demonstrated by the hexose transporters Htx1 and Hxt2 in yeast, where the expression of both genes is repressed by the level of intracellular glucose. Thus, once the flux of glucose from the environment to the yeast’s cytoplasm decreases, an additional glucose pump is induced for expression. 10

FUNCTIONS OF RESPONSIVE BACKUP CIRCUITS

What functions could be associated with redundant duplicates regulated by responsive circuitries as described in this chapter? One interesting possibility raised by Kafri et al. (2006) is that by such responsive circuitries, genetic redundancy can be utilized to control stochastic noise effectively in gene expression. The idea is that random undesired fluctuations of a given gene generate reciprocal fluctuations of its redundant duplicates such that the sum function remains stable. By modeling different responsive circuitries (Kafri et al., 2006) we have shown that by such regulatory circuits, genetic redundancy offers noise control that is far more efficient than that obtained by simple negative feedback regulation. It is noteworthy that this suggested function in which redundancy is utilized for noise control is but one option and remains to be corroborated. Alternative functions include ecoparalogs (Sanchez-Perez et al., 2008).

METHODOLOGIES: INFERRING REDUNDANT INTERACTIONS

247

11 METHODOLOGIES: INFERRING REDUNDANT INTERACTIONS To date, different studies have relied on different means to infer redundant interactions. Generally, inference of genetic redundancy has relied on either of four methods: (1) increased dispensability of duplicates, (2) measured synergistic interactions, (3) literature curation, and (4) synthetic lethal interactions. 11.1 Increased Dispensability of Duplicates Systematic knockout studies have revealed that the proportion of genes that are essential for cell viability are far greater among singleton genes than among duplicated genes. This difference between the proportions of essential singletons to the proportion of essential duplicates has been used to estimate the proportions of redundant interactions. It is noteworthy that this estimate is based on the assumption that the difference between the essentiality rates or duplicates and singletons is a sole consequence of compensatory interactions stemming from redundant functions. A recent extension of this method has relied on the fact that the degree, k , of genes in the physical protein interaction network (i.e., the number of proteins with which a gene product interacts) is positively correlated with the probability of the encoding gene carrying out essential functions (Jeong et al., 2001). Relying on this association, we later showed that the difference between the proportion of dispensable duplicates to that of singletons was shown to increase steadily with increasing connectivity in the protein interaction network (Kafri et al., 2005). For example, duplicates with a degree higher than 10 contain five times higher proportions of dispensable genes than do singletons of the same degree, while those with a low degree are largely as dispensable as singletons. From the perspective of the inferred essentiality of gene functions, this result suggests that among duplicates, dispensability is associated with greater probability with the essential functions. 11.2 Epistatic Synergism For a given pair of duplicate genes, x and y, let Sx represent the fitness cost of the deletion of x and Sy represent the fitness cost of the deletion of y. Further, let Sxy represent the fitness cost of the double mutant xy. The relation xy = Sxy − Sx − Sy defines the degree of epistatic synergism between the duplicates x and y and is a quantitative measure of their capacity to compensate for each other’s loss of function. In (DeLuna et al., 2008), Kishony and co-workers have relied on careful measurements of synergistic epistasis to infer redundant interactions among the set of duplicates originating from the yeast whole-genome duplications 100 million years back. 11.3 Literature Curation Possibly, a most reliable method of obtaining a list of functionally redundant duplicates is to collect well-characterized examples from the literature. Relying on an exhaustive literature curation to obtain a list of redundant duplicates has been employed by Kafri et al. (2008). Specifically, the authors collected a list of duplicates by performing standard BLASTP on all yeast genome. A script was then applied to collect, for each

248

EVOLUTIONARY AND FUNCTIONAL ASPECTS OF GENETIC REDUNDANCY

such pair, all references in PubMed for which both pair members were cited concomitantly in the same reference. The authors then manually inspected the resulting list of more than 2000 abstracts and publications. Genes were classified “redundant” if they met the following criteria: (1) clear documentation in the literature, from nonhigh-throughput studies, of their functional overlap; and (2) experimental validation of compensatory interactions between the pair members. This search yielded 112 highly validated redundant paralogous pairs. 11.4 Synthetic Lethal Interactions A possible alternative method of determining redundant interactions is by use of synthetic lethal interactions. For a given pair of genes, a synthetic lethal interaction occurs when the knockouts of either pair member will not produce lethality but the double knockout will. In other words, either of the two genes is separately sufficient to allow viability. With advance in high-throughput screening technologies, synthetic lethal databases are becoming readily accessible. This accessibility has made it tempting to rely on these data sets as proxies for redundant interactions. It should be noted, however, that interpreting synthetic lethal interactions as proxies for redundant functions for nonduplicate gene pairs may be greatly misleading. Specifically, close inspection reveals that the vast majority of synthetic lethal pairs are far from being even remotely functionally equivalent. For example, the list of synthetic lethals of the FKS1 gene in yeast (Lesage et al., 2004), includes the gene FEN1 . This gene is an enzyme that lies on the pathway of sphingolipid biosynthesis. Its genetic interaction with FKS1 is thought to result from an accumulation of its substrate, phytosphingosine, in the plasma membrane of the FEN1 deletion mutant. Phytosphingosine, in turn, is thought to repress the interaction between Fks1p and Rho1p, possibly by forming a microdomain around Fks1p and physically preventing its association with its regulatory subunit, Rho (Abe et al., 2001). Thus, despite the genetic interaction, no redundancy can be implied.

REFERENCES Abe M, Nishida I, et al. 2001. Yeast 1,3-beta-glucan synthase activity is inhibited by phytosphingosine localized to the endoplasmic reticulum. J Biol Chem 276(29):26923–26930. Akimenko MA, Ekker M, et al. 1994. Combinatorial expression of three zebrafish genes related to distal-less: part of a homeobox gene code for the head. J Neurosci 14(6):3475–3486. Alberts B. 2002. Molecular Biology of the Cell . New York: Garland Science. Atchley WR, Fitch WM, et al. 1994. Molecular evolution of the MyoD family of transcription factors. Proc Natl Acad Sci USA 91(24):11522–6. Berthet C, Kaldis P. 2007. Cell-specific responses to loss of cyclin-dependent kinases. Oncogene 26(31):4469–4477. Brookfield JF. 2003. Gene duplications: the gradual evolution of functional divergence. Curr Biol 13(6):R229–R230. Cases S, Novak S, et al. 1998. ACAT–2, a second mammalian acyl-CoA:cholesterol acyltransferase: its cloning, expression, and characterization. J Biol Chem 273(41):26755–26764. Conant GC, Wagner A. 2003. Asymmetric sequence divergence of duplicate genes. Genome Res 13(9):2052–2058.

REFERENCES

249

Conant GC, Wagner A. 2004. Duplicate genes and robustness to transient gene knock-downs in Caenorhabditis elegans. Proc R Soc Lond B 271(1534):89–96. Cusack BP, Wolfe KH. 2007. When gene marriages don’t work out: divorce by subfunctionalization. Trends Genet 23(6):270–272. Dahl E, Koseki H, et al. 1997. Pax genes and organogenesis. Bioessays 19(9):755–765. Davis JC, Petrov DA. 2004. Preferential duplication of conserved proteins in eukaryotic genomes. PLoS Biol 2(3):E55. DeLuna A, Vetsigian K, et al. 2008. Exposing the fitness contribution of duplicated genes. Nat Genet 40(5):676–681. Dorman CJ. 2004. H-NS: a universal regulator for a dynamic genome. Nat Rev Microbiol 2(5):391–400. Douglas CM, Foor F, et al. 1994. The Saccharomyces cerevisiae FKS1 (ETG1) gene encodes an integral membrane protein which is a subunit of 1,3-beta-d-glucan synthase. Proc Natl Acad Sci USA 91(26):12907–12911. Ekker M, Akimenko MA, et al. 1992. Regional expression of three homeobox transcripts in the inner ear of zebrafish embryos. Neuron 9(1):27–35. Ellies DL, Stock DW, et al. 1997. Relationship between the genomic organization and the overlapping embryonic expression patterns of the zebrafish dlx genes. Genomics 45(3):580–590. Enns LC, Kanaoka MM, et al. 2005. Two callose synthases, GSL1 and GSL5, play an essential and redundant role in plant and pollen development and in fertility. Plant Mol Biol 58(3):333–349. Free A, Dorman CJ. 1997. The Escherichia coli stpA gene is transiently expressed during growth in rich medium and is induced in minimal medium and by stress conditions. J Bacteriol 179(3):909–918. Garcia-Rodriguez LJ, Trilla JA, et al. 2000. Characterization of the chitin biosynthesis process as a compensatory mechanism in the fks1 mutant of Saccharomyces cerevisiae. FEBS Lett 478(1–2):84–88. Gasch AP, Spellman PT, et al. 2000. Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell 11(12):4241–4257. Gerhart J, Kirschner M. 1997. Cells, Embryos, and Evolution: Toward a Cellular and Developmental Understanding of Phenotypic Variation and Evolutionary Adaptability. Malden, MA: Blackwell Science. Gu X. 2003. Evolution of duplicate genes versus genetic robustness against null mutations. Trends Genet 19(7):354–356. Gu Z, Nicolae D, et al. 2002. Rapid divergence in expression between duplicate genes inferred from microarray data. Trends Genet 18(12):609–613. Gu Z, Steinmetz LM, et al. 2003. Role of duplicate genes in genetic robustness against null mutations. Nature 421(6918):63–66. Gu Z, Rifkin SA, et al. 2004. Duplicate genes increase gene expression diversity within and between species. Nat Genet 36(6):577–579. Haldar M, Karan G, et al. 2008. Two cell lineages, myf5 and myf5-independent, participate in mouse skeletal myogenesis. Dev Cell 14(3):437–445. He X, Zhang J. 2005a. Gene complexity and gene duplicability. Curr Biol 15(11):1016–10121. He X, Zhang J. 2005b. Higher duplicability of less important genes in yeast genomes. Mol Biol Evol. He X, Zhang J. 2006. Transcriptional reprogramming and backup between duplicate genes: Is it a genomewide phenomenon? Genetics 172(2):1363–1367.

250

EVOLUTIONARY AND FUNCTIONAL ASPECTS OF GENETIC REDUNDANCY

Herradon G, Ezquerra L, et al. 2005. Midkine regulates pleiotrophin organ-specific gene expression: evidence for transcriptional regulation and functional redundancy within the pleiotrophin/midkine developmental gene family. Biochem Biophys Res Commun 333(3):714–721. Holland PW, Holland LZ, et al. 1992. An Amphioxus homeobox gene: sequence conservation, spatial expression during development and insights into vertebrate evolution. Development 116(3):653–661. Ihmels J, Levy R, et al. 2004. Principles of transcriptional control in the metabolic network of Saccharomyces cerevisiae. Nat Biotechnol 22(1):86–92. Inoue SB, Takewaki N, et al. 1995. Characterization and gene cloning of 1,3-beta-d-glucan synthase from Saccharomyces cerevisiae. Eur J Biochem 231(3):845–854. Jeong H, Mason SP, et al. 2001. Lethality and centrality in protein networks. Nature 411(6833):41–42. Jordan IK, Wolf YI, et al. 2004. Duplicated genes evolve slower than singletons despite the initial rate increase. BMC Evol Biol 4:22. Kafri R, Bar-Even A, et al. 2005. Transcription control reprogramming in genetic backup circuits. Nat Genet 37(3):295–299. Kafri R, Levy M, et al. 2006. The regulatory utilization of genetic redundancy through responsive backup circuits. Proc Natl Acad Sci USA 103(31):11653–11658. Kafri R, Dahan O, et al. 2008. Preferential protection of protein interaction network hubs in yeast: evolved functionality of genetic redundancy. Proc Natl Acad Sci USA 105(4):1243–1248. Kavsak P, Rasmussen RK, et al. 2000. Smad7 binds to Smurf2 to form an E3 ubiquitin ligase that targets the TGF beta receptor for degradation. Mol Cell 6(6):1365–1375. Kellis M, Birren BW, et al. 2004. Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature 428(6983):617–624. Kirschner MW. 2005. The meaning of systems biology. Cell 121(4):503–504. Kirschner M, Gerhart J. 1998. Evolvability. Proc Natl Acad Sci USA 95(15):8420–8427. Kirschner M, Gerhart J. 2005. The Plausibility of Life: Resolving Darwin’s Dilemma. New Haven, CT: Yale University Press. Kondrashov FA, Koonin EV. 2004. A common framework for understanding the origin of genetic dominance and evolutionary fates of gene duplications. Trends Genet 20(7):287–290. Krakauer DC, Nowak MA. 1999. Evolutionary preservation of redundant duplicated genes. Semin Cell Dev Biol 10(5):555–559. Krause M, Fire A, et al. 1990. CeMyoD accumulation defines the body wall muscle cell fate during C. elegans embryogenesis. Cell 63(5):907–919. Landry CR, Oh J, et al. 2006. Genome-wide scan reveals that genetic variation for transcriptional plasticity in yeast is biased towards multi-copy and dispensable genes. Gene 366(2):343–351. Leon M, Sentandreu R, et al. 2002. A single FKS homologue in Yarrowia lipolytica is essential for viability. Yeast 19(12):1003–1014. Lesage G, Sdicu AM, et al. 2004. Analysis of beta-1,3-glucan assembly in Saccharomyces cerevisiae using a synthetic interaction network and altered sensitivity to caspofungin. Genetics 167(1):35–49. Lin X, Liang M, et al. 2000. Smurf2 is a ubiquitin E3 ligase mediating proteasomedependent degradation of Smad2 in transforming growth factor-beta signaling. J Biol Chem 275(47):36818–36822. Lin YS, Hwang JK, et al. 2006. Protein complexity, gene duplicability and gene dispensability in the yeast genome. Gene.

REFERENCES

251

Lucau-Danila A, Delaveau T, et al. 2003. Competitive promoter occupancy by two yeast paralogous transcription factors controlling the multidrug resistance phenomenon. J Biol Chem 278(52):52641–52650. Lynch M, Conery JS. 2000. The evolutionary fate and consequences of duplicate genes. Science 290(5494):1151–1155. Makova KD, Li WH. 2003. Divergence in the spatial pattern of gene expression between human duplicate genes. Genome Res 13(7):1638–1645. Mansouri A, Gruss P. 1998. Pax3 and Pax7 are expressed in commissural neurons and restrict ventral neuronal identity in the spinal cord. Mech Dev 78(1–2):171–178. McCammon MT, McAlister-Henn L. 2003. Multiple cellular consequences of isocitrate dehydrogenase isozyme dysfunction. Arch Biochem Biophys 419(2):222–233. Michelson AM, Abmayr SM, et al. 1990. Expression of a MyoD family member prefigures muscle pattern in Drosophila embryos. Genes Dev 4(12A):2086–2097. Musso G, Costanzo M, et al. 2008. The extensive and condition-dependent nature of epistasis among whole-genome duplicates in yeast. Genome Res. Nowak MA, Boerlijst MC, et al. 1997. Evolution of genetic redundancy. Nature 388(6638):167–171. Ohno S. 1970. Evolution by Gene and Genome Duplication. New York: Springer-Verlag. Olson EN, Klein WH. 1994. bHLH factors in muscle development: dead lines and commitments, what to leave in and what to leave out. Genes Dev 8(1):1–8. Onda M, Ota K, et al. 2004. Analysis of gene network regulating yeast multidrug resistance by artificial activation of transcription factors: involvement of Pdr3 in salt tolerance. Gene 332:51–59. Ozcan S. 2002. Two different signals regulate repression and induction of gene expression by glucose. J Biol Chem 277(49):46993–46997. Papp B, P´al C, et al. 2004. Metabolic network analysis of the causes and evolution of enzyme dispensability in yeast. Nature 429(6992):661–664. Pearce AC, Senis YA, et al. 2004. Vav1 and vav3 have critical but redundant roles in mediating platelet activation by collagen. J Biol Chem 279(52):53955–53962. Peters H, Wilm B, et al. 1999. Pax1 and Pax9 synergistically regulate vertebral column development. Development 126(23):5399–5408. Qiu M, Bulfone A, et al. 1995. Null mutation of Dlx-2 results in abnormal morphogenesis of proximal first and second branchial arch derivatives and abnormal differentiation in the forebrain. Genes Dev 9(20):2523–2538. Qiu M, Bulfone A, et al. 1997. Role of the Dlx homeobox genes in proximodistal patterning of the branchial arches: mutations of Dlx-1 , Dlx-2 , and Dlx-1 and -2 alter morphogenesis of proximal skeletal and soft tissue structures derived from the first and second arches. Dev Biol 185(2):165–184. Quezada H, Aranda C, et al. 2008. Specialization of the paralogue LYS21 determines lysine biosynthesis under respiratory metabolism in Saccharomyces cerevisiae. Microbiology 154(Pt. 6):1656–1667. Rudnicki MA, Braun T, et al. 1992. Inactivation of MyoD in mice leads to up-regulation of the myogenic HLH gene Myf-5 and results in apparently normal muscle development. Cell 71(3):383–390. Rudnicki MA, Schnegelsberg PN, et al. 1993. MyoD or Myf-5 is required for the formation of skeletal muscle. Cell 75(7):1351–1359. Sabourin LA, Rudnicki MA. 2000. The molecular regulation of myogenesis. Clin Genet 57(1):16–25.

252

EVOLUTIONARY AND FUNCTIONAL ASPECTS OF GENETIC REDUNDANCY

Sanchez-Perez G, Mira A, et al. 2008. Adapting to environmental changes using specialized paralogs. Trends Genet 24(4):154–158. Schwarz M, Alvarez-Bolado G, et al. 1997. Conserved biological function between Pax-2 and Pax-5 in midbrain and cerebellum development: evidence from targeted mutations. Proc Natl Acad Sci USA 94(26):14518–14523. Semon M, Wolfe KH. 2007. Consequences of genome duplication. Curr Opin Genet Dev 17(6):505–512. Solomon KS, Fritz A. 2002. Concerted action of two dlx paralogs in sensory placode formation. Development 129(13):3127–3136. Taniguchi T. 1995. Cytokine signaling through nonreceptor protein tyrosine kinases. Science 268(5208):251–255. Tautz D. 1992. Redundancies, development and the flow of information. Bioessays 14(4):263–66. Tischler J, Lehner B, et al. 2006. Combinatorial RNA interference in C. elegans reveals that redundancy between gene duplicates can be maintained for more than 80 million years of evolution. Genome Biol 7(8):R69. Toresson H, Campbell K. 2001. A role for Gsh1 in the developing striatum and olfactory bulb of Gsh2 mutant mice. Development 128(23):4769–4780. Van den Berg MA, de Jong-Gubbels P, et al. 1996. The two acetyl-coenzyme A synthetases of Saccharomyces cerevisiae differ with respect to kinetic properties and transcriptional regulation. J Biol Chem 271(46):28953–28959. Venuti JM, Goldberg L, et al. 1991. A myogenic factor from sea urchin embryos capable of programming muscle differentiation in mammalian cells. Proc Natl Acad Sci USA 88(14):6219–6223. Waddington CH. 1942. Nature 150:563–565. Wagner A. 2001. Birth and death of duplicated genes in completely sequenced eukaryotes. Trends Genet 17(5):237–239. Wagner A. 2002. Asymmetric functional divergence of duplicate genes in yeast. Mol Biol Evol 19(10):1760–1768. Weiss K, Stock D, et al. 1998. Perspectives on genetic aspects of dental patterning. Eur J Oral Sci 106 Suppl. 1: 55–63. Wolfe KH, Shields DC. 1997. Molecular evidence for an ancient duplication of the entire yeast genome. Nature 387(6634):708–713. Wong SL, Roth FP. 2005. Transcriptional compensation for gene loss plays a minor role in maintaining genetic robustness in Saccharomyces cerevisiae. Genetics 171(2):829–33. Yamashita M, Ying SX, et al. 2005. Ubiquitin ligase Smurf1 controls osteoblast activity and bone homeostasis by targeting MEKK2 for degradation. Cell 121(1):101–113. Yang H, Bard M, et al. 1996. Sterol esterification in yeast: a two-gene process. Science 272(5266):1353–1356. Yu C, Kennedy NJ, et al. 1996. Molecular cloning and characterization of two isoforms of Saccharomyces cerevisiae acyl-CoA:sterol acyltransferase. J Biol Chem 271(39):24157–24163. Zhang W, Behringer RR, et al. 1995. Inactivation of the myogenic bHLH gene MRF4 results in up-regulation of myogenin and rib anomalies. Genes Dev 9(11):1388–1399. Zhang A, Rimsky S, et al. 1996. Escherichia coli protein analogs StpA and H-NS: regulatory loops, similar and disparate effects on nucleic acid dynamics. EMBO J 15(6):1340–1349. Zhang Y, Chang C, et al. 2001. Regulation of Smad degradation and activity by Smurf2, an E3 ubiquitin ligase. Proc Natl Acad Sci USA 98(3):974–979. Zhao C, Jung US, et al. 1998. Temperature-induced expression of yeast FKS2 is under the dual control of protein kinase C and calcineurin. Mol Cell Biol 18(2):1013–1022.

14

Phylogenomic Approach to the Evolutionary Dynamics of Gene Duplication in Birds CHRIS L. ORGAN* Museum of Comparative Zoology, Harvard University, Cambridge, Massachusetts

MATTHEW D. RASMUSSEN* Computer Science and Artificial Intelligence, Massachusetts Institute of Technology, Cambridge, Massachusetts

MAUDE W. BALDWIN Museum of Comparative Zoology, Harvard University, Cambridge, Massachusetts

MANOLIS KELLIS Computer Science and Artificial Intelligence, Massachusetts Institute of Technology, Cambridge, Massachusetts

SCOTT V. EDWARDS Museum of Comparative Zoology, Harvard University, Cambridge, Massachusetts

1 INTRODUCTION New genes are thought to arise primarily through a process of gene duplication. Genes that are homologous as a result of divergence across lineages via speciation are said to be orthologous, whereas genes that are homologous as a result of gene duplication are paralogous (Li, 2006). Paralogous genes that are functionally redundant and selectively nearly neutral can result in one copy being mutated into a functionless sequence called a pseudogene, or they can be deleted altogether. On the other hand, some duplicated genes can be beneficial from their time of origin because of dosage effects (Kondrashov et al., 2002) and may ultimately be important for speciation. In Passeriformes (perching birds) this may be the case for growth ∗ These

two authors contributed equally to this work.

Evolution After Gene Duplication, Edited by Katharina Dittmar and David Liberles Copyright © 2010 Wiley-Blackwell

253

254

EVOLUTIONARY DYNAMICS OF GENE DUPLICATION IN BIRDS

hormone (GH) paralogs, which have undergone differential selection since their divergence (Yuri et al., 2008). In a process called subfunctionalization, each paralog adopts partial function of their ancestral gene (Nowak et al., 1997; Lynch and Force, 2000). Changes in gene expression immediately following duplication as a result of subfunctionalization appear to be common (Gu et al., 2002). Duplicated genes can also diverge to produce novel functions in a process known as neofunctionalization (Zhang, 2003). For example, some duplicated members of the RNaseA gene superfamily in primates evolved a novel antibacterial function that was not present in the common ancestral gene or its descendants (Zhang et al., 1998). Through the acquisition of novel functions, gene duplication plays a vital role in generating diversity at both molecular (Eirin-Lopez et al., 2004) and organismal (Hittinger and Carroll, 2007) levels and can increase the phenotypic complexity and diversity of animals (Ohno, 1970; Maniatis and Tasic, 2002). Protein coding genes, regulatory genes, and RNA noncoding genes are all subject to gene duplication. Gene duplication occurs in all three major domains of life: Bacteria, Archaea, and Eukarya (Zhang, 2003), and comparative analyses have suggested an average origin rate of new gene duplicates on the order of 0.1 per gene per million years (Gao and Innan, 2004; Osada and Innan, 2008). However, Gao and Innan (2004) and Osada and Innan (2008) have shown that using silent-site divergence as Lynch and Connery (2000) did can seriously overestimate the rate of duplication, due to spurious similarity of paralogs on account of gene conversion. More robust rates can be estimated from phylogenetic trees in which the role of gene conversion can be properly assessed, and they have suggested rates of duplication for yeast and Drosophila one to three orders of magnitude lower than Lynch and Connnery’s rates. The ubiquity of gene duplication and its power to generate material on which selection may act is an especially interesting topic in birds because of their uniquely structured genomes. Birds have smaller genomes than those of any other amniote group (Gregory, 2002), but they also have a substantially reduced density of active repetitive elements (Shedlock, 2006) and segmental duplications and pseudogenes (Hillier et al., 2004). Birds may also have fewer protein-coding genes in their genomes than mammals have, with roughly 18,000 in chickens compared with approximately 22,000 in humans (Hillier et al., 2004). By comparing the genomes of chicken with those of humans and pufferfish (Fugu), the difference in gene count can be explained by significantly less lineage-specific gene duplication along the lineage of birds. For example, gene duplication by retrotransposition appears to be rare in birds compared with mammals, which have roughly 300 times the number of retrotransposed gene duplicates than in chicken (over 15,000 compared with 51) (Hillier et al., 2004). Of the 51 duplicates detected in the chicken genome by Hillier et al. (2004), 36 appear to be pseudogenes. This is probably due in part to the abundance of long interspersed nuclear elements (LINEs) in mammal genomes, which are probably responsible for the reverse transcription of retrotransposed gene duplicates. Chicken repeat 1 (CR1) is the dominant active transposable element in chicken and other archosaurs (Shedlock et al., 2007), a feature that may account for the lack of retrotransposed gene duplicates due to this element’s inability to copy polyadenylated mRNA (Haasa et al., 2001; Hillier et al., 2004). The whole-genome sequencing of the chicken (Gallus gallus) provided an unprecedented window into the architecture of bird genomes (Hillier et al., 2004). This study revealed a number of important details of gene duplication. For example, most of the expansions in gene families within chicken are associated with the immune system and host defense against parasites (Ota and Nei, 1994; Nei et al., 1997). The

METHODOLOGY AND COMPUTATIONAL APPROACH

255

chicken genome project found an expansion of the scavenger receptor cysteine-rich (SRCR) domain (Hillier et al., 2004), a highly conserved protein module involved in the innate immune system (Sarrias et al., 2004). Linked with the major histocompatibility complex (MHC) class I gene cluster in humans are certain olfactory receptors; some subgroups, such as γ-c subgroup, also underwent a lineage-specific expansion within the chicken (Steiger et al., 2009). These olfactory receptor genes, related to two orthologs in human (OR5U1 and OR5BF1 ), appear to have expanded within birds relatively recently to constitute the majority of the over 200 olfactory receptor genes in the chicken genome. The expansions of these genes could conceivably be part of the genetic mechanism linking the immune and olfactory systems with mate choice, kin recognition, and social interactions in birds (Zelano and Edwards, 2002). Other gene expansions were probably also important for key innovations in birds, such as expansions in the keratin gene family, which are the proteinaceous building blocks of feathers and therefore vital for thermoregulation, sexual display, and flight. However, even in the absence of selection (most gene duplicates are thought to be pseudogenized during evolution) gene duplication may still play a critical role in creating postmating reproductive barriers that aid in speciation (Lynch and Conery, 2000). The age of gene duplications within lineages is also of fundamental interest in the study of gene family evolution. Analysis of animal, plant, and fungal genomes has shown that the majority of duplications are recent, because most duplicated genes are thought to be pseudogenized shortly after the duplication event (Lynch and Conery, 2000). It is currently unknown if the small, streamlined genomes of birds deviate from this pattern, given the lower number of total genes, paucity of transposable elements, and highly recombinant microchromosomes found in chicken. Here we explore the evolution of chicken gene families within the larger context of amniote evolution. Phylogenomics (Eisen, 1998) attempts systematically to reconstruct the phylogeny of gene families across multiple complete genomes so as to deduce putative protein homologies and functions as well as the specific gene duplications and losses responsible for gene family diversity. The phylogenomic approach to multigene family evolution has shown recent success in many different clades, such as vertebrates (Zmasek and Eddy, 2002; Storm and Sonnhammer, 2003; Li et al., 2006; Huerta-Cepas et al., 2007) and 16 fungi species (Wapinski et al., 2007). Recently, Rasmussen and Kellis (2007) deduced a pattern of substitution rates within and among Drosophila and yeast species by using machine learning to compare the gene trees in multigene families to the presumed species tree. In addition, Heger and Ponting (2007) developed a comprehensive database of orthologous and paralogous gene sets for several clades, including amniotes. Here we follow a complementary approach to characterize the dynamics and evolution of gene duplication within birds. Whereas many previous approaches to recognizing paralogs use reciprocal BLAST hits and pairwise comparison of orthologs, we apply Ensembl phylogenies and a variety of phylogenetic and comparative analyses within a clade of two dozen sequenced amniote species to analyze the dynamics of gene duplication in the complete chicken (Gallus gallus) genome.

2 METHODOLOGY AND COMPUTATIONAL APPROACH For our analysis, we obtained 25,363 gene trees from Ensembl’s (v50) gene tree database. Ensembl’s gene trees were produced with their custom computational

256

EVOLUTIONARY DYNAMICS OF GENE DUPLICATION IN BIRDS

pipeline. Initially, family clusters were defined by single-linkage clustering of best reciprocal BLAST hits between amino acid translations of all genes within the Ensembl database (Hubbard et al., 2007). Each cluster was then processed by Ensembl to create a multiple amino acid alignment using the alignment algorithm MUSCLE (Edgar, 2004) and a maximum likelihood phylogenetic tree using PHYML (Guindon and Gascuel, 2003). Our database of gene trees assembled from Ensembl contains 611,441 genes from 39 species, including 25 mammals, Gallus gallus, Xenopus tropicalis, five fish, six invertebrates, and Saccharomyces cerevisiae. There are 17,487 annotated chicken (Gallus gallus) genes in the Ensembl database, 13,649 (78%) of which belong to one of the 7785 gene trees. To study gene duplication along the bird lineage, we identified the divergence between birds and mammals in each gene tree (Figures 1 and 2). This event is represented as a node that is parental to one clade of chicken-specific genes and another clade of orthologs and paralogs found only in mammals. Although we describe the chicken paralogs as chicken-specific, we do so only because chicken is the only bird in our analysis; presumably, sampling of additional genomes within Reptilia would reveal many of these genes and gene duplication to be shared with other birds and nonavian reptiles. If duplications occurred prior to the common ancestor of birds and mammals, a single family may have diversified significantly into multiple orthologs through speciation. In addition, some of these ancient gene duplications can be partially obscured

Figure 1 Phylogenetic pattern that gene duplication takes in multispecies comparisons. The star indicates duplication before species divergence and a gray triangle indicates possible duplication after species diversification.

RESULTS: DYNAMICS OF CHICKEN-SPECIFIC GENE DUPLICATION

257

Figure 2 Olfactory receptor gene tree as defined by single-linkage clustering of best reciprocal BLAST hits between amino acid translations within the Ensembl database. It includes four chicken genes related to chicken olfactory receptor 7 (COR7 ). In Table 1 we give statistics for a gene tree including the three Gallus COR7 family members that cluster together, as the fourth Gallus member split before the amniote divergence, and thus was part of a second gene tree. These three chicken genes, together with the nine mammalian paralogs that form a monophyletic group after the amniote speciation, are an example of a gene family used in various analyses throughout this chapter. The tip names are designated as Ensembl proteins.

when there is complete gene loss in either the bird or mammalian lineage. We were able to identify 12,094 unambiguous and datable divergences due to the bird–mammal divergence (using Gallus and Homo). For each such divergence we isolated the tree of amniote genes rooted at the amniote ancestor for further analysis. Outgroup species (fish; Fugu) were used to help position the root of the gene tree. We expect this protocol to yield chicken gene families that differ from those circumscribed, for example, solely by reciprocal BLAST hits; sometimes our definition of chicken gene families includes genes not detected by nonphylogenetic methods, whereas in other instances our approach will miss some genes that manual inspection and curation would have revealed. On the other hand, our approach has the advantage of being objective and repeatable, and can easily be extended to study the dynamics of gene duplication in other taxa.

3 RESULTS: DYNAMICS OF CHICKEN-SPECIFIC GENE DUPLICATION We found that the distribution of the number of paralogs per chicken gene family was heavily skewed toward small families (Figure 3). Whereas nearly 30 gene families had three members, only six families had more than nine. These figures focus only on those gene families that duplicated after the divergence of chicken from the amniote ancestor (by comparing Gallus with Homo). Thus, very ancient gene families—many potentially with large numbers of family members—do not figure into this calculation. Still, we were surprised by the small number of very large (>20) gene families in

258

EVOLUTIONARY DYNAMICS OF GENE DUPLICATION IN BIRDS

Figure 3

Count of gene families vs. the number of chicken genes in amniote gene trees.

the chicken. Gene family size has been found to follow a power-law distribution in animals, a pattern thought to be produced by differential rates of pseudogenization among families (Huynen and van Nimwegen, 1998; Hughes and Liberles, 2008). Moreover, birth–death models and purifying selection appear to account for much of the conservation seen within lineage-specific paralogs (Ota and Nei, 1994; Piontkivska et al., 2002; Piontkivska and Nei, 2003; Eirin-Lopez et al., 2004). Substitution rates within amniote lineages are known to be quite variable, and we expect similar rate variation among paralogous members of chicken gene families. We estimated dates of divergence for each node in our chicken multigene family trees by using a penalized likelihood model, as implemented in the r8s rate analysis program (Sanderson, 2003), and by using 310 million years as an estimate for the Gallus/Homo common ancestor (Benton and Donoghue, 2007). The penalized likelihood model finds the optimal trade-off between maximizing the likelihood of a Poisson process for nucleotide substitution and a penalty term for rate variation between neighboring branches. The weighting of these two terms is determined by a coefficient λ, such that higher values of λ greatly penalize rate variation in favor of a clocklike model, and lower values allow a large amount of rate variation. Using a cross-validation procedure (Sanderson, 2003), we determined the optimal choice of λ from the possible values: 10−2 , 10−1 , 1, 10, 102 , and 103 . Within our amniote gene trees we found that over 54% of them were best explained by a λ value of 10−2 , indicating that the vast majority of gene families exhibit substantial rate variation among lineages (Figure 4). This rate variation probably stems from two sources: natural deviations in the clock as commonly found, for example, in phylogenetic analyses of different species; and bursts of adaptive evolution among newly evolved gene family members. Under the first hypothesis, it might be expected that larger gene families would exhibit more

RESULTS: DYNAMICS OF CHICKEN-SPECIFIC GENE DUPLICATION

(A)

259

(B)

Figure 4 Branch length rate variation (molecular clock; λ) for amniote gene trees (25 mammals and chicken). (A) Distribution of branch length rate variation binned in different λ values. Low values of λ represent highly variable rates, and high values of λ represent clocklike rates. (B) Comparison of each gene tree’s best fit λ to its size show little correlation (r 2 = 0.0003, p = 0.08, n = 6,901). Values of λ were visually dithered to illustrate density.

rate variation than would small ones, and that the incidence of rate variation would increase with family size. However, we did not find this trend when we regressed λ on gene family size (Figure 4). Many of the gene families best explained by a log λ value of −2 showed levels of divergence that suggests duplication since the Cenozoic era, especially since the Neogene period. For this reason we suspect that much of the rate variation among gene family members may in fact be due to adaptive bursts, because generation time effects among different lineages of birds are expected to influence rate variation only for those gene families that duplicated prior to the chicken’s divergence from other lineages. Of the other categories of substitution rate variation among gene family members, the class best explained by log λ = 3 was the next most common. Substitution rates among gene family members in this category are fairly clocklike. The ages of gene duplications in chicken are distributed exponentially, with most duplications occurring recently (Figure 5). This pattern is consistent with previous analyses, regardless of whether synonymous substitutions or phylogenetic analyses are used (Lynch and Conery, 2000), and suggests that assuming a relatively constant rate of gene duplication, most genes are pseudogenized or eliminated from the genome soon after duplication. However, these results are also consistent with widespread gene conversion between paralogs (Gao and Innan, 2004; Osada and Innan, 2008), in which the duplication event between highly similar sequences would be older than direct sequence comparisons would suggest. With the full genome sequence of two birds it is difficult to untangle the relative contributions of gene loss or gene conversion in producing the skewed age distribution of gene duplications. Moreover, this pattern could also suggest that concerted evolution might be more common among chicken gene families than in other groups. For example, concerted evolution among major histocompatibility complex (MHC) paralogs in birds is thought to occur more frequently,

260

EVOLUTIONARY DYNAMICS OF GENE DUPLICATION IN BIRDS

Figure 5 Age of paralogs on the lineage leading to Gallus gallus that evolved after the divergence between Gallus and Homo. Pivotal events during the evolution of this lineage are noted on the figure. Pg, Paleogene; Ng, Neogene.

and over a shorter time scale, than in mammals (Hess and Edwards, 2002), and the phylogenetic scale over which MHC orthologs can be identified may be smaller in birds than in mammals (but see Burri et al., 2008).

4

EXAMPLES OF FAMILIES WITH CHICKEN-SPECIFIC DUPLICATIONS

Gene family composition is shaped both by gene gain and loss, yet as other researchers have noted (Furlong, 2005), gene family expansion is easier to detect, especially when annotation is not complete and gaps remain in recent genome builds. We examined several gene families containing lineage-specific expansions in the chicken, using the amniotic gene tree rooted at the chicken–human divergence. In Table 1 we describe the dynamics of five representative families: Toll-like receptors, hemoglobin, ovalbuminrelated serpins, four subfamilies of olfactory receptors, and keratin. These families were selected for their variety in size, age, and function and because the annotation and family membership could be at least partially cross-validated with recent studies. 4.1 Toll-like Receptors Temperley and colleagues (2008) describe the evolutionary history and chromosomal location of chicken Toll-like receptors (TLRs), a family that is part of the innate immune system and is characterized by an ancient, highly conserved pathogen-recognition

EXAMPLES OF FAMILIES WITH CHICKEN-SPECIFIC DUPLICATIONS

TABLE 1

Summary of Properties for Specific Gene Families in the Amniotesa

Family Name Toll-like-receptors (TLR2A and TLR2B ) Ovalbumin B serpins (gene X, gene Y , and ovalbumin) Ovalbumin B serpins (MENT, serpinb10 ) Hemoglobin β—globin (βH , ρ, ε) Olfactory receptors (orthologous to OR5U1 and OR5BF1 in Homo) Olfactory receptors (related to COR7 in Gallus) Olfactory receptors (related to COR 1-6 in Gallus) Olfactory receptors (small cluster on chromosome 1 in Gallus) β-keratin a

261

Sequence Mean Branch Divergence Number Number Length from Estimated Before Chicken of of Tips to Duplication Duplications Amniote Chicken Duplication Time After Amniote Paralogs Paralogs in Chicken (Mya) Divergence Log λ 18

2

0.075

67

0.260

−2

73

3

0.141

107

0.227

−2

24

2

0.183

203

0.094

−2

69

3

0.075

122

0.086

−2

511

196

0.205

281

0.320

−2b

12

3

0.014

15

0.394

−2

60

4

0.275

238

0.084

2

207

3

0.049

30

0.421

−2

117

117

0.635

NAc

NA

NA

Number of amniote paralogs is defined as the number of genes across all species in the amniote tree (25 mammals and chicken; see Ensembl for exact species). Mean branch length is the average path length (in substitutions/site) between chicken paralogs (tips) and their common ancestor (the first chicken duplication). Across all families this length is on average 0.235 substitution/site. Estimated duplication time is the time of first chicken duplication in millions of years. Sequence divergence before the duplications in chicken is given in substitutions/site between the root of the amniote tree (Gallus/Homo common ancestor) and the first chicken duplication. λ is a molecular-rate variation parameter (low values are highly variable rates; high values are clocklike). b This λ is calculated from Homo and Gallus for computational limitations due to large family size. c NA, cannot be computed because of chicken-only expansions.

262

EVOLUTIONARY DYNAMICS OF GENE DUPLICATION IN BIRDS

domain that triggers an inflammatory response. These authors discovered that while chickens and humans both have 10 receptors, only four genes in chicken maintain oneto-one orthology with mammalian genes; much gain and loss has occurred in every lineage. In the chicken, their study suggested that a duplication event estimated at 66 million years ago (Mya) gave rise to TLR2A and TLR2B , orthologs to the single TLR2 in mammals. Three other genes, two that duplicated in tandem (TLR1LA and TLR1LB, estimated duplication time 147 Mya), as well as TLR15 , have no mammalian counterpart. Other mammalian members have been pseudogenized or fully lost in chicken. Using our automated phylogenetic approach, we are able to analyze one of these chicken-specific expansions: the duplication event that gave rise to TLR2A and TLR2B in chickens. We obtained the same estimate of duplication time (67 Mya) as did Temperley et al. (2008). As with all of the families that we examined in detail, there was considerable rate variation among gene lineages (log λ = −2). The sequence divergence in the Toll-like receptor family is greater than the sequence divergence found for 50.8% of other chicken gene families. We could not analyze the other duplication in chicken, as our data set from Ensembl was missing one of the chicken-specific genes (TLR1LB ). TLR15 , another gene unique to birds, had no mammalian ortholog, so it also was not included in our amniote gene tree. 4.2 Ovalbumin-Related Serpins Another gene family with documented chicken-specific expansions is the ov-serpin family, also called ovalbumin-related serpins or clade B serpins. Benarafa and RemoldO’Donnell (2005) examine the phylogenetic relationship between the chicken members (some of which function as egg-white storage proteins) and their mammalian counterparts (involved in diverse roles such as embryogenesis, inflammation regulation, and angiogenesis). The initial duplication is thought to have occurred very early in the vertebrate lineage; and, like TLRs, the family is also marked by recent lineage-specific expansions and losses. Chickens have 10 members and humans have 13 members. Three genes in chicken—ovalbumin and ovalbumin-like genes X and Y —are paralogs and lack a human ortholog. Another gene, with a single human ortholog, seems to have duplicated to produce the chicken genes Serpinb10 and MENT (mature erythrocyte nuclear termination state-specific protein). The remaining family members from chicken each have single human orthologs. Among the two subfamilies with chickenspecific expansions, rate variation is substantial (log λ = −2). Moreover, the sequence divergence in the subfamily containing ovalbumin and ovalbumin-like genes X and Y is greater than the sequence divergence found for 69.6% of other chicken gene families that have duplicated since the chicken–mammalian split. The sequence divergence in the other ovalbumin subfamily (serpinb10 and MENT ) is greater than the sequence divergence found for 77.6% of other chicken gene families that have duplicated since the chicken–mammalian split. 4.3 Hemoglobin Metabolic rate is an important trait that governs many organismal characters, from growth strategies to sustained physical activity. In amniotes, an elevated metabolism (endothermy) has only evolved within two extant groups, birds and mammals, although paleontologists suspect that many extinct dinosaurian lineages possessed endothermy

EXAMPLES OF FAMILIES WITH CHICKEN-SPECIFIC DUPLICATIONS

263

(de Ricql`es et al., 2001; Horner et al., 2001). Whereas the typical mammalian and avian condition is homeothermy (roughly constant body temperature), some birds, such as swifts, hummingbirds, and nightjars, are facultatively poikilothermic, a condition in which their usually elevated body temperature can vary over a wider range than that seen in mammals (e.g., Lane et al., 2004). The hemoglobin multigene family is closely associated with metabolism and the respiratory system. Hemoglobin, a multidomain protein, has rapidly diversified within vertebrate lineages (Gribaldo et al., 2003; Cooper et al., 2006; Opazo et al., 2008; Alev et al., 2009). For example, α-globin underwent rapid duplication and deletion in mammals (Hoffmann et al., 2008). Based on an analysis of the platypus genome, which incorporated information from flanking loci, a recent model (Patel et al., 2008) proposes that the β-globin paralogs arose from a single transposition in the amniote ancestor followed by independent duplication in birds and mammals. From our data set, the β-globin paralogs βH , ρ, and ε appear as a chicken-specific expansion, consistent with this model. Rate diversity is high in this family as well, and sequence evolution (point mutations) in β-globin paralogs is similar to that seen in the Toll-like receptors family (the sequence divergence in the β-globin family is greater than the sequence divergence found for 50.8% of other chicken gene families). 4.4 Olfactory Receptors Olfaction has recently gained much recognition as an important sensory modality for birds (Nevitt et al., 2008; O’Dwyer et al., 2008; Steiger et al., 2008, 2009, 2010; Warren et al., 2010). Historically, birds were assumed to communicate primarily via the visual or auditory systems, but behavioral and genomic data suggest that chemosensory perception plays a larger role. The chicken genome paper remarked on the surprisingly large group of avian-specific olfactory receptors (218 genes were identified), whereas Steiger et al. (2009) found 479 genes, including 111 pseudogenes. Our bioinformatic approach detected 196 genes belonging to this subfamily; the discrepancy is perhaps in part due to different genome builds, but also no doubt to our fully automated approach involving no manual inspection. As in other recent work (Lagerstr¨om et al., 2006), we also identified other families of olfactory receptors with small expansions in chicken (three to four genes in our analysis). These include a cluster associated with the previously identified COR1-6 genes (chicken olfactory receptor genes) on chromosome 5, a second cluster on chromosome 10 related to COR7 (Figure 2), and a third cluster on chromosome 1. Interestingly, the subfamilies have very different extents of sequence divergence ranging from 26.7% (family containing COR7) to 88.4% (family containing COR1–6). The latter gene family also had a more clocklike rate than other families, which together with the large sequence divergence suggests that it is among the oldest gene duplications in chicken. 4.5 Keratin The evolution of feathers in theropod dinosaurs was a major innovation that probably provided insulation for metabolically active animals, ornamentation for display, and in one lineage transformed arms into wings (Ji et al., 1998; Zhang and Zhou, 2000; Currie and Chen, 2001; Norell et al., 2002; Sawyer and Knapp, 2003). β-Keratins differ from the keratins found in nonavian reptiles and are the basic structural elements of

264

EVOLUTIONARY DYNAMICS OF GENE DUPLICATION IN BIRDS

feathers and therefore a gene family vital for the success of birds. In the publication of the first genome draft, the International Chicken Genome Sequencing Consortium (2004) noted the large expansion of the avian-specific keratin gene family, estimated at around 150 members. This avian keratin family, which encodes proteins forming feathers and scales (Sawyer et al., 2000), is functionally and evolutionarily distinct from the mammalian hair-specific α-keratin, but recent work suggests that chickens possess α-keratin genes and that these genes are expressed in avian digits (Eckhart et al., 2008). Other components of hair, the keratin-associated proteins, have no members in the chicken genome (Wu et al., 2008). Within nonavian reptiles, β-keratins probably duplicated by retrotransposition, resulting in the loss of introns in some paralogs; in birds, all paralogs have lost introns. Unequal crossover is also thought to expand and contract keratin gene arrays in birds, resulting in a tandem organization of multiple paralogs (Toni et al., 2007). We find that the amount of sequence evolution in keratin is among the highest for any chicken gene family (greater than the sequence divergence found for 98% of other chicken gene families). This high divergence is likely a consequence of the absence of comparisons with other reptiles, but it could also be due to the adaptive significance of these proteins within birds.

5

PROSPECTS AND CONCLUSIONS

Reptilia, including birds and nonavian reptiles, is the sister group of mammals and as such holds an important phylogenetic position for shedding light on patterns of gene duplication in amniotes (Wang et al., 2006). Reptiles are arguably more diverse than mammals in many traits; with about 17,000 species (about 10,000 in birds and 7000 nonavian reptiles) they are substantially more species-rich than mammals (about 5000 species) and possess a greater diversity of sex chromosome and sex determination systems (Organ and Janes, 2008). The chicken is currently the sole member of Reptilia with a draft genome and as such provides the only point of comparison of genome dynamics between mammals and their sister group. A greater understanding of genome and multigene family dynamics in mammals will undoubtedly require greater genome sampling and characterization in Reptilia. Gene duplication and the families they produce are vital for generating the thread with which evolution weaves new adaptations and species. We have developed a pipeline for phylogenomic analysis of gene duplication in the chicken lineage, but our approach can be applied easily to any particular clade of interest. Our approach rests on the assumption that gene orthology and paralogy are best identified through phylogenetic analysis, and we delimit chicken-specific gene duplications by an approach (Figure 1) that combines initial identification and collection of gene copies across many vertebrates that show significant sequence similarity in Ensembl, followed by phylogenetic analysis of these gene sets; identification of particular nodes in these gene trees that correspond to gene duplications, in our case the mammal–bird divergence; identification of those gene clusters that diversify from these particular nodes; and statistical analysis of the gene trees collected. Many of the duplications we have identified here as “chicken-specific” in fact will be found to have duplicated in ancestors of the chicken, since orthologs of many chicken genes will no doubt be discovered in other reptile genomes as they emerge. Nonetheless, using an approximate time scale (Figure 5) we can estimate which chicken gene paralogs might be found in upcoming

REFERENCES

265

reptilian genome projects based on their estimated timing of duplication relative to the divergence times of species whose genomes are being compared. Our approach has the advantage of providing an objective means of identifying chicken-specific gene duplications, but of course when conducted on a genome-wide scale, it will miss some gene family members that manual curation will identify; we have illustrated this with some specific examples (Table 1). The loss of detail for some gene families is offset by the ability to study genome-wide distributions of multigene family dynamics; both approaches are required to provide an informed view of the dynamics of multigene family evolution in birds and relatives. Phylogenomic approaches such as those presented here have only just begun to provide a window into the dynamics and importance of gene duplication within organisms. For example, nonprotein coding RNA paralogs are dispersed throughout the chicken genome; this, along with an unusual paucity of nonprotein coding RNA pseudogenes, suggests that they may not undergo the same processes of duplication (unequal crossover and retrotransposition) that characterize protein coding genes (Hillier et al., 2004). Currently available data are insufficient to address this and other hypotheses, because as of the time of this writing the genome of only one reptile species has been sequenced. But progress is quickly being made with the publication of the zebra finch (Taeniopygia guttata) genome (Warren et al., 2010) and the release of the anole lizard (Anolis carolinensis) genomes. An increase in the number of genomes will permit more detailed quantitative comparison of the evolutionary dynamics of gene duplication in amniotes and other lineages and will help clarify the role of these gene duplications in organismal diversification. REFERENCES Alev C, et al. 2009. Genomic organization of zebra finch alpha and beta globin genes and their expression in primitive and definitive blood in comparison with globins in chicken. Dev Genes Evol 219:353–360. Benarafa C, Remold-O’Donnell E. 2005. The ovalbumin serpins revisited: perspective from the chicken genome of clade B serpin evolution in vertebrates. Proc Nat Acad Sci USA 102(32):11367–11372. Benton MJ, Donoghue PCJ. 2007. Paleontological evidence to date the tree of life. Mol Biol Evol 24(1):26–53. Burri R, et al. 2008. Evolutionary patterns of MHC class II B in owls and their implications for the understanding of avian MHC evolution. Mol Biol Evol 25:1180–1191. Cooper SJB, et al. 2006. The mammalian alpha(D)-globin gene lineage and a new model for the molecular evolution of alpha-globin gene clusters at the stem of the mammalian radiation. Mol Phyl Evol 38:439–448. Currie PJ, Chen P-J. 2001. Anatomy of Sinosauropteryx prima from Liaoning, northeastern China. Can J Earth Sci 38(12):1705–1727. de Ricql`es AJ, et al. 2001. The bone histology of basal birds in phylogenetic and ontogenetic perspectives. In Gauthier J, Gall LF (eds.), New Perspectives on the Origin and Early Evolution of Birds: Proceedings of the International Symposium in Honor of John H. Ostrom. New Haven, CT: Peabody Museum of Natural History, pp. 411–426. Eckhart L, et al. 2008. Identification of reptilian genes encoding hair keratin-like proteins suggests a new scenario for the evolutionary origin of hair. Proc Natl Acad Sci 105:18419–18423.

266

EVOLUTIONARY DYNAMICS OF GENE DUPLICATION IN BIRDS

Edgar RC. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32:1792–1797. Eirin-Lopez JM, et al. 2004. Birth-and-death evolution with strong purifying selection in the histone H1 multigene family and the origin of orphon H1 genes. Mol Biol Evol 21(10):1992–2003. Eisen JA. 1998. Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res 8:163–167. Furlong RF. 2005. Insights into vertebrate evolution from the chicken genome sequence. Genome Biol 6(2). Gao L-Z, Innan H. 2004. Very low gene duplication rate in the yeast genome. Science 306:1367–1370. Gregory TR. 2002. A bird’s-eye view of the C-value enigma: genome size, cell size, and metabolic rate in the class Aves. Evolution 56(1):121–130. Gribaldo S, et al. 2003. Functional divergence prediction from evolutionary analysis: a case study of vertebrate hemoglobin. Mol Biol Evol 20(11):1754–1759. Gu Z, et al. 2002. Rapid divergence in expression between duplicate genes inferred from microarray data. Trends Genet 18(12):609–613. Guindon S, Gascuel O. 2003. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol 52(5):696–704. Haasa NB, et al. 2001. Subfamilies of CR1 non-LTR retrotransposons have different 50 UTR sequences but are otherwise conserved. Gene 265:175–183. Heger A, Ponting CP. 2007. Evolutionary rate analyses of orthologs and paralogs from 12 Drosophila genomes. Genome Res 17(12):1837–1849. Hess CM, Edwards SV. 2002. The evolution of the major histocompatibility complex in birds. Bioscience 52(5):423–431. Hillier LW, et al. 2004. Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 432(7018):695–716. Hittinger CT, Carroll SB. 2007. Gene duplication and the adaptive evolution of a classic genetic switch. Nature 449:677–681. Hoffmann FG, et al. 2008. Rapid rates of lineage-specific gene duplication and deletion in the alpha-globin gene family. Mol Biol Evol 25(3):591–602. Horner JR, et al. 2001. Comparative osteology of some embryonic and perinatal archosaurs: developmental and behavioral implications for dinosaurs. Paleobiology 27(1):39–58. Hubbard TJP, et al. 2007. Ensembl 2007. Nucleic Acids Res 35(database issue):D610–D617. Huerta-Cepas J, et al. 2007. The human phylome. Genome Biol 8:R109. Hughes T, Liberles DA. 2008. The power-law distribution of gene family size is driven by the pseudogenisation rate’s heterogeneity between gene families. Gene 414(1–2):85–94. Huynen MA, van Nimwegen E. 1998. The frequency distribution of gene family sizes in complete genomes. Mol Biol Evol 15(5):583–589. Ji Q, et al. 1998. Two feathered dinosaurs from northeastern China. Nature 393:753–761. Kondrashov FA, et al. 2002. Selection in the evolution of gene duplications. Genome Biol 3(2):1–9. Lagerstr¨om MC, et al. 2006. The G protein–coupled receptor subset of the chicken genome. PLoS Comput Biol 2(6):493–507. Lane JE, et al. 2004. Daily torpor in free-ranging whip-poor-wills (Caprimulgus vociferus). Physiol Biochem Zool 77:297–304. Li W-H. 2006. Molecular Evolution. Sunderland, MA: Sinauer Associates.

REFERENCES

267

Li H, et al. 2006. TreeFam: a curated database of phylogenetic trees of animal gene families. Nucleic Acids Res 34:D572–D580. Lynch M, Conery JS. 2000. The evolutionary fate and consequences of duplicate genes. Science 290:1151–1155. Lynch M, Force A. 2000. The probability of duplicate-gene preservation by subfunctionalization. Genetics 154:459–473. Maniatis T, Tasic B. 2002. Alternative pre-mRNA splicing and proteome expansion in metazoans. Nature 418(6894):236–243. Nei M, Gu X, Sitnikova T. 1997. Evolution by the birth-and-death process in multigene families of the vertebrate immune system. Proc Natl Acad Sci 1997 94(15):7799–7806. Nevitt GA, et al. 2008. Evidence for olfactory search in wandering albatross, Diomedea exulans. Proc Nat Acad Sci 105(12):4576–4581. Norell MA, et al. 2002. “Modern” feathers on a non-avian dinosaur. Nature 416:36–37. Nowak MA, et al. 1997. Evolution of genetic redundancy. Nature 388(6638):167–171. O’Dwyer TW, et al. 2008. Examining the development of individual recognition in a burrownesting procellariiform, the Leach’s storm-petrel. J Exp Biol 211(3):337–340. Ohno S. 1970. Evolution by Gene Duplication. New York: Springer-Verlag. Opazo J, Hoffmann CFG, Storz JF. 2008. Differential loss of embryonic globin genes during the radiation of placental mammals. Proc Natl Acad Sci 105:12950–12955. Organ CL, Janes DE. 2008. Evolution of sex chromosomes in Sauropsida. Integrative Comparative Biol 48(4):512–519. Osada N, Innan H. 2008. Duplication and gene conversion in the Drosophila melanogaster genome. PLoS Genets 4(12):e1000305. Ota T, Nei M. 1994. Divergent evolution and evolution by the birth-and-death process in the immunoglobulin V-H gene family. Mol Biol Evol 11(3):469–482. Patel VS, et al. 2008. Platypus globin genes and flanking loci suggest a new insertional model for beta-globin evolution in birds and mammals. BMC Biol 6(34). Piontkivska H, Nei M. 2003. Birth-and-death evolution in primate MHC class I genes: divergence time estimates. Mol Biol Evol 20(4):601–609. Piontkivska H, et al. 2002. Purifying selection and birth-and-death evolution in the histone H4 gene family. Mol Biol Evol 19(5):689–697. Rasmussen MD, Kellis M. 2007. Accurate gene-tree reconstruction by learning geneand species-specific substitution rates across multiple complete genomes. Genome Res 17(12):1932–1942. Sanderson MJ. 2003. r8s: inferring absolute rates of molecular evolution and divergence times in the absence of a molecular clock. Bioinformatics 19(2):301–302. Sarrias MR, et al. 2004. The scavenger receptor cysteine-rich (SRCR) domain: an ancient and highly conserved protein module of the innate immune system. Crit Rev Immunol 24(1):1–37. Sawyer RH, Knapp LW. 2003. Avian skin development and the evolutionary origin of feathers. J Exp Zool B 298B(1):57–72. Sawyer RH, et al. 2000. The expression of beta (β) keratins in the epidermal appendages of reptiles and birds. Am Zool 40(4):530–539. Shedlock AM. 2006. Phylogenomic investigation of CR1 LINE diversity in reptiles. Syst Biol 55(6):902–911. Shedlock AM, et al. 2007. Phylogenomics of non-avian reptiles and the structure of the ancestral amniote genome. Proc Natl Acad Sci 104:2767–2772.

268

EVOLUTIONARY DYNAMICS OF GENE DUPLICATION IN BIRDS

Steiger SS, et al. 2008. Avian olfactory receptor gene repertoires: evidence for a well-developed sense of smell in birds?. Proc R Soc B 275(1649):2309–2317. Steiger SS, et al. 2009. A comparison of reptilian and avian olfactory receptor gene repertoires: species-specific expansion of group gamma genes in birds. BMC Genomics 10:446. Steiger SS, et al. 2010. Evidence for adaptive evolution of olfactory receptor genes in 9 bird species. J Heredity 101(3):325–333. Storm CEV, Sonnhammer ELL. 2003. Comprehensive analysis of orthologous protein domains using the HOPS database. Genome Res 13:2353–2362. Temperley ND, et al. 2008. Evolution of the chicken Toll-like receptor gene family: a story of gene gain and gene loss. BMC Genom 9(62). Toni M, et al. 2007. Hard (beta-) keratins in the epidermis of reptiles: composition, sequence, and molecular organization. J Proteome Res 6(9):3377–3392. Wang Z, et al. 2006. Tuatara (Sphenodon) genomics: BAC library construction, sequence survey, and application to the DMRT gene family. J Hered 97(6):541–548. Wapinski I, et al. 2007. Natural history and evolutionary principles of gene duplication in fungi. Nature 449:54–61. Warren C, et al. 2010. The genome of a songbird. Nature 464:757–762. Wu DD, et al. 2008. Molecular evolution of the keratin associated protein gene family in mammals, role in the evolution of mammalian hair. BMC Evol Biol 8(241). Yuri T, et al. 2008. Duplication of accelerated evolution and growth hormone gene in passerine birds. Mol Biol Evol 25(2):352–361. Zelano B, Edwards SV. 2002. An Mhc component to kin recognition and mate choice in birds: predictions, progress, and prospects. Am Nat 160:S225–S237. Zhang J. 2003. Evolution by gene duplication: an update. Trends Genet 18(6):292–298. Zhang F, Zhou Z. 2000. A primitive enantiornithine bird and the origin of feathers. Science 290(5498):1955–1959. Zhang J, et al. 1998. Positive Darwinian selection after gene duplication in primate ribonuclease genes. Proc Nat Acad Sci USA 98:3708–3713. Zmasek CM, Eddy SR. 2002. RIO: analyzing proteomes by automated phylogenomics using resampled inference of orthologs. BMC Bioinf 3:14.

15

Gene and Genome Duplications in Plants PAMELA S. SOLTIS Florida Museum of Natural History, University of Florida, Gainesville, Florida

J. GORDON BURLEIGH Department of Biology, University of Florida, Gainesville, Florida

ANDRE S. CHANDERBALI and MI-JEONG YOO Florida Museum of Natural History, University of Florida, Gainesville, Florida; Department of Biology, University of Florida, Gainesville, Florida

DOUGLAS E. SOLTIS Department of Biology, University of Florida, Gainesville, Florida

1 INTRODUCTION Many plants have large and complex genomes, with genome size varying 1000-fold (Bennett and Leitch, 2004). Although large genome size in conifers has been attributed to expansion of retrotransposons, the large size and complexity of most plant genomes appears to be due to gene duplication—through expansion of gene families [e.g., SKP1 , which shows extensive gene birth and death (Kong et al., 2004); MADS-box genes, which are more numerous in plants than in other eukaryotes (Becker and Theissen, 2003)] and whole-genome duplication (polyploidy; e.g., Vision et al., 2000). Current research is exploring the role of gene and genome duplication in genetic interactions, floral development, morphological diversification, speciation, and adaptation. These once disparate areas have recently been unified by the availability of genomic data and by conceptual and informatic developments that allow genomic methods to be applied to nonmodel plants. The result is a paradigm shift in our understanding of the importance and pervasiveness of gene and genome duplication. Gene duplication has long been recognized as the ultimate source for evolutionary change (Ohno, 1970), and recent theoretical developments have clarified the possible fates of duplicate genes (Lynch and Conery, 2000). Both members of a duplicate gene pair may be maintained, retaining their original function, or they may diverge. Divergence may follow any of the following paths: retention of one copy and loss/silencing of the other, leading to a pseudogene; retention of the ancestral function by one copy Evolution After Gene Duplication, Edited by Katharina Dittmar and David Liberles Copyright © 2010 Wiley-Blackwell

269

270

GENE AND GENOME DUPLICATIONS IN PLANTS

and acquisition of a new function by the other (neofunctionalization; Lynch and Conery, 2000); partitioning of the original function or domain of expression between the two copies (subfunctionalization; Lynch and Conery, 2000; Lynch and Force, 2000; e.g., if the ancestral copy had functioned in both flowers and leaves, one copy of the duplicate pair might function in the flower and the other might function in the leaves). Clarification of hypotheses of gene fate has helped to stimulate new areas of research in gene expression and gene function (e.g., Adams et al., 2003, 2004; Adams and Wendel, 2005a). But how do gene duplications arise? What is the ultimate source of novel genetic material? Some plants, such as Clarkia (Onagraceae, evening primrose family), are prone to extensive chromosomal rearrangements (Lewis and Lewis, 1955), which have generated dozens of gene duplications (e.g., Gottlieb, 1974, 1977; Soltis et al., 1987). Such duplicate gene pairs are typically unlinked because they arose via translocation of one chromosomal segment onto another chromosome. Linked duplicates may arise via tandem duplication; the evolutionary dynamics of linked duplicate genes would be expected to differ from those involving unlinked genes. Perhaps the greatest source of duplicate genes in plants is whole-genome duplication (WGD; i.e., polyploidy). Evolutionary dynamics of genes duplicated via polyploidy are also likely to differ from those accompanying either tandem duplications or small-scale chromosomal duplications, most important because the stoichiometry of duplicate genes is maintained following WGD, allowing for divergence and modification not only of single genes but conceivably, of gene networks. Polyploidy has long been considered an important mechanism of speciation and of genomic change (e.g., Darlington, 1937; Clausen et al., 1945; Stebbins, 1950; Grant, 1981), but its fundamental role in genome evolution (not only in plants but also other eukaryotes, yeast (Kellis et al., 2004), and vertebrates [reviewed by Furlong and Holland (2004)] is only beginning to be appreciated. In this chapter we provide an overview of WGD in plants: How prevalent is it? How have our views changed with the acquisition of genomic data? How can it be detected? We follow these questions with an exploration of the consequences of gene duplication (regardless of its origin), with regard to (1) retention or loss of duplicate genes and (2) changes in expression or function. We address particularly the role of duplicate genes in development, with an eye to possible shifts in developmental programs and to morphological novelty associated with gene duplication. We conclude with a brief perspective on the relative contributions of duplicate gene retention, gene loss/silencing, neofunctionalization, and subfunctionalization in plant genomes. We focus on angiosperms because there are more data for this large clade than for any other, but we encourage the investigation of gene duplication and genome structure and evolution in nonangiosperms as well.

2

WHOLE-GENOME DUPLICATION (POLYPLOIDY) IN PLANTS

2.1 Traditional Views It has long been recognized that polyploidy is an important evolutionary force in plants, particularly ferns and angiosperms. Polyploidy has been studied in plants for 100 years (Lutz, 1907; Gates, 1909; Kuwada, 1911), with early investigators of the topic now comprising a “who’s who” of prominent plant evolutionists and geneticists

WHOLE-GENOME DUPLICATION (POLYPLOIDY) IN PLANTS

271

(e.g., Winge, 1917; M¨untzing, 1936; Darlington, 1937; Clausen et al., 1945; Stebbins, 1947, 1950; L¨ove and L¨ove, 1949; Lewis, 1980a; Grant, 1981). Following Stebbins’s seminal (1940, 1947, 1950) overviews of polyploidy, considerable research effort was focused on polyploid complexes and groups of hybridizing species. Allopolyploids are those formed via hybridization between two parental species, coupled with genome doubling; autopolyploids form from within a single species. Hence, polyploidy represented a major portion of biosystematic research during this “classic period,” up until the late 1980s. At that time, the ease of application of DNA methods resulted in fewer and fewer studies of polyploidy, as plant systematists focused much more attention on phylogeny reconstruction [reviewed by Soltis et al. (2003)]. The considerable research conducted on polyploid systems during the classic period resulted in the establishment of what may now best be termed the traditional tenets of polyploid evolution. One of the prominent early views was that although polyploidy was considered relatively common, the genetic fate of polyploid species was not considered promising, leading to the view of polyploids as evolutionary dead-ends. For example, the parental genomes were considered largely static following polyploidy. In addition, autopolyploids were considered rare and maladaptive, and each polyploid species—whether allo- or autopolyloid—was considered to have a single origin. All of these views have been turned on their heads: Today, polyploidy is not viewed as maladaptive, polyploidy genomes are dynamic, autopolyploids are common and evolutionarily diverse, and nearly all polyploid species examined to date show evidence of recurrent formation. There was also considerable effort during the classic period to provide estimates of the frequency of polyploidy in plants, particularly within angiosperms. A number of prominent authors attempted this exercise by using published chromosome numbers and establishing hypotheses for the presumed cutoff between “diploid” and “polyploid” chromosome numbers. Thus, estimates varied depending on the base chromosome number cutoff used as well as on the taxa considered. For example, both M¨untzing (1936) and Darlington (1937) suggested that about 50% of all angiosperm species were polyploid, while Stebbins (1950) later estimated the frequency of polyploidy in angiosperms at 30 to 35%. Using a cutoff point of n = 14, Grant (1963, 1981) inferred that 47% of all flowering plants were of polyploid origin and proposed that 58% of monocots and 43% of “dicots” (his usage) were polyploid. Using additional chromosome counts and the same methods and cutoff as Grant, Goldblatt (1980) subsequently recalculated the frequency of polyploidy in the monocots to be 55%. Goldblatt also suggested that Grant’s (1963) estimate was too conservative; he thought that taxa with chromosome numbers above n = 9 or 10 probably have polyploidy in their evolutionary history. Using these lower numbers, he calculated that at least 70%, and perhaps 80%, of monocots are of polyploid origin. Lewis (1980b) applied an approach similar to Goldblatt’s to dicots and estimated that 70 to 80% were polyploid. More recently, Masterson (1994) used the novel approach of comparing leaf guard cell size in fossil and extant taxa from a few angiosperm families (Platanaceae, Lauraceae, Magnoliaceae) to estimate polyploid occurrence through time. Because guard cell size is often much larger in polyploids than in diploids, this provided a method for estimating whether the fossil taxa were diploid (smaller guard cells than extant taxa) or polyploid (the same or larger guard cell sizes vs. extant species). From these

272

GENE AND GENOME DUPLICATIONS IN PLANTS

comparisons, Masterson (1994) estimated that 70% of all angiosperms had experienced one or more episodes of polyploidy in their ancestry. 2.2 Genetic and Genomic Approaches to Understanding Polyploidy Genomic data have provided unprecedented new insights into the genetic and genomic consequences of genome doubling, dramatically changing our views of polyploid evolution and resulting in the formulation of a new polyploid paradigm. For example, genomic data have provided novel insights into the frequency and timing of ancient polyploid events in angiosperms. Genomic investigations reveal that flowering plants possess genomes with considerable gene redundancy; much of this redundancy may be the result of ancient episodes of polyploidy. Complete sequencing of the very small genome of Arabidopsis thaliana, long considered the archetype of diploidy, revealed numerous duplicate genes and suggested two or three rounds of WGD (Vision et al., 2000; Bowers et al., 2003). Because Arabidopsis has only five chromosomes, it was not previously classified as a polyploid using the common cutoff criteria noted above. But genomic data clearly indicate a recent round of duplication, perhaps during the early evolution of the Brassicaceae, with an earlier round of duplication that occurred deeper in Brassicales, and a third event that may coincide with the early diversification of eudicot angiosperms [reviewed by D.E. Soltis et al. (2009)]. Significantly, all other angiosperms whose entire nuclear genomes have been sequenced completely all show evidence of WGD events: Oryza (Paterson et al., 2004), Populus (Tuskan et al., 2006), Vitis (Jaillon et al., 2007; Velasco et al., 2007), and Carica (Ming et al., 2008). These genomic investigations provide evidence for a number of phylogenetically important ancient genome doubling events, including a proposed paleohexaploid event that may have occurred close to the origin of the eudicots [reviewed by D.E. Soltis et al. (2009)]. ESTs (expressed sequence tags), now available for many angiosperm species, are another important major source of genomic data that can be used to infer ancient polyploidy. The thousands of ESTs available provide a useful genomic “snapshot,” permitting determination of ancient genome duplication events as well as a rough approximation of the timing of those events. Lynch and Conery (2000) developed a method that evaluates the frequency distribution of per-site synonymous divergence levels (Ks ) for pairs of duplicate genes (see below). A genomewide duplication event results in thousands of paralogous pairs that are all duplicated simultaneously. Evidence of past genome duplications can be seen as peaks in the distribution of Ks values for sampled paralogous pairs (Lynch and Conery, 2000). Importantly, this method does not require information on the position of genes within the genome, and can therefore be applied to any species for which there are moderate-to-large EST sets. When the Ks approach was applied to ESTs from diverse angiosperms, most species show evidence of ancient polyploidy, and sometimes there is evidence for multiple events. For example, Blanc and Wolfe (2004) used Ks values and found evidence of ancient polyploidy in Zea (maize), Glycine (soybean), Gossypium (cotton), and Solanum (tomato and potato). Similarly, Schlueter et al. (2004) found evidence in eight major crop species, including Glycine, Medicago (alfalfa), Solanum, Zea, Sorghum, Oryza, and Hordeum (barley), and inferred multiple independent genome duplications in Fabaceae (legumes), Solanaceae (potatotes and tomatoes), and Poaceae (grasses). A recent survey of ESTs from Asteraceae suggests that members of this family are also ancient polyploids (Barker et al., 2008).

WHOLE-GENOME DUPLICATION (POLYPLOIDY) IN PLANTS

273

Cui et al. (2006) applied the Ks approach to ESTs from several basal angiosperms and found evidence for episodes of ancient polyploidy in Nuphar advena (Nymphaeaceae; water lilies), the sister to all other living angiosperms following Amborella (Soltis et al., 2005). They also found the signature of ancient polyploidy in two members of the magnoliid clade [Persea americana (avocado: Lauraceae) and Liriodendron tulipifera (Magnoliaceae)] and Saruma henryi (Aristolochiaceae). In addition, Cui et al. (2006) detected WGDs in the basal eudicot Eschscholzia californica (California poppy; Papaveraceae) and the basal monocot Acorus americanus (Acoraceae). In fact, several genomewide duplication events appear to have occurred in Nuphar (Cui et al., 2006). One of these events appears to be restricted to Nymphaeaceae, but an older event is evident, and this may date to the common ancestor of all angiosperms except Amborella. Significantly, however, Amborella lacks evidence of ancient polyploidy, despite the use of very large EST data sets (D.E. Soltis et al., 2009). The analysis of Ks values by Cui et al. (2006) also provided weak, albeit inconclusive evidence of a still older polyploidization in Persea that may correspond to the old event suggested for the common ancestor of all angiosperms except Amborella. Alternatively, it could perhaps even predate the angiosperms, but testing these hypotheses will require comprehensive transcriptome sequencing for additional basal angiosperms and a complete Amborella genome sequence (Soltis et al., 2008). Genetic investigation of other taxa using other genetic methods suggests still additional ancient polyploidy events. For example, “diploid” members of Brassica are, at the least, ancient tetraploids (Kowalski, 1994; Lan et al., 2000; Quiros, 2001) and perhaps ancient hexaploids based on analyses of linkage maps—a number of genes (and blocks of genes) are clearly represented multiple times (e.g., Lagercrantz and Lydiate, 1996; Lukens et al., 2004). Genomic data provide evidence for other lineage-specific duplications: one within the legumes (Schlueter et al., 2004; Pfeil et al., 2005; Cannon et al., 2006) and another occurring in Capparaceae (Schranz and Mitchell-Olds, 2006). Genomic studies now raise the question: Are there really any true diploids (D.E. Soltis et al., 2009a)? Furthermore, the classic question of what percentage of angiosperms is of polyploid origin now appears moot. The evidence strongly suggests that all angiosperms may be ancient polyploids. The major question is no longer “How many angiosperms are polyploid?” but rather, “How many episodes of genome duplication have various angiosperm lineages experienced?” Finally, the contrast between the role that polyploidy is now envisioned to play in angiosperm (and plant) evolution has changed dramatically over the past 15 to 20 years. For example, in a recent study, Fawcett et al. (2009) estimated that ancient polyploid events occurred at the same time (about 65 million years ago) in several diverse angiosperm lineages, suggesting the possibility of a shared common causal factor. Interestingly, this estimate corresponds with the K-T boundary. Hence, the authors propose that genome doubling was a catalyst for the survival and/or diversification of extinction event that occurred following the Cretaceous–Tertiary (K-T) boundary. Similarly, the correspondence of ancient polyploid events to the origin of many species-rich plant clades, including Fabaceae, Asteraceae, eudicots, monocots, and even angiosperms as a whole, has also prompted speculation about the role of polyploidy in stimulating major bursts of plant diversification (D.E. Soltis et al., 2009a). In the light of such speculation, it is worthwhile recalling that only 25 years ago, polyploids were commonly viewed as “evolutionary dead-ends” [reviewed by Soltis and Soltis (1993–2000)].

274

GENE AND GENOME DUPLICATIONS IN PLANTS

2.3 Detecting Ancient Whole-Genome Duplications Analyses of large-scale genomic data have provided evidence of many previously undetected, ancient whole-genome duplication (WGD) events (e.g., Blanc and Wolfe, 2004; Schlueter et al., 2004; Cui et al., 2006); however, there is little consensus on the number and timing of ancient WGDs in plants. The task of identifying and placing ancient WGDs is greatly complicated by the diploidization process following a WGD, during which rapid gene loss and chromosomal rearrangements obscure, if not erase, evidence of the WGD. Still, numerous promising approaches have been developed to identify the remaining signals of ancient WGDs from different types of genomic data, including gene maps, pairs of duplicated genes, and gene trees. Although these methods are not designed specifically for plants, the frequency of polyploidy in plants and the wealth of genomic data make plants among the most useful systems for testing methods to detect ancient WGDs. The presence of large, syntenic (duplicated) blocks within a genome provided the first evidence of ancient WGDs in Arabidopsis and Brassicaceae (e.g., Vision et al., 2000; Simillion et al., 2002; Blanc et al., 2003; Bowers et al., 2003; Schranz and Mitchell-Olds, 2006), rice (e.g., Guyot and Keller, 2004; Paterson et al., 2004; Wang et al., 2005; Yu et al., 2005), legumes (Shoemaker et al., 1996; Cannon et al., 2006), poplar (Tuskan et al., 2006), and Vitis (Jaillon et al., 2007; Velasco et al., 2007). Although these duplicated chromosomal segments may provide direct and unambiguous evidence of a past WGD, in practice, rapid gene losses and rearrangements after polyploidy can make it extremely difficult to detect such duplications (Lynch and Conery, 2000; Eckhardt, 2001; Simillion et al., 2002). Different methods of detecting duplicated blocks and use of different criteria for defining a syntenous block can greatly affect interpretations of the history of large-scale duplications (see Durand and Hoberman, 2006). For example, the gene-order data in rice have been interpreted as ancient aneuploidy (Vandepoele et al., 2003) and as ancient polyploidy (e.g., Yu et al., 2005). Once duplicated chromosomal blocks have been identified, the timing of the WGD event(s) that created the duplication can be estimated using the sequence divergence of paralogous genes on each block, usually based on silent (synonymous) substitution rates. Variation in rates of molecular evolution among both genes and taxa can also make it extremely difficult to date accurately or precisely the corresponding WGD events. However, perhaps the greatest current limitation to identifying WGDs from genetic map data in plants is the lack of such data across a phylogenetically diverse sampling of taxa. Still, without evidence of duplicated blocks of genes, or any genetic map data at all, it is possible to detect ancient polyploidy based on the age distributions of pairs of paralogous genes throughout a genome (e.g., Lynch and Conery, 2000; Vision et al., 2000). If gene duplication and loss occur at a constant rate, the frequency of duplicated genes in a genome will decrease exponentially with time. In contrast, a large-scale duplication event such as a WGD should result in an overrepresentation of duplicated gene pairs at the time corresponding to the large-scale duplication event. Thus, if one plots the age distribution, as represented by sequence divergence, of duplicated genes from a genome, peaks in the age distribution curves may indicate WGDs (Figure 1). As with the map-based methods for detecting WGDs, the date of the large-scale duplication (peak in the graph) is usually estimated from the molecular divergence of the overrepresented gene pairs. Since the age is most often represented by

Frequency

WHOLE-GENOME DUPLICATION (POLYPLOIDY) IN PLANTS

275

Genome duplication

Age (Ks divergence)

Figure 1 Example of Ks curve. Under a constant rate of gene duplication and loss, the frequency of duplicate (paralogous) genes should decrease exponentially with age, as shown in the bottom line. In the thinner, top line, a WGD event will result in an overabundance of duplicate genes at the age of the WGD.

the synomymous (or silent) substitutions between duplicate genes, the age-distribution plots are often called Ks plots. Ks plots can be built from any large, gene-sequence data sets. In fact, as noted above, the Ks plot approach has identified ancient WGDs from EST data sets in many angiosperm lineages (Blanc and Wolfe, 2004; Schlueter et al., 2004; Sterck et al., 2005; Cui et al., 2006; Barker et al., 2008), and even the gymnosperm Welwitschia (Cui et al., 2006) and the moss Physcomitrella (Rensing et al., 2007). Still, there are several disadvantages to inferring WGDs from Ks plots. First, the age distribution plots can be difficult to interpret, and in some cases, analyses of Ks plots have failed to detect known WGDs (e.g., Blanc and Wolfe, 2004; Paterson et al., 2004). Furthermore, the age distribution of duplicate genes is indirect evidence for WGDs, and in theory, a period of reduced gene loss (Lynch, 2007), or a large-scale, but not whole-genome duplication event can be mistaken for a WGD on a Ks plot. Finally, as with the map-based analyses, it is difficult to place a duplication event precisely just from divergence time estimates from duplicated genes. Fawcett et al. (2009) addressed this issue by using a rate-smoothing technique that does not assume a molecular clock to date the divergences (Sanderson, 2002). Examining gene sequences from multiple taxa in a phylogentic context provides an approach to date WGD events in relation to speciation events rather than the potentially problematic divergence time estimates (e.g., Bowers et al., 2003; Langkjaer et al., 2003; Vandepoele et al., 2003; Chapman et al., 2004). For example, in the simple three-taxon case, a gene tree is constructed with a pair of paralogous genes from the test taxon, and the best homologs from a second taxon and from an outgroup taxon (Bowers et al., 2003; Langkjaer et al., 2003; Vandepoele et al., 2003; Chapman et al., 2004). If the paralogs from the test taxon form a clade, they diverged after the common ancestor of the test taxon and the second taxon; if they do not, they diverged before the last common ancestor. This three-taxon phylogenetic approach has been used to determine the timing of WGDs in Arabidopsis relative to its divergence with pines, rice, and other eudicots (Bowers et al., 2003) and rice relative to its divergence with

276

GENE AND GENOME DUPLICATIONS IN PLANTS

Gene tree

a

c

b

Species tree

d A

B

C

D

Figure 2 Example of LCA mapping to identify gene duplications. In lowest common ancestor (LCA) mapping, each gene (node) of the gene tree on the left is mapped to the lowest node in the species tree on the right that could have included the gene. A duplication exists when parent and child nodes on the gene tree map to the same node on the species tree, as shown with the arrows. The node on the species tree indicated by an asterisk marks the lowest possible location of the gene duplication event.

pines, Arabidopsis, and other monocots (Vandepoele et al., 2003; Chapman et al., 2004). There also has been much interest in identifying WGDs by mapping gene duplications from a collection of gene trees, containing genes from many taxa, onto a species tree (e.g., Guig´o et al., 1996; Fellows et al., 1998; Page and Cotton, 2002; Burleigh et al., 2008). A gene in the gene tree can be interpreted as a duplication if it has a child with the same lowest common ancestor mapping (LCA mapping) on a species tree (Figure 2; Eulenstein, 1998; see Bansal et al., 2007). The LCA mapping associates every gene in the gene tree to the most recent species in the species tree that could have contained the preduplication ancestral gene. However, this does not mean that the LCA mapping on the species tree indicates the location of the duplication event; in many cases, the duplication event could, in theory, predate the location of the LCA mapping. There are several proposed ways to de?ne the range of possible location(s) of a gene duplication on a species tree (e.g., Guig´o et al., 1996; Fellows et al., 1998; Page and Cotton, 2002). Because there is often a range of possible locations for each duplication, the number of possible locations for the set of all duplications can be exponentially large in the size of the input trees. The challenge is to identify a mapping, or set of locations for all duplications, that will highlight WGDs. One such approach is to seek a mapping that implies the minimum number of locations in the species tree where all duplications in the gene trees can be placed. Burleigh et al. (2008) demonstrated that this approach can be used to identify WGDs across angiosperms with relatively few gene trees. Alternatively, Bansal and Eulenstein (2008) described an efficient algorithm to find the mapping that minimizes the number of gene duplication events, or episodes, that can include all gene duplications, but this has not been tested in plants. Furthermore, it may be useful to identify WGDs based on the size (number of duplications) of an episode, but such approaches are not yet developed. Although there appears to be much potential for gene mapping approaches to identifying WGDs, they all rely on the accuracy of the gene tree topologies. Specifically, error in the gene trees often appears like duplications toward the root of the species tree. Therefore, it may be difficult to distinguish WGDs near the root of the species tree from gene tree error (Hahn, 2007; Burleigh et al., 2008).

FATES AND CONSEQUENCES OF GENE DUPLICATION IN PLANTS

277

Because the process of diploidization rapidly erases evidence of WGDs, identifying and locating ancient WGDs is an inherently difficult task. Our current understanding of WGDs throughout the evolution of plants will doubtlessly be improved with largescale genomic data from a wider range of taxa and refinement of the current methods, perhaps through the development of hypothesis-testing frameworks, which are unfortunately uncommon in WGD analyses. Yet the fact that we still find evidence of WGDs from even hundreds of millions of years ago in current, relatively simple analyses of gene maps, distribution of duplicated genes, and topology of gene trees from plants demonstrates the profound influence of WGDs on plant genome structure. The future challenge for studying WGDs in plants is not only to seek more accurate estimates of the number and location of WGDs, but to better characterize the effects of WGDs on plant genomes through time. 3 FATES AND CONSEQUENCES OF GENE DUPLICATION IN PLANTS 3.1 Homoeolog Loss vs. Retention in Polyploids It is generally agreed that the majority of homoeologs (genes duplicated via polyploidy) will be under relaxed selective constraints due to redundancy, allowing mutations to accumulate in one duplicate copy (Ohno, 1970; Nowak et al., 1997). Most mutations will be deleterious, causing a loss of function, so nonfunctionalization and gene silencing are expected to be the ultimate fate of one homoeolog for the majority of duplicate gene pairs. For populations with effective sizes less than the reciprocal of the loss-of-function mutation rate, nonfunctionalization is expected to occur in less than approximately 1 million generations (Lynch and Force, 2000). However, the observed frequency of homoeologs retained in polyploid organisms is higher than would be expected under this model (Wendel, 2000), suggesting that natural selection may be acting to maintain gene duplicates (Shiu et al., 2006). Selection for retention of homoeologs could occur via several mechanisms. First, if gene copy number correlates with transcript production (“dosage dependence”), selection for a certain level of expression can act upon gene copy number. Such selection might be especially important in networks requiring precise stoichiometry of interacting gene products (Birchler et al., 2005; Maere et al., 2005; Aury et al., 2006; Thomas et al., 2006). Second, retention of homoeologs can provide fixed heterozygosity in allopolyploids, providing interlocus heterodimers and increased protein diversity in an individual (Roose and Gottlieb, 1976; Levin, 1983; Soltis and Rieseberg, 1986; Hedrick, 1987; Soltis and Soltis, 1989, 1993; Udall and Wendel, 2006) and perhaps rendering it more resistant to the deleterious effects of repeated self-fertilization (Barrett and Shore, 1987; Soltis and Soltis, 1990; Pannell et al., 2004). Third, duplicates of a gene with more than one function or expressed in more than one tissue may both degenerate to carry out complementary subsets of the ancestral gene’s role, so that both copies are maintained by selection (subfunctionalization; Force et al., 1999a, 1999b). Fourth, gene duplication may provide an “escape from adaptive conflict,” where two functions of an ancestral singleton gene are both freed to improve in different gene duplicates (Des Marais and Rausher, 2008). Fifth, rare beneficial mutations could cause one homoeolog to carry out a new function that is favored by natural selection (neofunctionalization; Ohno, 1970; Ohta, 1988; Walsh, 1995; Lynch and Conery, 2000).

278

GENE AND GENOME DUPLICATIONS IN PLANTS

Recently, it has been suggested that for certain genes, selection could favor rapid loss of one homoeolog (Paterson et al., 2006). This scenario could occur due to dosagedependent effects. First, some gene products may be effective in low concentrations, but tend to form nonfunctional aggregates at high concentrations (Conrad and Antonarakis, 2007). Second, disrupted stoichiometry of gene products can occur if one gene is duplicated but another is not; this might occur within the nucleus if a segment of the genome is duplicated, or between nuclear and cytoplasmic genes in the case of a whole (nuclear)-genome duplication. In Homo sapiens, gene duplications cause a significant number of genetic diseases that tend to involve dosage-sensitive genes and genes encoding proteins with a propensity to aggregate (Conrad and Antonarakis, 2007). Selection for singleton status could also act when whole-genome duplication accompanies hybridization (allopolyploidy), in which case an organism’s genome comprises two divergent parental genomes. For example, products of genes A and B in one parental genome form a protein heterodimer and have coevolved, and genes A and B have also coevolved in the other genome so that their products can form a dimer with one another. If A –B and A–B interlocus heterodimers fail, selection could favor the loss or silencing of genes A and B or A and B (Comai et al., 2003). Comparative studies in putative paleopolyploids provide some evidence that certain genes, if duplicated, are predisposed for a return to singleton status. In a study comparing genome duplication in Oryza (rice), Arabidopsis, Tetraodon (puffer fish), and Saccharomyces (yeast), Paterson et al. (2006), found 16 protein family (Pfam) domains in Arabidopsis and 12 Pfam domains in Oryza that contain a higher-than-expected percentage of singleton genes. In five cases, these Pfam domains were the same in the two species, and in one case a singleton-enriched Pfam domain in Oryza was the same as a singleton-enriched Pfam domain in Tetraodon. Paterson et al. (2006) suggested that this convergent reversion to singleton status points to certain domain-containing proteins being maladaptive in duplicate copy. Similarly, Leebens-Mack et al. (2006) estimated that 727 strict ortholog sets exist as singletons in the Arabidopsis, Oryza, and Populus genomes, but that if reversion to singleton status were independent in the three lineages, the number of strict ortholog sets should be 99. Again, this high rate of convergence could suggest the action of natural selection driving duplicate gene loss and return of the same genes to singleton status. A few studies have documented homoeolog loss in recent polyploids. In the allopolyploid species Tragopogon miscellus, loss of homoeologs is found in about fortiethgeneration natural populations (Tate et al., 2006; Buggs et al., 2009), but is not found in F1 hybrids (Tate et al., 2006) or first-generation synthetics (Buggs et al., 2009). Loss of coding (Kashkush et al., 2002) and noncoding (Liu et al., 1998b; Shaked et al., 2001; Ozkan et al., 2002) DNA sequence has been shown in synthetic Triticum allopolyploids. Genomic changes in synthetic allopolyploid lines in Brassica (Song et al., 1995; Gaeta et al., 2007) and genome downsizing in several polyploid species (Leitch and Bennett, 2004; Eilam et al., 2009) also suggest that rapid homoeolog loss may occur. One mechanism by which such rapid losses may occur is homoeologous nonreciprocal transpositions (Udall et al., 2005; Gaeta et al., 2007) or activation of transposons due to genomic shock (McClintock, 1984; Comai, 2000). Loss of some nonprotein coding sequences seems to follow a more concerted mechanism. In Tragopogon and Nicotiana, homogenization of rDNA repeats through gene conversion occurs gradually after allopolyploidization (Maty´asek et al., 2003, 2007; Kovar´ık et al., 2004, 2005).

FATES AND CONSEQUENCES OF GENE DUPLICATION IN PLANTS

279

In Triticum allopolyploids, low-copy noncoding DNA, some of which is chromosome specific, is eliminated in a well-orchestrated and reproducible fashion in F1 hybrids and synthetic allopolyploid lines (Liu et al., 1998b; Shaked et al., 2001; Ozkan et al., 2002). 3.2 Transposon Activation in Polyploids Hybridization and polyploidization may cause “genomic shock,” activating mobile elements in the genome (McClintock, 1984; Comai, 2000; Comai et al., 2003). These may cause genome restructuring (Lonnig and Saedler, 2002) and/or changes in gene expression (McClintock, 1984; Weil and Martienssen, 2008). Increased transposon activity has been found in allopolyploids of Nicotiana (Petit et al., 2007), Triticum (Kashkush et al., 2003; Dong et al., 2005), and Arabidopsis (Madlung et al., 2002, 2005). It is not known to what extent this is due to hybridization as opposed to genome doubling, as homoploid hybridization can activate mobile elements (Shan et al., 2005; Ungerer et al., 2006). In early-generation autopolyploid Arabidopsis, the transposon Sunfish was activated, but its transcription was repressed in advanced autopolyploid generations, whereas it remained active in allotetraploids (Madlung et al., 2002). 3.3 Gene Expression Changes in Polyploids Changes in gene expression due to polyploidization have been the subject of several recent reviews (Osborn et al., 2003; Adams and Wendel, 2005b; Chen and Ni, 2006; Adams, 2007; Chen, 2007; Hegarty et al., 2008; Hegarty and Hiscock, 2008). Most studies compare expression in different species, such as polyploids versus parental diploids, homoploid hybrids versus allopolyploids, or autopolyploids versus allopolyploids. Comparisons may be between (1) expression of genes without distinguishing between homoeologs at each locus (Hegarty et al., 2005, 2006; Wang et al., 2006a); or (2) expression of each homoeolog at individual loci (Adams et al., 2003, 2004; Adams and Wendel, 2005a; Tate et al., 2006; Liu and Adams, 2007; Flagel et al., 2008; Buggs et al., 2009). Comparison 1 may help us understand the extent and timing of the effects of polyploidy on the transcriptome, whereas comparison 2 allows us to understand the contributions of the individual genomes to those effects, and the influence that this may have on the evolution of duplicated genes. In synthetic Arabidopsis thaliana and A. suecica polyploids, microarray studies showed changes in gene expression in over 5% of 26,090 genes in the allopolyploids relative to their parents (Wang et al., 2006a). Of these genes, 68% were expressed at different levels in the parental species (Wang et al., 2006a), and only 41% of changes were common to two independent allopolyploid lineages. An autopolyploid of A. thaliana showed altered expression in only 0.3% of genes relative to its parent (Wang et al., 2006a). In synthetic Senecio cambrensis allopolyploids, anonymous microarrays with about 6000 cDNA clones showed that gene expression differs extensively in diploid hybrids compared to the parental species, but are ameliorated by polyploidization (Hegarty et al., 2005, 2006). Patterns of expression in the synthetic allopolyploid were maintained over five generations (Hegarty et al., 2006). Both of these studies suggest that hybridization rather than polyploidization has a greater instantaneous effect on global patterns of gene expression.

280

GENE AND GENOME DUPLICATIONS IN PLANTS

Differences in expression between homoeologs are likely to have important consequences, and studies that distinguish between homoeologs show this to be occurring. In natural cotton polyploids, a microarray study of 1383 genes found biased expression in 70% of homoeolog pairs, only 24% of which seem to have biased expression immediately on diploid hybridization of the parental species (Flagel et al., 2008). Detailed study of expression of homoeologs in cotton shows examples of environment-specific expression (Liu and Adams, 2007) and tissue-specific expression (Adams et al., 2003, 2004), one example of which occurs in diploid hybrids (Adams and Wendel, 2005a). Silencing of homoeologs, including tissue-specific homocologs, has also been shown in hexaploid wheat (Bottley et al., 2006). In Tragopogon miscellus, silencing of homoeologs in leaf tissue has been found in natural allopolyploids about 80 years old, but not in diploid hybrids or synthetic allopolyploids (Tate et al., 2006; Buggs et al., 2009). Homoeolog silencing has also been found in the natural allopolyploid Arabidopsis suecica (Lee and Chen, 2001). Genes thought to have been duplicated by polyploidy before the split of A. thaliana and A. arenosa show higher levels of expression divergence between the two species than singleton genes, and a greater proportion of them were nonadditively expressed in resynthesized and natural allotetraploids formed from the two species (Ha et al., 2009). Together, these studies suggest that modulation of gene expression occurs at temporally distinct stages. Some effects are instantaneous effects of hybridization (Adams and Wendel, 2005a; Hegarty et al., 2006; Flagel et al., 2008), whereas others evolve in generations subsequent to polyploidization (Tate et al., 2006; Flagel et al., 2008; Buggs et al., 2009). Thus, it appears that polyploidy may provide an initial saltation in gene expression, followed by gradual changes in expression permitted by the presence of genes in duplicate form. Changes in gene expression between homoeologs may lead to divergence of gene sequence. Silenced genes will not be selected for, and so may accumulate deleterious mutations and so be nonfunctionalized (see above). Homoeologs that are expressed in different tissues may be showing incipient subfunctionalization (see above). 3.4 Epigenetics in Polyploids Polyploidy appears to have epigenetic effects that are responsible for changes in gene expression that are stable across generations (Comai, 2000; Liu and Wendel, 2002, 2003; Osborn et al., 2003; Rapp and Wendel, 2005; Chen and Ni, 2006). Ni and colleagues (2009) have recently provided intriguing evidence that growth vigor and increased biomass in hybrid and allopolyploid plants of Arabidopsis are caused by epigenetic modulation of parental alleles and homologous loci of the internal circadian clock regulators: this alters the amplitude of downstream gene expression and metabolic flux in clock-mediated photosynthesis and carbohydrate metabolism. A theoretical study (Rodin and Riggs, 2003) suggests that epigenetic tissue-specific silencing may enhance the evolution of genes to divergent functions, especially in small populations, promoting subfunctionalization. Below we review the various types of epigenetic changes that have been found in polyploids. Methylation Methylation-sensitive AFLP analysis in Brassica (Song et al., 1995), Arabidopsis (Madlung et al., 2002), and Triticum (Liu et al., 1998a,b; Shaked et al., 2001) showed widespread changes in genomic methylation patterns

DUPLICATIONS IN THE MADS-BOX GENE FAMILY

281

upon polyploidization. In 20 accessions of allotetraploid Gossypium hirsutum, methylation–polymorphism diversity was greater than genetic diversity (Keyte et al., 2006). A study of 49 synthetic first-generation Brassica napus allopolyploids and their parental diploids found methylation changes to be much more common (35 of 73 markers) in the allopolyploids than insertions and deletions in the DNA (3 of 76 markers) (Lukens et al., 2006). It seems likely that the causes of these epigenetic changes are themselves epigenetic. Fulnecek and colleagues (2009) examined three major DNA methyltransferase families (MET1, CMT3, and DRM) in Nicotiana tabacum and found that both homoeologs of each gene were retained and expressed in the allopolyploid. Small Interfering RNA (siRNA) Recent evidence suggests that siRNA may play a role in controlling methylation of DNA in polyploids. Chen and colleagues (2008) found that accumulation of centromeric siRNA in A. suecica correlated with centromere methylation. When Preuss and colleagues (2008) knocked out two genes required for the biogenesis of siRNAs (RDR2 and DCL3 ) in A. suecica, nucleolar dominance was disrupted, suggesting that the methylation of 45S rRNA is directed by siRNA. Chromatin Remodeling Chromatin modification appears to play a significant role in changes in gene expression in polyploids (reviewed in Chen and Tian, 2007). Histone acetylation, together with DNA methylation, has been shown to play a role in repressing rRNA genes in Brassica (Chen and Pikaard, 1997) and Arabidopsis (Lawrence et al., 2004; Earley et al., 2006) allopolyploids, causing nucleolar dominance (Pikaard, 1999). Late flowering in Arabidopsis synthetic allotetraploids is correlated with activation of the flowering repressor gene FLC by histone acetylation and methylation (Wang et al., 2006b). Ni and colleagues (2009) suggest that in allotetraploid Arabidopsis the expression of clock regulators is altered by chromatin modifications, including rhythmic changes in histone acetylation. It has been suggested that histone modification could be influenced in polyploids by dosage effects of the gene products involved in histone modifier complexes (Birchler et al., 2005). Alternative Splicing Recent evidence suggests that allopolyploidy in wheat can affect gene regulation via changes in alternative splicing efficiency. Terashima and Takumi (2009) compared levels of the alternatively spliced forms of WDREB2, a transcription factor involved in abiotic stress response, among wheats of different ploidal levels. In diploids, the level of the nonfunctional transcript gradually decreased due to splicing in response to drought stress, but in hexaploid wheat lines, including both cultivars and synthetic lines, the nonfunctional form failed to decrease, suggesting that allopolyploidization inhibited efficient alternative splicing of the transcripts.

4 DUPLICATIONS IN THE MADS-BOX GENE FAMILY AND THEIR ROLES IN FLORAL DEVELOPMENT Gene duplication and diversification are among the most important genetic raw materials for evolutionary change. Many genes involved in floral development have undergone duplication over the course of angiosperm evolution. For example, gene families such as the MADS-box genes and the TCP genes were duplicated multiple times in the

282

GENE AND GENOME DUPLICATIONS IN PLANTS Gymnosperms

Angiosperms Eudicots

Monocots

Basal Angiosperms

Core Eudicots Arabidopsis Brassica Asterids Aquilegia Maize

Tulip

Persea Nuphar Amborella Cycas

Pinus

AP1+CAL SHP1+SHP2 SEP1+2 AP1+FUL AG+PLE AP3+TM6 SEP1,2+4

B A

C E

Sep

Pet

Stm

Car

AP3+PI AG+STK SEP3+SEP1,2,4

Figure 3 Phylogenetic distribution of MADS-box gene duplications across the angiosperms, as shown by the shaded bars on the branches of the tree. The shades of the bars correspond to the gene class of the ABCE model, which is shown at the lower left of the figure. Sep, sepals; Pet, petals; Stm, stamens; Car, carpels.

history of flowering plants (Howarth and Donoghue, 2006; reviewed by P.S. Soltis et al., 2009). MADS-box genes encode transcription factors containing a DNA-binding domain (the MADS domain) that regulates a wide variety of developmental processes. They are found in three eukaryotic kingdoms— plants, animals, and fungi—but have undergone a significant amount of gene duplication in plants, which coupled with the recruitment of duplicate genes to new roles is likely to have played a fundamental role in plant evolution (Theissen et al., 2000; Parenicova et al., 2003; Irish and Litt, 2005; Martinez-Castilla and Alvarez-Buylla, 2005). In flowering plants, especially, MADSbox genes have a wide range of functions, including the transition from vegetative growth to flowering and the development of flowers themselves. Over 70 MADS-box genes are present in the genomes of the angiosperm genetic models Arabidopsis and rice, while far fewer MADS-box genes have been found in the gymnosperms (Nam et al., 2003). Phylogenetic reconstructions suggest that the MADS-box gene family has diversified in angiosperms, with duplications of many gene lineages in angiosperm ancestors or within specific angiosperm clades (Becker and Theissen, 2003; Figure 3). 4.1 Gene Duplications and Diversification in MADS-Box Floral Organ Identity Genes According to the ABCE model, the overlapping influences of four functions (A, B, C, and E) regulate floral organ identity. In A. thaliana, A function is involved in the specification of sepals and petals, B function in petal and stamen specification, C function in stamen and carpel specification, and E function participates in the specification of

DUPLICATIONS IN THE MADS-BOX GENE FAMILY

283

all floral organs (Coen and Meyerowitz, 1991; Colombo et al., 1995; Pelaz et al., 2000; Ditta et al., 2004). These functions are all encoded by members of separate MADS-box gene lineages, each of which shows evidence of multiple duplication events (Theissen, 2001). On the basis of functional studies, most notably in the model plant A. thaliana, it has been demonstrated that ancestral role retention, role swapping, and the acquisition of novel roles in floral development are among the functional consequences of these duplication events.

A-Function and APETALA1 The A function lineage is represented by APETALA1 (AP1 ), CAULIFLOWER (CAL), and FRUITFUL (FUL) in Arabidopsis, of which AP1 is the de facto A-function gene and has undergone at least two duplications in angiosperm history (Litt and Irish, 2003). The most recent is probably confined to the Brassicaceae (Lowman and Purugganan, 1999) and produced the Arabidopsis paralogs AP1 and CAL, which are nearly identical in sequence and redundant for specifying floral meristem identity (Bowman et al., 1993; Kempin et al., 1995). A more ancient duplication event at the base of the core eudicots produced two distinct core eudicot gene clades, the euAP1 clade, which includes AP1 (and CAL), and the euFUL clade, which includes FUL (Litt and Irish, 2003; Litt, 2007). FUL has no known role in organ identity, but shares a function in floral meristem specification with AP1 and CAL (Mandel and Yanofsky, 1995; Ferrandiz et al., 2000) and also has a unique function in fruit development (Ferrandiz, 2002). Thus, there are three paralogous members of the A-function lineage in the Arabidopsis genome, each making varying contributions to floral meristem identity. Efforts to evaluate the functions of orthologs and paralogs AP1 /CAL and FUL in other species suggest that members of this lineage may play a conserved role in floral meristem identity. However, despite similar expression patterns of euAP1 orthologs, the AP1 function of A. thaliana is not conserved. For example, orthologs of AP1 (members of the euAP1 clade) in Antirrhinum, pea (Pisum sativum), and tomato (Solanum lycopersicum) do not appear to have a role in sepal and petal specification. Instead, mutants display decreased flowering and increased inflorescence branching, indicating a role in determining floral meristem identity (Huijser et al., 1992; Berbel et al., 2001; Taylor et al., 2002; Vrebalov et al., 2002). Within the euFUL gene clade there have been numerous duplications within various core eudicot groups, but few FUL orthologs have been functionally characterized. Reported expression patterns suggest little evidence for a conserved role in meristem identity and/or fruit development, but data are still limited (Litt, 2007). Angiosperm taxa that diverged before the core eudicot duplication “below” the euAPI and euFUL duplication have genes with greater sequence similarity to euFUL genes than euAP1 genes. These FUL-like genes have undergone numerous duplications in different angiosperm clades, and as with euFUL genes, functional data are generally lacking. Expression patterns tend to be broad and varied, but might support a conserved role in floral meristem identity. Expression levels of the FUL-like genes in the basal angiosperms Persea, Nuphar, and Magnolia were greater in leaves than mature floral organs (Kim et al., 2005; Chanderbali et al., 2006; Yoo et al., unpublished data), but the highest levels were measured in the emerging inflorescences of Persea and pre-meiotic buds of Nuphar, developmental stages enriched for floral meristems (Chanderbali et al., 2009; Yoo et al., unpublished data).

284

GENE AND GENOME DUPLICATIONS IN PLANTS

B-Function and APETALA3 and PISTILLATA The two B-function genes APETALA3 (AP3 ) and PISTILLATA (PI ) belong to two paralogous gene lineages that resulted from a duplication event prior to the origin of the angiosperms (Kim et al., 2004). In addition, the AP3 lineage underwent another duplication event at the base of the core eudicots, giving rise to two AP3 sublineages: the euAP3 and the TOMATO MADS BOX GENE6 (TM6 ) gene lineages (Kramer et al., 1998). As with the FUL-like genes in noncore eudicot angiosperms, the angiosperms that diverged prior to this duplication event have “paleoAP3 ” genes, which share greater sequence similarity with TM6 than with euAP3. TM6 genes have been independently lost in Arabidopsis and Antirrhinum, but most core eudicots have both paralogs. Functional studies suggest functional diversification in these paralogous lineages. For example, TM6 is involved in the development of stamens but not petals, while the ortholog of AP3 regulates both stamen and petal development in tomato (de Martino et al., 2006). In contrast, in petunia, petals are transformed into sepals with little effect on stamens after loss of the AP3 ortholog, and stamens are only affected when mutations afflict orthologs of both AP3 and TM6 (Rijpkema et al., 2006). Outside the core eudicots, duplicate paleoAP3 genes in poppy have apparently been specialized into promoting the development of either stamens or petals, but not both. Similarly, three paleoAP3 paralogs in Aquilegia appear to have temporal as well as spatial partitioning in their roles in stamen and/or petal development (Kramer et al., 2007). The only other functionally characterized paleoAP3 genes are from the monocots rice and maize, which do not seem to have a history of duplications, and function in both stamen and petal ( = lodicule in grasses) development (Whipple et al., 2004, 2007). The expression patterns of paleoAP3 genes in basal angiosperms suggest an ancestral role in stamen and perianth specification (Kim et al., 2005; Chanderbali et al., 2006; Soltis et al., 2006, 2007; Yoo et al., in preparation), although duplications may have resulted in neo- or subfunctionalization. For example, one of two paleoAP3 paralogs in Nuphar advena is restricted to stamens during early development while the other is expressed in both stamens and inner tepals (petals), although both paralogs are detected throughout the flower during mature floral stages (Kim et al., 2005). Functional diversification following duplication is also suggested by expression shifts of three paleoAP3 paralogs in Illicium floridanum. One paralog is expressed in the outer tepals, inner tepals, and stamens (the typical paleoAP3 expression), the second is restricted to the inner tepals and stamens, and the third is limited to the inner tepals (Kim et al., 2005). Unlike the AP3 lineage, the PI lineage has not undergone a duplication event at the base of the core eudicots, but there are ample examples of dynamic patterns of evolution in other angiosperms. For instance, there is clear evidence for ancient duplications in the magnoliid clade of basal angiosperms (Stellari et al., 2003), but expression data do not suggest functional diversification (Chanderbali et al., 2006, 2009), although functional data are unavailable. Numerous recent duplications have occurred in the Ranunculaceae (Kramer et al., 2003) and monocots (Winter et al., 2002), but the limited expression data do not suggest functional diversification. C-Function and AGAMOUS The evolutionary history of the C-function gene lineage demonstrates recent duplications in various angiosperm taxa, an older duplication event placed early in the history of the core eudicots, and an even more ancient duplication

DUPLICATIONS IN THE MADS-BOX GENE FAMILY

285

event early in angiosperm history after the divergence of the angiosperms and gymnosperms (Kramer et al., 2004; Zahn et al., 2006). Three sequential gene duplications in the C-function lineage have resulted in four paralogs in the Arabidopsis genome. The first occurred before the radiation of flowering plants and gave rise to the SEEDSTICK (STK) and AGAMOUS (AG) lineages. The second duplication event occurred at the base of the core eudicots, producing the PLENA lineage as sister to the euAG lineage. Duplicate members of the PLENA lineage in Arabidopsis, SHATTERPROOF1 (SHP1 ) and SHATTERPROOF2 (SHP 2 ), appear to have resulted from a recent duplication in the Brassicaceae. AG functions in floral meristem determinacy as well as stamen and carpel development, and is expressed in the floral meristem from developmental stage 3 (as defined by Smyth et al., 1990), in stamens and carpels from primordial to mature stages, and later in the developing seed coat (Bowman et al., 1991; Drews et al., 1991). SHP1 and SHP2 are expressed in the ovules and in the developing pistil and fruit, where they share largely redundant functions in specifying the fruit dehiscence zone required for seed-pod shattering, by controlling the formation of specialized valve margin cells that are found only in fruits of the Brassicaceae (Liljegren et al., 2000). STK is expressed in the developing ovule primordia and seeds and functions in specifying ovule identity (D function) along with the AG and SHP1/SHP2 genes (Rounsley et al., 1995; Favaro et al., 2003; Pinyopich et al., 2003). Remarkably, in a demonstration of how gene function can be unpredictably partitioned between products of a gene-duplication event, AG and PLENA (PLE ), its functional counterpart in Antirrhinum, respectively, belong to the paralogous euAG and PLENA lineages that descended from the core eudicot duplication event. Thus, PLE is orthologous to SHP1/SHP2 while functionally more similar to AG (Bradley et al., 1993; Davies et al., 1999). Other functionally characterized members of the euAG lineage from petunia and morning glory function similarly to AG, even though these species are more closely related to Antirrhinum than to Arabidopsis. The Antirrhinum AG ortholog, FARINELLI (FAR), has a lesser role in organ identity and meristem determinacy than AG or PLE , and instead, plays a greater role in late stamen development than PLE (Davies et al., 1999). Members of paralogous euAG and PLE lineages in the core eudicots therefore display redundancy, subfunctionalization, and/or neofunctionalization in different taxa, and demonstrate that the functional roles of duplicate genes can vary considerably. Outside the core eudicots, a duplication event in the monocots has also produced paralogous AG lineages that may have become subfunctionalized. For example, the maize gene ZAG1 is more strongly expressed in carpels, while the paralogous ZMM2 is restricted to stamens (Mena et al., 1996). Similarly, duplication events have produced three AG homologs in the magnoliid Persea, one of which is restricted to late-stage stamens and carpels while the other two are expressed in all floral organs at induction and maturity (Chanderbali et al., 2006, 2009). E-Function and SEPALLATA Sequential duplications similar to those in the AG lineage gave rise to four SEPALLATA (SEP1 to SEP4 ) genes, which may be largely functionally redundant in Arabidopsis. All are flower-specific and expressed in all floral organs, and only the quadruple mutant lacking all four genes exhibits a complete loss of floral organ identity (Ditta et al., 2004). The first duplication event predates the radiation of extant angiosperms and produced the SEP3 and SEP1/2/4 lineages. A second duplication in the latter lineage occurred at the base of the core eudicots and

286

GENE AND GENOME DUPLICATIONS IN PLANTS

separated the SEP4 and SEP1/2 lineages, while a subsequent duplication resulted in duplicate SEP1 and SEP2 in the Brassicaceae. The expression of SEP homologs in angiosperms is generally conserved, and it appears that the entire SEP subfamily has a potentially conserved function in controlling the identity of all floral organs. Additionally, members of the SEP lineage may have a conserved role in floral meristems and ovules (Zahn et al., 2005). However, lineage-specific duplications followed by functional diversification are evident. Several additional gene duplication events were detected within the SEP1/2/4 and SEP3 lineages in monocots and eudicots Zahn et al. (2005) detected at least five distinct grass clades in the monocots: three in the SEP1/2/4 lineage and two in the SEP3 lineage. After the origin of the eudicots but before the radiation of the core eudicots, an early duplication event in the SEP1/2 lineage (prior to the Brassicaceae duplication event) produced the FLORAL BINDING PROTEIN9 (FBP9 ) lineage that has apparently been lost in Arabidopsis (Zahn et al.. 2005). The expression patterns in the basal angiosperms are similar to those of Arabidopsis, although duplication events have also occurred. For example, a recent duplication in the SEP3 lineage has produced two paralogs in Persea (Chanderbali et al., 2006, 2009). Despite this dynamic evolutionary history of gene duplications and diversification, SEP genes may generally be conserved in specifying meristem and floral organ identity in angiosperms. However, it is noteworthy that they range from being developmentally redundant, as in Arabidopsis, to having unique roles in Gerbera of the sunflower family (Asteraceae). The Gerbera GRCD1 and GRCD2 genes, of the SEP3 and SEP1/2/4 lineages, respectively, are expressed in all floral organs but have become subfunctionalized. Down-regulation of GRCD1 results in transformation of the staminodes of female flowers into petals, while GRCD2 is needed for carpel development (Ulmari et al., 2004). 4.2 Gene Duplications and Morphological Novelty: Is There a Connection? It is tempting to hypothesize that gene and genome duplications, while providing new raw material for evolution, also provide the catalyst for morphological innovation. Although such hypotheses are probably oversimplifications of the underlying genetic requirements for morphological evolution and data are limited, recent evidence from Aquilegia (Kramer et al., 2007; Rasmussen et al., 2009) suggests that varying patterns of expression of three AP3 paralogs and PI control petaloidy in these flowers. That is, different expression patterns of the duplicated genes contribute to the novel features of columbine flowers. Although not extensive evidence for the hypothesis that morphological novelty can arise through the action of duplicate genes, the Aquilegia example is certainly intriguing and suggests that other cases should be investigated.

5

CONCLUSIONS

Plant genes and genomes are replete with duplications, due to both local and wholegenome duplications. These duplications span nearly the age of angiosperms themselves, with some WGDs dating back to the early nodes of angiosperm phylogeny, with more recent WGDs superimposed on these ancient events. The result is a complex genomic structure in all species investigated to date. The gene pairs resulting from various processes of duplication provide an immense data set for exploring the

REFERENCES

287

consequences of gene and genome duplication and the fate of duplicate genes. As predicted by theory, some duplicate genes are retained in duplicate, each continuing to perform the ancestral function. Other genes are silenced, or nonfunctionalized. Still other pairs undergo neofunctionalization, in which one member of the pair acquires a new function. Finally, data are beginning to accumulate in support of the concept of subfunctionalization—the parsing of ancestral function between members of a duplicate pair. One of the most surprising observations is the propensity for homoeolog loss, as observed in the hexaploid wheat and Tragopogon allotetraploids, in particular. Such homoeolog loss is more rapid than those processes that rely on the accumulation of point mutations for gene silencing or changes in gene function; loss may occur very soon after gene or genome duplication (e.g., Tate et al., 2006; Buggs et al., 2009). The fate of duplicate genes may be, to some extent, lineage-specific, as some allotetraploids undergo homoeolog loss and others do not. The factors that contribute to homoeolog loss are unknown. Following stabilization after homoeolog loss, longer-term processes involving point mutations take over, and duplicate gene pairs within the same genome may experience alternative fates, from silencing of one copy to neofunctionalization to subfunctionalization. The genomic attributes and selective pressures that result in one fate versus another have not been addressed. As patterns of duplicate gene fate begin to emerge, we should turn our attention next to those genomic features that may lead to one path versus another. For example, are genes duplicated by polyploidy more likely to undergo loss than those duplicated via tandem duplication? Do duplicate genes involved in the same pathway or network respond in the same way? And ultimately, what are the links between duplicate genes, morphological novelty, and organismal diversification? These questions are just beginning to be addressed (see, e.g., DeBodt et al., 2005; Maere et al., 2005; Freeling and Thomas, 2006; Semon and Wolfe, 2007; Fawcett et al., 2009; Freeling, 2009; Van de Peer et al., 2009). The next few years offer extremely exciting opportunities for further study of duplicate genes in plants. Acknowledgments This work was supported in part by National Science Foundation grants MCB-0346437, DEB-0608268, EF-0431266, PGR-0115684, and DBI-0638595 and by the University of Florida. We appreciate the contributions and helpful discussion of R. J. A. Buggs. We thank two anonymous reviewers for their comments on an earlier draft of the chapter. REFERENCES Adams KL. 2007. Evolution of duplicate gene expression in polyploid and hybrid plants. J Hered 98:136–141. Adams KL, Wendel JF. 2005a. Allele-specific, bidirectional silencing of an alcohol dehydrogenase gene in different organs of interspecific diploid cotton hybrids. Genetics 171:2139–2142. Adams KL, Wendel JF. 2005b. Novel patterns of gene expression in polyploid plants. Trends Genet 21:539–543. Adams KL, Cronn R, Percifield R, Wendel JF. 2003. Genes duplicated by polyploidy show unequal contributions to the transcriptome and organ-specific reciprocal silencing. Proc Natl Acad Sci USA 100:4649–4654.

288

GENE AND GENOME DUPLICATIONS IN PLANTS

Adams KL, Percifield R, Wendel JF. 2004. Organ-specific silencing of duplicated genes in a newly synthesized cotton allotetraploid. Genetics 168:2217–2226. Aury JM, Jaillon O, Duret L, Noel B, Jubin C, Porcel BM, et al. 2006. Global trends of wholegenome duplications revealed by the ciliate Paramecium tetraurelia. Nature 444:171–178. Bansal MS, Eulenstein O. 2008. The multiple gene duplication problem revisited. Bioinformatics 24:i132–i138. Bansal MS, Burleigh JG, Eulenstein O, Wehe A. 2007. Heuristics for the gene-duplication problem: a (http://bioinformatics.oxfordjournals.org/math/theta.gif) isn’t in document θ(n) speed-up for the local search. In Speed TP, Huang H (eds.), Proceedings of the 11th Annual International Conference on Research in Computational Molecular Biology (RECOMB’07 ), Vol. 4453 of Lecture Notes in Computer Science. New York: Springer-Verlag, pp. 238–252. Barker MS, Kane NC, Matvienko M, Kozik A, Michelmore RW, Knapp SJ, Rieseberg LH. 2008. Multiple paleopolyploidizations during the evolution of the Compositae reveal parallel patterns of duplicate gene retention after millions of years. Mol Biol Evol 25:2445–2455. Barrett SCH, Shore JS. 1987. Variation and evolution of breeding systems in the Turnera ulmifolia complex (Turneraceae). Evolution 41:340–354. Becker A, Theissen G. 2003. The major clades of MADS-box genes and their role in the development and evolution of flowering plants. Mol Phylogenet Evol 29:464–489. Bennett MD, Leitch IJ. 2004. Plant DNA C-values database (release 3.0, Dec. 2004). www.rbgkew.org.uk/cval/homepage.html. Berbel A, Navarro C, Ferrandiz C, Canas LA, Madueno F, Beltran JP. 2001. Analysis of PEAM4 , the pea AP1 functional homologue, supports a model for AP1 -like genes controlling both floral meristem and floral organ identity in different plant species. Plant J 25:441–451. Birchler JA, Riddle NC, Auger DL, Veitia RA. 2005. Dosage balance in gene regulation: biological implications. Trends Genet 21:219–226. Blanc G, Wolfe KH. 2004. Widespread paleopolyploidy in model plant species inferred from age distributions of duplicate genes. Plant Cell 16:1667–1678. Blanc G, Hokamp K, Wolfe KH. 2003. A recent polyploidy superimposed on older large-scale duplications in the Arabidopsis genome. Genome Res 13:137–144. Bottley A, Xia GM, Koebner RMD. 2006. Homoeologous gene silencing in hexaploid wheat. Plant J 47:897–906. Bowers JE, Chapman BA, Rong J, Paterson AH. 2003. Unravelling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events. Nature 422:433–438. Bowman JL, Drews GN, Meyerowitz EM. 1991. Expression of the Arabidopsis floral homeotic gene AGAMOUS is restricted to specific cell types late in flower development. Plant Cell 3:749–758. Bowman JL, Alvarez J, Weigel D, Meyerowitz EM, Smyth DR. 1993. Control of flower development in Arabidopsis thaliana by APETALA1 and interacting genes. Development 119:721–743. Bradley D, Carpenter R, Sommer H, Hartley N, Coen E. 1993. Complementary floral homeotic phenotypes result from opposite orientations of a transposon at the plena locus of Antirrhinum. Cell 72:85–95. Buggs RJA, Doust AN, Tate JA, Koh J, Soltis K, Feltus FA, et al. 2009. Gene loss and silencing in Tragopogon miscellus (Asteraceae): comparison of natural and synthetic allotetraploids. Heredity 103:73–81. Burleigh JG, Bansal MS, Wehe A, Eulenstein O. 2008. Locating multiple gene duplications through reconciled trees. RECOMB, LNBI 4955:273–284. Cannon SB, Sterck L, Rombauts S, Sato S, Cheung F, Gouzy J, et al. 2006. Legume genome evolution viewed through the Medicago truncatula and Lotus japonicus genomes. Proc Natl Acad Sci USA 103:14959–14964.

REFERENCES

289

Chanderbali AS, Kim S, Buzgo M, Zheng Z, Oppenheimer DG, Soltis DE, Soltis PS. 2006. Genetic footprints of stamen ancestors guide perianth evolution in Persea (Lauraceae). Int J Plant Sci 167:1075–1089. Chanderbali AS, Albert VA, Leebens-Mack J, Altman NS, Soltis DE, Soltis PS. 2009. Transcriptional signatures of ancient floral developmental genetics in avocado (Persea americana; Lauraceae). Proc Natl Acad Sci USA 106:8929–8934. Chapman BA, Bowers JE, Schulze SR, Paterson AH. 2004. A comparative phylogenetic approach for dating whole genome duplication events. Bioinformatics 20:180–185. Chen ZJ. 2007. Genetic and epigenetic mechanisms for gene expression and phenotypic variation in plant polyploids. Annu Rev Plant Biol 58:377–406. Chen ZJ, Ni Z. 2006. Mechanisms of genomic rearrangements and gene expression changes in plant polyploids. Bioessays 28:240–252. Chen ZJ, Pikaard CS. 1997. Epigenetic silencing of RNA polymerase I transcription: a role for DNA methylation and histone modification in nucleolar dominance. Genes Dev 11:2124–2136. Chen ZJ, Tian L. 2007. Roles of dynamic and reversible histone acetylation in plant development and polyploidy. Biochim Biophys Acta 1769:295–307. Chen M, Ha M, Lackey E, Wang JL, Chen ZJ. 2008. RNAi of met1 reduces DNA methylation and induces genome-specific changes in gene expression and centromeric small RNA accumulation in Arabidopsis allopolyploids. Genetics 178:1845–1858. Clausen J, Keck DD, Hiesey WM. 1945. Experimental studies on the nature of species: II. Plant evolution through amphiploidy and autopolyploidy, with examples from the Madiinae. Washington, DC: Carnegie Institute of Washington. Coen ES, Meyerowitz EM. 1991. The war of the whorls: genetic interactions controlling flower development. Nature 353:31–37. Colombo L, Franken J, Koetje E, van Went J, Dons HJ, Angenent GC, van Tunen AJ. 1995. The petunia MADS box gene FBP11 determines ovule identity. Plant Cell 7:1859–1868. Comai L. 2000. Genetic and epigenetic interactions in allopolyploid plants. Plant Mol Biol 43:387–399. Comai L, Madlung A, Josefsson C, Tyagi A. 2003. Do the different parental “heteromes” cause genomic shock in newly formed allopolyploids?. Philos Trans R Soc Lond B 358:1149–1155. Conrad B, Antonarakis SE. 2007. Gene duplication: a drive for phenotypic diversity and cause of human disease. Annu Rev Genom Hum Genet 8:17–35. Cui L, Wall PK, Leebens-Mack J, Lindsay BG, Soltis D, Doyle JJ, et al. 2006. Widespread genome duplications throughout the history of flowering plants. Genome Res16:738–749. Darlington CD. 1937. Recent Advances in Cytology, 2nd Philadelphia: P. Blakiston’s Son and Co. Davies B, Motte P, Keck E, Saedler H, Sommer H, Schwarz-Sommer Z. 1999. PLENA and FARINELLI: redundancy and regulatory interactions between two Antirrhinum MADS-box factors controlling flower development. EMBO J 18:4023–4034. DeBodt S, Maere S, van de Peer Y. 2005. Genome duplication and the origin of angiosperms. Trends Ecol Evol 20:591–597. de Martino G, Pan I, Emmanuel E, Levy A, Irish VF. 2006. Functional analyses of two tomato APETALA3 genes demonstrate diversification in their roles in regulating floral development. Plant Cell 18:1833–1845. Des Marais DL, Rausher MD. 2008. Escape from adaptive conflict after duplication in an anthocyanin pathway gene. Nature 454:762–765. Ditta G, Pinyopich A, Robles P, Pelaz S, Yanofsky MF. 2004. The SEP4 gene of Arabidopsis thaliana functions in floral organ and meristem identity. Curr Biol 14:935–940.

290

GENE AND GENOME DUPLICATIONS IN PLANTS

Dong Y, Liu Z, Shan X, Qiu T, He M, Liu B. 2005. Allopolyploidy in wheat induces rapid and heritable alterations in DNA methylation patterns of cellular genes and mobile elements. Russ J Genet 41:890–896. Drews GN, Bowman JL, Meyerowitz EM. 1991. Negative regulation of the Arabidopsis homeotic gene AGAMOUS by the APETALA2 product. Cell 65:991–1002. Durand D, Hoberman R. 2006. Diagnosing duplications: Can it be done?. Trends Genet 22:156–164. Earley K, Lawrence RJ, Pontes O, Reuther R, Enciso AJ, Silva M, et al. 2006. Erasure of histone acetylation by Arabidopsis HDA6 mediates large-scale gene silencing in nucleolar dominance. Genes Dev 20:1283–1293. Eckhardt N. 2001. A sense of self: the role of DNA sequence elimination in allopolyploidization. Plant Cell 13:1699–1704. Eilam T, Anikster Y, Millet E, Manisterski J, Feldman M. 2009. Genome size in natural and synthetic autopolyploids and in a natural segmental allopolyploid of several Triticeae species. Genome 52:275–285. Eulenstein O. 1998. Vorhersage von Genduplikationen und deren Entwicklung in der Evolution. GMD Research Series, Vol. 20. Sankt Augustin, Germany. Favaro R, Pinyopich A, Battaglia R, Kooiker M, Borghi L, Ditta G, et al. 2003. MADSbox protein complexes control carpel and ovule development in Arabidopsis. Plant Cell 15:2603–2611. Fawcett, JA, Maere S, van de Peer Y. 2009. Plants with double genomes might have had a better chance to survive the Cretaceous–Tertiary extinction event. Proc Natl Acad Sci USA 106:5737–5742. Fellows M, Hallett M, Stege U. 1998. On the multiple gene duplication problem. ISAAC’98, LNCS 1533:347–357. Ferrandiz C. 2002. Regulation of fruit dehiscence in Arabidopsis. J Exp Bot 53:2031–2038. Ferrandiz C, Gu Q, Martienssen R, Yanofsky MF. 2000. Redundant regulation of meristem identity and plant architecture by FRUITFULL, APETALA1 and CAULIFLOWER. Development 127:725–734. Flagel LE, Udall J, Nettleton D, Wendel J. 2008. Duplicate gene expression in allopolyploid Gossypium reveals two temporally distinct phases of expression evolution. BMC Biol 6:16. Force A, Lynch M, Pickett FB, Amores A, Yan YL, Postlethwait J. 1999a. Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151:1531–1545. Force A, Lynch M, Postlethwait J. 1999b. Preservation of duplicate genes by subfunctionalization. Am Zool 39:460. Freeling M. 2009. Bias in plant gene content following different sorts of duplication: tandem, whole-genome segmental, or by transposition. Annu Rev Plant Biol 60:433–533. Freeling M, Thomas BC. 2006. Gene-balanced duplications, like tetraploidy, provide predictable drive to increase morphological complexity. Genome Res 16:805–814. Fulneˇcek J, Maty´asˇ ek R, Kovao´ık A. 2009. Faithful inheritance of cytosine methylation patterns in repeated sequences of the allotetraploid tobacco correlates with the expression of DNA methyltransferase gene families from both parental genomes. Mol Genet Genom 281:407–420. Furlong RF, Holland PW. 2002. Were vertebrates octoploid?. Philos Trans R Soc Lond B 357:531–544. Gaeta RT, Pires JC, Iniguez-Luy F, Leon E, Osborn TC. 2007. Genomic changes in resynthesized Brassica napus and their effect on gene expression and phenotype. Plant Cell 19:3403–3417. Gates RR. 1909. The stature and chromosomes of Oenothera gigas De Vries. Arch Zellforsch 3:525–552.

REFERENCES

291

Goldblatt P. 1980. Polyploidy in angiosperms: monocotyledons. In Lewis WH (ed.), Polyploidy: Biological Relevance. New York: Plenum Press, pp. 219–239. Gottlieb LD. 1974. Gene duplication and fixed heterozygosity for alcohol dehydrogenase in the diploid plant Clarkia franciscana. Proc Natl Acad Sci USA 71:1816–1818. Gottlieb LD. 1977. Evidence for duplication and divergence of the structural gene for phosphoglucoisomerase in diploid species of Clarkia. Genetics 86:289–307. Grant V. 1963. The Origin of Adaptations. New York: Columbia University Press. Grant V. 1981. Plant Speciation, 2nd ed. New York: Columbia University Press. Guig´o R, Muchnik I, Smith TF. 1996. Reconstruction of ancient molecular phylogeny. Mol Phylogenet Evol 6:189–213. Guyot R, Keller B. 2004. Ancestral genome duplication in rice. Genome 47:610–614. Ha M, Kim E-D, Chen ZJ. 2009. Duplicate genes increase expression diversity in closely related species and allopolyploids. Proc Natl Acad Sci USA 106:2295–2300. Hahn M. 2007. Bias in phylogenetic tree reconciliation methods: implications for vertebrate genome evolution. Genome Biol 8:R141. Hedrick PW. 1987. Genetic load and the mating system in homosporous ferns. Evolution 41:1282–1289. Hegarty M, Jones J, Wilson I, Barker G, Sanchez-Baracaldo P, et al. 2005. Development of anonymous cDNA microarrays to study changes to the Senecio floral transcriptome during hybrid speciation. Mol Ecol 14:2493–2510. Hegarty M, Barker G, Wilson I, Abbott RJ, Edwards KJ, Hiscock SJ. 2006. Transcriptome shock after interspecific hybridization in Senecio is ameliorated by genome duplication. Curr Biol 16:1652–1659. Hegarty MJ, Hiscock SJ. 2008. Genomic clues to the evolutionary success of polyploid plants. Curr Biol 18:R435–R444. Hegarty MJ, Barker GL, Brennan AC, Edwards KJ, Abbott RJ, Hiscock SJ. 2008. Changes to gene expression associated with hybrid speciation in plants: further insights from transcriptomic studies in Senecio. Philos Trans R Soc Lond B 363:3055–3069. Howarth DG, Donoghue MJ. 2006. Phylogenetic analysis of the “ECE” (CYC/TB1) clade reveals duplications predating the core eudicots. Proc Natl Acad Sci USA 103:9101–9106. Huijser P, Klein J, L¨onnig WE, Meijer H, Saedler H, Sommer H. 1992. Bracteomania, an inflorescence anomaly, is caused by the loss of function of the MADS-box gene squamosa in Antirrhinum majus. EMBO J 11:1239–1249. Irish VF, Litt A. 2005. Flower development and evolution: gene duplication, diversification and redeployment. Curr Opin Genet Dev 15:454–460. Jaillon O, Aury JM, Noel B, Policriti A, Clepet C, Casagrande A, et al 2007. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 449:463–467. Kashkush K, Feldman M, Levy AA. 2002. Gene loss, silencing and activation in a newly synthesized wheat allotetraploid. Genetics 160:1651–1659. Kashkush K, Feldman M, Levy AA. 2003. Transcriptional activation of retrotransposons alters the expression of adjacent genes in wheat. Nat Genet 33:102–106. Kellis M, Birren BW, Lander ES. 2004. Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature 428:617–624. Kempin S, Savidge B, Yanofsky MF. 1995. Molecular basis of the cauliflower phenotype in Arabidopsis. Science 267:522–525. Keyte AL, Percifield R, Liu B, Wendel JF. 2006. Intraspecific DNA methylation polymorphism in cotton (Gossypium hirsutum L.). J Hered 9:444–450.

292

GENE AND GENOME DUPLICATIONS IN PLANTS

Kim S, Soltis DE, Albert V, Yoo MJ, Farris JS, Soltis PS, Soltis DE. 2004. Phylogeny and diversification of B-function MADS-box genes in angiosperms: evolutionary and functional implications of a 260-million-year-old duplication. Am J Bot 91:2102–2118. Kim S, Koh J, Yoo MJ, Kong H, Hu Y, Ma H, et al. 2005. Expression of floral MADS-box genes in basal angiosperms: implications for the evolution of floral regulators. Plant J 43:724–744. Kong H, Leebens-Mack J, Ni W, dePamphilis CW, Ma H. 2004. Highly heterogeneous rates of evolution in the SKP1 gene family in plants and animals: functional and evolutionary implications. Mol Biol Evol 21:117–128. Kovaˇr´ık A, Maty´asˇ ek R, Lim KY, Skalicka K, Koukalova B, Knapp S, et al. 2004. Concerted evolution of 18-5.8-26S rDNA repeats in Nicotiana allotetraploids. Biol J Linn Soc 82:615–625. Kova´ısˇk A, Pires JC, Leitch AR, Lim KY, Sherwood AM, Maty´asˇ ek R, et al. 2005. Rapid concerted evolution of nuclear ribosomal DNA in two Tragopogon allopolyploids of recent and recurrent origin. Genetics 169:931–944. Kowalski S, Lan T-H, Feldmann K, Paterson A. 1994. Comparative mapping of Arabidopsis thaliana and Brassica oleracea chromosomes reveals islands of conserved gene order. Genetics 138:499–510. Kramer EM, Dorit RL, Irish VF. 1998. Molecular evolution of petal and stamen development: gene duplication and divergence within the APETALA3 and PISTILLATA MADS-box gene lineages. Genetics 149:765–783. Kramer EM, Di Stilio VS, Schl¨uter PM. 2003. Complex patterns of gene duplication in the APETALA3 and PISTILLATA lineages of the Ranunculaceae. Int J Plant Sci 164:1–11. Kramer EM, Jaramillo MA, Di Stilio VS. 2004. Patterns of gene duplication and functional evolution during the diversification of the AGAMOUS subfamily of MADS box genes in angiosperms. Genetics 166:1011–1023. Kramer EM, Holappa L, Gould B, Jaramillo MA, Setnikov D, Santiago PM. 2007. Elaboration of B gene function to include the identity of novel floral organs in the lower eudicot Aquilegia. Plant Cell 19:750–766. Kuwada Y. 1911. Meiosis in the pollen mother cells of Zea mays L. Bot Mag 25:163–181. Lagercrantz U, Lydiate DJ. 1996. Comparative genome mapping in Brassica. Genetics 144:1903–1910. Lan TH, DelMonte TA, Reischmann KP, Hyman J, Kowalski SP, McFerson J, Kresovich S, Paterson AH. 2000. An EST-enriched comparative map of Brassica oleracea and Arabidopsis thaliana. Genome Res 10:776–788. Langkjaer RB, Cliften PF, Johnston M, Piskur J. 2003. Yeast genome duplication was followed by asynchronous differentiation of duplicated genes. Nature 421:848–852. Lawrence RJ, Earley K, Pontes O, Silva M, Chen ZJ, Neves N, et al. 2004. A concerted DNA methylation/histone methylation switch regulates rRNA gene dosage control and nucleolar dominance. Mol Cell 13:599–609. Lee HS, Chen ZJ. 2001. Protein-coding genes are epigenetically regulated in Arabidopsis polyploids. Proc Natl Acad Sci USA 98:6753–6758. Leebens-Mack JH, Wall K, Duarte J, Zheng Z, Oppenheimer D, dePamphilis C. 2006. A genomics approach to the study of ancient polyploidy and floral developmental genetics Adv Bot Res 44:528–549. Leitch IJ, Bennett MD. 2004. Genome downsizing in polyploid plants. Biol J Linn Soc 82:651–663. Levin DA. 1983. Polyploidy and novelty in flowering plants. Am Nat 122:1–25. Lewis WH. 1980a. Polyploidy in species populations. In Lewis WH (ed.), Polyploidy: Biological Relevance. New York: Plenum Press, pp. 103–144.

REFERENCES

293

Lewis WH. 1980b. Polyploidy in angiosperms: dicotyledons. In Lewis WH (ed.), Polyploidy: Biological Relevance. New York: Plenum Press, pp. 241–268. Lewis H, Lewis ME. 1955. The genus Clarkia. Univ Calif Publ Bot 20:241–392. Liljegren SJ, Ditta GS, Eshed Y, Savidge B, Bowman JL, Yanofsky MF. 2000. SHATTERPROOF MADS-box genes control seed dispersal in Arabidopsis. Nature 404:766–770. Litt A. 2007. An evaluation of A-function: evidence from the APETALA1 and APETALA2 gene lineages. Int J Plant Sci 168:73–91. Litt A, Irish VF. 2003. Duplication and diversification in the APETALA1/FRUITFULL floral homeotic gene lineage: implications for the evolution of floral development. Genetics 165:821–833. Liu Z, Adams KL. 2007. Expression partitioning between genes duplicated by polyploidy under abiotic stress and during organ development. Curr Biol 17:1669–1674. Liu B, Wendel JF. 2002. Non-mendelian phenomena in allopolyploid genome evolution. Curr Genom 3:489–505. Liu B, Wendel JF. 2003. Epigenetic phenomena and the evolution of plant allopolyploids. Mol Phylogenet Evol 29:365–379. Liu B, Vega JM, Feldman M. 1998a. Rapid genomic changes in newly synthesized amphiploids of Triticum and Aegilops: II. Changes in low-copy coding DNA sequences. Genome 41:535–542. Liu B, Vega JM, Segal G, Abbo S, Rodova M, Feldman M. 1998b. Rapid genomic changes in newly synthesized amphiploids of Triticum and Aegilops: I. Changes in low-copy noncoding DNA sequences. Genome 41:272–277. Lonnig WE, Saedler H. 2002. Chromosome rearrangements and transposable elements. Annu Rev Genet 36:389–410. L¨ove A, L¨ove D. 1949. The geobotanical significance of polyploidy: I. Polyploidy and latitude. Portugaliae Acta Biol Ser A: 273–352. Lowman AC, Purugganan MD. 1999. Duplication of the Brassica oleracea APETALA1 floral homeotic gene and the evolution of domesticated cauliflower. Genetics 90:514–520. Lukens LN, Quijada PA, Udall J, Pires JC, Schranz ME, Osborn TC. 2004. Genome redundancy and plasticity within ancient and recent Brassica crop species. Biol J Linn Soc 82:665–674. Lukens LN, Pires JC, Leon E, Vogelzang R, Oslach L, Osborn T. 2006. Patterns of sequence loss and cytosine methylation within a population of newly resynthesized Brassica napus allopolyploids. Plant Physiol 140:336–348. Lutz AM. 1907. A preliminary note on the chromosomes of Oenothera lamarckiana and one of its mutants, O. gigas. Science 26:151–152. Lynch M. 2007. The Origins of Genome Architecture. Sunderland, MA: Sinauer Associates. Lynch M, Conery JS. 2000. The evolutionary fate and consequences of duplicate genes. Science 290:1151–1155. Lynch M, Force A. 2000. The probability of duplicate gene preservation by subfunctionalization. Genetics 154:459–473. Madlung A, Masuelli RW, Watson B, Reynolds SH, Davison J, Comai L. 2002. Remodeling of DNA methylation and phenotypic and transcriptional changes in synthetic Arabidopsis allotetraploids. Plant Physiol 129:733–746. Madlung A, Tyagi AP, Watson B, Jiang HM, Kagochi T, Doerge RW, et al. 2005. Genomic changes in synthetic Arabidopsis polyploids. Plant J 41:221–230. Maere S, De Bodt S, Raes J, Casneuf T, Van Montagu M, Kuiper M, et al. 2005. Modeling gene and genome duplications in eukaryotes. Proc Natl Acad Sci USA 102:5454–5459. Mandel MA, Yanofsky MF. 1995. The Arabidopsis AGL8 MADS box gene is expressed in inflorescence meristems and is negatively regulated by APETALA1 . Plant Cell 7:1763–1771.

294

GENE AND GENOME DUPLICATIONS IN PLANTS

Martinez-Castilla LP, Alvarez-Buylla ER. 2003. Adaptive evolution in the Arabidopsis MADSbox gene family inferred from its complete resolved phylogeny. Proc Natl Acad Sci USA 100:13407–13412. Masterson J. 1994. Stomatal size in fossil plants: evidence for polyploidy in majority of angiosperms. Science 264:421–423. Maty´asˇ ek R, Lim KY, Kovaˇs´ık A, Leitch AR. 2003. Ribosomal DNA evolution and gene conversion in Nicotiana rustica. Heredity 91:268–275. Maty´asˇ ek R, Tate JA, Lim YK, Srubarova H, Koh J, Leitch AR, et al. 2007. Concerted evolution of rDNA in recently formed Tragopogon allotetraploids is typically associated with an inverse correlation between gene copy number and expression. Genetics 176:2509–2519. McClintock B. 1984. The significance of responses of the genome to challenge. Science 226:792–801. Mena M, Ambrose BA, Meeley RB, Briggs SP, Yanofsky MF, et al. 1996. Diversification of C-function activity in maize flower development. Science 274:1537–1540. Ming R, Hou S, Feng Y, Yu Q, Dionne-Laporte A, Saw JH, et al. 2008. The draft genome of the transgenic tropical fruit tree papaya (Carica papaya L.). Nature 452:991–996. M¨untzing A. 1936. The evolutionary significance of autopolyploidy. Hereditas 21:263–378. Nam J, dePamphilis CW, Ma H, Nei M. 2003. Antiquity and evolution of the MADS-box gene family controlling flower development in plants. Mol Biol Evol 20:1435–1447. Ni Z, Kim E-D, Ha M, Lackey E, Liu J, Zhang Y, et al. 2009. Altered circadian rhythms regulate growth vigour in hybrids and allopolyploids. Nature 457:327–331. Nowak MA, Boerlijst MC, Cooke J, Smith JM. 1997. Evolution of genetic redundancy. Nature 388:167–171. Ohno S. 1970. Evolution by Gene Duplication. New York: Springer-Verlag. Ohta T. 1988. Time for acquiring a new gene by duplication. Proc Natl Acad Sci USA 85:3509–3512. Osborn TC, Pires JC, Birchler JA, Auger DL, Chen JZ, Lee HS, et al. 2003. Understanding mechanisms of novel gene expression in polyploids. Trends Genet 19:141–147. Ozkan H, Levy AA, Feldman M. 2002. Rapid differentiation of homeologous chromosomes in newly-formed allopolyploid wheat. Isr J Plant Sci 50:S65–S76. Page RDM, Cotton JA. 2002. Vertebrate phylogenomics: reconciled trees and gene duplications. Pacific Symposium on Biocomputing, pp. 536–547. Pannell JR, Obbard DJ, Buggs RJA. 2004. Polyploidy and the sexual system: What can we learn from Mercurialis annua?. Biol J Linn Soc 82:547–560. Parenicova L, de Folter S, Kieffer M, Horner DS, Favalli C, Busscher J, et al. 2003. Molecular and phylogenetic analyses of the complete MADS-box transcription factor family in Arabidopsis: new openings to the MADS world. Plant Cell 15:1538–1551. Paterson AH, Bowers JE, Chapman BA. 2004. Ancient polyploidization predating divergence of the cereals, and its consequences for comparative genomics. Proc Natl Acad Sci USA 101:9903–9908. Paterson AH, Chapman BA, Kissinger JC, Bowers JE, Feltus FA, Estill JC. 2006. Many gene and domain families have convergent fates following independent whole-genome duplication events in Arabidopsis, Oryza, Saccharomyces and Tetraodon. Trends Genet 22:597–602. Pelaz S, Ditta GS, Baumann E, Wisman E, Yanofsky MF. 2000. B and C floral organ identity functions require SEPALLATA MADS-box genes. Nature 405:200–203. Petit M, Lim KY, Julio E, Poncet C, Dorlhac de Borne F, Kovarik A, et al. 2007. Differential impact of retrotransposon populations on the genome of allotetraploid tobacco (Nicotiana tabacum). Mol Genet Genomi 278:1–15.

REFERENCES

295

Pfeil BE, Schlueter JA, Shoemaker RC, Doyle JJ. 2005. Placing paleopolyploidy in relation to taxon divergence: a phylogenetic analysis in legumes using 39 gene families. Syst Biol 54:441–454. Pikaard CS. 1999. Nucleolar dominance and silencing of transcription. Trends Plant Sci 4:478–483. Pinyopich A, Ditta GS, Savidge B, Liljegren SJ, Baumann E, Wisman E, Yanofsky MF. 2003. Assessing the redundancy of MADS-box genes during carpel and ovule development. Nature 424:85–88. Preuss SB, Costa-Nunes P, Tucker S, Pontes O, Lawrence RJ, Mosher R, et al. 2008. Multimegabase silencing in nucleolar dominance involves siRNA-directed DNA methylation and specific methylcytosine-binding proteins. Mol Cell 32:673–684. Quiros CF, Grellet F, Sadowski J, Suzuki T, Li G, Wroblewski T. 2001. Arabidopsis and Brassica comparative genomics: sequence, structure and gene content in the ABI1-Rps2-Ck1 chromosomal segment and related regions. Genetics 157:1321–1330. Rapp RA, Wendel JF. 2005. Epigenetics and plant evolution. New Phytol 168:81–91. Rasmussen DA, Kramer EM, Zimmer EA. 2009. One size fits all?. Molecular evidence for a commonly inherited petal identity program in Ranunculales. Am J Bot 96:96–109. Rensing SA, Ick J, Fawcett JA, Lang D, Zimmer A, Van de Peer Y, Reski R. 2007. An ancient genome duplication contributed to the abundance of metabolic genes in the moss Physcomitrella patens. BMC Evol Biol 7:130. Rijpkema AS, Royaert S, Zethof J, van der Weerden G, Gerats T, Vandenbussche M. 2006. Analysis of the Petunia TM6 MADS box gene reveals functional divergence within the DEF/AP3 lineage. Plant Cell 18:1819–1832. Rodin SN, Riggs AD. 2003. Epigenetic silencing may aid evolution by gene duplication. J Mol Evol 56:718–729. Roose ML, Gottlieb LD. 1976. Genetic and biochemical consequences of polyploidy in Tragopogon. Evolution 30:818–830. Rounsley SD, Ditta GS, Yanofsky MF. 1995. Diverse roles for MADS box genes in Arabidopsis development. Plant Cell 7:1259–1269. Sanderson MJ. 2002. Estimating absolute rates of molecular evolution and divergence times: a penalized likelihood approach. Mol Biol Evol 19:101–109. Schlueter JA, Dixon P, Granger C, Grant D, Clark L, Doyle JJ, Shoemaker RC. 2004. Mining EST databases to resolve evolutionary events in major crop species. Genome 47:868–876. Schranz EM, Mitchell-Olds T. 2006. Independent ancient polyploidy events in the sister families Brassicaceae and Cleomaceae. Plant Cell 18:1152–1165. Semon M, Wolfe KH. 2007. Consequences of genome duplication. Curr Opin Genet Dev 17:505–512. Shaked H, Kashkush K, Ozkan H, Feldman M, Levy AA. 2001. Sequence elimination and cytosine methylation are rapid and reproducible responses of the genome to wide hybridization and allopolyploidy in wheat. Plant Cell 13:1749–1759. Shan X, Liu Z, Dong Z, Wang Y, Chen Y, Lin X, Long L, Han F, Dong Y, Liu B. 2005. Mobilization of the active MITE transposons mPing and Pong in rice by introgression from wild rice (Zizania latifolia Griseb.). Mol Biol Evol 22:976–990. Shiu S-H, Byrnes JK, Pan R, Zhang P, Li W-H. 2006. Role of positive selection in the retention of duplicate genes in mammalian genomes. Proc Natl Acad Sci USA 103:2232–2236. Shoemaker RC, Polzin K, Labate J, Specht J, Brummer EC, Olson T, et al. 1996. Genome duplication in soybean (Glycine subgenus soja). Genetics 144:329–338. Simillion C, Vandepoele K, Van Montagu MC, Zabeau M, Van de Peer Y. 2002. The hidden duplication past of Arabidopsis thaliana. Proc Natl Acad Sci USA 99:13627–13632.

296

GENE AND GENOME DUPLICATIONS IN PLANTS

Smyth DR, Bowman JL, Meyerowitz EM. 1990. Early flower development in Arabidopsis. Plant Cell 2:755–767. Soltis DE, Rieseberg LH. 1986. Autopolyploidy in Tolmiea menziesii (Saxifragaceae): genetic insights from enzyme electrophoresis. Am J Bot 73:310–318. Soltis DE, Soltis PS. 1989. Genetic consequences of autopolyploidy in Tolmiea (Saxifragaceae). Evolution 43:586–594. Soltis PS, Soltis DE. 1990. Evolution of inbreeding and outcrossing in ferns and fern-allies. Plant Species Biol 5:1–12. Soltis DE, Soltis PS. 1993. Molecular data and the dynamic nature of polyploidy. Crit Rev Plant Sci 12:243–273. Soltis DE, Soltis PS. 1999. Polyploidy: recurrent formation and genome evolution. Trends Ecol Evol 14:348–352. Soltis PS, Soltis DE. 2000. The role of genetic and genomic changes in the success of polyploids. Proc Natl Acad Sci USA 97:7051–7057. Soltis PS, Soltis DE, Gottlieb LD. 1987. Phosphoglucomutase gene duplications and their phylogenetic implications in Clarkia (Onagraceae). Evolution 41:667–671. Soltis DE, Soltis PS, Tate J. 2003. Advances in the study of polyploidy since Plant Speciation. New Phytol 161:173–191. Soltis DE, Soltis PS, Endress P, Chase MW. 2005. Phylogeny and Evolution of Angiosperms. Sunderland, MA: Sinauer Associates. Soltis PS, Soltis DE, Kim S, Chanderbali A, Buzgo M. 2006. Expression of floral regulators in basal angiosperms and the origin and evolution of the ABC model. Adv Bot Res 44:483–506. Soltis DE, Chanderbali AS, Kim S, Buzgo M, Soltis PS. 2007. The ABC model and its applicability to basal angiosperms. Ann Bot 100:155–163. Soltis DE, Albert VA, Leebens-Mack J, Palmer J, Wing R, dePamphilis C, et al. 2008. The Amborella genome initiative: a genome for understanding the evolution of angiosperms. Genome Biol 9:402. Soltis DE, Albert VA, Leebens-Mack J, Bell CD, Paterson A, Zheng C, et al. 2009a. Polyploidy and angiosperm diversification. Am J Bot 96:336–348. Soltis PS, Brockington SF, Yoo M-J, Piedrahita A, Latvis M, Moore MJ, et al. 2009b. Floral variation and floral genetics in basal angiosperms. Am J Bot 96:110–128. Song K, Lu P, Tang K, Osborn TC. 1995. Rapid genome change in synthetic polyploids of Brassica and its implications for polyploid evolution. Proc Natl Acad Sci USA 92:7719–7723. Stebbins GL. 1940. The significance of polyploidy in plant evolution. Am Nat 74:54–66. Stebbins GL. 1947. Types of polyploids: their classification and significance. Adv Genet 1:403–429. Stebbins GL. 1950. Variation and Evolution in Plants. New York: Columbia University Press. Stellari G, Jaramillo MA, Kramer E. 2004. Evolution of the APETALA3 and PISTILLATA lineages of MADS-box-containing genes in the basal angiosperms. Mol Biol Evol 21:506–519. Sterck L, Rombauts S, Jansson S, Sterky F, Rouz´e P, Van de Peer Y. 2005. EST data suggest that poplar is an ancient polyploid. New Phytol 167:165–170. Tate JA, Ni ZF, Scheen AC, Koh J, Gilbert CA, Lefkowitz D, et al. 2006. Evolution and expression of homeologous loci in Tragopogon miscellus (Asteraceae), a recent and reciprocally formed allopolyploid. Genetics 173:1599–1611. Taylor SA, Hofer JMI, Murfet IC, Sollinger JD, Singer SR, Knox MR, Ellis THN. 2002. PROLIFERATING INFLORESCENCE MERISTEM , a MADS-box gene that regulates floral meristem identity in pea. Plant Physiol 129:1150–1159.

REFERENCES

297

Terashima A, Takumi S. 2009. Allopolyploidization reduces alternative splicing efficiency for transcripts of the wheat DREB2 homolog, WDREB2 . Genome 52:100–105. Theissen G. 2001. Development of floral organ identity: stories from the MADS house. Curr Opin Plant Biol 4:75–85. Theissen G, Becker A, Di Rosa A, Kanno A, Kim JT, Munster T, Winter KU, Saedler H. 2000. A short history of MADS-box genes in plants. Plant Mol Biol 42:115–149. Thomas BC, Pedersen B, Freeling M. 2006. Following tetraploidy in an Arabidopsis ancestor, genes were removed preferentially from one homeolog leaving clusters enriched in dosesensitive genes. Genome Res 16:934–946. Tuskan GA, DiFazio S, Jansson S, Bohlmann J, Grigoriev I, Hellsten U, et al. 2006. The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science 313:1596–1604. Udall JA, Wendel JF. 2006. Polyploidy and crop improvement. Crop Sci 46:S3–S14. Udall JA, Quijada PA, Osborn TC. 2005. Detection of chromosomal rearrangements derived from homeologous recombination in four mapping populations of Brassica napus L. Genetics 169:967–979. Ulmari A, Kotilainen M, Elomaa P, Yu D, Albert VA, et al. 2004. Integration of reproductive meristem fates by a SEPALLATA-like MADS-box gene. Proc Natl Acad Sci USA 101:15817–15822. Ungerer MC, Strakosh SC, Zhen Y. 2006. Genome expansion in three hybrid sunflower species is associated with retrotransposon proliferation. Curr Biol 16:R872–R873. Van de Peer Y, Maere S, Meyer A. 2009. The evolutionary significance of ancient genome duplications. Nat Rev Genet 10:725–732. Vandepoele K, Simillion C, van de Peer Y. 2003. Evidence that rice and other cereals are ancient aneuploids. Plant Cell 15:2192–2202. Velasco R, Zharkikh A, Troggio M, Cartwright DA, Cestaro A, Preuss D, et al. 2007. A high quality draft consensus sequence of the genome of a heterozygous grapevine variety. PLoS ONE 2:e1326. Vision TJ, Brown DG, Tanksley SD. 2000. The origins of genomic duplications in Arabidopsis. Science 290:2114–2117. Vrebalov J, Ruezinsky D, Padmanabhan V, White R, Medrano D, Drake R, Schuch W, Giovannoni J. 2002. MADS-box gene necessary for fruit ripening at the tomato ripening-inhibitor (rin) locus. Science 296:343–346. Walsh JB. 1995. How often do duplicated genes evolve new functions?. Genetics 139:421–428. Wang X, Shi X, Hao B, Ge S, Luo J. 2005. Duplication and DNA segmental loss in the rice genome: implications for diploidization. New Phytol 165:937–946. Wang J, Tian L, Lee H-S, Wei NE, Jiang H, Watson B, et al. 2006a. Genomewide nonadditive gene regulation in Arabidopsis allotetraploids. Genetics 172:507–517. Wang JL, Tian L, Lee HS, Chen ZJ. 2006b. Nonadditive regulation of FRI and FLC loci mediates flowering-time variation in Arabidopsis allopolyploids. Genetics 173:965–974. Weil C, Martienssen R. 2008. Epigenetic interactions between transposons and genes: lessons from plants. Curr Opin Genet Dev 18:188–192. Wendel JF. 2000. Genome evolution in polyploids. Plant Mol Biol 42:225–249. Whipple CJ, Ciceri P, Padilla CM, Ambrose BA, Bandong SL, Schmidt RJ. 2004. Conservation of B-class floral homeotic gene function between maize and Arabidopsis. Development 131:6083–6091. Whipple CJ, Zanis MJ, Kellogg EA, Schmidt RJ. 2007. Conservation of B class gene expression in the second whorl of a basal grass and outgroups links the origin of lodicules and petals. Proc Natl Acad Sci USA 104:1081–1086.

298

GENE AND GENOME DUPLICATIONS IN PLANTS

Winge O. 1917. The chromosomes: their number and general importance. C R Trav Lab Carlsberg 13:131–275. Winter KU, Weiser C, Kaufmann K, Bohne A, Kirchner C, Kanno A, Saedler H, Theissen G. 2002. Evolution of class B floral homeotic proteins: obligate heterodimerization originated from homodimerization. Mol Biol Evol 19:587–596. Yu J, Wang J, Lin W, Li S, Li H, et al. 2005. The genomes of Oryza sativa: a history of duplication. PLoS Biol 3:266–281. Zahn LM, Kong H, Leebens-Mack JH, Kim S, Soltis PS, Landherr LL, et al. 2005. The evolution of the SEPALLATA subfamily of MADS-box genes: a pre-angiosperm origin with multiple duplications throughout angiosperm history. Genetics 169:2209–2223. Zahn LM, Leebens-Mack JH, Arrington JM, Hu Y, Landherr LL, dePamphilis CW, et al. 2006. Conservation and divergence in the AGAMOUS subfamily of MADS-box genes: evidence of independent sub- and neofunctionalization events. Evol Dev 8:30–45.

16

Whole Genome Duplications and the Radiation of Vertebrates SHIGEHIRO KURAKU Evolutionary Biology and Zoology, Department of Biology, University of Konstanz, Konstanz, Germany

AXEL MEYER Evolutionary Biology and Zoology, Department of Biology, University of Konstanz, Konstanz, Germany; Center for Advanced Study, Berlin, Germany

1 INTRODUCTION Almost 40 years ago, Susumo Ohno (1970) now famously said: “Duplication created where selection merely modified.” Ohno made this statement, which is still considered by most researchers to be rather heretical, virtually in the absence of empirical data, at least by the standards and knowledge of the age of genomics that we are in (see Meyer and Van de Peer, 2003). However, during the last four decades, particularly the last 10 years, both the profusion and importance of all kinds of duplications in the genome have become more widely recognized. Duplications as construction principles of evolution have increased in acceptance even outside the field of genomics. It is seen increasingly by researchers in the field of “evo-devo” as an important mechanism by which organisms are permitted to experiment with the evolution of novel gene function, without having to do this slowly or having to lose the original function of a gene (copy) altogether. Gene and genome duplications might increase the potential of evolutionary lines to produced diverse phenotypes of organisms (Ohno, 1970). One could classify genomic duplications based on their size or the mechanism that produced them. The first category would then be duplications of individual nucleotides followed by small numbers of base pairs such as dinucleotide motifs of microsatellites (e.g., CA repeats). Other potential categories could be the duplications of small sets of functional contiguous nucleotides such as enhancers and promoters. This category might be followed by exon duplications and entire gene duplication that might occur through tandem duplications or retropositions. Duplication of chromosomal regions that are larger than a single gene might include chromosome arms or even entire chromosomes. The largest, and presumably rarest form of duplication would be duplication of the whole genome. Whole genome duplications (WGDs) can apparently be recognized in genomes of Evolution After Gene Duplication, Edited by Katharina Dittmar and David Liberles Copyright © 2010 Wiley-Blackwell

299

300

WHOLE GENOME DUPLICATIONS AND THE RADIATION OF VERTEBRATES

“higher” organisms that have been sequenced completely. Usually, up to three genome (but interestingly, never more than that) WGDs have been inferred so far in the genomes of higher animals, fungi, and plants. WGD might be a major force that not only changes the genome content dramatically, but potentially creates a surplus of newly duplicated genes, which might also result in new genetic networks and, possibly, evolutionary phenotypic novelties [reviewed by Semon and Wolfe (2007a)]. Before whole genome sequences became available, whole genome duplications had been analyzed primarily for early vertebrate evolution, based on evidence obtained through molecular phylogenetic analyses of biologically important gene families (e.g., Hox genes, genes involved in the adaptive immune system). In such an analysis, one can commonly recognize an increase in numbers of genes among various gene families (Holland et al., 1994; Kasahara et al., 1996; Wittbrodt et al., 1998). More recently, by analyzing a larger set of gene sequences, whether they were complete or incomplete genome sequences, it was shown that similarly arranged sets of genes (synteny) are located on different chromosomes within a single genome (e.g., Pebusque et al., 1998) and that many chromosomal segments are similar within a genome (reviewed in Kasahara, 2007). Intragenome redundancy was later revealed for teleost fishes as well (reviewed in Meyer and Van de Peer, 2005). In this chapter we briefly describe current knowledge regarding large-scale gene duplications that occurred at the basal lineages of vertebrates. We focus on the teleostspecific genome duplication (TSGD). This teleost-specific genome duplication is also called the third round (3R) genome duplication because the basal lineage of vertebrates experienced two previous rounds (1R and 2R) of whole genome duplications. Here we give specific attention to the 1R, 2R, and 3R WGD in the basal vertebrate lineages.

2

TELEOST-SPECIFIC GENOME DUPLICATION

2.1 Background The idea of the teleost-specific whole genome duplication event in the actinopterygian lineage was proposed much later than that for 1R/2R WGDs. This is probably because studies at the molecular level for animals leading to human (mouse, chicken, and Xenopus) go back further than those on actinopterygians. Identification of higher numbers of genes for some gene families in teleost fishes than in tetrapods was the first DNAbased evidence on this issue (Wittbrodt et al., 1998). Some fish (e.g., salmonids and cyprinids) and amphibian lineages, including Xenopus laevis, have experienced additional independent whole genome duplication(s) more recently (see Gregory, 2005; Semon and Wolfe, 2008). The clumping of duplicated genes in specific genomic segments of modern fishes suggested a genome doubling that is not shared by tetrapods (e.g., Amores et al., 1998; reviewed by Meyer and Van de Peer, 2005). More recently, this fish-specific genome duplication has been shown with certainty through analyses involving large-scale sequence data (Taylor et al., 2001a, 2003), including those of draft genome sequences of teleost fish models, pufferfish, and medaka (Aparicio et al., 2002; Jaillon et al., 2004; Kasahara et al., 2007). The question of how far back in the evolution of fishes the 3R duplication occurred remained open because the initial comparative genomic analyses were based on the genomes of rather modern “model” teleost fishes. Studies of more basal lineages of

TELEOST-SPECIFIC GENOME DUPLICATION

301

Figure 1 Phylogenetic and genomic properties of key lineages representing pre- and post-WGD conditions for the teleost-specific whole genome duplication. Phylogenetic relationships are based on Kikugawa et al. (2004) and Inoue et al. (2003). Divergence times are based on Azuma et al. (2008). Animal groups with a pre-WGD state are shown in white boxes, while those with a post-WGD state are shown in black boxes. Information regarding C -values and chromosome numbers was retrieved from the Animal Genome Size Database (www.genomesize.com; Gregory et al., 2007). Note that multiple entries for the same species in this genome size database are included in the graphs.

Actinopterygii revealed that their divergences from the stem lineage of fishes preceded this genome duplication event (Crow et al., 2006; Hoegg et al., 2004). Now it is thought that TSGD occurred in the lineage leading to all extant teleost fishes after the separation of more basal actinopterygian lineages: the Polypteriformes (bichir), Acipenseriformes (sturgeons and paddlefish), Amiiformes (bowfin), and Semionotiformes (gars) (Hoegg et al., 2004) (Figure 1). The TSGD was formerly called the fish-specific genome duplication (FSGD), but is now based on the more precise knowledge of the phylogenetic timing of the event. Currently, it is now more correctly called the teleost-specific genome duplication (TSGD; Kuraku and Meyer, 2009). The latter term is also more accurate because the word fish is applied to a paraphyletic assemblage that includes cyclostomes, chondrichthyes, lungfishes, and coelacanths, which did not experience this genomic event. The phylogenetic timing of the TSGD has been revealed using two different approaches. First, based on a gene family tree approach, where timings of gene duplications are estimated directly, an analysis using duplicated sets of paralogs of the Fugu genome derived from the TSGD suggested that it occurred 320 ± 67 million years ago (Mya) (Vandepoele et al., 2004). Second, the relative timing of the TSGD was estimated based on the absolute timings of the split of nonteleost actinopterygians, and was inferred to have occurred approximately 380 to 300 Mya. The latter approach was taken, based on whole mitochondria DNA (mtDNA) genome sequences (Azuma et al., 2008), as well as a combination of both mtDNA and nuclear protein-coding genes. This study came up with a more recent estimate (316 to 226 Mya; Hurley et al., 2007).

302

WHOLE GENOME DUPLICATIONS AND THE RADIATION OF VERTEBRATES

2.2 Evolution Before the TSGD As already mentioned, the TSGD occurred after the separation of several ancient lineages from the actinopterygian stem lineage that led to the teleosts. Phylogenetic relationships among these pre-TSGD lineages have been explored using both mitochondrial genes and nuclear genes as well as using morphological characters (Inoue et al., 2003; Kikugawa et al., 2004; reviewed by Meyer and Zardoya, 2003). Both of these two types of genes produced consistent results and identified the Polypteriformes as the most basal actinopterygian lineage and found that bowfin and gars are more closely related to each other than to any other groups. Previously, Holostei was considered a monophyletic group based on morphological observations (Nelson, 1969). However, the position of the amia-gar group in relation to that of Acipenseriformes is still not resolved consistently (Inoue et al., 2003; Kikugawa et al., 2004) (see Figure 1). According to information from the Animal Genome Size Database (www.genomesize.com; Gregory et al., 2007), some members of Polypteriformes have much larger genome sizes than those of most other actinopterygian fishes, except for some species of the Acipenseriformes. Many sturgeons have hugely increased their genome sizes and their number of chromosomes as a result of repeated lineage-specific polyploidization events (Gregory, 2005; Peng et al., 2007). It is interesting to note that all of these pre-TSGD actinopterygian lineages show a low level of species diversity, and only about 40 extant species belong to the four most basal actinopterygian lineages (Figure 1). The finding that the ancestors of some basal lineages of actinopterygian fishes did not experience the TSGD is interesting, as it allows us to study their genomes in an effort to examine a pre-TSGD (i.e., a 2R) genomic condition. Although the amount of data is still limited, the pre-TSGD condition of gene repertoires or selected genomic segments haboring them has been confirmed for some more cases (e.g., Chiu et al., 2004; Hoegg and Meyer, 2007). For example, for the ParaHox gene family, Amia calva, an extant member of one of the lineages that diverged immediately before the TSGD, seems to possess a similar gene organization (Gsx, Xlox , and Cdx ) in the same transcription orientation to that of human and amphioxus (Mulley et al., 2006). The ParaHox gene repertoire of teleost fishes whose ancestor experienced the third vertebrate genome duplication is different. Their ancestral set of ParaHox gene clusters dispersed across different genomic regions as a result of successive subsequent gene losses after the TSGD (Siegel et al., 2007). 2.3 Evolution After the TSGD It has been suggested that the TSGD somehow permitted the remarkable diversification of teleost species we see in extant teleost fishes. It will be interesting to explore the relevance of this genome doubling for morphological diversification by identifying and studying fish lineages that diverged from the stem lineage immediately after this event. The most interesting lineage in this regard will be the Osteoglossomorpha (e.g., arowana, arapaima), the earliest post-TSGD lineage (Hoegg et al., 2004; Azuma et al., 2008). The time that elapsed between the TSGD and the split of these fishes from the stem lineage is thought to be less than 10 million years, a short amount of time compared to the following history of the post-TSGD fish lineages (−300 Mya). This estimate was based on evolutionary rates of Hox genes (Crow et al., 2006).

TELEOST-SPECIFIC GENOME DUPLICATION

303

The Osteoglossomorpha (<200 species), as well as the lineage of the Elopomorpha (e.g., eels, tenpounders), the sister group to the Euteleostei, are rather low in terms of species diversity (<750 species), compared to the diversity of all other extant teleost fishes (<26, 000 species) (Figure 1). The Osteoglossomorpha and Elopomorpha seem to have a rather limited range of genome sizes and chromosome numbers compared to the pre-TSGD fish lineages (except those in Polypteriformes and Acipenseriformes) and other teleosts (except those whose genomes polyploidized lineage, specifically after the TSGD: e.g., goldfish, salmons) (Figure 1). As Ohno proposed, many of the teleost fishes have karyotypes with 48 chromosomes for diploid genomes (Ohno, 1970). Attempts to reconstruct an ancestral karyotype of the hypothetical teleost ancestor produced karyotypes similar to those of this widely shared format (Jaillon et al., 2007; Kasahara et al., 2007). The constancy of karyotype organization across diverse lineages of actinopterygian fishes (with some exceptions) may be the reason there was no pioneering hypothesis of TSGD based solely on classical cytogenetic studies. In fact, the oldest teleost crown groups are, based on palaeontological data, believed to have diversified around at 150 Mya (Patterson, 1993; Benton, 1997; Figure 2). Therefore, based on palaeontological data, a time interval of about 150 to 170 million years followed after the TSGD before the phenotypic diversification of teleost fishes really began. In addition, other now extinct lineages of fish initially diversified but subsequently lost that diversity again. These observations have been used as counterevidence against the argument of a direct causal relationship between the TSGD and the huge diversification of fishes. As has often been observed before, palaeontological dates are always minimal age estimates of lineages, and molecular dates often lead to much older divergence estimates. Recent molecular datings based on Hox genes (Crow et al., 2006) and mtDNA (Azuma et al., 2008) have also suggested that the Osteoglossomorpha and Elopomorpha diverged immediately after the TSGD, questioning palaeontological data which suggested that a long time interval before the phenotypic diversification followed the TSGD. To date, analyses of the genomic consequences of the TSGD have concentrated on large-scale sequence data of model systems. A molecular phylogenetic analysis, including 10 fish model species, has shown that the most ancient divergence between those species is the split between the third most early-branching lineage, Otocephala, including zebrafish, and a group containing other model fishes (pufferfishes, medaka, stickleback). These two lineages diverged at approximately 290 to 270 Mya (Steinke et al., 2006b). This estimate is roughly consistent with that obtained by whole mitochondrial genomes (310 to 270 Mya; Azuma et al., 2008). It should be noted that comparisons among only model fishes cannot cover the entire teleost fish diversity following the TSGD. The future inclusion of the Osteoglossomorpha and the Elopomorpha would provide important additional information, as some of the species for these groups are model species. The inclusion of these species would also aid identification of previously overlooked secondary lineage-specific modifications of genomic features (Taylor et al., 2003; Steinke et al., 2006a; Loh et al., 2008; Postlethwait, 2007; Semon and Wolfe, 2007b,c).

304

WHOLE GENOME DUPLICATIONS AND THE RADIATION OF VERTEBRATES Paleozoic Silurian Devonian

Carboniferous

Cenozoic

Mesozoic Permian

Triassic

Jurassic

Cretaceous

Tertiary

Q. 0 MY 1.8 MY

65 MY

141 MY

195 MY

230 MY

280 MY

345 MY

395 MY

435 MY

Cladistia

Chondrostei

Halecomorphii stem group Actinopterygii

stem group Actinopteri (some may be more primitive Actinopterygii)

stem group Neopterygii

Lepidosteidea

stem group teleostei Osteoglossomorpha

Elopomorpha stem group Clupeocephala/Otocephola

Acanthomorpha

Otocephala

Protacanthopterygii

Esociformes Stomilformes

Aulopiformes Myctophiformes

Figure 2 Patterns and timing of the diversification of fishes. Diversity (x -axis) and timing (y-axis) of the diversification of the major actinopterygian fish lineages are shown based on fossil evidence. Acanthomorpha, which has extremely large number of species, is shown in gray in the background. MY, million years; Q, quaternary. (Adapted from Patterson, 1993.)

2.4 Discussion Reliable divergence time estimates of different teleost fish lineages are a key if one wants to test for a causal relationship between the TSGD and the phenotypic diversification and speciation of teleost fishes. If there was a considerable temporal interval following the TSGD before many species-rich lineages originated, this would tend to

1R/2R GENOME DUPLICATION

305

weaken the argument of a causal link between any genomic and large-scale phenotypic events. Divergence times obtained by whole mitochondria genome–based phylogenetic analyses await further confirmation with nuclear protein-coding genes. Since the early-branching lineages after the TSGD (Osteoglossomorpha and Elopomorpha) show relatively low species diversity, there may already be a case against a link between the TSGD and a massive species diversification. Indeed, there are many biological strategies, which in concert with ecological opportunities generate diverse phenotypes without apparent genomic events. 3 1R/2R GENOME DUPLICATION 3.1 Background Ohno (1970) was the first to propose that two-round whole genome duplications occurred during the evolution of chordates, based on genome sizes and karyotypes (Figure 3). This hypothesis was later supported through the identification of multiple copies of genes in model vertebrates (of course, jawed vertebrates such as human, mouse, and chicken) that were orthologous to a single ancestral invertebrate gene (Lundin, 1993; Holland et al., 1994; Sidow, 1996). More recently, large-scale sequencing studies of particular genomic regions provided evidence that similar arrays of genes are found in a single genome (mostly mammals), suggesting that this redundancy is derived from a duplication event that probably involved the entire genome [Holland et al., 1994; Kasahara et al., 1996; Katsanis et al., 1996; Lundin, 1993; Pebusque et al., 1998; Sidow, 1996); reviewed by Kasahara (2007)]. Based on genomewide sequence analyses, the hypothesis of two rounds of whole genome duplications (termed 1R and 2R) has been confirmed with certainty (McLysaght et al., 2002; Dehal and Boore, 2005; Hufton et al., 2008; Putnam et al., 2008). At a similar time as the two WGD events, jawless vertebrates, the Cyclostomata, diverged from the vertebrate stem lineage leading to the Gnathostomata (jawed vertebrates). The Cyclostomata is the only extant lineage of the Agnatha, the jawless vertebrates, which now consists of the Myxiniformes (hagfishes) and the Petromyzoniformes (lampreys) (Kuraku et al., 2009b). It was long thought that lampreys are more closely related to jawed vertebrates than to hagfishes (Janvier, 1996; see also Janvier, 2007), but recently most molecular phylogenetic studies have supported the monophyly of cyclostomes [summarized by Kuraku and Kuratani (2006)]. The origin of the agnathans has been debated vigorously in relation to the phylogenetic timing of the 1R and 2R WGDs that occurred during the evolutionary history of vertebrates (Escriva et al., 2002; Putnam et al., 2008). A recent study that involved both largescale phylogenetic analyses and intensive taxon sampling as well as cDNA sequencing suggested that the cyclostomes diverged after the two rounds of WGD; this hypothesis was termed the pan-vertebrate tetraploidization (PV4) hypothesis (Kuraku, 2008). This name was used to indicate that all vertebrates are characterized by post-2R genomes, and that subsequent genomic events such as the TSGD led to further modifications of the genomes of some vertebrate lineages, such as the teleosts. 3.2 Evolution Before the 2R The Vertebrata are a morphologically derived monophyletic group within the Phylum Chodata, which also includes the Cephalochordata (amphioxus) and the

306

WHOLE GENOME DUPLICATIONS AND THE RADIATION OF VERTEBRATES

Figure 3 Phylogenetic and genomic properties of key lineages representing pre- and post-WGD conditions for the 1R/2R whole genome duplications. Phylogenetic relationships are based on Kikugawa et al. (2004), Kuraku and Kuratani (2006), and Putnam et al. (2008). Divergence times are based on related literature (Kuraku and Kuratani, 2006). Groups with a pre-WGD state are shown in white boxes, those with a post-WGD state in black boxes. It has not been shown clearly whether Myxiniformes and Petromyzoniformes diverged before or after the 1R/2R WGD. Information on C -values and chromosome numbers was retrieved from the Animal Genome Size Database. Note that multiple entries for the same species in this genome size database are included in the graphs.

Urochordata (tunicates) (Brusca and Brusca, 1990); Figure 3). Traditionally, cephalochordates were thought to be a sister group of vertebrates (e.g., Wada and Satoh, 1994), but “phylogenomic” analyses provided an updated phylogenetic hypothesis that identified urochordates as the sister group to vertebrates and placed cephalochordates as basal to chordates (Abascal et al., 2005; Bourlat et al., 2006; Delsuc et al., 2006; Dunn et al., 2008; Putnam et al., 2008) (Figure 3). In terms of numbers of species, the chordate subphyla do not show pronounced diversity (Figure 3). As mentioned above, no consensus has been reached on the relative timing of the divergence between cyclostomes and gnathostomes relative to the timing of the 1R and 2R WGDs (Putnam et al., 2008; Kuraku et al., 2009a). In many cases, molecular phylogenetic trees containing cyclostome genes do not provide enough resolution or sufficiently strong statistical support on the relationships of multiple paralogs that are found in the genomes of jawed vertebrates (for more details, see Kuraku, 2008). A recently proposed view places cyclostomes after the 2R genome duplication (Kuraku et al., 2009a), while other studies suggest a pre-2R (Furlong et al., 2007) or a divergence time after the 1R but before the 2R duplication for cyclostomes (Escriva et al., 2002; Putnam et al., 2008). 3.3 Evolution After the 2R Apart from cyclostomes, the earliest lineage that clearly diverged after the 2R WGD is the Chondrichthyes (cartilaginous fishes: namely, sharks, rays, skates, and chimaeras). Recently, the genome of the ghost shark Callorhinchus milii (elephant shark or elephantfish) has been sequenced with 1.4-fold coverage and has been deposited in the

CONCLUDING REMARKS

307

NCBI Genome Survey Sequences (GSS) Database (Venkatesh et al., 2007; see also Venkatesh et al., 2005). Its genome contains convincing evidence that the lineage it belongs to experienced two rounds of genome duplication (Venkatesh et al., 2007). Other studies on this species provide further evidence that this lineage has highly conserved features in some genomic regions (Venkatesh et al., 2006; Yu et al., 2008). This conservation suggests that the common ancestor of chondrichthyans and osteichthyans apparently had a genome that was characterized by two consecutive rounds of genome duplications.

3.4 Discussion The expected pattern of two-round duplication steps (i.e., from one to two, and then from two to four; [[A, B], [C, D]]) has been found difficult to reconstruct for many gene families surveyed (reviewed by Kasahara, 2007; Kuraku, 2008). However, it is now almost accepted that the most parsimonious explanation for the large-scale redundancy within a genome is nothing but whole genome duplications. Thus, most inconsistencies with this hypothesis (generally, these are not statistically significant) are thought to be caused by the lack of phylogenetic signals preserved over long evolutionary times (Panopoulou and Poustka, 2005; Kuraku, 2008). The recently proposed hypothesis on “post-2R cyclostome” is based partly on the identification of a larger number of paralogs for RAR and Dlx genes in cyclostomes (Kuraku et al., 2009a). Previously, fewer were reported (Neidert et al., 2001; Escriva et al., 2006). In the case of the TSGD, different lineages have experienced different modes of evolution (e.g., gene repertoires, evolutionary rate, and gene function). One misleading argument against the TSGD originated from a lack of understanding of these post-WGD variations (Robinson-Rechavi et al., 2001a,b see also Taylor et al., 2001b). The knowledge of these variations should be taken into account in studies focusing on the tempo and mode of the 2R WGDs that are still disputable.

4 CONCLUDING REMARKS Whether or not this is coincident, lineages that diverged immediately before or after WGDs are usually those containing no model species whose genome sequences are deeply analyzed. Species in such groups tend not to be readily available, to be difficult to handle in experiments, or to be inconvenient for maintenance in laboratories compared with model organisms, whose phylogenetic positions are usually distant from WGD events. In this chapter, we provide an overview of karyotypes and genome sizes for the major lineages of osteichthyans and chordate lineages (Figures 1 and 3) and demonstrate that these values can vary drastically even within lineages that have experienced an identical history of whole genome duplications. Since many of these lineages are hundreds of years old, this observation is not all that surprising, as these genomes have undergone lineage-specific evolution. However, it does seem somewhat surprising that these lineages, which have experienced such varied evolutionary histories, do not seem to differ in more pronounced genomic features, such as genome size and/or chromosome numbers. Why this is the case is still only poorly understood.

308

WHOLE GENOME DUPLICATIONS AND THE RADIATION OF VERTEBRATES

We now have a robustly confirmed phylogenetic tree for major vertebrate lineages. By using existing and emerging genomic sequence resources, an evolutionary history of their redundant genomes, ideally including those of nonmodel key animals, should be characterized more deeply. Acknowledgments We thank Masaki Miya for his kind help in preparing Figure 1, and the German Science Foundation and the Wissenschaftskolleg Berlin for support of A.M. REFERENCES Abascal F, Zardoya R, Posada D. 2005. ProtTest: selection of best-fit models of protein evolution. Bioinformatics 21, 2104–2105. Amores A, Force A, Yan YL, Joly L, Amemiya C, Fritz A, Ho RK, Langeland J, Prince V, Wang YL, et al. 1998. Zebrafish hox clusters and vertebrate genome evolution. Science 282:1711–1714. Aparicio S, Chapman J, Stupka E, Putnam N, Chia JM, Dehal P, Christoffels A, Rash S, Hoon S, Smit A. et al. 2002. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science 297:1301–1310. Azuma Y, Kumazawa Y, Miya M, Mabuchi K, Nishida M. 2008. Mitogenomic evaluation of the historical biogeography of cichlids toward reliable dating of teleostean divergences. BMC Evol Biol 8:215. Benton MJ. 1997. Vertebrate Palaeontology. New York: Chapman & Hall. Bourlat SJ, Juliusdottir T, Lowe CJ, Freeman R, Aronowicz J, Kirschner M, Lander ES, Thorndyke M, Nakano H, Kohn AB, et al. 2006. Deuterostome phylogeny reveals monophyletic chordates and the new phylum Xenoturbellida. Nature 444:85–88. Brusca RC, Brusca GJ. 1990. Invertebrates. Sunderland, MA: Sinauer Associates. Chiu CH, Dewar K, Wagner GP, Takahashi K, Ruddle F, Ledje C, Bartsch P, Scemama JL, Stellwag E, Fried C, et al. 2004. Bichir HoxA cluster sequence reveals surprising trends in ray-finned fish genomic evolution. Genome Res 14:11–17. Crow KD, Stadler PF, Lynch VJ, Amemiya C, Wagner GP. 2006. The “fish-specific” Hox cluster duplication is coincident with the origin of teleosts. Mol Biol Evol 23:121–136. Dehal P, Boore JL. 2005. Two rounds of whole genome duplication in the ancestral vertebrate. PLoS Biol 3:e314. Delsuc F, Brinkmann H, Chourrout D and Philippe H. 2006. Tunicates and not cephalochordates are the closest living relatives of vertebrates. Nature 439:965–968. Dunn CW, Hejnol A, Matus DQ, Pang K, Browne WE, Smith SA, Seaver E, Rouse GW, Obst M, Edgecombe GD, et al. 2008. Broad phylogenomic sampling improves resolution of the animal tree of life. Nature 452:745–749. Escriva H, Manzon L, Youson J, Laudet V. 2002. Analysis of lamprey and hagfish genes reveals a complex history of gene duplications during early vertebrate evolution. Mol Biol Evol 19:1440–1450. Escriva H, Bertrand S, Germain P, Robinson-Rechavi M, Umbhauer M, Cartry J, Duffraisse M, Holland L, Gronemeyer H, Laudet V. 2006. Neofunctionalization in vertebrates: the example of retinoic acid receptors. PLoS Genet 2:e102. Furlong RF, Younger R, Kasahara M, Reinhardt R, Thorndyke M, Holland PW. 2007. A degenerate ParaHox gene cluster in a degenerate vertebrate. Mol Biol Evol 24:2681–2686.

REFERENCES

309

Gregory TR. 2005. The Evolution of the Genome. Burlington, MA: Elsevier Academic. Gregory TR, Nicol JA, Tamm H, Kullman B, Kullman K, Leitch IJ, Murray BG, Kapraun DF, Greilhuber J, Bennett MD. 2007. Eukaryotic genome size databases. Nucleic Acids Res 35:D332–D338. Hoegg S, Meyer A. 2007. Phylogenomic analyses of KCNA gene clusters in vertebrates: Why do gene clusters stay intact? BMC Evol Biol 7:139. Hoegg S, Brinkmann H, Taylor JS, Meyer A. 2004. Phylogenetic timing of the fish-specific genome duplication correlates with the diversification of teleost fish. J Mol Evol 59:190–203. Holland PW, Garcia-Fern`andez J, Williams NA, Sidow A. 1994. Gene duplications and the origins of vertebrate development. Dev. Suppl., pp. 125–133. Hufton AL, Groth D, Vingron M, Lehrach H, AJ, P Panopoulou G. 2008. Early vertebrate whole genome duplications were predated by a period of intense genome rearrangement. Genome Res 18:1582. Hurley IA, Mueller RL, Dunn KA, Schmidt EJ, Friedman M, Ho RK, Prince VE, Yang Z, Thomas MG, Coates MI. 2007. A new time-scale for ray-finned fish evolution. Proc Biol Sci 274:489–498. Inoue JG, Miya M, Tsukamoto K, Nishida M. 2003. Basal actinopterygian relationships: a mitogenomic perspective on the phylogeny of the “ancient fish.” Mol Phylogenet Evol 26:110–120. Jaillon O, Aury JM, Brunet F, Petit JL, Stange-Thomann N, Mauceli E, Bouneau L, Fischer C, Ozouf-Costaz C, Bernot A, et al. 2004. Genome duplication in the teleost fish Tetraodon nigroviridis reveals the early vertebrate proto-karyotype. Nature 431:946–957. Jaillon O, Aury JM, Noel B, Policriti A, Clepet C, Casagrande A, Choisne N, Aubourg S, Vitulo N, Jubin C, et al. 2007. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 449:463–467. Janvier P. 1996. The dawn of the vertebrates: characters versus common ascent in the rise of current vertebrate phylogenies. Paleontology 39:259–287. Janvier P. 2007. Evolutionary biology: born-again hagfishes. Nature 446:622–623. Kasahara M. 2007. The 2R hypothesis: an update. Curr Opin Immunol 19:547–552. Kasahara M, Hayashi M, Tanaka K, Inoko H, Sugaya K, Ikemura T, Ishibashi T. 1996. Chromosomal localization of the proteasome Z subunit gene reveals an ancient chromosomal duplication involving the major histocompatibility complex. Proc Natl Acad Sci USA 93:9096–9101. Kasahara M, Naruse K, Sasaki S, Nakatani Y, Qu W, Ahsan B, Yamada T, Nagayasu Y, Doi K, Kasai Y, et al. 2007. The medaka draft genome and insights into vertebrate genome evolution. Nature 447:714–719. Katsanis N, Fitzgibbon J, Fisher EM. 1996. Paralogy mapping: identification of a region in the human MHC triplicated onto human chromosomes 1 and 9 allows the prediction and isolation of novel PBX and NOTCH loci. Genomics 35:101–108. Kikugawa K, Katoh K, Kuraku S, Sakurai H, Ishida O, Iwabe N, Miyata T. 2004. Basal jawed vertebrate phylogeny inferred from multiple nuclear DNA-coded genes. BMC Biol 2:3. Kuraku S. 2008. Insights into cyclostome phylogenomics: pre-2R or post-2R? Zool Sci 25:960–968. Kuraku S Kuratani S. 2006. Time scale for cyclostome evolution inferred with a phylogenetic diagnosis of hagfish and lamprey cDNA sequences. Zool Sci 23:1053–1064. Kuraku S, Meyer A. 2009. The evolution and maintenance of Hox gene clusters in vertebrates and the teleost-specific genome duplication. Int J Dev Biol. 53:765–773. Kuraku S, Meyer A, Kuratani S. 2009a. Timing of whole genome duplications relative to the origin of the vertebrates: did cyclostomes diverge before, or after? Mol Biol Evol 26:47–59.

310

WHOLE GENOME DUPLICATIONS AND THE RADIATION OF VERTEBRATES

Kuraku S, Ota S, Kuratani S. 2009b. Jawless fishes (Cyclostomata). In Hedges ESB, Kumar S (eds.), Timetree of Life. New York: Oxford University Press, pp. 315–319. Loh YH, Brenner S, Venkatesh B. 2008. Investigation of loss and gain of introns in the compact genomes of pufferfishes (Fugu and Tetraodon). Mol Biol Evol 25:526–535. Lundin LG. 1993. Evolution of the vertebrate genome as reflected in paralogous chromosomal regions in man and the house mouse. Genomics 16:1–19. McLysaght A, Hokamp K, Wolfe KH. 2002. Extensive genomic duplication during early chordate evolution. Nat Genet 31:200–204. Meyer A, Van de Peer Y. 2003. “Natural selection merely modified while redundancy created”—Susumu Ohno’s idea of the evolutionary importance of gene and genome duplications. J Struct Funct Genom 3:vii–ix. Meyer A, Van de Peer Y. 2005. From 2R to 3R: evidence for a fish-specific genome duplication (FSGD). Bioessays 27:937–945. Meyer A, Zardoya R. 2003. Recent advances in the (molecular) phylogeny of vertebrates. Annu Rev Ecol Evol Syst 34:311–338. Mulley JF, Chiu CH, Holland PW. 2006. Breakup of a homeobox cluster after genome duplication in teleosts. Proc Natl Acad Sci USA 103:10369–10372. Neidert AH, Virupannavar V, Hooker GW, Langeland JA. 2001. Lamprey Dlx genes and early vertebrate evolution. Proc Natl Acad Sci USA 98:1665–1670. Nelson GJ. 1969. Gill arches and the phylogeny of fishes, with notes on the classification of vertebrates. Bull Am Mus Nat Hist 141:475–552. Ohno S. 1970. Evolution by Gene Duplication. New York: Springer-Verlag. Panopoulou G, Poustka AJ. 2005. Timing and mechanism of ancient vertebrate genome duplications: the adventure of a hypothesis. Trends Genet 21:559–567. Patterson C. 1993. An overview of the early fossil record of acanthomorphs. Bull Mar Sci 52:29–59. Pebusque MJ, Coulier F, Birnbaum D, Pontarotti P. 1998. Ancient large-scale genome duplications: phylogenetic and linkage analyses shed light on chordate genome evolution. Mol Biol Evol 15:1145–1159. Peng Z, Ludwig A, Wang D, Diogo R, Wei Q, He S. 2007. Age and biogeography of major clades in sturgeons and paddlefishes (Pisces: Acipenseriformes. Mol Phylogenet Evol 42:854–862. Postlethwait JH. 2007. The zebrafish genome in context: ohnologs gone missing. J Exp Zool B 308:563–577. Putnam NH, Butts T, Ferrier DE, Furlong RF, Hellsten U, Kawashima T, Robinson-Rechavi M, Shoguchi E, Terry A, Yu JK, et al. 2008. The amphioxus genome and the evolution of the chordate karyotype. Nature 453:1064–1071. Robinson-Rechavi M, Marchand O, Escriva H, Bardet PL, Zelus D, Hughes S, Laudet V. 2001a. Euteleost fish genomes are characterized by expansion of gene families. Genome Res 11:781–788. Robinson-Rechavi M, Marchand O, Escriva H, Laudet V. 2001b. An ancestral whole-genome duplication may not have been responsible for the abundance of duplicated fish genes. Curr Biol 11:R458–R459. Semon M, Wolfe KH. 2007a. Consequences of genome duplication. Curr Opin Genet Dev 17:505–512. Semon M, Wolfe KH. 2007b. Rearrangement rate following the whole-genome duplication in teleosts. Mol Biol Evol 24:860–867. Semon M, Wolfe KH. 2007c. Reciprocal gene loss between Tetraodon and zebrafish after whole genome duplication in their ancestor. Trends Genet 23:108–112.

REFERENCES

311

Semon M, Wolfe KH. 2008. Preferential subfunctionalization of slow-evolving genes after allopolyploidization in Xenopus laevis. Proc Natl Acad Sci USA 105:8333–8338. Sidow A. 1996. Gen(om)e duplications in the evolution of early vertebrates. Curr Opin Genet Dev 6:715–722. Siegel N, Hoegg S, Salzburger W, Braasch I, Meyer A. 2007. Comparative genomics of ParaHox clusters of teleost fishes: gene cluster breakup and the retention of gene sets following whole genome duplications. BMC Genom 8:312. Steinke D, Salzburger W, Braasch I, Meyer A. 2006a. Many genes in fish have species-specific asymmetric rates of molecular evolution. BMC Genom 7:20. Steinke D, Salzburger W, Meyer A. 2006b. Novel relationships among ten fish model species revealed based on a phylogenomic analysis using ESTs. J Mol Evol 62:772–784. Taylor JS, Van de Peer Y, Braasch I, Meyer A. 2001a. Comparative genomics provides evidence for an ancient genome duplication event in fish. Philos Trans R Soc Lond B 356:1661–1679. Taylor JS, Van de Peer Y, Meyer A. 2001b. Revisiting recent challenges to the ancient fishspecific genome duplication hypothesis. Curr Biol 11:R1005–R1008. Taylor JS, Braasch I, Frickey T, Meyer A, Van de Peer Y. 2003. Genome duplication, a trait shared by 22000 species of ray-finned fish. Genome Res 13:382–390. Vandepoele K, De Vos W, Taylor JS, Meyer A, Van de Peer Y. 2004. Major events in the genome evolution of vertebrates: paranome age and size differ considerably between ray-finned fishes and land vertebrates. Proc Natl Acad Sci USA 101:1638–1643. Venkatesh B, Tay A, Dandona N, Patil JG, Brenner S. 2005. A compact cartilaginous fish model genome. Curr Biol 15:R82–R83. Venkatesh B, Kirkness EF, Loh YH, Halpern AL, Lee AP, Johnson J, Dandona N, Viswanathan LD, Tay A, Venter JC, et al. 2006. Ancient noncoding elements conserved in the human genome. Science 314:1892. Venkatesh B, Kirkness EF, Loh YH, Halpern AL, Lee AP, Johnson J, Dandona N, Viswanathan LD, Tay A, Venter JC, et al. 2007. Survey sequencing and comparative analysis of the elephant shark (Callorhinchus milii) Genome. PLoS Biol 5:e101. Wada H, Satoh N. 1994. Details of the evolutionary history from invertebrates to vertebrates, as deduced from the sequences of 18S rDNA. Proc Natl Acad Sci USA 91:1801–1804. Wittbrodt J, Meyer A, Schartl M. 1998. More genes in fish? Bioessays 20:511–515. Yu WP, Rajasegaran V, Yew K, Loh WL, Tay BH, Amemiya CT, Brenner S, Venkatesh B. 2008. Elephant shark sequence reveals unique insights into the evolutionary history of vertebrate genes: a comparative analysis of the protocadherin cluster. Proc Natl Acad Sci USA 105:3819–3824.

wwwwwww

INDEX

ABCE model, 282 Abiotic stress response, 281 Acanthomorpha, 303 ACAT1/ACAT2 genes, 233 Accession numbers, 177 Acetylation, 281 Acipenseriformes, 301–302 Acorus americanus, 273 Acs1/Acs2 genes, 239 Acs13 gene, 7 Actinopterygii, 301–303 Activity-reducing mutations, 111 Adaptation, 16, 96, 121, 123–124, 174, 212, 223 Adaptive conflict resolution model, 36 Adaptive evolution, 88, 125, 174, 258 Adaptive gene duplications, 64–66 Adaptive mutations, 110, 115, 117, 121 Adaptive response dosage considerations, 65–66 environmental interactions and, 66–68, 70 stress-induced, 64 Adaptive storytelling, 78 Additive fitness function, 60 AEF-box, 149 Aerobic growth, 239 A-function lineage, 283 AGAMOUS, 284–285 Age distribution plots, 275 Agnatha, 305 Alfalfa, 272 Alignment, 164, 171 All-α proteins, 153–154, 156 All-β proteins, 148, 156–157 Alleles, 12, 60, 68, 121, 126, 235, 280 Allelic divergence, 126 Allopolyploidization, 48, 278 Allopolyploids, 24, 32–33, 271, 277–279 Allotetraploids, 34, 281, 287 α-Synuclein, 62 Alternative splicing, 3, 26–27, 41, 47, 281 Alzheimer’s disease, 41 Amborella, 273 Amia calva, 302 Amiiformes, 301

Amino acid changes in, 86, 88, 113, 116, 118, 121, 123, 125, 167 chemically complex, 211 diversity, 85–86 duplicated genes, 23 expression differentiation, 88 functions of, 8, 10, 66, 79, 81, 208 patterns of, 164 positive charge, 150 sequences, 85–86, 136, 147, 229 substitutions, 33, 37, 115–117, 119 synthesis, 209 Amniote divergence, 261 Amniote genes, 256–257 Amniote gene tree, 259–260 Amphioxus, 306 Amylases, 63–64 Anaerobic growth, 239 Analysis evolution, 181 ANC1 gene, 92 Ancestors, in gene trees, 187 Ancestral functionality, 36 Ancestral genes, 37, 58, 80, 147, 233, 254 Ancestral sequence inference, 168 Ancient duplication events, 215 Ancient genes, 185 Ancient genome duplications, 31 Aneuploidy, 32 Angiosperm plants, 31, 270–272, 282–286 Animal genomes, 255, 300. See also specific types of animals Ankyryns, 140 Annotations, 176, 185, 256, 260 Anole lizard, 265 Anolis carolinensis, 265 Anopheles gambiae, 64 Antennapedia, 5 Antibiotic(s) functions of, 65 resistance, 106, 115 Antibodies, 120 Antigens, 85 Antirrhinum, 283–285

Evolution After Gene Duplication, Edited by Katharina Dittmar and David Liberles Copyright © 2010 Wiley-Blackwell

313

314

INDEX

Antisense oligonucleotides, 243 AP3 genes, 284 APETAL genes, 283–284 Aphids, 64 Apoptosis, 169 Apparent gene tree reconciliation problem, 191 Apparent polytomy, 191 Apparent species tree reconciliation problem, 191 Aquaporin, 149 Aquilegia, 286 Arabidopsis ancestor, 34, 276 arenosa, 280 asymmetric divergence, 35 chromatin remodeling, 281 complex duplicates, 47 dosage-balanced genes, 63 double synteny, 94 expression divergence, 45–46 MADS-box genes, 37, 281–286 root development, 89–90 segmental duplications, 95–96 suecica, 280 thaliana, 25–26, 38, 42, 67, 89, 107, 279, 282–283 retained duplicates, 42, 44 whole-genome duplication, 39, 48, 276 Arapaima, 302 Archaea, 254 Arc-repressor, 116 Are1/Are2 genes, 233 Arg24, 4 Arginine synthetic pathway, 3 Arowana, 302 Asp27, 4 Aspartate, 143 Aspartyl protease family, 142–143 Asteraceae, 272 Asymmetric protein evolution, 34 Asymmetrical divergence, 34–35 ATH1 GeneChip, 89 ATP, 81, 208 Auditory placodes, 245–246 Autopolyploids, 32, 271, 279 Autoregulators, 26, 246 Autoregulatory element (ARE), 7 Autosomal chromosomes, 133 Autosomal genes, 7 Avocado, 273 Bacillus subtilis, 150–151 Backup circuits, 239, 246 Backup systems, redundant, 60 Bacteria bacterial genomes, 181 bacterial proteins, 16 characterized, 6, 123, 254

Bacteriorhodopsin, 152 Baker’s yeast, 64, 67. See also Yeast Barley, 272 Base mismatches, 135 Bayesian law, 168 Bayesian methods, 186 Bayesian models, 200 Beads on a string, 137, 140 BE hairpin, 147 Beneficial duplications, 63–64, 67 Beneficial mutations, 110, 216, 277 Best-fit decay constants, 44 Beta/alpha class of proteins, 143, 146, 155–157 β-barrel architecture, 154–155 folds, 145–146 B-function lineage, 284 β-galactosidase, 211 β-Propeller structures, 143–145 β-propellors, 142–145, 156 β-sheets, 116, 143, 157 Biases codon, 39–40 functional, 42–45 mutation, 27 Bichir, 301 Bidirectional responsive backup circuits, 244 Bifunctional genes, 80 Binary trees, 186, 189–191, 193, 196, 198 Binding, generally dinucleotides, 146 enzymes, 6–7, 141 interactions, 1 lipids, 43 proteins, 148 specificity, 5–6 transcription, 2–3, 10–11, 83, 90–92 transcription factor, 90–92 Biochemistry, 16, 221 Biodiversity, 16 Bioinformatics, 117 Biological networks, 97, 218 Biological organization levels, 16 Biological species concept, 16 Biological systems, 23, 225. See also Biological networks Biosynthesis, 211 Biotic stress-response genes, 43 Birds gene expansion methods, 255 genomic analysis, 254 phylogenetic approach to evolutionary dynamics of gene duplication chicken-specific gene duplication, 257–265 computational approach, 255–257

INDEX methodology, 255–257 overview, 253–255 poikilothermic, 263 Birth-death process, 200–201 Bivalent molecules, 140 BLAST databases (NCBI) best-reciprocal BLAST hits (BRH), 176 BLASTP, 247 contents of, 175–176, 178 reciprocal hits, 255, 257 searches, 181, 255 Bonobos, pseudogenes, 64 Bowfin fish, 301–302 Brain studies, 35–36 Brassicaceae, 272, 278, 281, 283, 286 Brewer’s yeast, 81–82, 91, 94, 97 Bruguiera gymnorrhyza, 27 Caenorhabditis elegans characterized, 80, 231 duplicative processes, 32, 34, 98 redundant duplicates, 233 Calcineuerin-dependent pathway, 246 Cambrian explosion, 17 Cancer, 62–63 Candida albicans, 35 Carica, 272 Caspases, 164, 169–171 Catalysis, 1, 3, 81, 111 Catalytic promiscuity, 118 Catarrhini, gene divergence, 36 CAULIFLOWER (CAL) genes, 283 cDNA, 279, 306 CDR region, 85 CED-3, 169–170 Cell cycle, 43, 244 Cell death, 43. See also Apoptosis Centroids, 134 Centromeres, 134 Cephalochordata, 306 C-function lineage, 284–285 Chain genes, 81 Character mapping, 177, 179 Charcot-Marie-Tooth disease type1A (CMT1A), 62–63 Chicken genome, 179. See also Chicken-specific gene duplication Chicken olfactory receptor 7 (COR7), 257 Chicken-specific gene duplication dynamics of, 257–261 hemoglobin, 262–263 identification strategies, 264–265 keratin, 263–264 olfactory receptors, 263 serpins, ovalbumin-related, 262 toll-like receptors (TLRs), 260, 262 whole-genome sequencing, 254

315

Chimeric genes, 88, 306 Chloroplasts, 42 Chondrichthyes, 301, 306 Chromatin(s) functions of, 8 remodeling, 281 structure, 38 Chromosome(s) analysis, 300 chromosomal duplications, 270 homologous, 121 Chromosome 21, 9 Chordates, 13–14, 16, 180, 304 CIC chloride, 149 Ciliates, genome duplications, 31 Circular permutations, 140 Circular reasoning, 93 cis-regulatory noncoding molecules (CRMs), 174 Clade B serpins, 262 Cladogenesis, 17 Clarkia, 270 Coding sequences, 1, 13–14, 23, 33, 35, 80–81, 90, 174 Codons, 86, 88 Coelacanths, 301 Coherence, 136 Coincidentally dispensible genes, 232 Coinducers, gene sharing, 81 Collagen, 140 Common ancestor, 24, 178, 256, 261. See also Last common ancestor (LCA); Least common ancestor; Most recent common ancestor (MRCA) Comparative analyses benefits of, 254–255 genomic analysis, 300 genomic data analysis, 14 Comparative genomics, 174, 178, 181–182, 201 Compensatory mutations, 106–107, 115 Complementary mutations, 83 Computational biology, 185 Computer-assisted storytelling,78 Computer simulation programs, 115, 117, 122, 125–126 Computer software programs detection of functional divergence, 165, 169–171 GeneTree, 200 gene tree reconciliation, 191–192 NOTUNG, 191–192 phylogenetic tree reconciliation, 200 Softparsmap, 191–192 Concave functions, 69–70 Conditional probability, 165–166, 200 Conformations, 11, 112 Conservation, evolutionary, 123, 163, 181 Constitutive splicing, 3 Contraction, lineage-specific, 16

316

INDEX

Copy number polymorphisms (CNPs), 66 Copy number variations (CNVs), 66–67 COR genes, 263 Coregulation, 236–240 Correlation coefficients, 90 Correlation studies, 39 Cotton, 272, 280 Covalent attachment, 2 Covalent bonds, 140 Cretaceous period, 31 Cretaceous-Tertiary (K-T) boundary, 273 Creutzfeldt-Jakob disease, 117 Cross-hybridization, 134 Crossing over, 95, 107, 115, 134. See also Crossover mechanisms Cross-linking, 140 Crossover mechanisms, 134–136, 265 Cross-regulatory inhibitions, 243 Crosstalk, 6 Crystallins, 79–80 C-termini, 105, 108, 141, 153 Culex sp., 64 Cutoff value, functional divergence, 168–169 Cyanovirin, 138, 146 Cyclic permutation, 154–155 Cyclins, 234 Cyclostome genes, 301, 305–307 Cyprinids, 300 Cysteine mutagenesis, 152 Cystein-rich domains (CRDs), 117 Cytochrome P450 genes, 64 Cytochromes, 64, 147 Cytokines, 80, 234, 243 Cytoplasm, 152 Cytoskeletons, 40, 140 Darwinian selection, 79, 84–85. See also Natural selection Data mining, 176 Databases Animal Genome Size Database, 305 DAVID database, 177 Ensembl, 175–176, 178, 180–181, 255–257 InParanoid, 175, 177–179, 181 localization of gains and losses, 177–178 MultiParanoid, 175, 177–179, 181 orthology/parology identification, 175–176 OrthoMCL, 175, 177–179 PubMed, 240, 248 RoundUp, 175–178 synthetic lethal, 248 Daughter genes, 80, 82 DAVID database, 177 Dayhoff model, 165–166 Deep coalescence, 185–186 Degenerative mutations, 32, 58, 84 Dehydrogenases, 154

Deleterious duplications, 62–63 Deleterious mutations, 59, 68, 83, 110–111, 115, 117, 216, 218, 220, 277 Deletions in birds, 253, 263 duplicate retention, 38–40 in plants, 281 protein products, 140, 153, 155, 157–158 redundancy, 231–232, 239, 246–248 significance of, 88, 106, 117 De novo origination, 174 evolution, 2 Descendants, in gene trees, 187 Deuterostomes, 180 Developmental biology, 240 Developmental genes, 42 Dietary factors, 63–64 Differential optimization, 2 Differential selection, 253 Dimeric interactions, 146 Dimerization, 111, 153 Dimerization-competent genes, 147 Dimerization-incompetent genes, 147 Dimers, 140–141, 146–148 Diminishing returns function, 69–70 Dinucleotide-binding proteins, 146 Dinucleotides, 299 Diploid genomes, 134 Diploidization, 18, 33, 107, 276 Direct regulatory interaction, 245 Dispensability, 107 Disulfides bonds, 140 bridges, 117, 123 DIVERGE computer program, 165, 169–171 Diversification, mutation-driven, 3. See also Functional divergence of duplicated genes Dlx genes, 243, 245–246, 307 DMA1/DMA2 genes, 92 DNA base sequence, 78–79 B-form, 2 double-stranded, 135, 137 father/mother exchange, 134–135 functions of, 215, 229 in haploid cells, 173 histones and, 146 homologous sequences, 115 junk, 124, 135 -level mutations, 226 metabolism, 43 methyltransferase, 281 mismatch repair mechanisms, 135 noncoding, 279 repair, 42 replication, 42, 45, 106, 135, 208

INDEX single-stranded, 135 strand breakage, 135, 137 transcription, 243 whole mitochondria (mtDNA), 301, 304 DNA polymerases, 32, 106, 143 Domains entangled, see Entangled domains swapping, 146–148 tandem duplicates, 138–140 DOM-fold, 148 Donor chromosome, 96 Dosage balance, see Dosage balance compensation, 12–14, 201 dependence, 277 -dependent effects, 278 -dependent linear response, 241–242 effects, 108–109, 111, 215–216, 225, 253 Dosage balance effects theory, 46–47 hypothesis, 40–42, 63, 111 Double helix, 124 Down-regulation, 13, 242, 286 Downstream reactions, 2 Drosophila characterized, 67, 254–255 duplicated genes, 45 melanogaster, 5, 25, 34, 123, 155, 176 Dup-loss, generally problem, 197 reconciled trees, 189, 191 Duplicate-and-destroy method, 158 Duplicate preservation, 36 Duplicate retention mechanisms buffering, 38–40 complexity factors, 38, 46–47 dosage balance, 40–42, 47–48 increases in, 40 expression divergence, 45–46 functional bias, 42–43, 45 gene duplication vs. alternative splicing, 47 neofunctionalization, 33–36 overview of, 32–33, 216–217 subfunctionalization, 36–38 Duplication, generally acceptance of, 299 costs, impact of, 190, 200, 212 deletions and domain swaps, 154 and divergence paradigm, 229–230 functional divergence, 23–28 -loss costs, 190–191, 197, 200 modes of, 32 nodes (D-nodes), 188–190 problem, 197 small-scale, see Small-scale duplications (SSDs)

317

Duplication-degeneration-complementation (DDC) model, 36, 47, 83–85, 88, 97, 110 Duplication, swapping, and deletion (DSD) events, 153–154 Duplicative retrotransposition, 32 Early Cretaceous period, 46 Ebx homeodomain, 5 Ecoparalogs, 246 Eels, 303 E-function lineage, 285–286 Elopomorpha, 303–304 Embedded gene trees, 188–189, 191 EmrE protein, 150–151 Endoplasmic reticulum, 92 Energy costs, 209–210, 212, 215 Ensembl database, 175–176, 178, 180–181, 255–257 Entropy, 141 Environmental conditions adaptation to, 106 change in, 78 impact of, 67, 224, 236 significance of, 211 stressful, 65–66. See also Environmental stress Environmental stress impact of, 26–27, 123, 239 response, 244 Enzyme(s) active centers, evolution after duplication binding interface and interaction partner changes, 6–7 DNA-binding specificity changes, 3–6 examples of, 8–10 regulation of duplicate copy, instantaneous changes, 7–8 regulatory elements that control gene expression, change of, 7 active sites, 141 adaptive responses, 70 binding sites, 141 catalysis process, 12 detoxifying, 64–65 duplicate gene birth, 10 functional promiscuity of, 118–120 hydrolytic, 45 metabolic, 212 paraoxonase (PON1), 118–119 repair, 16 triosephosphate isomerase (TIM), 146 Epigenetic(s) complementation (EC), 37–38, 47–48 events, 215 in polyploids, 280–281 silencing, 37–38 Episode-based gene duplications characterized, 192–193, 202

318

INDEX

Episode-based gene duplications (Continued) episode clustering problem, 194 minimum episode problem, 194–195 multiple gene duplication problem II, 195–196 solution space, 193–194 Episode clustering problem, 193–195 Epistasis characterized, 60–61 epistatic effects, 40 epistatic synergism, 237, 247 Equilibrium theory, 16 Escape from adaptive conflict (EAC), 36, 47, 112, 122, 277 Escherichia coli, 26, 149–151, 208, 212, 223, 243 Eschscholzia californica, 273 Esterases, 64 EUG1 gene, 92 Eukaryotes duplicate retention, 31, 34, 41–42, 47 evolutionary dynamics, 121, 123–124, 254 functional divergence, 26, 31, 34, 41–42 mapping gene gains and losses, 174, 180 proteins, tandem gene duplications, 136–137, 152 significance of gene duplication, 84, 93, 208, 270 Euteleostei, 303 Eutheria, 179, 181 Evo-devo, 299 Evolution adaptive, 125 asymmetric, 36 preduplication, 112 Evolutionary biology, 77–78, 88, 96, 99, 192, 196 Evolutionary divergence, 234 Evolutionary dynamics, 215 Evolutionary genomics, 95 Evolutionary mechanisms, interplay between, 13. See also Neofunctionalization; Pseudogenization; Subfunctionalization; Subneofunctionalization Evolutionary process, 77 Evolutionary selection, 230 Evolutionary theorists, 77 Evolutionary transitions, 108 Exons, 26, 88, 154, 173, 299 Expansion, lineage-specific, 16 Experimental biologists, 65 Explanation trees, 188–189 Exponential distribution, 15, 201 Expressed sequence tags (ESTs), 272–273 Expression differentiation duplicated genes in Arabidopsis root development, 89–90 transcription factor binding, 90–92 Expression divergence, 25–26, 36 Eye color, 26

FEN1 gene, 248 Ferns, 270 Fish, genome duplications, 31, 45. See also specific types of fish Fish-specific genome duplication (FSGD), 301 Fission, 105–106, 174 Fitness, generally cost, 208, 234, 247 density, 107 effects, 106, 108, 209, 220–223 functions, 60–61, 68–69 landscape, 118–121, 123 Fixation, 64, 67–68, 108 Fixed gene duplications characterized, 66 tandem, 141 FKS1/FKS2 genes, 238–239, 244, 246 Flatworms, 173 Flowering plants, 281–282 Fly-by-wire system, 60 Folds, knotted, 153–154 Forbidden mutations, 59 Formalism, 235 Founder effect, 17 Four-way Holliday junction, 134 Fourier analysis, 156 Fragmentation, 46 Framework modeling systems, 14 Frequency distribution, 66, 94 FRUITFUL (FUL) genes, 283 Fugu, 254, 257, 301 Full-genome duplication and sequencing, 174–175 Functional constraint, 166, 168 Functional differentiation, 79–81, 88 Functional divergence detection using statistical methods overview of, 163–164 two-state model for, 164–165, 168 Type I, 164–169 Type II, 164–165 of duplicated genes expression divergence, 25–26 protein-protein interactions (PPI), 23–24 of splice variants, 26–27 implications of, 79, 92 Functional diversification, 286 Functional overlap. See Redundancy, genetic Functional promiscuity, 118 Functional redundancy, 57–58, 236, 243 Functional specialization, 239 Functionless genes, 79 Fungi genomes, 65, 216, 255, 300 Fused genes, spliced variant duplication, 27 Fusions, 105–107, 133, 136, 140, 174

Fast-evolving genes, 34, 37 Feedback regulation, 235–236

Galactokinases, 81, 84 Galactose-1-phosphate, 81

INDEX GAL genes, 81–82, 84, 88 Gallus gallus, 255–256, 260. See also Chicken Gallus lineage, 179 Gametes, 32, 107 Gamma distribution, 166, 201 Gars, 301–302 GDP-mannose dehydrogenase (GPDMDh), 154 Gene(s) generally amplifications 65 clusters, homogenous, 4 conversion, 39–40, 67, 135, 254, 258, 278 copy number, 60–61, 63–64, 68, 96 diversification rates, 17 dosage adaptive responses, 65–68, 70 balance, 66 beneficial duplications, 63–64, 67 copy number polymorphisms (CNPs), 66 copy number variations (CNVs), 66–67 deleterious duplications, 62–63 duplications and, 57 genetic dominance, 68 as unifying theory for prediction of duplication fates, 69–70 duplication, see Duplication cost of, 207–208 importance of, 1 evolutionary benefits of, 207 overview of, 187–189 families, reanalysis of investigations, 178–181 gains and losses, mapping, 175 loss, 186, 192 mapping, 175, 177–178, 274 redundancy, see Redundancy regulatory network, 218–219 sharing, 79, 112 trees, see Gene trees Gene expression costs energy, 209–210 evolutionary cost signatures, 212 lac operon studies, 211–212 material, 210–211 dynamics, 224 network functions, 217–219 patterns, 174, 279 profile, 10 significance of, 2 Gene ID Conversion, 177 Gene ontology (GO) categories, 42–44 Gene-sharing model, 80–83 GeneTree, 200 Gene trees characterized, 110, 121, 185–186, 201–202, 274–277 definitions, 187 gene duplication (GD) model, 187–189

319

inferred, 188–189 notations, 187 reconciliation of comparability, 189 components of, 186–189 gene duplication, 189–190 model-based approaches, 200–201 reconciliation cost, 190–192, 197 supertrees, 186–187, 196–200 Genealogical history dynamics, 17 Genetic diseases, 41, 278 Genetic diversity, 123, 125 Genetic drift, 32, 57, 59, 67, 77, 98, 207–208, 215 Genetic events, 215 Genetic interaction patterns, 40 Genetic redundancy, 60–61. See also Redundancy Genetic robustness model, 14 Genome/genomic, generally characterized, 2, 13 complexity, 173 duplication, 26, 192–193, 224. See also Duplication evolution, 82, 96, 99, 175 rearrangements, 38, 46 sequences, 1, 31, 98, 176, 182, 210, 305 shock, 279 signatures, 212 -wide sequence analysis, 305 Genomic DNA, 32, 208 Genotypes, 58, 60, 68, 123 Gerbera genes, 286 Germ cells, diploid, 136 Germline mutations, 32, 134 GH hairpin, 147 Gks-3 gene, 234 Glires lineage, 178 Global sequence alignment, 175 Globin fold, 147–148 Globular proteins, 116, 153–155 Glucose fermentation, 40 influx, 239–240 GLUD2 gene, 35 Glutamate dehydrogenase, 35 functions of, 149 Glutathione-S-transferases (GST), 64 Glycerol-3-phosphate dehydrogenase, 153 Glycerol transport family, 149 Glycine, 272 Glycolytic genes, 40 Glycoprotein genes, 87 Glycosylation, 3 Gnathostomes, 306 Goldfish, 303 Gossypium genes, 272, 281 G-protein coupled receptors (GPCRs), 152–153, 155

320

INDEX

Grasses, 272 Greek-key motif, 147, 156 Growth hormone (GH), 253 Gsh1/Gsh2 genes, 243 GSL genes, 231 Gymnosperms, 275, 282 Hagfishes, 305 Hair color, 26 Hairy2 genes, 97 Haploid genome, 48, 173 Heat-shock proteins, 66 Heavy metals, 65 Heme-binding ancestors, 147 Hemoglobin, 260–261 Heterocomplexes, 41 Heterodimers, 24, 141, 146, 277–278 Heterogeneity, 41, 80, 166 Heterosis effects, 48–49 Heterozygous deleterious mutations, 68 Heterozygous duplications, 62 Heterozygous individuals, 12 Hexose, 246 High-affinity hexose transport genes (HXT6/HXT7), 64 High-throughput analysis, 24, 28, 248 Histidine, 8, 143 Histone(s) acetylation, 281 characterized, 40, 138 fold, 146–147 Hitchhiking, 66 HIV genome, 123, 142 HN-S gene, 243 Holliday junction, 135 Holostei, 302 Homeobox genes, 3 Homeothermy, 263 Homeotic genes, evolution of, 6 Hominoid-specific gene duplication, 23 Homo sapiens, 177, 278 Homodimers, 24, 140–141, 243 Homoelogs, 18, 277–280, 287 Homogenization, 39, 278 Homologs, 6, 108, 150–152, 164, 179, 234, 255, 286 Homology, 95 Homozygote mutation, 241 Hordeum genes, 272 Host-parasite interactions, 45 HOX gene/proteins, 4–7, 93, 300, 304 Hsp90, 123 5-HT, 152 Htx1/Htx2 genes, 246 “Hub” genes, 97 Human genome, 15, 27, 176 Human Phylome, 175 Hxt1/Hxt2 genes, 239–240

Hybridization, 24, 32, 135, 278–279 Hydra, nematocysts, 117 Hydraulic systems, 60 Hydrogen bonds, 41 Hydrolysis, 63 Hydrophobic interactions, 10 Hypothesis testing, 78 IAD (innovation, amplification, divergence) model, 36, 47 Icams, 140 ICEtype caspases, 170–171 idp2/Idh genes, 239 Ile-tRNA synthase, 3 Illicium floridanum, 284 Imbalance, stoichiometric, 13 Immune functions, influential factors, 14 Immune response, 123, 170 Immune system, 93 Immunoglobulins, 85, 87, 93, 135 Inbreeding, 17 Indels, 106, 117 Inferred gene trees, 188–189, 202 Initial gene duplication, 9 Innovation IAD (innovation, amplification, divergence) model, 36, 47 morphological, 286 phenotypic models, 96–97 small-scale, 57 InParanoid database, 175, 177–179, 181 Insecticide resistance, 64 Inseparable domains, 140, 142–143 Insertions, 88, 106, 117, 140, 157–158, 281 In silico evolution, 219 Interaction interfaces, 10 Interactomes, duplicated genes, 23–24 Interleukin-1 receptor (IL-1R), 80 Intermolecular interaction, 2 Internal gene duplication, 17 International Chicken Genome Sequencing Consortium (ICGSC), 179, 263 Interresidue interactions, 2 Intramolecular interaction, 2 Intrinsically dispensible genes, 232 Introns, 8, 135, 137, 147–148, 173, 264 Invertebrates, 26 Isopropyl-β-D-thiogalactosidase (IPTG), 211 Isosymes, redundant duplicates, 237 Isozymes, redundant duplicates, 229, 233, 239 Janus proteins, 116 Jelly roll, 156 Joint distribution, 167 Joint probability, 165, 167 Jukes-Cantor model, modified, 177 Junction migration, 134, 136

INDEX Ka /Ks asymmetry hypothesis, 33 KAR1 family, 153–154 Karyotypes, 303–304, 307 Kelch repeat, 143 Keratin, 254, 260–261 Ketol-acid reductionisomerase (IKARI), 153–154 Kinases, 7, 45, 221, 234 Kinetics, 3 Kluyveromyces lactis, 81–82, 84 Knockout phenotypes, 230, 236, 240 lac operon, 93, 211–212 LacZ/gal system, 243 Lampreys, 305 Large-scale duplications, 31–32, 274 Large-scale effects, 16–18 Large-scale expression studies, 66, 174 Large-scale mutations, 105–106 Last common ancestor (LCA), 275–276 Lattice models, 14, 36, 115, 126, 158 Leaf mapping, 189–191 Least common ancestor, 187, 190–191 Least-squares approach, 177 Leaves, in gene trees, 187 Leghemoglobin, 147 Legumes, 273 Leptin, 11 Leu-tRNA synthase, 3 Life history, 18 Likelihood function, 167. See also Maximum likelihood Lineage, generally amniote, 257–258 gene duplications within, 254–255 multiple, 173 sorting, 185–186 -specific changes, 16 duplications, 273, 286 evolution, 307 Linear fitness function, 69–70 Lipid binding, 43 Liriodendron tulipifera, 273 Literature review Evolution by Gene Duplication (Ohno), 1, 31 Molecular Biology of the Cell (Alberts), 233 Local search problem, 197, 199–200 Long interspersed nuclear elements (LINEs), 254 Look-ahead effect, 123 Loop-out events, 134 Loss(es) defined, 189 dynamics, 45 mapping, 173–185 rate of decay, 201 Loss-of-function genes, 79–80

321

implications of, 80 mutations, 59, 68, 83, 88, 216, 244, 277 Low affinity interactions, 2 Lungfishes, 93, 301 Lysine synthetic pathway, 3 Lysosomes, 62 Macaque genome, 176 Macroevolutionary events, 185, 202 MADS genes, 37, 111 Magnolia, 283 Maize, 272, 285 Major histocompatibility complex (MHC) genes, 85–86, 93, 254, 259–260 Major Intrinsic Protein (MIP) family, 149 Mammalian Phenotype Ontology, 180 Mammals/mammalian adaptive gene duplications, 65 extant, 196 genomes, 15, 180, 254 Mangrove. See Bruguiera gymnorhyza MANTiS database, 175–181 MAP kinases, 97, 99 Mapping character, 177–179 genome duplication in plants, 275–276 phylogenetic, 181 scenarios, 194–195 Markov chain model, Type I functional divergence predicting critical residues for, 168 testing, 165–167 Markov chain Monte Carlo (MCMC), 201 Markov clustering algorithm, 175 Material budget, 208 Material costs, 210–212 Maximum likelihood estimation, 168 implications of, 186, 258 phylogenetic tree (PHYML), 256 Maximum parsimony tree, 99 Medaka, 300, 304 Medicago, 272 Meiosis, 32–33, 107, 134 Melanocytes, 26 Membrane proteins cyclic permutations, 154–155 dual-topology generality of fusions, 151–153 implications of, 150–151 symmetry in, 149–150 MENT (mature erythrocyte nuclear termination), 262 Messenger RNA (mRNA), 209–211 Metabolic network, 219 Metabolism, significance of, 43 Metal-ion coordination, 140 Metazoan phylogeny, 180, 182 Methanomicrobia, 147

322

INDEX

Methylation, 37, 280–281 Microarray analysis, 92, 279 Microbes, fast-replicating, 125 Microbiologists, 65 Microphtalmia-associated (Mitf) transcription factor, 26 Microsatellites, 299 Midkine genes, 243 Minium episode problem, 193 Misaligned crossover, 136 Missing data, 186 Mitochondria, 42, 45 Mitosis, 32, 107 Model-based gene family analysis, 14 Molecular archaelogy studies, 49 Molecular biology, 3, 79, 95 Molecular chaperones, 123 Molecular diversity, 254 Molecular evolution, 163 Molecular evolutionary biology, 78 Molecular genetics, 201 Molecular mechanisms, 18 Molecular phylogenetic studies, 305 Molecular sequences, 17 Molybdenum cofactor binding protein (1jroB), 148 Monodelphis domestica, 180 Monooxygenases, 64 Morning glory, 285 Morphogenesis, 43, 45 Morpholino oligonucleotides (MO), 245 Morphological complexity, 46 Morphological evolution, 174 Mosses, 275 Most recent common ancestor (MRCA), 179 Motifs, 38, 143, 147, 156 Mouse genome, 7, 67 Move-up rule, 195–196 MRF4 gene, 241 Mrk1 gene, 234 mRNA, 243, 246, 254 Multicellular organisms, 84 Multidrug resistance, 243 Multifunctional genes, 80 Multigene families characterized, 79, 85–86, 89, 98 dynamics, 264 evolution, 255 MultiParanoid database, 175, 177–179, 181 Multiple-birth models, 106 Multiple gene duplication problem, 193, 195–196, 201 Multiple segmental duplications, 94 Murine studies, gene duplication vs. alternative splicing, 47. See also Mouse genome Mus musculus, 176–177 Mutation(s) activity-reducing, 37

cost, 190, 232 degenerative, 36 dynamics, 2–3 fitness effects of, 106, 116n implications of, 98, 157 measure, 190–191 nonfunctionalizing, 36 opportunities after duplicate gene birth, 10–11 during redundancy (MDR) model, 78–80 small-scale, 105 Myf-5 genes, 230, 241–242 Myoblasts, 241 MyoD genes, 230, 241–242 Myofibers, 241 Myogenesis, 241 Myogenin, 241 Myoglobin, 138, 148 Myxiniformes, 305 Myzostomes, 173 Natural distribution, 16 Natural selection, 36, 58–59, 77–78, 85–88, 98, 207–211 NCBI databases BLAST, 175–176, 178, 181, 247, 255, 257 Genome Survey Sequences (GSS), 306 Nearest-neighbor interchange (NNI) operation, 198–200 Negative selection, 13–14, 66 Neighborhood, supertree heuristics, 197–198 Nematode worms, 80 Neo-Darwinists, 77–78 Neofunctionalization in birds, 254 defined, 10, 58 dosage compensation, 13 increases as form of, 12–13 duplicate divergence, 36 duplicate retention, 33–36 evolutionary dynamics, 108–112, 116, 121–122, 125–126 genetic redundancy, 13 high rate of, 16 MADS-box gene family, 285 mapping gains and losses, 174, 201 networks and, 216–217, 219, 224–225 plant genomes, 277, 287 protein folds, 14 redundant duplicates, 233 retention profile, 14 schematics, 12 significance of, 8, 11–12 timing of, 12 Networks duplicate genes in, 97–98

INDEX duplicate retention, 225 duplicate robustness, 225 dynamics and evolution in, 218–220 fitness effects of duplications, 220–223 gene duplicates in, 218, 223–225 gene duplication model, 225–226 gene functions in, 217–219 interaction set, 219 neutral, 113–114, 116–119 signaling, 217–221, 225 topology, 218 Neural crest genes, 180 Neuropathies, 62–63 Neurotransmitter(s) release, 27 uptake proteins, 149 Neutral drift, 108, 124 Neutral fixation, 68 Neutrality, 116 Neutral mutations, 110, 115, 218 Neutral network characterized, 113–114, 116 promiscuous enzymes, 118–119 protein, 125 prototype sequence, 117 New gene function hypothesis characterized, 77, 207 duplication-degeneration-complementation (DDC) model, 83–85 gene-sharing model, 80–83 mutation during redundancy (MDR), 78–80 Nicotiana, 278 NK clusters, 4 Node reconciled trees, 189, 191 Noise, in gene expression, 236, 242 Nonbinary species trees, 186 Noncoding sequences, 18, 80, 88 Nonfunctionalization, 58, 121, 216, 224, 233 Nonhomologenous end joining, 135 Nonsynonymous mutations, 121 Nonsynonymous rate (Ka )/nonsynonymous substitutions (Ks ) ratio, 67–68 Nonsynonymous substitutions, 67–68, 85–87, 91 NOTUNG, 191–192 N-terminus, 6, 141, 150, 155 Nuclear protein-coding gene, 301, 304 Nucleic acid sequences, 124 Nucleotide(s) diversity, 208 functions of, generally, 299 polymorphisms, 66 sequences, 107 substitutions, 3, 5–6, 85, 94 Null, generally models, 67 mutations, 230 phenotype, 242

323

Nuphar genes, 273, 283 Nutrient limitations, 66 Obligatory interactions, 10 Ohno’s dilemma, 112 Olfactory system placodes, 245–246 receptors, 254–255, 257, 260–261 Oligonucleotides, 243, 245 Oncogene duplication, 62 Open reading frame, 136 1R/2R/3R genome duplication background, 304–306 characteristics of, 43, 45–46 evolution, 2R after, 306 before, 306 Operons, 150–151. See also lac operon Opossum genome, 176 OR5U1/OR5BF1 genes 255, 263 Organismal diversification, 265 Organismal diversity, 254 Organophosphates, 64 Orthologous interactomes, 24 Orthologs in birds, 253, 255–256, 260 duplicate retention, 35 evolutionary dynamics, 108, 253, 255–256, 260 functional divergence, 25 mapping gains and losses, 175–176, 179, 181 in plants, 283–285 OrthoMCL database, 175, 177–179 Oryza, 272, 278 Osteichthyans, 307 Osteoglossomorpha, 302–304 Other species concept, 16 Otocephala, 304 Outgroup expression patterns, 37 Oxidoreductases, 153 Oxygen binding, 43 Paddlefish, 301 PaleoAP3 genes, 284 Paleopolyploidy, 18, 42, 278 Pancreas, 63 Panther database, 176 Pan-vertebrate tetraploidization (PV4) hypothesis, 306 ParaHox gene family, 4, 302 Paralogs in birds, 253, 255–256, 258–260, 264–265 duplicate retention, 34, 39, 48 evolutionary dynamics, 107–108, 110, 121 functional divergence, 24, 27, 163 genetic redundancy, 229, 236 mapping gains and losses, 175–176 in plants, 272, 274–275, 283–285

324

INDEX

Paralogs (Continued) significance of, 82, 87, 90 tandem gene duplication, 149 in vertebrates, 301 Paramecium, 40, 63 Paraoxonase (PON1), 118–120 Parasites, 254 Parent(s) gene duplication, 17 in gene tree, 187 Parkinson’s disease, 41, 62 Parsimony, 186, 189, 201 Partitioning, 36–37, 233, 270 Patchwork evolution, 1, 3 Pathogenic gene duplications, 66 Pathogens, 66, 106 Pax1/Pax9 mutations, 242–243 PD11 genes, 92 Penalized likelihood model, 258 Peptide chains, 113 Persea, 273, 283, 285–286 Pesticides, 64–65 Petromyzoniformes, 305 Petunia, 285 Pfam domain, 278 DUF606 family, 151 Phenotypes/phenotypic, generally complexity, 173 dosage effects, 60, 68 evolution dynamics, 96–97, 112 functional divergence, 34, 38, 40 genetic redundancy, 230, 232–233, 236, 240 mutations, 123–126 population genetics, 2, 18 transitions, 120–124, 126, 174 variation, 16 in vertebrates, 299–300 6-Phosphogluconate dehydrogenase (PGDH), 153–154 Phosphoproteins, 27 Phosphorylation, 7, 217 Phosphotases, 221, 234 Photosynthesis, 280 Phylogenetic, generally analysis, 81, 94, 98 reconstruction, 186, 192 trees, 167, 170, 175, 254, 307 Phylogenomics, 255, 265 Phylogeny characterized, 93, 201 reconstruction, 271 robust, 181 Physcomitrella, 275 Phytosphingosine, 248 Pines, 275 PISTILLATA genes, 284

Pisum sativum, 283 Planted tree, 198 Plant genomes, see specific types of plants adaptive gene duplications, 65 characterized, 255, 269, 300 duplications, 31, 277–281 overview of, 269–270 whole-genome duplication, 270–277 Plant polygalacturonases, 45 Platypus genome, 263 Pleiootropy characterized, 233 constraints, 12, 36 cytokines, 243 genes, 230 PMP22 gene, 62–63 Poaceae, 272 Point mutations, 106–107, 126, 263, 287 Poisson model, Type I functional divergence predicting critical residues for, 168–169 testing, 167–168 Poisson process, 258 Pollution, impact of, 66 Polymorphic gene duplications, 57, 62, 66 Polynomial time algorithms, 186, 191, 194 Polypeptides, 106 Polyploidization, 9, 18, 38, 42, 60, 92–97, 279–280. See also Polyploidy Polyploidy ancient, 274 genetic and genome approaches to, 272–273 in plant genome, 270–272 significance of, 17–18, 32–33, 92–97, 107, 287 Polypteriformes, 301 Polytomies, 185, 188, 191–192, 202 Poplar. See Populus trichocarpa Poppy, 273 Population, generally dynamics, 1, 17, 42, 45, 215 genetics analysis, 18, 67, 70, 123, 201, 216, 224 dynamics, 16, 122 models, 218, 226 networks and, 225–226 studies, 67 theory, 32 genomics, 18 size, significance of, 121, 137, 213, 216 -size lineage, 11, 16 Populus genome, 27, 272, 278 Positive mutations, 121 Positive selection, 11–12, 14, 64, 66–68, 79, 85–88, 201 Posterior analysis, functional divergence, 168 Posterior distributions, 200 Potato, 272 Power-law distribution, 10, 257

INDEX Ppz1/Pp2 genes, 234 Pregnancy-associated glycoproteins (PAGs), 86–87 Primate studies, 11, 35–36, 64, 97 Principal components (PC1/PC2), 89–91 Principle of parsimony, 186 Prion protein, 116 Priority species, 176 Probabilistic gene evolution models, 200 Prokaryotes, 34, 65, 84–85, 121, 137, 174, 180, 208 Promiscuous functions, 118–119 Promoters, 7, 26, 81, 83, 88, 108 Proteases, 142–143 Protein(s) ancestral, 106, 149–150 bifunctional, 81–82 biochemistry, 1 biosynthesis, 45 chicken, 180 complex-forming, 39, 41 coding genes, 92, 107, 109, 181, 254, 265 complexes, balance-sensitive duplicated, 42 degradation, 45 development of, 105–106 disulfide isomerases, 92 domains, 105, 111 evolution, 37, 112, 126, 217 function diversification, 111 fold/folding, 2, 14, 16, 106, 115–117 function, role of, 14, 16 functional divergence of, 121–122 interaction divergence, 36 implications of, 208, 225–226, 232 networks, 24, 247 protein-protein, 6, 23–24, 233–234, 238 self-interacting, 24 kinases, 42 latent functions, 121 neofunctionalization, 35–36 new, via gene duplication, 105, 107–108 phenotypes, evolutionary transitions between, 117–118 phosphatases, 42 point mutations and, 106–107 redundant duplicates and, 236 sequence, 33–35, 38, 40, 112–113, 116, 135, 118–121, 174 splicing, 3 stress-induced, 65–66 structures multidomain, 174 promiscuity of, 112, 116–117 stability of, 112–115 symmetries, intrinsic all-α proteins, 156 all-β proteins, 156 β/αproteins, 155–156

325

Fourier analysis of structural symmetry, 156–157 synthesis, 115, 211 thermodynamic stability, 112–116 underwrapped, 41–42 Proteinases, 86 Proteomes, 27, 150 Proteosomes, 42, 148 Protists, 65 Pseudodimers, 140–142, 147 Pseudogenes, 79, 108, 216, 253, 265, 269 Pseudogenization, 11, 34, 37, 108–109, 111–112, 121, 257 Pseudosymmetry, 149, 151 PubMed database, 240, 248 Pufferfish, 27, 254, 278, 300, 304 Qualitative subfunctionalization, 37 Quantitative subfunctionalization (EC), 37, 47 Quasispecies, 123 Radiation/speciation, 17 Ranunculaceae, 284 RAR genes, 307 Rate-smoothing techniques, 275 Rattus norvegicus, 176–177 Rays, 306 Rearrangements, 105, 111, 136, 146, 154, 158, 192 Reciprocal smallest distance (RSD) algorithm, 175 Recombinatino, 88, 106, 115, 121, 134–135, 137 Reconciled trees, 186–189 Redundancy complete, 69–70 duplicates, 39 functional, 57–58 genetic, see Redundancy, genetic genomic, 108, 272, 300 implications of, 59–60, 225 intragenome, 300 MADS-box gene family, 285 plant genomes, 272 regulation of coregulation, 236, 238–240 cross-regulatory interactions, 234–236, 238, 241 design of, 236, 245–246 developmental regulators, 229, 40–243 recurring patterns of, 246 Redundancy, genetic conditional coregulation, 236–240 conservation of redundant duplicates, 230, 234–239 defined, 229–230, 248 developmental regulators, 229, 240–243 dispensibility of duplicates increased, influential factors, 247 overview of, 230–231

326

INDEX

Redundancy, genetic (Continued) significance of, 232 evolution of redundant duplicates, 232–234 inferring interactions epistatic synergism, 237, 247 literature curation, 237, 247–248 synthetic lethal interactions, 248 metabolic fluxes, maintenance of, 239–240 regulation design of, 245–246 recurring patterns of, 243–245 responsive backup circuits, functions of, 239, 246 significance of, 3, 11, 13, 25, 60–61 Regulation/regulatory, generally dependencies, 234–235 mutations, 174 neofunctionalization, 35 networks, 111, 219–220, 225 proteins, 108 subfunctionalization, 37 Relaxation embedding, 191 selection, 68 Replication, 125, 135, 218 Reproductive isolation, 16–18 Reptiles/Reptilia, 256, 263–264 Restricted subtrees, 187 Retention duplicate, 225. See also Duplicate retention mechanisms in polyploidy, 277–279 profile, expectations for, 13–15 3N-Retinoic acid (RARE), 7 Retrocopy, 7 Retrograde evolution, 1–2 Retropositions, 299 Retrotransposition, 7–8, 107, 254, 264–265 Retrotransposon, 108, 133 Reverse transcriptase, 107 Reverse transcription, 254 RFPL1.2.3, 35–36 Rgt2 gene, 240 Rho, 248 Rhodopsin, 152 Ribosomal genes, 27, 40, 91–92 Ribosomal proteins, 90, 95 Ribosomes, 211 Ribozyme structures, evolution of, 112, 124–125 Rice genome, 32, 275 Rim11, 234 RNA, see mRNA; rRNA; siRNA characterized, 107, 113, 124–125, 209–210 binding, 43 editing, 3 noncoding genes, 254 nonprotein coding, 265 polymerases, 92, 143, 211

RNaseA gene superfamily, 254 Robust systems, 13–14, 106–107, 112, 125, 221, 225 Roh1-dependent pathways, 246 Root edge, 198 Rooted trees, 187, 190–191, 193, 196, 198 Rop protein, 116–117 Rossman fold, 146 RoundUp database, 175–178 Rpl32 gene, 27 rRNA, 281 Saccharomyces cerevisiae duplicate retention, 34–35, 39, 43, 47 duplication myths, 81–82, 84, 88, 91, 94–95 material costs, 208–211 networks, 223–224, 243 dosage balance and, 63 evolutionary dynamics, 256 redundant duplicates, 230–231, 233, 237 Saliva, 63–64 Salivary amylase I, 13 Salmon, 303 Salmonella enterica tymphimurium, 115, 149 Salmonids, 300 SAPit program, 148 SAP program, 156 Saruma henryi, 273 SAS5 genes, 92 Scavenger receptor cysteine-rich (SRCR) domain, 254 Schizosaccharomyces pombe, 94–95 Sclerotome development, 242 SCOP classification, 106, 140 Secondary functions, 36 Segmental duplication phenotypic innovation models, 96–97 polyploidization obsession, 92–96 Segregating mutations, 2 Selection, generally coefficients, 211–212 lineage-specific, 16 negative, 13–14, 66 positive, 11–12, 14, 64, 66–68, 79, 85–88, 201 pressure, 117, 121–123, 126, 135, 218, 222 relaxed, 68 Selective fixation, 68 Selective pressure, 13 Semionotiformes, 301 Senecio camberensis, 279 SEPALLATA genes, 285–286 Sequence, generally analysis, 163 divergence, 274 evolution, 165, 263–264

INDEX Serine, 143 Serpinb10, 262 Serpins, ovalbumin-related, 260–261 SERT, 152 Sex chromosome, 264 Sexual reproduction, 106 Sharks, 306 Short repeats, 133–134 Siblings gene duplication, 17 in gene trees, 187 polyploidy and, 33 Side chains, 141 Signaling cascade, 221 network, 217–221, 225 Signal transduction systems, 6–8, 41–43, 229 Silenced/silencing genes, 37–38, 280, 287 Silent-site divergence, 254 Single-base-pair mutations, 2–3 Single-copy gene, 58, 61 Single-domain proteins, 105 Single-gene duplication, 68 Single-nucleotide polymorphisms (SNPs), 38 substitutions, 105 Site loss evolution, 2 Skates, 306 Skeletal muscle development, 230, 241 Skin color, 26 Slippage, 134 Slow-evolving genes, 37 Small interfering RNA (siRNA), 281 Small-scale duplications (SSDs) asymmetric protein evolution, 34 characterized, 11, 24, 32, 70, 107, 201, 216 expression divergence, 45–46, 48 retained, 14, 38–39, 42–43 WGD duplicates compared with, 33, 44 Smurf-1/Smurf-2 genes 243 Snf3 gene, 240 Solanum genes, 272, 283 Solution space, 197 Somatic duplications, 134 Soybeans, 272 Spatial expression, 174, 181 SPE7 gene, 120 Specialization, 88 Speciation, 14, 16, 43, 57, 110, 188, 255–257, 270, 275 Species-specific changes, 16 Species tree, 185–187, 189, 190, 193–196, 200–201, 276 Specificity, influential factors, 3, 5–6, 10–11, 16 Spectrin, 140 Spermatogenesis, 7

327

Spindles, 134 Splice variant divergence, 1, 26–27 Splicing, significance of, 3. See also Alternative splicing;Constitutive splicing Spurious interactions, 146 Starch hydrolysis, 63 Stickleback, 304 Stochastic applications, 200–201 STOP codon, 135–136 Stress conditions, 40, 43 Structural mutations, 174, 181 Sturgeons, 301 Subfunctionalization bird genomes, 253–254 defined, 10 dosage compensation, 13–14 dosage effects, 58–60 duplicate retention, 36–38 evolution dynamics, 96–97, 109–112, 118, 121, 125–126 historical, 7 implications of, 83–84, 96–97 MADS-box gene family, 285 mapping gains and losses, 174 networks, 216–217, 219, 224–225 partial, 235 phylogenetic trees, 201 plant genomes, 270, 277, 280, 287 population genetics, 11 protein folds, 14 redundant duplicates, 233 retention profile, 14 schematics, 12 splice variants, 26 Subneofunctionalization, 47, 111 Substitution profile, expectations for, 13–14 Subtrees characterized, 166–167, 187, 189 pruning and regrafting (SPR) operation, 198–200 transfer operation, 197–198 o-Succinylbenzoate synthase, 3–4 Sugars, 65–66 Superoxide dismutase gene (Sodcp), 27 Supertrees characterized, 195, 202 reconciliation of characteristics of, 186–187, 196 hill-climbing heuristics, 197–200 problems with, 197 Swapped dimers, 146 SwissPort, 178 Synapsins, 27 Synergistic epistasis, 60–61 Synonymous mutations, 121 Synonymous substitutions, 67–68, 85–87, 91 Synteny, 93–94, 300

328

INDEX

Synthetases, 65–66 Systems biology functions of, 83 higher-level organization, 1–2 interacting redundant duplicates, 235–236 Syteny, 300 Taeniopygia guttata, 265 TAF14 genes, 92 Takifugu rubripes, spliced variants, 27 Tandem gene duplication duplicated proteins beads on a string, 137, 140 beta/alpha class, 143–146 characterized, 137, 140 inseparable domains, 142–143 pseudodimeric domains, 140–142 entangled domains cyclic permutation, 154–155 domain-swapped dimers, 146–148 knotted folds, 153–154 membrane proteins, 149–153 genetic mechanisms, 133–137 implications of, 11, 25, 93, 299 intrinsic protein symmetries, 155–157 partial deletion, 157–158 structural options for, 138–139 Tandem repeats, 140 Taxonomic bias, 174 T-cell receptors, 93 Teleost fish species, 26–27, 34, 43, 45, 300. See also Teleost-specific genome duplication (TSGD) Teleost-specific genome duplication (TSGD) background, 300–301 evolution after, 302–304 before, 302 Temporal expression, 174, 181 Tenpounders, 303 Tetradon, 34 Tetrahymena, 125 Tetramers, 146 Tetraodon, 278 Tetrapods, 93 TFG3 gene, 92 Theria, 181 Thermodynamic stability, 112–116 Thermoregulation, 254 TIM barrel, 146 Time-dependent retention, 13, 15 Time-dependent substitution, 13 Tissue specialization, 37, 41 Titin, 140 TLR genes, 261–262 TM6 genes, 284

Toll-like receptors (TLRs), 260–262 Tomato, 272 Toxins, 65 Tragopogon, 278, 280, 287 Transcription cost of, 211 factor (TF) affinity of, 11 bHLH (basic helix-loop-helix), 111 binding, 2–3, 10–11, 83, 90–92 DNA-binding specificity, 3 duplicate genes, 25, 37 expression differentiation, 88 GAL genes, 81 microphtalmia-associated (Mitf), 26 significance of, 41–43, 208, 223–224, 243, 281–282 tissue-specific, 83 impact of, 125, 238, 240, 302 networks, 236 regulators, 245 Transcriptome, 273, 279 Transfer RNA (tRNA), 3, 42, 45 Transient interactions, 10 Translational efficiency, 40, 107, 125 Translocation, 38, 95, 150, 152, 270 Transporters, 65 Transposition, 11, 278 Transposon, 95, 279 Tree(s), see Gene trees bisection and reconnection (TBR) operation, 198–200 intervals, 187, 193 of life, 174, 185, 196 search graph, 197, 199 Triose phosphate isomerase dimer, 141 Trisomy, 9 Triticum, 278–279 True polytomy, 191 True species tree, 177 Tumor suppressor genes, 62 Tunicates, 173, 306 2R hypothesis, 93 Ubiquitin, 120, 234 UDP-glucose dehydrogenase, 154 Ultrabithorax, 5 Unequal crossing-over, 107 Unicellular life cycle, 94 Unicellular organisms, 84 Unidirectional responsive backup circuits, 244 Unrooted trees, 191–192 Unswapped dimers, 146 Up-regulation, 238, 241, 243–244 Urochordata, 306 Utp14 gene, 7–8

INDEX Vav genes, 231 Vertebrate(s) characterized, 93, 174, 243, 255, 270 duplications, 31, 35, 44–45, 93 evolution, 300 1R/2R genome duplication, 304–305 segmental duplications, 93 Viruses, 123, 142–143 Vitis genes, 272 WD40 motif, 143 WDREB2 gene, 281 Weibull dsitribution, 14–15, 201 Welwitschia, 275 Wheat, 281 Whole-genome duplication (WGD) ancient, 34, 274–277 compared with SSD, 33, 44 complex, 46 dosage balance and, 60, 63, 111 duplicate retention, 32, 36–37, 47 environmental interactions, 68 expression divergence, 45–46 functional divergence, 24 identification problem, 42 networks, 216 phylogenetic trees, 201 population genetics, 14, 17–18 protein development and, 107 recognition of, 299–300 redundant duplicates, 233 retention profiles, 14 SSD compared with, 47–48 Whole-genome sequencing, 254 Wild-type genotypes, 123 Wnt pathway regulation, 234

Xenopus spp. laevis, 34, 37, 79, 96–97, 208 300 tropicalis, 96–97, 256 X-linked genes, 7–8 X-ray crystallographic studies, 142 Yarrow lipolytica, 238 YDR518W gene, 92 Yeast, see Saccharomyces cerevisiae 1,3-β-glucane synthease, 238, 244 duplicate cost of, 208 dispensability of, 232 divergence, 36–38 neofunctionalization, 34–35 redundant, 233 retention, 217 whole-genome, 40, 45 epistatic synergism, 247 expression divergence, 25 gene deletions, 39 gene expression costs duplication, 208 energy costs, 209–210 material costs, 210–211 genetic network, 97 genome duplications, 31, 174 glucose and, 239 maximum parsimony tree, 99 transcription factors, 41 YRR1 gene, 243 Zea, 272 Zebra finch, 265 Zebrafish, 34, 243, 245, 304 Zygotes, 107

329