Proc. Natl. Acad. Sci. USA Vol. 94, pp. 7691–7697, July 1997 Colloquium Paper
This paper serves as an introduction to the following papers which were presented at a colloquium entitled ‘‘Genetics and the Origin of Species,’’ organized by Francisco J. Ayala and Walter M. Fitch, held January 30–February 1, 1997, at the National Academy of Sciences Beckman Center in Irvine, CA.
Genetics and the origin of species: An introduction FRANCISCO J. AYALA*
AND
WALTER M. FITCH
Department of Ecology and Evolutionary Biology, University of California, Irvine, CA 92697-2525
Theodosius Dobzhansky (1900–1975) was a key author of the Synthetic Theory of Evolution, also known as the Modern Synthesis of Evolutionary Theory, which embodies a complex array of biological knowledge centered around Darwin’s theory of evolution by natural selection couched in genetic terms. The epithet ‘‘synthetic’’ primarily alludes to the artful combination of Darwin’s natural selection with Mendelian genetics, but also to the incorporation of relevant knowledge from biological disciplines. In the 1920s and 1930s several theorists had developed mathematical accounts of natural selection as a genetic process. Dobzhansky’s Genetics and the Origin of Species, published in 1937 (1), refashioned their formulations in language that biologists could understand, dressed the equations with natural history and experimental population genetics, and extended the synthesis to speciation and other cardinal problems omitted by the mathematicians. The current Synthetic Theory has grown around that original synthesis. It is not just one single hypothesis (or theory) with its corroborating evidence, but a multidisciplinary body of knowledge bearing on biological evolution, an amalgam of well-established theories and working hypotheses, together with the observations and experiments that support accepted hypotheses (and falsify rejected ones), which jointly seek to explain the evolutionary process and its outcomes. These hypotheses, observations, and experiments often originate in disciplines such as genetics, embryology, zoology, botany, paleontology, and molecular biology. Currently, the ‘‘synthetic’’ epithet is often omitted and the compilation of relevant knowledge is simply known as the Theory of Evolution. This is still expanding, just like one of those ‘‘holding’’ business corporations that have grown around an original enterprise, but continue incorporating new profitable enterprises and discarding unprofitable ones.
have the best chance of surviving and of procreating their kind? On the other hand, we may feel sure that any variation in the least degree injurious would be rigidly destroyed. This preservation of favorable variations and the rejection of injurious variations, I call Natural Selection.’’ Darwin’s argument is that natural selection emerges as a necessary conclusion from two premises: (i) the assumption that hereditary variations useful to organisms occur, and (ii) the observation that more individuals are produced than can possibly survive. The most serious difficulty facing Darwin’s evolutionary theory was the lack of an adequate theory of inheritance that would account for the preservation through the generations of the variations on which natural selection was supposed to act. Theories then current of ‘‘blending inheritance’’ proposed that offspring merely struck an average between the characteristics of their parents. As Darwin became aware, blending inheritance could not account for the conservation of variations, because differences among variant offspring would be halved each generation, rapidly reducing the original variation to the average of the preexisting characteristics. The missing link in Darwin’s argument was provided by Mendelian genetics. About the time the Origin of Species was published, the Augustinian monk Gregor Mendel was performing a long series of experiments with peas in the garden of his monastery in Bru ¨nn, Austria-Hungary (now Brno, Czech Republic). Mendel’s paper, published in 1866, formulated the fundamental principles of a theory of heredity that accounts for biological inheritance through particulate factors (now called ‘‘genes’’) inherited one from each parent, which do not mix or blend but segregate in the formation of the sex cells, or gametes (3). Mendel’s discoveries, however, remained unknown to Darwin and, indeed, did not become generally known until 1900, when they were simultaneously rediscovered by several scientists. In the meantime, Darwinism in the latter part of the 19th century faced an alternative evolutionary theory known as neo-Lamarckism. This hypothesis shared with Lamarck’s original theory the importance of use and disuse in the development and obliteration of organs, and it added the notion that the environment acts directly on organic structures, which explained their adaptation to the ways of life and environments of each organism. Adherents of this theory rejected natural selection as an explanation for adaptation to the environment. The rediscovery in 1900 of Mendel’s theory of heredity led to an emphasis on the role of heredity in evolution. In the Netherlands, Hugo de Vries (4) proposed a new theory of evolution known as mutationism, which essentially did away
Darwin to Dobzhansky Darwin summarized the theory of evolution by natural selection in the Origin of Species (2) as follows: ‘‘As many more individuals are produced than can possibly survive, there must in every case be a struggle for existence, either one individual with another of the same species, or with the individuals of distinct species, or with the physical conditions of life. . . . Can it, then, be thought improbable, seeing that variations useful to man have undoubtedly occurred, that other variations, useful in some way to each being in the great and complex battle of life, should sometimes occur in the course of thousands of generations? If such do occur, can we doubt (remembering that many more individuals are born than can possibly survive) that individuals having any advantage, however slight, over others, would
Abbreviation: MHC, major histocompatibility complex. *To whom reprint requests should be addressed at: Department of Ecology and Evolutionary Biology, University of California, 321 Steinhaus Hall, Irvine, CA 92697-2525. e-mail: FJAYALA@ UCI.EDU.
© 1997 by The National Academy of Sciences 0027-8424y97y947691-7$2.00y0 PNAS is available online at http:yywww.pnas.org.
7691
7692
Colloquium Paper: Ayala and Fitch
with natural selection as a major evolutionary process. According to de Vries (joined by other geneticists such as William Bateson in England), there are two kinds of variation in organisms. One is the ‘‘ordinary’’ variation observed among individuals of a species, which is of no lasting consequence in evolution because, according to de Vries, it could not ‘‘lead to a transgression of the species border even under conditions of the most stringent and continued selection.’’ The other consists of the changes brought about by mutations, spontaneous alterations of genes that yield large modifications of the organism and give rise to new species. According to de Vries, a new species originates suddenly, produced by the existing one without any visible preparation and without transition. Mutationism was opposed by many naturalists, and in particular by the so-called biometricians, led by Briton Karl Pearson, who defended Darwinian natural selection as the major cause of evolution through the cumulative effects of small, continuous, individual variations (which the biometricians assumed passed from one generation to the next without being subject to Mendel’s laws of inheritance). The controversy between mutationists (also referred to at the time as Mendelians) and biometricians approached a resolution in the 1920s and 1930s through the theoretical work of several geneticists (5). These geneticists used mathematical arguments to show, first, that continuous variation (in such characteristics as size, number of eggs laid, and the like) could be explained by Mendel’s laws; and second, that natural selection acting cumulatively on small variations could yield major evolutionary changes in form and function. Distinguished members of this group of theoretical geneticists were R. A. Fisher and J. B. S. Haldane in Britain and Sewall Wright in the United States (6–8). Their work contributed to the downfall of mutationism and, most importantly, provided a theoretical framework for the integration of genetics into Darwin’s theory of natural selection. Yet their work had a limited impact on contemporary biologists because it was formulated in a mathematical language that most biologists could not understand; because it was almost exclusively theoretical, with little empirical corroboration; and because it was limited in scope, largely omitting many issues, such as speciation, that were of great importance to evolutionists. Dobzhansky’s Genetics and the Origin of Species advanced a reasonably comprehensive account of the evolutionary process in genetic terms, laced with experimental evidence supporting the theoretical arguments. It had an enormous impact on naturalists and experimental biologists, who rapidly embraced the new understanding of the evolutionary process as one of genetic change in populations. Interest in evolutionary studies was greatly stimulated, and contributions to the theory soon began to follow, extending the synthesis of genetics and natural selection to a variety of biological fields. The main writers who, together with Dobzhansky, may be considered the architects of the synthetic theory were the zoologists Ernst Mayr (9) and Julian Huxley (10), the paleontologist George G. Simpson (11), and the botanist George Ledyard Stebbins (12). [The National Academy of Sciences held in January 1994 a colloquium (13) to commemorate the 50th anniversary of the publication of Simpson’s seminal book, Tempo and Mode in Evolution (11).] These researchers contributed to a burst of evolutionary studies in the traditional biological disciplines and in some emerging ones—notably population genetics and, later, evolutionary ecology. By 1950 acceptance of Darwin’s theory of evolution by natural selection was universal among biologists, and the synthetic theory had become widely adopted. The line of thought of Genetics and the Origin of Species is surprisingly modern—in part, no doubt, because it established the pattern that successive evolutionary investigations and treatises largely would follow. Dobzhansky writes in the preface: ‘‘The problem of evolution may be approached in two
Proc. Natl. Acad. Sci. USA 94 (1997) different ways. First, the sequence of the evolutionary events as they have actually taken place in the past history of various organisms may be traced. Second, the mechanisms that bring about evolutionary changes may be studied. . . . The present book is dedicated to a discussion of the mechanisms of species formation in terms of the known facts and theories of genetics.’’ The book starts with a consideration of organic diversity and discontinuity. Successively, it deals with mutation as the origin of hereditary variation, the role of chromosomal rearrangements, variation in natural populations, natural selection, the origin of species by polyploidy, the origin of species through gradual development of reproductive isolation, physiological and genetic differences between species, and the concept of species as natural units. The book’s organization was largely preserved in the second (1941) and third (1951) editions, and in Genetics of the Evolutionary Process (14), published in 1970, a book that Dobzhansky thought of as the fourth edition of the earlier one, but had changed too much for publication under the same title. Dobzhansky sought to extend the evolutionary synthesis to mankind in numerous articles and several books, most notably Mankind Evolving (15), published in 1962, a book that many judge to be as important as Genetics and the Origin of Species. Dobzhansky was a leading experimentalist and prolific writer, who published several books and nearly 600 papers dealing with leading questions in population and evolutionary genetics, as well as with philosophical problems and humanistic issues. The experimental organisms of most of his research were Drosophila fruitflies. A Man for All Seasons Theodosius Dobzhansky was born on January 25, 1900, in Nemirov, a small town 200 km southeast of Kiev in the Ukraine. He was the only child of Sophia Voinarsky and Grigory Dobrzhansky (precise transliteration of the Russian family name includes the letter ‘‘r’’), a teacher of high school mathematics. In 1910 the family moved to the outskirts of Kiev, where Dobzhansky lived through the tumultuous years of World War I and the Bolshevik revolution. In those times the family was often beset by various privations, including hunger. In his unpublished autobiographical Reminiscences for the ‘‘Oral History Project’’ of Columbia University, Dobzhansky states that his decision to become a biologist was made about 1912. Through his early high school years, Dobzhansky became an avid butterfly collector. A school teacher gave him access to a microscope that Dobzhansky used particularly during the long winter months. In the winter of 1915–1916 he met Victor Luchnik, a 25-year-old college drop-out, who was a dedicated entomologist specializing in Coccinellidae beetles. Luchnik convinced Dobzhansky that butterfly collecting would not lead anywhere and that he should become a specialist. Dobzhansky chose to work with ladybird beetles, which would be the subject of his first scientific publication in 1918. (Reference to Dobzhansky’s publications can be found in the extensive bibliography published by the National Academy of Sciences, ref. 16.) Dobzhansky graduated in biology from the University of Kiev in 1921. Before his graduation, he was hired as an instructor in zoology at the Polytechnic Institute in Kiev. He taught there until 1924, when he became an assistant to Yuri Filipchenko, head of the new department of genetics at the University of Leningrad. Filipchenko was familiar with Thomas Hunt Morgan’s work in the United States and had started a Drosophila laboratory, where Dobzhansky was encouraged to investigate the pleiotropic effects of genes. In 1927, Dobzhansky obtained a fellowship from the International Education Board (Rockefeller Foundation) and arrived in New York on December 27 to work with Thomas Hunt Morgan at Columbia University. In the summer of 1928 he
Colloquium Paper: Ayala and Fitch followed Morgan to the California Institute of Technology, where Dobzhansky was appointed assistant professor of genetics in 1929, and professor of genetics in 1936. In 1940 he returned to New York as professor of zoology at Columbia University, where he remained until 1962, when he became professor at the Rockefeller Institute (renamed Rockefeller University in 1965) also in New York City. On July 1, 1970, Dobzhansky became professor emeritus at Rockefeller University; in September 1971, he moved to the Department of Genetics at the University of California, Davis, where he was adjunct professor until his death in 1975. On August 8, 1924, Dobzhansky married Natalia (Natasha) Sivertzev, a geneticist in her own right, who was at the time working with the famous Russian biologist I. I. Schmalhausen in Kiev. Natasha was Dobzhansky’s faithful companion and occasional scientific collaborator until her death from coronary thrombosis on February 22, 1969. The Dobzhanskys had only one child, Sophie, married until her recent death to Michael D. Coe, professor of anthropology at Yale University. In a routine medical check-up on June 1, 1968, it was discovered that Dobzhansky suffered from chronic lymphatic leukemia, the least malignant form of leukemia. He was given a prognosis of ‘‘a few months to a few years’’ of life expectancy. Over the following 7 years, the progress of the leukemia was unexpectedly slow and, surprising to his physicians, it had little if any noticeable effect on his energy and work habits. However, the disease took a conspicuous turn for the worse in the summer of 1975. In mid-November Dobzhansky started to receive chemotherapy, but continued living at home and working at the laboratory. He was convinced that the end of his life was near and dreaded that he might become unable to work and to care for himself. This never came to pass. He died of heart failure on the morning of December 18, 1975. The previous day, he had been working in the laboratory. Dobzhansky was an excellent teacher and distinguished educator of scientists. Throughout his academic career he had more than 30 graduate students and an even greater number of postdoctoral and visiting associates, many of them from foreign countries. Some of the most distinguished geneticists and evolutionists in the United States and abroad are his former students. Dobzhansky spent long periods of time in foreign academic institutions, and was largely responsible for the establishment or development of genetics and evolutionary biology in various countries, notably Brazil, Chile, and Egypt. Dobzhansky gave generously of his time to other scientists, particularly to young ones and to students. But he resented time spent in committee activities, which he shunned as often as he reasonably could. Throughout his academic career, he avoided administrative posts, alleging, perhaps correctly, that he had neither temperament nor ability for management. Most certainly, he preferred to dedicate his working time to research and writing rather than to administration. Dobzhansky was a world traveler and an accomplished linguist able to speak fluently six languages and to read several more. He was a good naturalist and never lacked time for a hike in the California Sierras, the New England forests, or the Amazon jungles. He loved horseback riding but practiced no other sports. Dobzhansky’s interests included the visual arts, music, history, Russian literature, cultural anthropology, philosophy, religion, and, of course, science. His artistic preferences were unsystematic and definitely traditional. His favorite composer was Beethoven followed by Bach and other baroques; he loved Italian operas, but had little appreciation for most twentieth century music and a definite distaste for atonalism. (Of electronic and computer-composed music, he said that it is fit only for computers to listen to it.) In art, Dobzhansky admired the Italian Renaissance painters as well as the Dutch and Spanish masters of the seventeenth century; he appreciated the French Impressionists but detested cubism and all subsequent styles and schools of modern art.
Proc. Natl. Acad. Sci. USA 94 (1997)
7693
Dobzhansky’s obvious personality traits were magnanimity and expansiveness. He recognized and generously praised the achievements of other scientists; he admired the intellect of his colleagues, even when admiration was alloyed with disagreement. He made many long-lasting friendships, usually started by professional interaction. Many of Dobzhansky’s friends were scientists younger than himself, who either had worked in his laboratory as students, postdoctorals, or visitors, or had met him during his travels. He was conspicuously affectionate and loyal toward his friends; he expected affection and loyalty in return. Dobzhansky’s exuberant personality was manifest not only in his friendships but also in his antipathies, which he was seldom able, or willing, to hide. Dobzhansky was a religious man, although he apparently rejected fundamental beliefs of traditional religion, such as the existence of a personal God and of life beyond physical death. His religiosity was grounded on the conviction that there is meaning in the universe. He saw that meaning in the fact that evolution has produced the stupendous diversity of the living world and has progressed from primitive forms of life to mankind. Dobzhansky held that, in man, biological evolution has transcended itself into the realm of self-awareness and culture. He believed that somehow mankind would eventually evolve into higher levels of harmony and creativity. He was a metaphysical optimist. Dobzhansky’s prodigious scientific productivity was made possible by incredible energy and very disciplined work habits. His enormous success as the creator of new ideas and as a synthesizer was, at least in part, based on his broad knowledge, phenomenal memory, and an incisive mind able to see the relevance that a new discovery or a new theory might have with respect to other theories or problems. His success as an experimentalist depended on a wise blending of field and laboratory research; whenever possible he combined both in the study of a problem, often using laboratory studies to ascertain or to confirm the causal processes involved in the phenomena discovered in nature. He obtained the collaboration of mathematicians to design theoretical models for experimental testing and to analyze statistically his empirical observations. He was no inventor or gadgeteer, but he had an uncanny ability to exploit the possibilities of any suitable experimental apparatus or experimental method. Dobzhansky received many honors and awards. He was president of several professional organizations, including the Genetics Society of America (1941), the American Society of Naturalists (1950), the Society for the Study of Evolution (1951), the American Society of Zoologists (1963), the American Teilhard de Chardin Association (1969), and the Behavior Genetics Association (1973). He was a member of the National Academy of Sciences, the American Academy of Arts and Sciences, the American Philosophical Society, and of many foreign academies, such as the Royal Society of London. He received more than 20 honorary degrees from universities in the United States and abroad. He received the Daniel Giraud Elliot Medal (1946) and the Kimber Genetics Award (1958) from the National Academy of Sciences and numerous other medals, including the National Medal of Science, which he received in January 1964 from President Lyndon Baines Johnson (16, 17). The 16 papers that follow were presented at a colloquium sponsored by the National Academy of Sciences to celebrate the 60th anniversary of the publication of Dobzhansky’s Genetics and the Origin of Species. These papers are organized into four successive sections: Genetic Variation and Its Origins, Adaptation and Natural Selection, Population Differentiation and Speciation, and Patterns of Evolution. Genetic Variation and Its Origins In 1937, when Dobzhansky published Genetics and the Origin of Species (1), the DNA structure was not yet discovered, nor
7694
Colloquium Paper: Ayala and Fitch
were there any grounds to anticipate the tremendous impact that molecular biology would have on evolutionary research. We now know how genes are organized and function, and we can ask primeval questions such as what the original organisms were like or how ur-genes were organized. Walter Gilbert advanced in 1987 ‘‘the exon theory of genes’’ (18; see also 19) contending that introns have been around since the progenote, the earliest genetic organism, as spacers between the early, simple genes, and were thereafter used to assemble the complex genes that would later evolve as coalitions of the primitive ones. This hypothesis has been challenged with the alternative proposal that introns came about late in evolution and had nothing to do with the arrangement and rearrangement of gene pieces. Walter Gilbert, S. J. de Souza, and M. Long in ‘‘Origin of Genes’’ (20) review the two theories, as well as an intermediate position proposing that introns arose at the beginning of multicellularity and played a major role during the Cambrian explosion in creating new genes by exon shuffling. The authors argue that if exon shuffling originated with the progenote, exons should consist of an integer number of codons and should be correlated with compact regions of polypeptides. The evidence that they now present, they say, strongly supports the case. The ultimate source of genetic variation was thought to be, at the time of the publication of Genetics and the Origin of Species, gene mutation. Dobzhansky was soon to realize that chromosomal mutations could also play important roles in the evolution sweepstakes. The significance of the transposable elements, first discovered by Barbara McClintock in the 1940s, would become apparent only several decades later. Transposable elements, say Margaret G. Kidwell and Damon Lisch (21), are ubiquitous in many kinds of organisms and account for 10–15% of the Drosophila’s genome and more than 50% of maize’s. Transposable elements provide, indeed, genetic variation on a scale and variety that could hardly have been imagined even a few years ago. Kidwell and Lisch point out the manifold effects of transposable elements. In the genotype, they are involved in many gene mutations, are ubiquitous, and incessantly shift their numbers and locations. Transposable elements modify phenotypes as well, subtly in some cases, causing drastic alternations of development and organization in others. From an evolutionary perspective, transposable elements may be seen as parasites of genomes, but like with other parasites, organisms have often become coadapted with them and have even learned to subvert them for their own benefit. The word ‘‘virus’’ does not appear in the index of any of the three editions of Genetics and the Origin of Species. By 1970, when Genetics of the Evolutionary Process (14) was published, viruses had become a favored organism of molecular genetics, and the term ‘‘viruses’’ is represented by six entries in the index, mostly referring to bacteriophages, but there is also a discussion of the myxomatosis virus, introduced in 1950 in Australia to control a rabbit population explosion. Two decades later, the accumulation of virus gene sequences combined with the development of new phylogenetic methodologies has brought viruses into the mainstream of molecular evolution. Important insights that have been gained concern evolutionary processes but also epidemiology, public health, and geographic patterns of human migrations. Walter M. Fitch and colleagues (22) investigate the HA1 domain of the hemagglutinin gene from human influenza A viruses isolated throughout the world from 1984 to 1996. The gene is evolving at a rate of 5.7 3 1023 substitutions per site per year, about one million times faster than cellular genes. In several positions of hemagglutinin a majority of the nucleotide substitutions are nonsynonymous—i.e., result in amino acid replacements—which strongly supports positive Darwinian selection rather than neutral evolution. The authors aver that
Proc. Natl. Acad. Sci. USA 94 (1997) gene sequence phylogenies may manifest which isolates are most likely to cause future epidemics and might therefore be used for vaccine production. Dobzhansky’s interest in human genetic diversity was motivated by science but also by his enduring concern with the human predicament. He saw that the pervasiveness of genetic diversity was the foundation of human individuality but provided no grounds for any sort of discrimination. Equality—as in equality in law and equality of opportunity—‘‘pertains to the rights and the sacredness of life of every human being’’ (ref. 23, p. 4). In Mankind Evolving (15, p. 18) he wrote that ‘‘Human evolution has two components, the biological or organic, and the cultural or superorganic. These components are neither mutually exclusive nor independent, but interrelated and interdependent. Human evolution cannot be understood as a purely biological process, nor can it be adequately described as a history of culture. It is the interaction of biology and culture. There exists a feedback between biological and cultural processes.’’ For more than three decades, L. L. Cavalli-Sforza has sought to elucidate the geographic origins and dispersal patterns of human populations by investigating gene frequency distributions. Genetic information has accumulated exponentially, encompassing protein-encoding genes, nuclear and mitochondrial, as well as microsatellite and other DNA sequences. ‘‘Genes, Peoples, and Languages’’ (24) emphasizes the African origin of modern humans, whence the other continents were colonized starting '100,000 years ago, first West Asia, then East Asia and Oceania, both probably through the coastal route of South Asia, and later Europe and America, both from East Asia and the latter certainly from the north, via the Bering land passage created in the ice ages. Cavalli-Sforza sees that the genetic conclusions are confirmed by trees of linguistic families, although these are temporally shallow. Adaptation and Natural Selection Starting in the late 1960s gel electrophoresis of soluble enzymes uncovered stores of genetic variation, much greater than had been suspected, in all sorts of animal and plant populations, as well as bacteria and other microorganisms. Whether this variation is adaptively important or just neutral noise became a matter of debate. The 1980s ushered in populational DNA sequencing. Much additional variation was discovered in the form of nucleotide differences between haplotypes. We now know that any two haplotypes of any gene differ on the average by several nucleotide substitutions, although most do not yield amino acid differences in the encoded protein. The neutral-selection controversy rages on. Richard R. Hudson and collaborators (25) investigate the Sod gene (coding for the Cu,Zn superoxide dismutase) in Drosophila melanogaster, where an unusual polymorphism prevails. At the protein level two alleles, Fast and Slow, are discerned, with Slow absent in some populations but reaching frequencies '5–15% in many others. It turns out that all Slow alleles have identical DNA sequences (with trivial exceptions) even when they originate from different world continents. The Fast alleles fall into two categories: roughly half of them are identical, whether they come from Europe, Asia, or the Americas; the other half are heterogeneous, most of them distinguished from each other by several nucleotide differences. Adding to the puzzle is that the Fast alleles that are identical to each other are also identical to the Slow alleles except for the one nucleotide substitution that accounts for their different amino acid composition. Hudson and collaborators conclude that within the last few thousand years a previously rare allele has rapidly risen in frequency to the present levels. The process was driven by fairly strong natural selection.
Colloquium Paper: Ayala and Fitch Polymorphisms shared between species were investigated long and hard by Dobzhansky, mostly chromosomal rearrangements present in two closely related species, Drosophila pseudoobscura and Drosophila persimilis. DNA sequencing has uncovered numerous trans-specific polymorphisms, notably in the genes of the major histocompatibility complex (MHC) of mammals, where some shared alleles have persisted for millions of years. In plants of the family Solanaceae, alleles that are self-incompatible in fertilization have persisted across species barriers for 70 million years. Drosophila species also share DNA sequence polymorphisms that are several million years old. Andrew G. Clark (26) develops mathematical models seeking to elucidate the causes of trans-specific shared polymorphisms. The shared self-incompatibility polymorphisms of plants and MHC alleles of humans and other primates are maintained by strong natural selection, because the protein products accrue a fitness advantage to the bearer of those alleles if they are different. The Drosophila polymorphisms, however, are recent enough that they might have persisted by neutral drift. Three decades ago, Zuckerkandl and Pauling (27) conjectured that morphological evolution is largely caused by changes in the expression of genes, rather than in the amino acid sequence of the encoded polypeptides. Natalia A. Tamarina, Michael Z. Ludwig, and Rollin C. Richmond (28) explore the issue in two homologous genes in two species, D. melanogaster (Est-6) and D. pseudoobscura (Est-5B). The coding regions of these two genes share 80% of their nucleotide and amino acid sequences. The regions flanking the genes are, in contrast, so different that it is difficult to align their sequences to ascertain homology. Tamarina and colleagues (28) make recourse to the magician’s bag of tricks available to Drosophila geneticists. They pick up regulatory DNA segments from D. pseudoobscura and introduce them in the appropriate locations of D. melanogaster. The expression of the gene in the D. melanogaster transgenic flies becomes substantially altered. The expression of the two genes in normal flies follows similar patterns, yet the generegulating apparatus has become different in the two species. It was not until the 1970s that demography was integrated into the theory of the dynamics of natural selection. Population genetics theory had until then treated all individuals in a population as effectively equivalent, without corroboration of longevity, age-dependent fecundity, and other life history parameters. The beginnings of an integration of the theories of population ecology and population genetics appeared in the 1970s, although this integration never engaged much attention from theorists or experimentalists, perhaps because of the many complexities involved. Dobzhansky and some of his students and collaborators made important experimental contributions to the problems (see refs. 29–35). Wyatt W. Anderson and Takao K. Watanabe (36) analyze life history schedules of births and deaths to investigate the outcome of laboratory population experiments involving several chromosomal arrangements of D. pseudoobscura in various combinations. Coincidentally, it happens that all possible genetic outcomes occur: stable polymorphic equilibrium, unstable polymorphic equilibrium, and fixation for one of the alternatives. The authors conclude that, in these populations, both viability and fertility are important fitness components. Age-dependent female fecundity plays a particularly significant role in the outcome. Population Differentiation and Speciation The concept of species is fundamental in evolutionary theory. The modern understanding of this concept can be traced to 1935 when Dobzhansky introduced what is now known as the ‘‘biological species concept’’ (37). Dobzhansky defined species
Proc. Natl. Acad. Sci. USA 94 (1997)
7695
as ‘‘that stage in the evolutionary process at which the once actually or potentially interbreeding array of forms becomes segregated in two or more separate arrays which are physiologically incapable of interbreeding’’ (37; also ref. 1, p. 312). Dobzhansky saw that the species is not only a category of classification, but in sexual organisms also a natural unit defined by the ability to interbreed or its absence. He called attention to the determining role played by reproduction ‘‘isolating mechanisms,’’ a term that he created. The biological species concept has recently been challenged on the grounds that it unduly neglects phylogeny. John C. Avise and Kurt Wollenberg (38) examine this criticism by bringing to bear recent gene coalescence theory with an analysis of multiple, gender-defined pathways in genealogical pedigrees. They conclude that the supposed sharp distinction between the biological species concept and the phylogenetic constructs favored by the critics is illusory. ‘‘Historical descent and reproductive ties,’’ they write, ‘‘are related aspects of phylogeny, and jointly illuminate biotic discontinuity.’’ Among the reproductive isolating mechanisms identified by Dobzhansky (1) there was one, later called ‘‘gametic isolation,’’ occurring when the ‘‘spermatozoa fail to reach the eggs, or to penetrate into the eggs; in higher plants, the pollen tube growth may be arrested if foreign pollen is placed on the stigma of the flower.’’ Therese Markow (39) notes that the investigation of gametic isolation as an evolutionary mechanism has been unduly neglected. In the genus Drosophila alone, a huge diversification exists in the size and pattern of gametes and other internal reproductive traits affecting fertilization. For fertilization to occur, ‘‘sperm must successfully enter the female and be transported to the storage organs. . . [and] must stay alive with adequate motility until they are utilized by the female.’’ Markow examines how these steps fail in different cases and draws a richly patterned quilt that one can see will likely be much extended as other organisms are investigated. The evolutionary possibilities by which these variegations may come about are virtually infinite. Coniferous forests and oak woodlands along the North American Pacific Coast are inhabited by Ensatina terrestrial salamanders. Several species were thought to occur in California, but detailed morphological and coloration analysis led to the conclusion in 1949 that various forms were parts of only one polytypic species arranged in the form of a ring around the Central Valley of California (40). Dobzhansky (41) saw that virtually all stages in a speciation process could be identified along the ring, with complete reproductive isolation between the terminal populations meeting in the southern part of the valley. In Dobzhansky’s view speciation was thwarted by ongoing gene flow via the intermediate populations around the ring. Wake and colleagues demonstrated in the late 1980s (42–44) that gene flow could not hold the complex together: an analysis of protein variation in 19 populations along the ring disclosed great genetic differentiation among populations. David B. Wake (45) reviews mitochondrial DNA and other variation. The Ensatina population array is old, consisting of a number of geographically and genetically distinct components that have reached or approximate full species level. The evolutionary history elucidated is extremely complex, with repeated interludes of geographic separation and genetic interactions upon renewed contact. Peter R. Grant and Rosemary Grant (46) see that Dobzhansky’s Genetics and the Origin of Species is an appropriate starting point for investigating the speciation process and the underlying genetic changes. But in one respect, they note, Dobzhansky’s book is disappointing because it says nothing about the genetics of birds, which are their consuming research interest. Birds are made to serve a good purpose for illustrating geographical patterns of morphological variation within species, adaptation to newly colonized habitats, rapid radiation in
7696
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Ayala and Fitch
archipelagos, and interspecies competition. The evolution of reproductive isolation is considered, but ‘‘the genetics of speciation are the genetics of other organisms, mainly Drosophila.’’ Peter and Rosemary Grant note differences between speciation in birds and speciation in Drosophila. It is significant that in birds the behavioral barriers that prevent mating evolve first, whereas post-mating isolation typically evolves much later, perhaps after gene exchange has all but ceased. Premating isolation in birds may arise from nongenetic causes, often from factors such as song, which in many groups of birds is culturally inherited through an imprinting-like process. Of the factors involved in pre-mating isolation, such as plumage, morphology, and behavior, some are under single-gene control, but most are polygenetically determined. Patterns of Evolution The universal tree of life consists of three domains, or ‘‘empires,’’ bacteria, archaea, and eukarya. The three multicellular kingdoms, animals, plants, and fungi, are just 3 of the 10–12 extant major branches of the eukaryote domain. Molecular evolutionary investigations in the last decade have elucidated the large genetic diversity encompassed by the set of all eukaryotes and, hence, the reduced proportion represented by the multicellular kingdoms. The existence and great genetic heterogeneity of the archaea have been discovered by molecular evolutionists also in the last few years, and so have been most of the species and higher taxonomic groups. The reconstruction of the universal tree and the assessment of the genetic diversity of each branch are buttressed by the hypothesis of the molecular clock of evolution, which has multifarious other applications in other evolutionary studies. How good is the molecular clock? It has been known for some time that the time variance of molecular evolutionary events is larger than would be expected if the molecular clock were a stochastic clock, like the radioactive decay of isotopes. Francisco Ayala in ‘‘Vagaries of the Molecular Clock’’ (47) reviews two clocks, the genes Gpdh and Sod, investigated in his laboratory. Gpdh evolves in Drosophila very slowly, at a rate of 1.1 3 10210 amino acid replacements per site per year. But the rate is much faster, '4.5 3 10210 in mammals, between Dipteran families, between animal phyla, or between plants, animals, and fungi. On the other hand, Sod evolves very fast in Drosophila, '16 3 10210, which is also the rate in mammals and between Dipteran families; but the rate becomes much slower, 5.3 3 10210, between animal phyla, and still slower, 3.3 3 10210, between plants, animals, and fungi. If one were to assume that Gpdh and Sod are good clocks and project the Drosophila rate to estimate the time of divergence of the three multicellular kingdoms, Gpdh would yield an estimate of 3,990 million years, Sod an estimate of 224 million years, both very much off the commonly accepted divergence time of '1,100 million years. It is unlikely that many molecular clocks are as erratic as Gpdh or Sod, but molecular clocks should be applied with caution, particularly when remote extrapolations are made. The hypothesis of the molecular clock was originally predicated on the assumption that the evolutionary replacement of one amino acid for another, or one nucleotide for another is most often of no adaptive consequence. If such assumption would obtain, the process of molecular evolution would be governed by a time-dependent stochastic process. The assumption of adaptive inconsequence seems safest in the case of synonymous nucleotide substitutions, which do not change the amino acids encoded by a gene. Jeffrey Powell and Etsuko Moriyama (48) explore a vexing problem, namely that organisms do not use alternative synonymous codons with the frequencies expected if synonymous substitutions were inconsequential. The deviations from random expectations are large
in Drosophila genes and they often persist through long periods of evolution. Powell and Moriyama (48) exclude differential mutation rates as the cause of the codon bias. Rather, they conclude that natural selection is the cause. The determining factor is the relative abundance of the tRNAs that execute the translation of genes into proteins: genes that are expressed at high rates favor codons that match those tRNAs that are more abundant. The genes in the nucleus of plants often occur as ‘‘families’’—i.e., a gene encoding a particular polypeptide may exist in several copies of more or less remote evolutionary origin. Michael Clegg, Michael P. Cummings, and Mary L. Durbin investigate ‘‘The Evolution of Plant Nuclear Genes’’ (49) by focusing on three gene families, rbcS, Chs, and Adh. Additional copies are recruited at different rates in these families: new Chs and rbcS genes are recruited 20 times faster than Adh genes. The multiplication of gene copies and their divergence is particularly notable for Chs genes in the evolution of flowering plants. The evolution of Adh in monocot plants is not consistent with the molecular clock hypothesis even for synonymous nucleotide substitutions. Clegg and colleagues conclude that natural selection plays a significant role in driving the evolutionary divergence of duplicated genes. They add that new alleles often arise by intragene recombination (49). Multigene families occur in animals as well as in plants. Notable in humans and other mammals are genes associated with the immune system, such as the MHC genes and immunoglobulin (Ig) genes. Some multigene families, in animals as in plants, arise by concerted evolution, a process that generates new genes by interlocus recombination or gene conversion. Masatoshi Nei, Xun Gu and Tatyana Sitnikova (50) raise the question whether concerted evolution may account for the MHC and Ig families, as some authors have suggested. They note that member genes of these families are often more similar to homologous genes from different species than they are to other member genes within the same species. This would not be expected if concerted evolution were the main originating process of gene multiplication within a family. Phylogenetic analyses of several MHC and Ig multigene families display patterns inconsistent with the concerted evolution hypothesis. The evidence favors the conclusion that the creation of new genes by gene duplication has repeatedly occurred in the evolutionary history of organisms. Some duplicated genes persist in the diversified descendant species for a long time; others effectively disappear, either because they are deleted or have become nonfunctional by deleterious mutations. We are grateful to the National Academy of Sciences for the generous grant that financed the colloquium and to Kenneth Fulton and Edward Patte, and the staff of the Arnold and Mabel Beckman Center for their skill and generous assistance during the colloquium and its preparation. Special gratitude is owed to Denise Chilcote, who was responsible for the colloquium’s logistics at all stages, and for her gracious and dedicated performance. Most of all, we are grateful to the speakers and their co-authors for their wonderful contribution to the colloquium and in the papers that follow. We have borrowed extensively from ref. 16 in the preparation of Dobzhansky’s biographical statement. 1. 2. 3. 4. 5. 6.
Dobzhansky, Th. (1937) Genetics and the Origin of Species (Columbia Univ. Press, New York); 2nd Ed., 1941; 3rd Ed., 1951. Darwin, C. (1859) On the Origin of Species by Means of Natural Selection (Murray, London). Mendel, G. (1866) Verh. Naturforsch. Vereines Abhandlungen Bru ¨nn 4, 3–47. de Vries, H. (1900) Rev. Gen. Bot. 12, 257–271. Provine, W. G. (1971) The Origins of Theoretical Population Genetics (Univ. of Chicago Press, Chicago). Fisher, R. A. (1930) The Genetical Theory of Natural Selection (Clarendon, Oxford).
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Ayala and Fitch 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27.
Haldane, J. B. S. (1932) The Causes of Evolution (Harper, New York). Wright, S. (1931) Genetics 16, 97–159. Mayr, E. (1942) Systematics and the Origin of Species (Columbia Univ. Press, New York). Huxley, J. S. (1942) Evolution: The Modern Synthesis (Harper, New York). Simpson, G. G. (1944) Tempo and Mode in Evolution (Columbia Univ. Press, New York). Stebbins, G. L. (1950) Variation and Evolution in Plants (Columbia Univ. Press, New York). Fitch, W. M. & Ayala, F. J., eds. (1995) Tempo and Mode in Evolution (National Academy Press, Washington, DC). Dobzhansky, Th. (1970) Genetics of the Evolutionary Process (Columbia Univ. Press, New York). Dobzhansky, Th. (1962) Mankind Evolving (Yale Univ. Press, New Haven, CT). Ayala, F. J. (1985) Biogr. Mem. Natl. Acad. Sci. U.S.A 55, 163–213. Ayala, F. J. (1990) in Dictionary of Scientific Biography, ed. Gillespie, C. C. (Scribner’s, New York), Vol. 17, Suppl. II, pp. 233–242. Gilbert, W. (1987) Cold Spring Harbor Symp. Quant. Biol. 52, 901–905. Doolittle, W. F. (1978) Nature (London) 272, 581–582. Gilbert, W., de Souza, S. J., & Long, M. (1997) Proc. Natl. Acad. Sci. USA 94, 7698–7703. Kidwell, M. G. & Lisch, D. (1997) Proc. Natl. Acad. Sci. USA 94, 7704–7711. Fitch, W. M., Bush, R. M., Bender, C. A. & Cox, N. J. (1997) Proc. Natl. Acad. Sci. USA 94, 7712–7718. Dobzhansky, Th. (1973) Genetic Diversity and Human Equality (Basic Books, New York). Cavalli-Sforza, L. L. (1997) Proc. Natl. Acad. Sci. USA 94, 7719–7724. Hudson, R. R., Sa´ez, A. G. & Ayala, F. J. (1997) Proc. Natl. Acad. Sci. USA 94, 7725–7729. Clark, A. G. (1997) Proc. Natl. Acad. Sci. USA 94, 7730–7734. Zuckerkandl, E. & Pauling, L. (1965) in Evolving Genes and Proteins, eds. Bryson, V. & Vogel, H. J. (Academic, New York), pp. 97–166.
28. 29. 30. 31.
32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44.
45. 46. 47. 48. 49. 50.
7697
Tamarina, N. A., Ludwig, M. Z. & Richmond, R. C. (1997) Proc. Natl. Acad. Sci. USA 94, 7735–7741. Beardmore, J. A., Dobzhansky, Th. & Pavlovsky, O. A. (1960) Heredity 14, 19–33. Dobzhansky, Th., Lewontin, R. C. & Pavlovsky, O. (1964) Heredity 22, 169–186. Ayala, F. J. (1970) in Essays in Evolution and Genetics in Honor of Theodosius Dobzhansky, eds. Hecht, M. K. & Steere, W. C. (Appleton–Century–Drofts, New York), pp. 121–158. Ayala, F. J. (1969) Can. J. Genet. Cytol. 11, 439–456. Mueller, L. D. & Ayala, F. J. (1981) Genetics 97, 667–677. Mueller, L. D. & Ayala, F. J. (1981) Proc. Natl. Acad. Sci. USA 78, 1303–1305. Mueller, L. D., Guo, P. & Ayala, F. J. (1991) Science 253, 433–435. Anderson, W. W. & Watanabe, T. K. (1997) Proc. Natl. Acad. Sci. USA 94, 7742–7747. Dobzhansky, Th. (1935) Philos. Sci. 2, 344–355. Avise, J. C. & Wollenberg, K. (1997) Proc. Natl. Acad. Sci. USA 94, 7748–7755. Markow, T. A. (1997) Proc. Natl. Acad. Sci. USA 94, 7756–7760. Stebbins, R. C. (1949) Univ. Calif. Publ. Zool. 48, 377–526. Dobzhansky, Th. (1958) A Century of Darwin, ed. Barnett, S. A. (Harvard Univ. Press, Cambridge, MA), pp. 19–55. Wake, D. B. & Yanev, K. P. (1986) Evolution 40, 702–715. Wake, D. B., Yanev, K. P. & Brown, C. W. (1986) Evolution 40, 866–868. Wake, D. B., Yanev, K. P. & Frelow, M. M. (1989) in Speciation and its Consequences, eds. Otte, D. & Endler, J. A. (Sinauer, Sunderland, MA), pp. 134–157. Wake, D. B. (1997) Proc. Natl. Acad. Sci. USA 94, 7761–7767. Grant, P. R. & Grant, B. R. (1997) Proc. Natl. Acad. Sci. USA 94, 7768–7775. Ayala, F. J. (1997) Proc. Natl. Acad. Sci. USA 94, 7776–7783. Powell, J. R. & Moriyama, E. N. (1997) Proc. Natl. Acad. Sci. USA 94, 7784–7790. Clegg, M. T., Cummings, M. P. & Durbin, M. L. (1997) Proc. Natl. Acad. Sci. USA 94, 7791–7798. Nei, M., Gu, X. & Sitnikova, T. (1997) Proc. Natl. Acad. Sci. USA 94, 7799–7806.
Proc. Natl. Acad. Sci. USA Vol. 94, pp. 7698–7703, July 1997 Colloquium Paper
This paper was presented at a colloquium entitled ‘‘Genetics and the Origin of Species,’’ organized by Francisco J. Ayala (Co-chair) and Walter M. Fitch (Co-chair), held January 30–February 1, 1997, at the National Academy of Sciences Beckman Center in Irvine, CA.
Origin of Genes (intronyexonymoduleyevolution)
WALTER GILBERT*, SANDRO J. DE SOUZA,
AND
MANYUAN L ONG
Department of Molecular and Cellular Biology, Biological Laboratories, Harvard University, 16 Divinity Avenue, Cambridge, MA 02138
view is that introns were added very late in evolution, even in the last few million years, and thus have nothing to do with the rearrangement of pieces of genes. There is no exon shuffling on this picture. A third, intermediate view, popular in its own right, is that the introns arose at the initiation of multicellularity. In this picture, the Cambrian explosion used introns to create exon shuffling and a profusion of new genes. The idea of exon shuffling is that introns are as hot spots for genetic recombination, which is a property that introns would have solely because of their length. Introns affect the rate of homologous recombination between exons in a way that scales with length, but, more importantly, they affect nonhomologous recombination as the square of their length. Consider a new gene made by a new combination of regions of earlier genes by an unequal crossing-over event, a rare event at the DNA level, that matches small, similar sequences between two DNAs. To make a new protein that contains the first part of one protein with the second part of another requires such a rare, and in frame, event. However, if the regions that encode parts of the protein are separated by 1,000–10,000-base-long introns along the DNA, a process of unequal crossing-over occurring anywhere within that intron between the exons will create a new combination of exons. There is a combinatorial number of ways to find the matching of short sequences to initiate the unequal crossing over, and thus the recombination process will go a million to a hundred million times faster in the presence of an intron. This is a great enhancement of the rate of creation of new genes. The Exon Theory of Genes (2) is a specific statement of the idea that the first genes were made of small pieces. The crucial elements of that theory are that the very first genes and exons represented small polypeptide chains '15–20 amino acids long, that the basic method used by evolution to make new genes was to shuffle the exons, and that a major trend of evolution was then to lose introns and to fuse small exons together to make complicated exons. (The first enzymes probably were aggregates of such short gene products, but these ur-exons were soon tied together by an intronyexon system so that the proteins would have a covalently connected backbone.) The dominant evolutionary processes are thus to be recombination within introns, the sliding and drift of introns to change amino acid sequence around their borders, and the loss of introns, which can change the gene structure but does not affect protein structure. The strength of this concept is its argument that one searches sequence space not by amino acids and point mutations but by larger elements. We might compare a protein to a sentence. It is easier to understand the sentence as made up of words rather than simply as a string of letters.
ABSTRACT We discuss two tests of the hypothesis that the first genes were assembled from exons. The hypothesis of exon shuff ling in the progenote predicts that intron phases will be correlated so that exons will be an integer number of codons and predicts that the exons will be correlated with compact regions of polypeptide chain. These predictions have been tested on ancient conserved proteins (proteins without introns in prokaryotes but with introns in eukaryotes) and hold with high statistical significance. We conclude that introns are correlated with compact features of proteins 15-, 22-, or 30-amino acid residues long, as was predicted by ‘‘The Exon Theory of Genes.’’ The role of introns and exons in the history of genes has been the subject of debate between two extreme positions. One side holds that introns were used to assemble the first genes, an ‘‘introns-early’’ view (1, 2), and the other side maintains that introns were added during evolution to break up previously continuous genes, an ‘‘introns-late’’ view (3, 4). This discussion has a significant impact on our conceptions about the way genes were constructed in the first cells. Unfortunately, the two sides make opposing judgments about each piece of evidence, and no decisive evidence has yet been agreed upon. For example, in the context of phylogenies, that bacteria have no introns whereas vertebrates have many introns is interpreted differently by the two sides. One view is that introns were there originally and were simply lost; the alternative view is that they were gained. In homologous genes, one often finds introns in similar but not identical positions between genes separated by great evolutionary distances. The early-intronists say that these positions represent the same original intron, possibly moved slightly in position (intron drift or sliding). The late-intronists say that it is obvious that introns could not have existed in such closely neighboring positions in a single original gene and that, because introns could not have moved, these near coincidences must be evidence of insertion. There have been efforts to correlate introns with the threedimensional structure of proteins. The introns-late view denies that there are any such correlations and asserts that introns behave as though they were inserted randomly into the structure of genes (5). Alternatively, the early-intron position generally affirms such a connection but, up to now, has not been able to muster any strong statistical evidence. Recently, however, we have defined such a correlation in a way that yields strong statistical support (6). There are three possible scenarios for the evolutionary history of introns. One is that there were introns at the very beginning of evolution and that during evolution they were lost or, possibly, mostly lost and some added. This complex of ideas is ‘‘The Exon Theory of Genes’’ (2). The extreme alternative © 1997 by The National Academy of Sciences 0027-8424y97y947698-6$2.00y0 PNAS is available online at http:yywww.pnas.org.
*To whom reprint requests should be addressed.
7698
Colloquium Paper: Gilbert et al. How are introns lost? The most direct way is retroposition. A spliced RNA transcript of a gene with an intronyexon structure is copied back into cDNA by a reverse transcriptase, and that DNA is inserted into the chromosome within an intron of a previously existing gene. Splicing can now make that element serve as a complex exon in a new gene. (This process makes pseudogenes, if the reinsertion does not fall under a promoter.) A clear example of this process was worked out in the jingwei gene system in Drosophila (7). The argument that the first exons were 15–20 amino acids long does not have direct support in today’s exon distribution, which peaks at lengths of 35–40-amino acid residues (8). In terms of exon fusion, we expect that there has been, on average, two or three acts of fusion in going from the original 15–20-amino acid long exons to the pieces that are being shuffled today. That protein evolution begins with 15–20-residue polypeptides, essentially small ORFs, whose products are just long enough to have some shape in solution (or as an aggregate) provides an answer to the classic problem of how long proteins evolved. Although it is impossible to find one of (20)200 sequences by a random process (there is not enough carbon in the universe), all short fragments 15–20-residues long can be found in a few mols of material. Although we have described the Exon Theory of Genes as involving DNA-based introns and exons, the theory flows naturally out of an RNA world view (9) that (i) pictures RNA genetic material creating (by splicing) RNA enzymes to do all of the biochemistry, (ii) introduces then activated amino acids, one by one, to build up oligopeptides to support ribozyme function, and, finally, (iii) uses 20 amino acids, short exons, and mRNA splicing to create protein enzymes. This RNA world picture is supported by the ribosome’s RNA-based peptidebond catalysis, by the spliceosome’s RNA enzyme-based splicing mechanism, and by the essential RNA involvement in DNA synthesis and the biosynthesis of the DNA precursors (10). How can one devise any proofs or disproofs of these attitudes about the origin of genes? The polar views make different predictions, which can be tested (8, 11). The theory that introns are present today because there was exon shuffling in the original genes makes certain predictions about intron position and phase whereas theories that the introns were added to DNA sequence by a random process make different predictions. One example of such an introns-added theory is the hypothesis of a transposable element that bears splicing signals on its ends. If such an element were to insert into a gene, its RNA transcript would be spliced out, and the gene product would be unaffected. An element of this kind could spread through the genome as selfish DNA and put introns everywhere. Intron Phase Predictions The first set of predictions involves intron phase, the position of the intron within a codon. Even though there is no signature on the message after an intron has been spliced out, the intron position along the DNA can be referenced to the ultimate protein sequence. An intron can lie either between the codons, phase 0, after the first base, phase 1, or after the second base, phase 2. This is an evolutionarily conserved property if the intron remains present in the gene. If the introns had been inserted into the DNA, there would be no ‘‘phase’’ preference at the point of insertion. That insertion, as a DNA process, could take note of DNA sequence but not protein sequence. If, on the other hand, the exons had been shuffled and exchanged, the simplest model would have all introns in the same phase so that every combination between exons would work. Thus, introns-early predicts phase bias, and introns-late predicts (in its simplest form) equal numbers in each phase.
Proc. Natl. Acad. Sci. USA 94 (1997)
7699
A second property is phase correlation. Consider an exon bounded by introns. If the two introns had been inserted into a continuous gene, there could be no necessary relation between the phases of the intron that lies before and the intron that lies after the exon. The two events of insertion, and hence the phases, should be uncorrelated. On the other hand, if the exon had been inserted into a previously existing intron, then the phases of the intron on either side should be the same so that the reading frame will continue across it. That is, exon shuffling suggests that exons should be multiples of three bases. Intron addition makes no such commitment. How might one test these predictions? We constructed a database of exons by going to GenBank, identifying all genes with introns, purging that set to remove related genes, and getting a set of quasi-independent genes: 1636 genes with 9192 internal exons from GenBank 84 (8). We then looked at a special subset of those eukaryotic genes: those that have homologous sequences in the prokaryotes. In our database, there were 296 such genes with 1496 introns (8). These genes have the following essential property: They are prokaryotic genes that are colinear with a region of an eukaryotic gene. The prokaryotic gene has no introns; the region of eukaryotic gene has introns. Any introns-late model requires that all of these introns be inserted because there cannot have been exon shuffling for these sequences. Although one might argue in general, for eukaryotic sequences, that they could have been made by exon shuffling, these particular parts of eukaryotic sequence cannot be so made because they are orthologous and thus derive from the cenancestor. In an introns-late picture, all the introns in these homologous regions must be derived. They must have been inserted, and so they should show no phase bias and no phase correlations. According to the introns-early model, there should be such correlations because some or all of these introns originated in the progenote where these genes were assembled by exon shuffling. In fact, this subset of introns in ancient, conserved regions does show a phase bias: 55% are in phase 0, 24% are in phase 1, and 21% are in phase 2. (The alternative model predicts 33%, 33%, and 33%.) Still more interesting, from a biological viewpoint, there is an excess of symmetric exons, symmetric pairs of exons, triples, and quadruples of exons. Table 1 shows these data (8). All of these sets show an excess of multiples of three, significant at about the 1% level. This is the first strong argument for the existence of ancient introns. The excess of symmetric exons in these ancient conserved regions is predicted in a simple way by the idea that the introns were used to assemble the first gene, but it is not predicted by an insertional model without special biochemical pleadings that forces this result to happen. (One might, in principle, argue that introns inserted into special sequences on the DNA, like AGuGT, and that these sequences might show a bias relative to amino acid sequence to generate a phase bias. To this one might then add the ad hoc assumption that the splicing mechanism sees both ends of the exon and likes it to be a Table 1.
Intron correlations in ancient conserved regions Observedyexpected
Sets of exons
Symmetric
Asymmetric
x2
P
1 2 3 4
562y515 (9%) 439y400 (10%) 348y309 (13%) 267y238 (12%)
725y772 530y569 400y439 312y341
7.1 6.5 8.4 6.0
0.008 0.011 0.004 0.014
The symmetric exons or exon sets begin and end in the same phase. The asymmetric sets begin and end in different phases. The expectations for the single exons were calculated from the observed intron phases. The expectations for the sets of exons were calculated from the prior observed frequencies of the subset exons. The percentage difference with the symmetric exon or exon sets is calculated as (observed 2 expected)yexpected.
7700
Colloquium Paper: Gilbert et al.
Proc. Natl. Acad. Sci. USA 94 (1997)
multiple of three.) A simpler interpretation is that there was exon shuffling in the formation of the first genes. Introns Correlate with Protein Structure If genes had been put together from small shuffled pieces, proteins should be made up of repeated small elements, possibly elements of folding, although the evolutionary argument only requires elements that can be subject to natural selection. Evolution requires only function; it does not require biochemical structure. The prediction of the Exon Theory of Genes is that there should be such elements that evolution has selected, which we call modules, and that they should be coextensive to exons. By module, we mean a region of polypeptide chain that can be circumscribed in space by a maximum diameter. If one traces the Ca positions along the backbone of the polypeptide chain in space and requires that all of the pairwise distances be less than some maximum value, then this region of the chain must fold back and forth in space. Putting a maximum length over the Ca distances, roughly speaking, means that, as that region of chain travels through space, it can be circumscribed by a sphere of that diameter. How might one define the boundaries of such modules? The module notion was first introduced by Mitiko Go in the early 1980s and was used to predict the existence of novel introns. She used it (12) to suggest that there should be a novel intron in globin and that the intron was later discovered in the leghemoglobin of plants. We used that same idea to predict the existence of positions in which one might find introns in triosephosphate isomerase (13, 14). One difficulty with this notion, and a challenge that the introns-late supporters have made, is that this concept of compactness does not provide a sharp view of where the boundaries are to be. The ‘‘spheres’’ overlap, and one does not have a clean definition of where one module stops and the next begins. We have converted this difficulty into a virtue (6) by suggesting that one should take the overlap regions between the spheres that surround the folded portions and, rather than asking for a single intron to be added at a precise position, define these overlaps as “boundary regions” within which introns might lie. Such boundary regions are designed such that, if one put an intron into each of those regions along the gene, the gene product would be dissected into modules less than the specified diameter. This notion is well defined, and one can now write a computer program to define those regions. Fig. 1 shows this definition of the boundary regions for globin. By constructing a distance plot, a Go plot, of all pairwise distances between Ca positions along the protein and marking all distances .28 Å in black, one can easily see that the five large triangles along the diagonal identify the longest segments of polypeptide chain that lie in 28-Å modules and that the four small overlap triangles define the boundary regions. The gene thus is divided into two portions, one that corresponds to modules and another that corresponds to boundary regions. This yields a very simple statistical test. The boundary regions are approximately one-third of a gene: Do introns lie preferentially in these regions, or are their positions random? Again, we considered ancient conserved regions, choosing ones that correspond to three-dimensional structures homologous to bacterial genes without introns and to eukaryotic sequences with introns, to ask: Do those intron positions in the eukaryotic homologs tend to lie in the boundary regions or do they not? The two theories predict quite opposite effects. These are all derived introns on the introns-late model, they were added to the preexisting gene, and their positions should be random. The early-intron model predicts that these positions should fall in the boundary regions.
FIG. 1. Go plot for horse hemoglobin. The black spots represent pairs of amino acids whose a-carbons are separated by 28 Å or more. The five large triangles correspond to modules. Boundary regions (BR) are defined by the overlap of these triangles.
Using 28 Å to define the modules, a size that was used before to define modules for triosephosphate isomerase or globin, we examined a set of 32 ancient proteins and a corresponding set of 570 intron positions. The random expectation was to find 182.5 introns in the boundary regions, but we found 214. That is a 17% excess, not a big number, but there are so many positions that the x2 is 8 and the P value is less than 0.005. One might wonder if there could be some other reason, rather than ancient introns, that introns might lie within the boundary regions. One possibility might be the existence of some special sequences in the boundary regions, or some sequence biases, that could serve as targets for insertion. We have examined the sequences in the boundary regions and do not find any particular sequence or compositional bias at the amino acid level or the DNA level. Occasionally, people conjecture that introns might have targeted sequences like AGG or AGGT, ‘‘proto-splicing’’ sequences, but there is no excess of those sequences in the boundary regions. Craik and his coworkers once suggested that introns might lie on the surface of proteins (15), thus one might think that the boundary regions perhaps are on the surface of the proteins and that is why introns are in those regions. However, in this set of proteins, neither the boundary regions nor the introns are biased toward the surface (6). So far, we have not been able to identify any bias-dependent model that would put introns into the boundary regions. The hypothesis we are testing, the Exon Theory of Genes, says that intron positions should lie within these boundary regions. Even though some introns may have been added in the course of evolution, even though some introns may have been lost, even though some introns may have moved, and even though the protein structure may have altered since it was put together, one can still see an excess there. A further argument that the excess of intron positions in the boundary regions is due to intron antiquity is found in the examination of an ‘‘ancient’’ subset of the intron positions. We examined those introns that have the same, or similar, positions in three of the four groups: plants, vertebrates, invertebrates, and fungi. Of the 20 introns in this subset, 13 lie in the boundary regions whereas only 6.5 are expected. That is a 100% excess, as opposed to the 17% excess overall. Thus, in a group that is selected to be ancient, we found a higher bias. (That bias was significantly different from the 17%; the x2 for the difference between 100% and 17% was 6.5, a P value ' 0.01.) This finding is further support for the idea that the
Colloquium Paper: Gilbert et al. underlying signal is due to ancient introns. If the pattern is simply one of biased insertion, then any subset should simply have a value ranging around the 17% excess. The 28 Å size was purely arbitrary, a particular one that we had used historically. It worked, and we had chosen that size before we knew that this analysis would work, but there was no profound reason for that size. Because we have a computer program that can take any diameter and decompose the protein into modules corresponding to that diameter, we can ask: Is there some optimal decomposition? Fig. 2 shows the results of varying the module diameters from 6 Å, which is one amino acid apart along the chain, out to 50 Å and plotting the x2 values for the significance of the excess of introns within the boundary regions. Fig. 2 shows three peaks of significance: one peak corresponding to an '21-Å diameter, one of an '28-Å diameter, and one of an '33-Å diameter. The peaks rise to probabilities ' 0.001. This is a strong statistical argument that there are three differently sized structural elements in these proteins that are correlated with intron positions. (One might worry that the curve shows a statistical calculation repeated a thousand times; if the phenomenon had been purely random, at least one of the points should have yielded a P value of 0.001. If one examines the underlying distribution of the excess of intron positions, one sees that it is robust: Smooth peaks appear in the excess of the observed intron positions over the expectations.) Thus, we conclude that intron positions are correlated with modules of three different diameters: 21 Å, 28 Å, and 33 Å. Can we understand these modules in a more informative way? We can ask about the average length of the polypeptide chain contained within each of these modules, which is equivalent to asking for the average length of the hypothetical exons predicted by the computer program. Fig. 3 shows that the 21-Å modules have an average length of 15 amino acid residues; the 28-Å modules have an average length of 22 residues; and the
FIG. 2. x2 distribution for the matching of intron positions to the boundary regions of 32 ancient proteins as a function of module diameter. The 570 intron positions were drawn from version 90 of GenBank. There are three major peaks of significance around module diameters of 21, 28, and 33 Å.
Proc. Natl. Acad. Sci. USA 94 (1997)
7701
FIG. 3. Lengths of predicted modules for the peaks of significance around 21, 28, and 33 Å. The three peaks correspond to distributions centered around 15, 22, and 30 amino acid residues in length.
33-Å modules have an average length of 30 residues (with a considerable spread). We have given a very strong statistical argument, with P values ' 0.001, that introns define elements of protein structure with sizes of 15, 22, and 30 residues. This feature is exactly what the Exon Theory of Genes suggested back in 1987. Recently, we went back to the database. Since the calculation was first done, there are 90 more introns, 662 in total, so we can redo the calculation to see if it is better or worse. Most of the novel intron positions have come in through the Caenorhabditis elegans project, so they represent great evolutionary distances from many that were in the database before. With the new data, the peaks improve in statistical significance. Fig. 4 shows that the peak at 21 Å rises to a x2 ' 19, and both it and the peak at 28 Å rise to a P value less than 0.0001. Currently, we are analyzing the shapes that make up these peaks. The peak at 21 Å, for the set of 32 proteins, arises from a set of 822 modules. Because we know the structures of the proteins, we know the three-dimensional structures of each of these modules, and we can search for signs of exon shuffling. The hypothesis that we are testing not only says that there should be correlations of ancient introns with these modules but also that there should be a pattern of reuse of these elements. What we expect to find is that some 21-Å regions, some 28-Å regions, and some 33-Å regions will have been used over and over again. Once we have a classification of shapes that are reused, we will ask for further evidence that those shapes correspond to shuffled exons. Such evidence would be that those modules that have been reused are ones correlated with introns or that those modules that have been reused show sequence similarities that would suggest a divergent evolution. At this time, we know these patterns only very crudely. The most common module is an a-helix followed by a turn and a strand; '8–10% of all shapes at 21 Å are of that form. Then there are strand–turn–helix shapes, helix–turn–helix, and strand–turn–strand shapes repeating in the 21 Å peak. The other peaks contain more complicated shapes.
7702
Colloquium Paper: Gilbert et al.
Proc. Natl. Acad. Sci. USA 94 (1997) introns have sharp statistical significance. As the databases continue to increase in the future, these tests will become even more convincing. Issues of Selection and Adaptation
FIG. 4. The same analysis shown in Fig. 2 (dashed line) was repeated using a database of intron positions based on GenBank version 96 (662 intron positions, continuous line). The peaks around 21, 28, and 33 Å now reach x2 values around 19, 15, and 13, respectively.
DISCUSSION We have reviewed here two strong, statistical arguments that there were ancient introns used to shuffle exons in the first genes. Both arguments detect a signal of the presence of ancient introns in today’s intron spectrum, over a background that could be due to new introns, to moved introns, or to mutation and change of the protein structures. Both arguments, intron phase correlations and intron correlation to modules, were applied to ancient conserved regions of gene sequence. These are regions of sequence conserved between prokaryotes and eukaryotes; thus, these genes, on any theory, came into existence early in evolution, possibly in the progenote, certainly in the cenancestor, the last common ancestor. These regions are colinear between the prokaryotic forms and their eukaryotic homologs. It is for these ancient conserved regions especially that the two theories make the most divergent predictions. All forms of introns-late theories assert that these genes came into existence before there were spliceosomal introns. Hence, all of the introns in their eukaryotic counterparts had to be inserted during the course of evolution; they must be derived characters because the prokaryotic form, on those theories, was created as a continuous whole. No exon shuffling can have intervened for these eukaryotic counterparts because they are colinear to the prokaryotic forms. Conversely, all introns-early theories predict that these proteins actually were assembled from exons in the progenote or later by exon shuffling. During the evolution of the prokaryotes, these theories predict that all of the introns were lost. Only in the eukaryotic forms did (some of) these introns survive. Thus, for these introns, one theory says all were added, and thus should obey random statistics, and the other theory predicts that the current introns will show correlations due to their ancient origin. The databases of gene sequences have so increased in size that one can show that these traces of ancient
Are the introns under selection? In general, we argue that they are not. The hypothesis that the role of introns was to speed up evolution by increasing the recombination between exons is not based on the idea that they therefore were selected for that use. Such an idea would be a wrong teleological view, i.e., that they are present because they aid future selection. Rather, our view is that they are present because the easy path in the past that lead to the creation of a gene used them and that they have not yet been removed by selective pressure. Although the introns are not under any selective pressure in general, where there has been pressure on DNA size there would have been loss of introns, such as in prokaryotes, Arabidopsis, or other small genome organisms. Drosophila, for example, recently has been shown to have a high deletion frequency for unneeded DNA (16) associated with genome slimming, which suggests that many current introns in Drosophila may be adaptive and be maintained by such features as enhancers or gene expression timing. Many introns in Drosophila are very small, which may reflect the deletion pressure for loss of sequence that still does not go to completion because of the difficulty of removing the intron exactly. Gerald Fink (17) suggested that, in S. cerevisae, a special mechanism (in that case a runaway reverse transcriptase) led to the loss of introns as a result of bombarding the genome with cDNA copies of spliced messengers. Could natural selection on added introns create the observed correlation between introns and protein features? Such models fail because, for these ancient conserved genes, they involve selection for a future purpose. One such model, for example, hypothesizes that, as introns are being added to these ancient (continuous) genes, a well formed exon is shuffled off for use in some other gene and hence selected for. In reality, selection could fix that novel exon in the new gene in the population, but that selection would fail to fix the correct ancestral (donor) form of the gene in the population. (If the organisms had sex, then the donor form is unlinked and hence not fixed. If the organism had only one linkage group, so that the donor form would be fixed by piggybacking, so too would all of the wrongly inserted introns everywhere in the genome.)
CONCLUSION We have examined a large set of introns in ancient conserved regions. All of these introns should have been derived, late features if the first genes had been continuous. We found, however, that these introns show patterns of correlation to the gene sequence and to the protein structure of the gene products that are consistent with the predictions of The Exon Theory of Genes. We thank the National Institutes of Health for support (Grant GM 37997). S.J.d.S. was supported by Fundacao de Amparo a Pesquisa do Estado de Sao Paulo and the PEW–Latin American Fellows Program. 1. 2. 3. 4. 5. 6. 7.
Doolittle, W. F. (1978) Nature (London) 272, 581–582. Gilbert, W. (1987) Cold Spring Harbor Symp. Quant. Biol. 52, 901–905. Palmer, J. D. & Logsdon, J. M. J. (1991) Curr. Opin. Genet. Dev. 1, 470–477. Cavalier-Smith, C. C. F. (1978) J. Cell Sci. Stoltzfus, A., Spencer, D. F., Zuker, M., Logsdon, J. M. J. & Doolittle, W. F. (1994) Science 265, 202–207. de Souza, S. J., Long, M., Schoenbach, L., Roy, S. W. & Gilbert, W. (1996) Proc. Natl. Acad. Sci. USA 93, 14632–14636. Long, M. & Langley, C. (1993) Science 260, 91–95.
Colloquium Paper: Gilbert et al. 8. 9. 10. 11. 12.
Long, M., Rosenberg, C. & Gilbert, W. (1995) Proc. Natl. Acad. Sci. USA 92, 12495–12499. Gilbert, W. (1986) Nature (London) 319, 618. Gesteland, R. F. & Atkins, J. F., eds. (1993) The RNA World, (Cold Spring Harbor Lab. Press, Plainview, NY). Long, M., de Souza, S. J. & Gilbert, W. (1995) Curr. Opin. Genet. Dev. 5, 774–778. Go, M. (1981) Nature (London) 291, 90–93.
Proc. Natl. Acad. Sci. USA 94 (1997) 13. 14. 15. 16. 17.
7703
Straus, D. & Gilbert, W. (1985) Mol. Cell. Biol. 5, 3497–3506. Gilbert, W., Marchionni, M. & McKnight, G. (1986) Cell 46, 151–154. Craik, C. S., Sprang, S., Fletterick, R. & Rutter, W. J. (1982) Nature (London) 299, 180–182. Petrov, D. A., Lozovskaya, E. R. & Hartl, D. L. (1996) Nature (London) 384, 346–349. Fink, G. R. (1987) Cell 49, 5–6.
Proc. Natl. Acad. Sci. USA Vol. 94, pp. 7704–7711, July 1997 Colloquium Paper
This paper was presented at a colloquium entitled ‘‘Genetics and the Origin of Species’’ organized by Francisco J. Ayala (Co-chair) and Walter M. Fitch (Co-chair), held January 30 to February 1, 1997, at the National Academy of Sciences Beckman Center in Irvine, CA.
Transposable elements as sources of variation in animals and plants MARGARET G. K IDWELL*
AND
DAMON LISCH
Department of Ecology and Evolutionary Biology and The Center for Insect Science, University of Arizona, Tucson, AZ 85721
and that he would have been an active participant in the continuing debate about their role in evolution.
ABSTRACT A tremendous wealth of data is accumulating on the variety and distribution of transposable elements (TEs) in natural populations. There is little doubt that TEs provide new genetic variation on a scale, and with a degree of sophistication, previously unimagined. There are many examples of mutations and other types of genetic variation associated with the activity of mobile elements. Mutant phenotypes range from subtle changes in tissue specificity to dramatic alterations in the development and organization of tissues and organs. Such changes can occur because of insertions in coding regions, but the more sophisticated TE-mediated changes are more often the result of insertions into 5* flanking regions and introns. Here, TE-induced variation is viewed from three evolutionary perspectives that are not mutually exclusive. First, variation resulting from the intrinsic parasitic nature of TE activity is examined. Second, we describe possible coadaptations between elements and their hosts that appear to have evolved because of selection to reduce the deleterious effects of new insertions on host fitness. Finally, some possible cases are explored in which the capacity of TEs to generate variation has been exploited by their hosts. The number of well documented cases in which element sequences appear to confer useful traits on the host, although small, is growing rapidly.
Distribution and Classification TEs are discrete segments of DNA that are distinguished by their ability to move and replicate within genomes. Since their discovery by Barbara McClintock '50 years ago (1), TEs have been found to be ubiquitous in most living organisms. They comprise a major component of the middle repetitive DNA of genomes of animals and plants. They are present in copy numbers ranging from just a few elements to tens, or hundreds, of thousands per genome. In the latter case, they can represent a major fraction of the genome, especially in some plants. For example, TEs recently have been estimated to make up .50% of the maize genome (2). In Drosophila, '10–15% of the genome is estimated to be made up of TEs, most of which are found in distinct regions of centric heterochromatin (3). TEs are classified in families according to their sequence similarity. Two major classes are distinguished by their differing modes of transposition (4). Class I elements are retroelements that use reverse transcriptase to transpose by means of an RNA intermediate. They include long terminal repeat retrotransposons and long and short interspersed elements (LINES and SINES, respectively). Long terminal repeat retrotransposons are closely related to other retroelements of major interest, such as retroviruses (5). The gypsy element in Drosophila is an example of a rare type of retrotransposon that can sometimes also behave as a retrovirus (6). Class II elements transpose directly from DNA to DNA and include transposons such as the Activator-Dissociation (Ac-Ds) family in maize, the Tam element in Antirrhinum, the P element in Drosophila, and the Tc1 element in the worm, Caenhorabditis elegans. Recently. a category of TEs has been discovered (7) whose transposition mechanism is not yet known. These miniature inverted-repeat TE (MITEs) have some properties of both class I and II elements. They are short (100–400 bp in length), and none so far has been found to have any coding potential. They are present in high copy number (3,000–10,000) per genome and have target site preference for TAA or TA in plants. MITEs such as the Tourist element in maize and the Stowaway element in Sorghum (7) are found frequently in the 59 and 39 noncoding regions of genes and are frequently associated with the regulatory regions of genes of diverse flowering plants. TEs with similar properties also have been described in Xenopus (8), humans (9, 10), and the yellow fever mosquito, Aedes aegypti (11). Most, but not all, TE families are made up of both autonomous and nonautonomous elements. Whereas autonomous
The book whose publication we are celebrating in this colloquium indicates that Theodosius Dobzhansky had a very special interest in gene mutation and its causes. Dobzhansky recognized mutation as the ‘‘raw material’’ on which natural selection acts and as the first of three steps necessary for evolution to take place. However, the discovery of transposable elements (TEs) in the 1940s by Barbara McClintock occurred a decade later, and it was a further 30 years before the significance of her findings started to be fully appreciated. Sixty years ago, Dobzhansky was well aware of the mutagenic properties of ionizing radiation discovered in 1927 by H. J. Muller but acknowledged that much less than 1% of spontaneous mutations were attributable to this cause. He distinguished between spontaneous and induced mutations: ‘‘The former are those which arise in strains not consciously exposed to known or suspected mutation-producing agents.’’ He also pointed out that ‘‘since the name spontaneous constitutes only a thinlyveiled [sic] admission of the ignorance of the phenomenon to which it is applied, the quest for the causes of mutation has always occupied the attention of geneticists.’’ Although at that time no clues to its nature were yet available, Dobzhansky realized that a major piece of the mutation puzzle was still missing. We believe he would have been intrigued with the discoveries of TEs in natural populations that have taken place during the last 20 years
Abbreviations: TE, transposable elements; MITE, miniature invertedrepeat TE. *To whom reprint requests should be addressed. e-mail:
[email protected].
© 1997 by The National Academy of Sciences 0027-8424y97y947704-8$2.00y0 PNAS is available online at http:yywww.pnas.org.
7704
Colloquium Paper: Kidwell and Lisch elements code for their own transposition, nonautonomous elements lack this ability and usually depend on autonomous elements from the same, or a different, family to provide a reverse transcriptase or transposase in trans. This paper aims first to provide a brief, general description of the types of genetic variation caused by TEs in animals and plants and then to examine this variation within an evolutionary framework: (i) direct selection on TEs at the level of the DNA sequence (parasitic DNA); (ii) coevolution of TEs and their animal and plant hosts to avoid or mitigate the deleterious effects of insertion; and (iii) positive selection on elements that have evolved to provide some positive benefit to their hosts in addition to simply minimizing the harm they do. Types of TE-Induced Genetic Variation Like new mutations produced by any mutator mechanism, the majority of new TE-induced mutations are expected to be deleterious to their hosts. Those mutations that survive over long periods of evolutionary time are expected to be a small subsample of newly induced mutations. The property that distinguishes TE-induced mutations from those produced by other mutational mechanisms is their remarkable diversity and the degree to which their induction is regulated by both the host and the TE itself. The genetic variability resulting from TEs ranges from changes in the size and arrangement of whole genomes to changes in single nucleotides. It may produce major effects on phenotypic traits or small silent changes detectable only at the DNA sequence level. It is important to note that TEs produce their mutagenic effects not simply on initial insertion into host DNA. TEs may also produce mutations when they excise, leaving either no identifying sequence or only small ‘‘footprints’’ of their previous presence. In addition, some TE-induced mutations that may be of evolutionary significance to their hosts, such as mutations in regulatory sequences (12), may take long periods of time to evolve new functions or these new functions may have been acquired a long time ago. Consequently, they may have lost their original identification as TEs. For these reasons, the reliance solely on the distribution of TE sequences in the genomes of contemporary species of animals and plants to deduce the long term evolutionary importance of TEs may produce a biased result that may not adequately reflect TE-associated events that occurred long in the past. Deleterious effects of TEs can result not only from mutations caused by the insertion or excision of these elements at a single chromosomal site but also from genomic-level disruptive effects associated with TE transposition. For example, massive chromosome breakage in larval cells resulting from excision and transposition of genomic P elements has been implicated as the cause of temperature-dependent pupal lethality and sterility in hybrid dysgenesis in Drosophila melanogaster (13, 14). A brief description of the types of genetic variability caused by TE activity follows, based largely on the types of host DNA involved. Some of the mutations described were generated in the laboratory and have been subjected to artificial selection under unnatural and noncompetitive conditions. Although these are generally not the class of mutations that are of interest from an evolutionary perspective, we include them here to provide some indication of the potentially wide spectrum of phenotypic changes associated with TE activity. Insertions of TEs into Exons of Host Genes. On average, TEs that insert within the exons of genes are most likely to result in null mutations because of the sensitivity of these regions to frame shift mutations and the lack of tolerance of highly conserved regions to most mutations of any kind. However, those mutations that are not simply inviable can provide interesting and sometimes spectacular phenotypic variability. In Drosophila, a series of null alleles at the X-linked, white locus allowed the first identification of the P element in D. melanogaster as the causal agent of P–M hybrid dysgenesis
Proc. Natl. Acad. Sci. USA 94 (1997)
7705
(15). The insertion of both the P element and the copia element into exon sequences interrupted the coding sequences and the production of the red eye pigment by the wild-type gene. The result is a bleached white eye phenotype that reflects the lack of pigmentation. Such a null mutation can be maintained in the laboratory but is unlikely to survive in natural populations. A good classic example is the insertion of an element of the Ac-Ds family into wx-m9, an allele of the waxy locus in maize first discovered by McClintock (16). The mutation is caused by the insertion of Ds (Dissociator) into the 10th exon of the waxy locus. This was the first element to be cloned from maize, and it is of continuing interest because it is spliced, resulting in partial revertant activity (17, 18). In this case, the effect of the insertion is attenuated by the loss through splicing of the TE after transcription. Insertions into Regulatory Regions of Genes. An excellent example of this type is the insertion of gypsy into the 59 upstream region of the yellow gene in Drosophila, which causes a loss of expression of the yellow gene in specific tissues (19). The loss of expression in some tissues and not others in this case is the result of the interaction of the element, tissuespecific enhancers upstream of the element, and specific host factors. In Antirrhinum, a Tam3 element was observed to insert into a region 59 of the niv gene, which is involved in the synthesis of anthocyanin pigments. The initial insertion was observed to down-regulate expression of the gene. However, a series of rearrangements mediated by this element resulted in a change in the level and tissue specificity of expression of niv (Fig. 1). The net effect is a new and novel distribution of anthocyanin pigment in the flower tube (20). This series of mutations exemplifies the potential for TE-mediated ‘‘rewiring’’ of regulatory networks, in this case by bringing new regulatory sequences in proximity to exonic sequences via an inversion, followed by an imprecise excision event. TE activity can result in even more complex rearrangements that can have effects on gene regulation. In maize, the insertion of a Mu (Mutator) element into the TATA box of Adh1 changes the tissue specificity of RNA expression (21). Of interest, excision of this element caused a complex series of duplications and inversions whose net effect was to cause additional changes in tissue specificity (22). Kloeckener–Gruissem and Freeling (22) suggest that this kind of ‘‘promoter scrambling’’ may represent a more general process by which transposons produce variants of a type not produced by other mechanisms. Furthermore, in this case, like that of the Tam element in the niv gene of Antirrhinum, the TE footprint left behind after the element has generated the new mutation is small enough to be invisible to TE probes. Insertions in Introns. TEs that insert into introns generally have a greater chance to survive because these insertions are less visible to natural selection. Many of them are probably successfully spliced out during mRNA processing and have no obvious effect on the function of the gene. Even when spliced, however, introns are sometimes the site of regulatory sequences. In these cases, TE insertions into introns can affect gene regulation in surprising ways. For instance, the insertion of Mu elements into an intron of the Knotted locus in maize induces ectopic expression of the gene, suggesting that the intron carries sequences normally required to repress expression of the gene in certain tissues (23). Similarly, in Antirrhinum, complementary floral homeotic phenotypes result from opposite orientations of a Tam3 transposon in an intron of the ple gene (24) . Insertions in Heterochromatin. Middle repetitive DNA sequences, including TEs, are an important component of b heterochromatin in Drosophila, and retrotransposons constitute a considerable fraction of this DNA (25). Recent work by Pimpinelli and coworkers (3) has revealed that TE clustering into discrete regions of heterochromatin is a general property of elements in Drosophila. The cause of this distribution pattern is an open question. In some cases, a heterochromatic location probably reduces the probability of elimination of inserted se-
7706
Colloquium Paper: Kidwell and Lisch
Proc. Natl. Acad. Sci. USA 94 (1997) stimulation of site-specific gene conversion and recombination mediated by P element transposition. As a consequence of the selection against the negative effects of ectopic recombination, this is postulated to be the mechanism chiefly responsible for the removal of certain subsets of TEs from genomes and a means for controlling copy number (26, 31). It is worth noting, however, that not all rearrangements caused by ectopic recombination are necessarily selected against. From an evolutionary perspective, rare surviving chromosomal rearrangements could be of significance. For example, Lyttle and Haymer (32) demonstrated the presence of the hobo element at the breakpoints of several endemic, but not cosmopolitan, inversions in D. melanogaster natural populations from Hawaii. This result is consistent with the recent introduction of hobo elements to D. melanogaster by horizontal transfer (33) and the subsequent production of these inversions as a consequence of the activity of these elements. Effects on Quantitative Variability. The P element in Drosophila provides one of the most compelling demonstrations of TE-induced genetic variability in quantitative genetic traits. A series of experiments (34, 35) has shown that quantitative variability for bristle number is induced by P–M hybrid dysgenesis and is demonstrable by directional artificial selection. Furthermore, in well controlled experiments (36), a dramatic increase in new additive genetic variation in abdominal bristle number was observed that was 30 times greater than that expected from spontaneous mutation. In another Drosophila study (37), an excess of P element mutations having large effects on metabolic characters was observed relative to those expected. Significant among-line heterogeneity indicated that the mutational target site for enzyme activity is large and that most of the mutations must be regulatory. It was concluded that the large pleiotropic effects observed had important consequences for metabolic characters.
FIG. 1. Rearrangements associated with TE activity near a gene result in altered tissue specificity. (A) In the original isolate, a Tam3 element had inserted 64 bp upstream of the start of transcription of the niv gene of Antirrhinum. The result of this insertion was a reduced level of expression. (B) A derivative of the initial insertion allele carries an inversion flanked by two copies of the transposon. This allele confers an additional reduction in the level of expression. (C) Excision of the element closest to the niv gene left a short (26 bp) ‘‘footprint.’’ This rearrangement resulted in an increase in the level of niv gene expression as well as a novel pattern of expression, presumably due to the juxtaposition of a novel sequence with the niv gene TATA box and coding sequences [adapted from Lister et al. (20)].
quences from the genome by ectopic recombination, the mechanism believed to be largely responsible for controlling TE copy number in chromosome arms (26). However, some centric heterochromatic regions have been described as a graveyard for dead elements, rather than a safe haven for active elements, because the majority found there appear to be inactive and highly diverged sequences. For example, LINE-like I elements cloned from the Charolles-reactive strain of D. melanogaster contain no active euchromatic I factors, only defective copies that are embedded in clusters of defective copies of other retroelements (27). In contrast, elements such as mdg1 have been found to have a nested arrangement within other retrotransposons located in euchromatic chromosome regions (28, 29). These sites appear to exhibit properties of intercalary heterochromatin (25) and may be responsible for the properties of ectopic pairing, susceptibility to breakage, and late replication that are characteristic of this type of chromatin. Mediation of Recombination. TE-mediated increases in the rate of recombination have consequences not only for genetic variation at individual loci. This recombination activity can also result in more general changes in both fine and gross structural characteristics of chromosomes. For instance, analysis of mutations in the mei-41 and mus302 genes required for normal postreplication repair in D. melanogaster (30) revealed a striking
Evolutionary Considerations The idea that TEs are primarily parasitic is not at all inconsistent with a role for these elements in the evolution of their hosts. Indeed, as documented below and elsewhere (12, 38– 40), there is a growing body of evidence for coadaptation by both elements and their hosts to the long term presence in the genome of these parasitic sequences. In some cases, it appears that this coadaptation even may have lead to the use of TE sequences for essential and beneficial host functions. In this section, we explore all three perspectives. TEs as Genomic Parasites The intrinsically parasitic nature of active TEs (41–43) accounts for their undisputed ability to invade new species, increase in copy number, and survive over long periods of evolutionary time. The replicative advantage of TEs (44) is responsible for this ability, which is facilitated by their generally compact structure and inclusion of the coding capacity for transposition within their sequences. Natural selection acting on TEs at the level of the DNA sequence is responsible for maintaining their essentially parasitic properties. For example, P elements can rapidly invade a naive population of D. melanogaster, despite extremely strong negative selection at the host level in the form of high frequencies of temperature-dependent gonadal sterility (45). A proclivity for horizontal transfer is consistent with the role of TEs as genomic parasites. The life cycle of TEs in any single phylogenetic lineage can apparently last for many thousands or millions of years and can be considered as a succession of three phases: dynamic replication, inactivation, and degradation (46, 47). The transposition of both major classes of elements is error-prone and produces nonautonomous elements that often repress the transposition rate of active elements. Over long periods of evolutionary time, there is a tendency for a family of elements to degrade in coding capacity, but horizontal transfer to another host lineage provides the opportunity for active TEs to
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Kidwell and Lisch move to another lineage and begin the cycle over again (46, 48, 49). There is evidence that TEs do transfer horizontally more frequently than nonmobile genes (46). Class II elements such as mariner and P elements provide good examples of TE horizontal transfer (50–52), but a major puzzle remains regarding the mechanism by which horizontal transfer is achieved (49). Coadaptations to Mitigate Reduced Host Fitness Like viruses, TEs are dependent on their host organisms for survival, but unlike viruses, most TEs do not have a phase in their life cycle in which they can survive independent of their hosts. Therefore, coevolution and coadaptation of TEs with host genomes is expected to play a particularly important role in the long term survival of these element families. Given that the majority of new insertions tend to be deleterious to hosts, it is in the interests of both parties to mitigate or remove such deleterious effects. TEs can insert in many locations other than exons. In these noncoding sequences, they are likely to increase their probability of survival because of less visibility to natural selection. Examples of some of the ways that coadaptations by both mobile elements and their hosts appear to have evolved are described below and are summarized in Table 1. Insertion Bias for Noncoding Regions. A dramatic demonstration of the preference for new P element insertions into potentially regulatory regions of genes, rather than exons, is provided by P elements in Drosophila (53). Only a small minority of P elements was observed to have inserted in coding sequences, and these elements were in the 59 portion. Thus, there is a strong bias in favor of insertion in the 59 end of the gene and especially for the 59 untranslated region. Because these insertions all caused an obvious mutant phenotype (they failed to compliment a deficiency), this result is actually an understatement of the true number of insertions into or near regions of genes often involved in regulation. This preference is supported by the remarkable observation that, in another experiment, .65% of the 500 independent P element enhancer trap insertions were expressed in a spatially and temporally restricted fashion (54) . Another good example of preferential insertion of elements outside of coding regions is provided in yeast (55). Of over 100 new insertions of Ty1 observed in chromosome III, nearly all were inserted into or near either tRNA genes or preexisting long terminal repeats; only 3% were found in ORFs. Distribution patterns favoring the 59 noncoding sequences of genes were observed for MITE elements in plants (7) and animals (e.g., see ref. 11). However, it is not clear whether this pattern represents an insertion preference or whether it results from strong earlier selection against insertions into other regions. Similarly, in view of the large number of group II and group III introns present in the chloroplast genome of E. gracilis, the complete absence of introns in rRNA and tRNA genes is striking (56). One possibility is that secondary structure features of rRNA and intron RNA or tRNA and intron RNA (if they were present in the same Table 1.
pre-mRNA molecule) would interact in such a way as to prevent one or both from functioning. Preference for Insertion into Preexisting Elements. In some cases, it is clear that unrestricted transposition would be absolutely disastrous for the host. As mentioned above, a remarkable number of retroelements, probably representing well over 50% of the genome, are found between genes in maize (2). The five most abundant families make up more than 25% of the genome, and one family alone, Opie, makes up 10–15% of the maize genome. Despite this ubiquity, few homologies in the database were found among maize genes; these elements appeared to have a pronounced preference for regions outside of genes. Indeed, fully half of the elements examined were found nested within other elements. Splicing from Pre-mRNA Transcripts. It has been suggested that splicing of TEs from exonic sequences has evolved as a means by which TEs can minimize their deleterious host impact. We refer back to the waxy mutation described earlier (18). In that case, splicing of the Ds element inserted into the waxy gene results in partial reversion of the mutation caused by the original insertion. Presumably, even a partial amelioration of the mutant phenotype can provide some selective advantage to those TEs capable of providing it. Additional examples are found in Drosophila, C. elegans, and other plants (57–59). In addition, Marillonnet and Wessler have observed tissue-specific splicing of an element (S. Wessler, personal communication), suggesting that TEs could potentially play a role in the evolution of tissue-specific regulation of some genes. These examples may be illustrating an evolutionary spectrum, from purely parasitic behavior to functional significance for the host. The original capacity to be spliced may have arisen as a way to minimize the impact of insertions into coding sequences but on the road from poorly spliced variants that simply ameliorate mutant phenotypes to fully effective and selectively invisible splicing may have come the opportunity to develop new classes of regulation, such as tissue specificity. Tissue Specificity of TE Activity. A good example of a likely adaptation of a TE to its host is the restriction of transposition of P and I elements to the germ line (14). It is of mutual benefit for an element to transpose in those tissues that will ensure transmission to the next host generation, but to curtail activity in somatic tissues is likely to result in loss of host fitness without providing any benefit to the transposon. Repression of P element transposition in somatic cells occurs on the level of RNA processing (60). The 2–3 intron is spliced only in the germ cells, resulting in the absence of transposase in somatic cells. Splicing of this intron is prevented in the somatic cells by an 87-kDa protein that binds to a site in exon 2 located 12–31 bases from the 59 splice site (61). An existing host-splicing mechanism apparently has been coopted for this purpose (61, 62) that has been highly conserved during evolution. Host Regulation of TE Copy Number. Good examples of host regulation of copy number are found in maize and
Possible coevolved mechanisms to mitigate reduction in host fitness Mechanism
Insertion bias for noncoding regions
Pre-mRNA splicing Tissue specificity of transposition TE copy number regulation
Regulators of mutant phenotype expression
7707
Examples Preferential insertion in regulatory regions (1); nested retrotransposons in maize (2); clustered I elements (27); mdg1 in Drosophila (28, 29) Splicing from the maize wx gene (18); various genes in Drosophila (57), C. elegans (59), and plants (58) Repression of P element transposition in somatic cells at the level of RNA processing (60) Methylation of maize Ac, Spm, and Mu by host factors (63, 64); type I and type II P element-encoded repressors (13, 68, 70–72); Ac dosage effects (16); Spm repressor action (73–75) Transposase-dependent expression of mutant phenotype caused by insertion of Spm in maize (73); masking of mutant phenotypes by alleles of host suppressor genes in Drosophila (40)
7708
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Kidwell and Lisch
Drosophila. Unknown host-encoded factors specifically methylate Ac, Spm, and Mutator elements in maize (63, 64). With Mu (64), the example is particularly striking because it represents the global methylation of dozens of previously active elements simultaneously in a single generation. The methylation is not simply related to structural features of insertion sites of Mu elements because they become specifically methylated even when inside of genes; modification is rarely detected in the flanking sequences within the genes (65) . In D. melanogaster females, expression of the gypsy element envelope gene is strongly repressed by one copy of the nonpermissive allele of flamenco [reviewed by Bucheton (6)]. A less dramatic reduction in the accumulation of other transcripts and retrotranscripts also is observed. These effects correlate well with the inhibition of gypsy transposition in the progeny of these females and are therefore likely to be responsible for this phenomenon. The effects of flamenco on gypsy expression apparently are restricted to the somatic follicle cells that surround the maternal germline. Self-Regulation of TE Copy Number. It has been shown theoretically that self-regulation of TEs cannot evolve if it is assumed that deleterious effects on host fitness are caused by increased copy number alone or are not caused by dominant lethals (66). However, if the deleterious effects are immediate and occur as a direct consequence of transposition itself, then there may be a selective advantage to elements with reduced transposition rates that still allow them to spread in the genome but at a reduced cost to their host [reviewed by Brookfield (67)]. The activity of the P elements in D. melanogaster is regulated by element-encoded repressor products. These repressors fall into two discrete categories, type I and type II. Type I repressors are responsible for a cellular condition known as P cytotype, which depends on a 66-kDa, P element-encoded, repressor of transposition and excision (68). The genomic position of repressor elements determines the maternal vs. zygotic inheritance of P cytotype (69). Type II repressors usually have large internal deletions, are sensitive to genomic location, but show no maternal inheritance (13, 70–72) . In plants, the Ac element shows dosage effects; an increase in number of elements results in a decreased number of transpositions of the element (16). This could be interpreted as a response of the plant to increases in the level of Ac transposase or as an autoregulatory mechanism. Similarly, Spm:tnpA can protect Spm from methylation but may also act as a repressor of Spm (73, 74). Additionally, some deleted Spm elements can repress full length Spm elements in trans (75) . Regulators of Expression of Mutant Phenotypes. Very early in the investigation of mutable alleles in maize, it was discovered that the expression of some alleles depended on the presence or absence of a second factor (76). In these cases, that factor was the Table 2.
source of the transposase. However, as is clear from the example of some gypsy insertions whose mutant phenotype is only manifested in the presence of both Su(Hw) and a second host encoded factor, mod(mdg4) (77), TE-induced mutant alleles can also become ameliorated by other factors as well. The mutant phenotypes associated with many retrotransposon insertions are masked by alleles of host suppressor genes that act as trans-regulators of retrotransposon expression (40). The argument made is that such suppressor action may allow insertion mutations to partially, or completely, escape the action of purifying selection and allow them to persist or even increase in frequency in natural populations. There is evidence for the presence of host genes with suppressor function in natural populations of Drosophila (78). TE-Induced Characters Having Benefit to the Host There has been considerable debate whether, in addition to deleterious effects on fitness, TE-induced variability has any significance for host organisms over evolutionary time (40). The generally unpredictable nature of TE movements, coupled with the paucity of fixed insertion sites for TEs in species such as Drosophila (26), has lead some to reject the possibility of TEs having any significant evolutionary importance, other than as molecular parasites. However, there is a rapidly growing list of possible examples of TEs having evolved highly sophisticated functions, as shown by the examples briefly described below and summarized in Table 2. Insertions with Host Gene Regulatory Functions. It has been speculated for some time that changes in cis-regulatory regions of duplicated genes may be more important for the evolution and divergence of functional and morphological characters than mutations in coding sequences (see, e.g., refs. 79 and 80). However, only recently has evidence started to accumulate to support this hypothesis. In Drosophila, for instance, the three homeotic genes paired (prd), gooseberry (gsb), and gooseberry neuro (gsbn) have evolved from a single ancestral gene, following gene duplication. They now have distinct developmental functions during embryogenesis. The three corresponding proteins PRD, GSB and GSBN are transcription factors. Li and Noll (81) demonstrated that the three proteins are interchangeable with respect to their regulatory functions and that their distinct developmental functions are a consequence of changes in the regulatory sequences rather than in the proteins themselves. Because they lend themselves so well to changes in the architecture of promoter regions, it is likely that TE mutations have been involved in this kind of regulatory evolution (82). The potential importance of TEs as modifiers of the expression of normal plant genes has been highlighted by recent findings in plants. Long terminal repeat retrotransposons and MITEs have
Examples of TEs having functions that benefit their hosts New function
Insertions with regulatory functions “Molecular domestication” Source of new introns Replacement of normal host functions A role in host cell repair mechanisms Mediation of concerted evolution Possible functions of heterochromatic TE clusters
DSB, double-strand chromosome break.
Examples More than 20 examples of insertions into regulatory regions of genes (12, 38, 39) P element tandem repeats in D. obscura group may provide a new host gene function (89) Introns and twintrons in Euglena gracilis plastids (24) Repair of damaged chromosome ends by HET-A and TART in Drosophila (92, 93) Endogenous retroelements associated with repair of DSBs in yeast (94, 95) P element-mediated changes in subtelomeric repeat numbers in D. melanogaster (96) Developmentally programmed changes in DNA content; expression of heterochromatically embedded loci; genomic housekeeping functions (25)
Colloquium Paper: Kidwell and Lisch been found to be associated with the genes of many plants where some of these TEs contribute regulatory sequences (7). Furthermore, the MITE elements recently discovered in Aedes aegypti (11) also are associated closely with genes. In domesticated rice, Oryza sativa, a computer-based search revealed 32 common sequences belonging to nine putative mobile element families (83). Four of these families had been previously described, but five families were first discovered through this computer search, and four of these five had characteristics of MITES. New Patterns of Tissue-Specific Expression. TEs can contribute to the functional diversification of genes by supplying cisregulatory domains altering expression patterns. Earlier, we described the insertion of the gypsy element into a 59 upstream region of the yellow gene in Drosophila causing a loss of expression of this gene in specific tissues (84). The tissue-specific alterations in expression (a kind of mutation that is more subtle than simply knocking out a gene) is due to the presence of a specific sequence of DNA that is bound by Su(Hw), which is thought to be a transcription factor. More interesting, the Su(Hw) binding sequence seems to act as a general ‘‘buffer’’ that helps to define structural domains in the chromatin (77). Thus, gypsy may serve to introduce domains of regulation into given regions of the chromosome. This may have arisen initially as a means by which the TE could buffer itself from its chromosomal environment, but this kind of domain alteration could certainly also result in interesting variations in gene regulation as well. In addition to simply buffering chromosomal regions, many TEs are specifically expressed only in particular tissues at particular times. Based on recent findings, it appears that tissue specificity is a general feature of all retrotransposons in Drosophila. The expression patterns of 15 different families of long terminal repeat-containing retrotransposons were examined by Ding and Lipshitz (85) during normal development in different wild-type strains of D. melanogaster. Each family exhibited a pattern typical of spatial and temporal expression during embryogenesis, suggesting that each TE harbors cisregulatory factors that interact specifically with host transcription factors. These mobile cis-regulatory factors could potentially act to modify the expression of any number of host genes. Other Types of Insertions with Regulatory Functions. Some of the examples given thus far are anecdotal; they represent laboratory observations as to the kinds of changes that TEs can introduce into the host genome, rather than changes that have actually contributed to the evolution of the host. However, Britten (12, 38, 39) has used stringent criteria for the identification of strong cases of the involvement of TEs in the actual evolution of gene regulation. He maintains that a long term perspective is necessary in identifying and understanding mutations important for gene regulation. The number of cases he has identified is small, but growing. In addition to the plant MITE examples discussed above, he includes cases involving Alu-containing, T cell-specific enhancers in the human CD8a gene (86), the association of a retrovirus-related element with androgen regulation of the sex-limited protein (Slp) gene in mouse (87), and inverted repeats in the CyIIIa actin gene of sea urchin (88). Tandem Repeats of P-Related Sequences in Drosophila. A number of tandem P element repeats in three closely related species of the obscura group provides a very interesting example of several unrelated TE sequences evolving together that may provide a type of host gene function, which Miller et al. have termed ‘‘molecular domestication’’ (89). In this case, the P elements have lost all of their terminal repeats and thus can no longer transpose. Remarkably, each cluster unit consists of a cis-regulating section composed of insertion sequences derived from unrelated TEs, followed by the first three exons that, in mobile P elements, code for a 66-kDa protein that represses P element transposition. In contrast to this normal repressor function, these stationary P element repeats are hypothesized to have evolved the function of transcription factors (90).
Proc. Natl. Acad. Sci. USA 94 (1997)
7709
A Source of New Introns. Some retroelements are apparently fully adapted to their niche within exonic sequences. For example, the 143-kb Euglena gracilis plastid genome contains 155 group II and group III introns (56), nearly 10 times the number in any other known plastid DNA. The original introns were likely mobile, retrotransposable genetic elements that invaded the genome from another organism, relying in part on internally encoded enzyme activities for mobility. The group III introns appear to be streamlined versions of group II introns, sharing a common evolutionary ancestor with a group II intron. Among the E. gracilis introns are a number of introns-within-introns (twintrons), suggesting that these elements themselves have been targets of intron insertions. In one particularly interesting example (91), a group III intron is formed from domains of two individual group II introns. The authors suggest the possibility that ‘‘the introduction of one catalytic RNA into a functional domain of another catalytic RNA, through a process similar to twintron formation, can result in new combinations of sequences and structural domains that might lead to new RNA catalyzed reactions significant for RNA evolution.’’ Telomeres in Drosophila. An unusually finely tuned system between the host genome and mobile elements has evolved in Drosophila to take over a basic cellular function. Several retroelements, such as HET-A and TART, carry out the function of replacing damaged chromosome ends that is performed by telomerase in other insects (92, 93). The insertion frequency of the TEs involved has become adapted to match the average rate of telomere loss to maintain constant chromosome size. This is the best example to date of a TE providing a vital function to its host. As a Repair Mechanism of DNA Double-Strand Breaks. Although SINEs and LINEs and pseudogenes are abundant in eukaryotic genomes, indicating that reverse transcriptasemediated phenomena are important in genome evolution, the mechanisms responsible for their spread are largely unknown. The results of two recent experiments with the yeast Saccharomyces cerevisiae (94, 95) have linked reverse transcriptasemediated events with double-strand chromosome breaks in the absence of normal repair. This suggests a possible role for endogenous retroelements in the repair of double-strand chromosome breaks under certain circumstances. Note that, in this case, as in others described here, coadaptation may have grown out of apparently parasitic element behavior; doublestrand chromosome breaks may simply represent an especially good target for efficient TE insertion, and in turn, these insertions may sometimes be the most efficient repair pathway available to the host. The net effect, rapid insertional repair of breaks, is expected to benefit both host and TE. Mediation of the Concerted Evolution of Repetitive Gene Families. Evidence for the ability of TEs to directly influence the constitution of repetitive DNA was provided by experiments using genetically marked P elements located in a subtelomeric repeat of D. melanogaster (96). After P element mobilization, the number of repeats frequently was observed to be altered, with decreases being more common than increases, due to unequal gene conversion events. Therefore, TEs may play an important role in the evolution of heterochromatin. Changes in Genome Size. As described above, TEs may represent a variable and sometimes surprisingly high proportion of genomes, particularly in plants. By means of variation in sheer bulk, it is possible that TEs affect variability in life history traits and related characteristics because of the correlation between genome size, cell size, and various aspects of plant life form, such as growth rate and developmental time (97). Other Possible Functions. The idea that heterochromatic clusters of nomadic elements are merely graveyards of dead transposons appears to be giving way to the idea that these regions may also be involved in a number of important cellular processes (25). These include developmentally programmed changes in DNA content, expression of heterochromatically embedded gene
7710
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Kidwell and Lisch
loci, and housekeeping functions such as chromosome pairing, sister chromatid adhesion, and centromere function. The TE content of these regions may be important for these processes in ways that are not yet understood. Discussion One of the most compelling questions that arise when considering the new data on the preponderance of TE-derived sequences in some plant genomes is how enormous numbers of TE copies can accumulate in a single genome. For example, how is it that a single TE can make up 10% of the maize genome? Obviously, the recombinogenic properties of TEs that are hypothesized to maintain a relatively low, constant copy number in other organisms are not relevant in these cases. It may be that, in the case of low copy elements transposing at relatively low frequencies, recombination is able to purge some elements from euchromatin, but the maize example suggests that there may be a vast number of elements interspersed between genes in many locations. Recombination, then, may not be a particularly effective mechanism for purging the vast majority of repetitive sequences, many of which are clearly not located in heterochromatin. It can be argued that, wherever there is a concentration of TE sequences, heterochromatinlike structural features begin to evolve to down-regulate their expression. In turn, this would also tend to reduce the frequency of removal by ectopic recombination. These considerations motivate us to postulate that there are at least two types of elements that occupy two very different niches in the ecology of the genome: first, a type that preferentially inserts into regions distant from host gene sequences, such as heterochromatin or the regions between genes [e.g., the many retrotransposons found inserted between the genes on the third chromosome in maize (2)]; and second, a type that lives more dangerously by being more prone to insert into, or near, single copy sequences. We suggest that the first type escapes the ‘‘trap’’ of inactivation (via methylation or heterochromatinization) in regions outside of single copy host genes through the use of various buffer sequences; it has become specifically adapted to (or even makes up much of) these regions. As a strategy to minimize their potentially devastating effects on their hosts, these elements target regions in which recombination is minimal and where essential genes are scarce. The second type travels light and has evolved to take advantage of relatively accessible chromosomal architecture, a high concentration of transcription factors, host enhancer sequences, and horizontal transfer to maximize replication advantage. This type, represented by elements like Mu (which target single copy sequences) and P elements (at least 65% of insertions are located near enhancers) trades the disadvantage of an increased risk of negative selection for the advantages of occupying regions which are enriched for factors promoting efficient transcription and replication. This second type is postulated to be the one most likely to be discovered by geneticists (it is more likely to cause mutations) and also the one most likely to be lost through recombination (by targeting actively transcribed regions of the genome in which recombination is more frequent). We suggest that, when these elements insert in heterochromatin, they become inactive because they are not well adapted to that environment. We therefore need to consider the possibility that there may be more than one strategy to being a transposon and that each strategy, although successful from an evolutionary perspective, has a very different dynamic. Each type would be expected to affect host evolution in a different way. Type 1 would affect the overall architecture of the host chromosome, rather than the specific expression characteristics of individual genes. In contrast, type 2 would participate more directly in changes in gene regulation, such as is observed at the Adh1 locus in maize. A second area of considerable interest from an evolutionary perspective is the stress-induced mutability that is character-
istic of some TEs (98). A gradualist argument leveled against the idea that regulatory changes resulting from TE-induced mutations may be important in evolution is that such ‘‘macromutations,’’ like Goldschmit’s hopeful monsters, would be unlikely to arise at the precise time when a new ecological niche became available (40). However, there is increasing evidence that TE-induced mutation rates are far from constant. High frequencies of mutations are expected to appear in waves, such as those resulting from hybrid dysgenesis that accompany element invasions of new populations or species. TE-induced mutations have been recorded to occur in transpositional bursts (99) whose cause is not well understood but is likely related to inbreeding and other forms of genomic or environmental stress, possibly akin to the genomic stress referred to by McClintock (100). For example, it appears that plant retroelements are normally quiescent but can be activated by stress (98), such as cell culturing (101) or microbial infection (102). We suggest that the proximal, or adaptive, function in these cases is to increase element copy number during periods of stress to ensure a high probability of transmission by those host variants that happen to survive. With respect to the evolution of the host, however, the preadaptive, or exaptive function is to provide variation during periods of stress. In this case, as in the other cases outlined above, the transposon does not have to ‘‘know’’ that it is contributing to the evolution of its host nor has it evolved to do so, but out of its elemental parasitic behavior arises the potential for both dramatic and subtle changes in the genome of its host. Conclusions We are only just beginning to glimpse the complexity of possible interactions in the coevolution of TEs and their hosts. A full understanding of the population and evolutionary dynamics of these interactions, and the consequences to hosts, must await the results of further research. However some tentative conclusions can be made on the basis of current information. The primary parasitic nature of these sequences during their invasion of host populations is beyond dispute, but we believe that this does not by any means represent the whole story. A number of features of both TEs and their hosts can be interpreted as coadaptations to mitigate or abolish the reduction of fitness due to unbridled transposition. Furthermore, the number of well documented cases in which TE sequences have been coopted successfully by the host to provide a useful function is small but is growing rapidly. We suggest that the process by which elements and their hosts coevolve mutually beneficial strategies may lend itself to the production of genetic variation that would not otherwise have arisen. Although the role of TEs in evolution may not turn out to be precisely what McClintock had in mind when she first described controlling elements in maize, the importance of their role in the evolution of gene regulation and other host functions may yet surprise us. To paraphrase Dobzhansky’s famous phrase, there is good reason to believe that ‘‘Nothing about mobile elements makes sense except in the light of evolution.’’ We thank Dr. Zhijian Tu for comments on the manuscript. This work was supported by National Science Foundation Grant DEB9119349 to M.K. D.L. was supported by National Institutes of Health Training Program in Insect Science 1T32 AI07475. 1. 2. 3. 4. 5. 6.
McClintock, B. (1948) Carnegie Inst. Wash. Yearbook 47, 155–169. SanMiguel, P., Tikhonov, A., Jin, Y. K., Motchoulskaia, N., Zakharov, D., Melake-Berhan, A., Springer, P. S., Edwards, K. J., Lee, M., Avramova, Z. & Bennetzen, J. L. (1996) Science 274, 765–768. Pimpinelli, S., Berloco, M., Fanti, L., Dimitri, P., Bonaccorsi, S., Marchetti, E., Caizzi, R., Caggesse, C. & Gatti, M. (1995) Proc. Natl. Acad. Sci. USA 92, 3804–3808. Finnegan, D. J. (1992) Curr. Opin. Genet. Dev. 2, 861–867. McClure, M. A. (1993) in Reverse Transcriptase (Cold Spring Harbor Lab. Press, Plainview, NY), pp. 425–443. Bucheton, A. (1995) Trends Genet. 11, 349–353.
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Kidwell and Lisch 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29.
30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54.
Wessler, S., Bureau, T. E. & White, S. E. (1995) Curr. Opin. Genet. Dev. 5, 814–821. Unsal, K. & Morgan, G. T. (1995) J. Mol. Biol. 248, 812–823. Morgan, J. T. (1995) J. Mol. Biol. 254, 1–5. Smit, A. F. A. & Riggs, A. D. (1996) Proc. Natl. Acad. Sci. USA 93, 1443–1448. Tu, J. (1997) Proc. Natl. Acad. Sci. USA, in press. Britten, R. J. (1996) Proc. Natl. Acad. Sci. USA 93, 9374–9377. Engels, W. R. (1996) in Transposable Elements, eds. Saedler, H. & Gierl, A. (Springer, Berlin), pp. 103–123. Bregliano, J. C. & Kidwell, M. G. (1983) in Mobile Genetic Elements, ed. Shapiro, J. A. (Academic, New York), pp. 363–410. Rubin, G. M., Kidwell, M. G. & Bingham, P. M. (1982) Cell 29, 987–994. McClintock, B. (1951) Cold Spring Harbor Symp. Quant. Biol. 16, 13–47. Wessler, S., Baran, G., Varagona, M. & Dellaporta, S. (1986) EMBO J. 5, 2427–2432. Wessler, S., Baran, G. & Varagona, M. (1987) Science 237, 916–918. Corces, V. G. & Geyer, P. K. (1991) Trends Genet. 7, 86–90. Lister, C., Jackson, D. & Martin, C. (1993) Plant Cell 5, 1541–1553. Kloeckener-Gruissem, B., Vogel, J. M. & Freeling, M. (1992) EMBO J. 11, 157–166. Kloeckener-Gruissem, B. & Freeling, M. (1995) Proc. Natl. Acad. Sci. USA 92, 1836–1840. Greene, B., Walko, R. & Hake, S. (1994) Genetics 138, 1275–1285. Bradley, D., Carpenter, R., Sommer, H., Hartley, N. & Coen, E. (1993) Cell 72, 85–95. Arkhipova, I. R., Lyubomirskaya, N. V. & Ilyin, Y. V. (1995) Drosophila Retrotransposons (Landes, Austin, TX). Charlesworth, B. & Langley, C. H. (1989) Annu. Rev. Genet. 23, 251–287. Vaury, C., Bucheton, A. & Pelisson, A. (1989) Chromosoma 98, 215–224. Tchurikov, N. A., Zelentsova, E. S. & Georgiev, G. P. (1980) Nucleic Acids Res. 8, 1243–1258. Tchurikov, N. A., Ilyin, Y. V., Skryabin, K. G., Anan’ev, E. V., Bayev, A. A., Krayev, A. S., Zelentsova, E. S., Kulguskin, V. V., Lyubomirskaya, N. V. & Georgiev, G. P. (1981) Cold Spring Harbor Symp. Quant. Biol. 45, 655–665. Banga, S. S., Velazquez, A. & Boyd, J. B. (1991) Mutat. Res. 255, 79–88. Charlesworth, B., Sniegowski, P. & Stephan, W. (1994) Nature (London) 371, 215–220. Lyttle, T. W. & Haymer, D. S. (1993) in Transposable Elements and Evolution, ed. McDonald, J. F. (Kluwer, Dordrecht, The Netherlands). Simmons, G. M. (1992) Mol. Biol. Evol. 9, 1050–1060. Mackay, T. F. C. (1987) Genet. Res. 49, 225–233. Mackay, T. F., Lyman, R. F. & Jackson, M. S. (1992) Genetics 130, 315–332. Torkamanzehi, A., Moran, C. & Nicholas, F. W. (1992) Genetics 131, 73–78. Clark, A. G., Wang, L. & Hulleberg, T. (1995) Genetics 139, 337–348. Britten, R. J. (1996) Mol. Phylogenet. Evol. 5, 13–17. Britten, R. J. (1997) Gene, in press. McDonald, J. F. (1995) Trends Ecol. Evol. 10, 123–126. Doolittle, W. F. & Sapienza, C. (1980) Nature (London) 284, 601– 603. Orgel, L. E. & Crick, F. H. C. (1980) Nature (London) 284. Hickey, D. A. (1982) Genetics 101, 519–531. Plasterk, R. A. (1995) in Mobile Genetic Elements, ed. Sherratt, D. J. (IRL, Oxford), pp. 18–37. Kiyasu, P. K. & Kidwell, M. G. (1984) Genet. Res. 44, 251–259. Kidwell, M. G. (1993) Ann. Rev. Genet. 27, 235–256. Miller, W. J., Kruckenhauser, L. & Pinsker, W. (1996) in Transgenic Organisms: Biological and Social Implications, eds. Tomiuk, J., Woehrmann, K. & Sentker, A. (Birkhaeuser, Basel), pp. 21–35. Hurst, G. D. D., Hurst, L. D. & Majerus, M. E. N. (1992) Nature (London) 356, 659–660. Kidwell, M. G. (1994) J. Hered. 85, 339–346. Robertson, H. M. (1993) Nature (London) 362, 241–245. Lohe, A. R., Moriyama, E. N., Lidholm, D. A. & Hartl, D. L. (1995) Mol. Biol. Evol. 12, 62–72. Clark, J. B., Maddison, W. P. & Kidwell, M. G. (1994) Mol. Biol. Evol. 11, 40–50. Spradling, A. C., Stern, D. M., Kiss, I., Roote, J., Laverty, T. & Rubin, G. M. (1995) Proc. Natl. Acad. Sci. USA 92, 10824–10830. Bellen, H., O’Kane, C. J., Wilson, C., Grossniklaus, U., Pearson, R. K. & Gehring, W. J. (1989) Genes Dev. 3, 1288–1300.
55. 56.
57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 74. 75. 76. 77. 78. 79. 80. 81. 82. 83. 84. 85. 86. 87. 88. 89. 90. 91. 92. 93. 94. 95. 96. 97. 98. 99. 100. 101. 102.
7711
Ji, H., Moore, D. P., Blomberg, M. A., Braiterman, L. T., Voytas, D. F., Natsoulis, G. & Boeke, J. D. (1993) Cell 73, 1007–1018. Hallick, R. B., Hong, L., Drager, R. G., Favreau, M. R., Montfort, A., Orsat, B., Spielmann, A. & Stutz, E. (1993) Nucleic Acids Res. 21, 3537–3544. Fridell, R. A., Pret, A. M. & Searles, L. L. (1990) Genes Dev. 4, 559–566. Purugganan, M. & Wessler, S. (1992) Genetica 86, 295–303. Rushforth, A. M. & Anderson, P. (1996) Mol. Cell. Biol. 16, 422–429. Laski, F. A., Rio, D. C. & Rubin, G. M. (1986) Cell 44, 7–19. Tseng, J. C., Zollman, S., Chain, A. C. & Laski, F. A. (1991) Mech. Dev. 35, 65–72. Bingham, P. M., Chou, T. B., I., M. & Zachar, Z. (1988) Trends Genet. 4, 134–138. Chomet, P. S., Wessler, S. & Dellaporta, S. L. (1987) EMBO J. 6, 295–302. Chandler, V. & Walbot, V. (1986) Proc. Natl. Acad. Sci. USA 83, 1767–1771. Bennetzen, J. L. (1996) in Transposable Elements, eds. Saedler, H. & Gierl, A. (Springer, Berlin), pp. 195–229. Charlesworth, B. & Langley, C. H. (1986) Genetics 112, 359–383. Brookfield, J. F. Y. (1995) in Mobile Genetic Elements, ed. Sherratt, D. J. (IRL, Oxford), pp. 130–153. Robertson, H. M. & Engels, W. R. (1989) Genetics 123, 815–824. Misra, S. & Rio, D. C. (1990) Cell 62, 269–284. Black, D. M., Jackson, M. S., Kidwell, M. G. & Dover, G. A. (1987) EMBO J. 6, 4125–4135. Jackson, M. S., D. M. Black & G. A. Dover. (1988) Genetics 120, 1003–1013. Rasmusson, K. E., Raymond, J. D. & Simmons, M. J. (1993) Genetics 133, 605–622. Fedoroff, N., Schlappi, M. & Raina, R. (1995) Bioessays 17, 291–297. Schlappi, M., Raina, R. & Fedoroff, N. (1994) Cell 77, 427–437. Cuypers, H., Dash, S., Peterson, P. A., Saedler, H. & Gierl, A. (1988) EMBO J. 7, 2953–2960. McClintock, B. (1958) Carnegie Inst. Wash. Yearbook 57, 415–429. Gdula, D. A., Gerasimova, T. I. & Corces, V. G. (1996) Proc. Natl. Acad. Sci. USA 93, 9378–9383. Csink, A. & McDonald, J. F. (1990) Genetics 126, 375–385. King, M. C. & Wilson, A. C. (1975) Science 188, 107–116. Britten, R. J. & Davidson, E. H. (1971) Q. Rev. Biol. 46, 111–138. Li, X. & Noll, M. (1994) Nature (London) 367, 83–87. Fincham, J. R. S. & Sastry, G. R. K. (1974) Annu. Rev. Genet. 8, 15–50. Bureau, T. E., Ronald, P. C. & Wessler, S. R. (1996) Proc. Natl. Acad. Sci. USA 93, 8524–8529. Pelisson, A., Song, S. U., Prud’homme, N., Smith, P. A., Bucheton, A. & Corces, V. G. (1994) EMBO J. 13, 4401–4411. Ding, D. & Lipshitz, H. D. (1994) Genet. Res. 64, 167–181. Hambor, J. E., Mennone, J., Coon, M. E., Hanke, J. H. & Kavathas, P. (1993) Mol. Cell. Biol. 13, 7056–7070. Stavenhagen, J. B. & Robins, D. M. (1988) Cell 55, 247–254. Anderson, R., Britten, R. J. & Davidson, E. H. (1994) Dev. Biol. 163, 11–18. Miller, W. J., McDonald, J. F. & Pinsker, W. (1997) Genetica, in press. Miller, W. J., Paricio, N., Hagemann, S., Martinez-Sebastian, M. J., Pinsker, W. & de Frutos, R. (1995) Gene 156, 167–174. Hong, L. & Hallick, R. B. (1994) Genes Dev. 8, 1589–1599. Biessmann, H., Valgeirsdottir, K., Lofsky, A., Chin, C., Ginther, B., Levis, R. W. & Pardue, M. L. (1992) Mol. Cell. Biol. 12, 3910–3918. Pardue, M. L., Danilevskaya, O. N., Lowenhaupt, K., Slot, F. & Traverse, K. L. (1996) Trends Genet. 12, 48–52. Moore, J. K. & Haber, J. E. (1996) Nature (London) 383, 644–646. Teng, S. C., Kim, B. & Gabriel, A. (1996) Nature (London) 383, 641–644. Thompson-Stewart, D., Karpen, G. H. & Spradling, A. C. (1994) Proc. Natl. Acad. Sci. USA 91, 9042–9046. Smyth, D. R. (1993) in Control of Plant Gene Expression, ed. Verma, D. P. S. (CRC, Boca Raton, FL). Wessler, S. R. (1996) Curr. Biol. 6, 959–961. Gerasimova, T. I., Matjunima, L. V., Mizrokhi, L. J. & Georgiev, G. P. (1985) EMBO J. 4, 3773–3779. McClintock, B. (1984) Science 226, 792–801. Pouteau, S., Huttner, E., Grandbastien, M. A. & Caboche, M. (1991) EMBO J. 10, 1911–1918. Pouteau, S., Grandbastien, M. A. & Boccara, M. (1994) Plant J. 5, 535–542.
Proc. Natl. Acad. Sci. USA Vol. 94, pp. 7712–7718, July 1997 Colloquium Paper
This paper was presented at a colloquium entitled ‘‘Genetics and the Origin of Species,’’ organized by Francisco J. Ayala (Co-chair) and Walter M. Fitch (Co-chair), held January 30–February 1, 1997, at the National Academy of Sciences Beckman Center in Irvine, CA.
Long term trends in the evolution of H(3) HA1 human influenza type A WALTER M. FITCH*†, ROBIN M. BUSH*, CATHERINE A. BENDER‡,
AND
NANCY J. COX‡
*Department of Ecology and Evolutionary Biology, University of California, Irvine, CA 92692; and ‡Influenza Branch, Center for Disease Control and Prevention, Atlanta, GA 30333
porate new mutations, mutations that cause changes in the hemagglutinin (HA) molecule, the major molecule to which the immune system makes its humoral response. The effect of these mutations is to change the HA molecule so that, at least temporarily, it is no longer recognized by these antibodies, thereby permitting the virus to multiply. For this reason, old vaccines lose their efficacy and new ones must be made with viruses having the altered HAs. That in turn raises the question of which of today’s strains should be used to make that vaccine; that is, which of today’s strains is most likely to be the progenitor of the of next year’s epidemic strains. Vaccines are now selected on the basis of knowing which of the current strains are least reactive to current antibodies (on the assumption that these strains have the best opportunity to spread) and which strains seem currently to be spreading the most effectively. Nevertheless, it would be useful to develop better predictive methods of deciding that question. It is the purpose of this paper to begin exploring how a knowledge of the evolutionary history of the HA gene might contribute to improved prediction of epidemic strains and of the strain of choice for the next vaccine. We shall explore the issues of (i) the rate of evolution, (ii) mutations occurring during virus propagation in the laboratory, (iii) the intercontinental spread of the influenza virus, (iv) the mutabilities of the different coding sites along the gene, and (v) sites subjected to strong selection.
ABSTRACT We have studied the HA1 domain of 254 human inf luenza A(H3N2) virus genes for clues that might help identify characteristics of hemagglutinins (HAs) of circulating strains that are predictive of that strain’s epidemic potential. Our preliminary findings include the following. (i) The most parsimonious tree found requires 1,260 substitutions of which 712 are silent and 548 are replacement substitutions. (ii) The HA1 portion of the HA gene is evolving at a rate of 5.7 nucleotide substitutionsyyear or 5.7 3 1023 substitutionsysite per year. (iii) The replacement substitutions are distributed randomly across the three positions of the codon when allowance is made for the number of ways each codon can change the encoded amino acid. (iv) The replacement substitutions are not distributed randomly over the branches of the tree, there being 2.2 times more changes per tip branch than for non-tip branches. This result is independent of how the virus was amplified (egg grown or kidney cell grown) prior to sequencing or if sequencing was carried out directly on the original clinical specimen by PCR. (v) These excess changes on the tip branches are probably the result of a bias in the choice of strains to sequence and the detection of deleterious mutations that had not yet been removed by negative selection. (vi) There are six hypervariable codons accumulating replacement substitutions at an average rate that is 7.2 times that of the other varied codons. (vii) The number of variable codons in the trunk branches (the winners of the competitive race against the immune system) is 47 6 5, significantly fewer than in the twigs (90 6 7), which in turn is significantly fewer variable codons than in tip branches (175 6 8). (viii) A minimum of one of every 12 branches has nodes at opposite ends representing viruses that reside on different continents. This is, however, no more than would be expected if one were to randomly reassign the continent of origin of the isolates. (ix) Of 99 codons with at least four mutations, 31 have ratios of non-silent to silent changes with probabilities less than 0.05 of occurring by chance, and 14 of those have probabilities <0.005. These observations strongly support positive Darwinian selection. We suggest that the small number of variable positions along the successful trunk lineage, together with knowledge of the codons that have shown positive selection, may provide clues that permit an improved prediction of which strains will cause epidemics and therefore should be used for vaccine production.
DATA, DEFINITIONS AND METHODS
Human influenza is an annual cause of morbidity and mortality world-wide which has a cumulative impact that is greater than the effects of the pandemics that occur every 20–30 years (1). The principle way to reduce this health problem is by vaccination. However, human influenza genes rapidly incor-
Data. This study utilizes 254 nucleotide sequences for the HA1 gene of HA obtained from human influenza A(H3N2) viruses isolated from 1984 to 1996. This time period was chosen because the previous period from 1968 to 1983 was sparsely sampled. Isolates AyOitay83 and AyCaen1y84 were chosen to root the tree. Isolates AyTexasy12835y83 and AyTexasy12764y83 were included because their HAs were located on the tree among the 1984 and later isolates and it was felt that one should use as dense a tree as possible. We know the geographical location from which all the viruses were obtained and the month of isolation of 206 of these isolates. Of the 254 isolates, 160 were from the four years, 1993–1996, the other 94 were from the preceding 9 years. These sequences were all 329 codons (987 nucleotides) long with no gaps required for homologous alignment. Viruses were isolated from the original clinical sample either in embryonated hens’ eggs or in Madin–Darby canine kidney, Spafas, chicken kidney, or monkey kidney cell culture. The substrate for virus propagation was eggs for 126 strains, kidney cells for 95 strains, and, at least partly, unknown for 30 strains.
© 1997 by The National Academy of Sciences 0027-8424y97y947712-7$2.00y0 PNAS is available online at http:yywww.pnas.org.
Abbreviation: HA, hemagglutinin. †To whom reprint requests should be addressed.
7712
Colloquium Paper: Fitch et al. For three clincal isolates, the sequence was obtained directly without isolation in eggs or cell culture. Viruses with passages in both hens’ eggs and kidney cells were assigned to the egg category. Definitions. There are various ways of using sequences for making trees and it will be necessary to keep in mind the alternatives. Where we use the unambiguous nucleotide sequences, we shall call the inferred changes (nucleotide) substitutions. Where we use the amino acids, we shall call the inferred changes (amino acid) replacements. If we back translate the amino acid sequences into ambiguous codons, we get substitutions, but only those required to change the amino acid. These changes are called replacement substitutions—i.e., those nucleotide substitutions that cause amino acid replacements. The difference between the substitution set and replacement-substitution set is the set of silent (or synonymous) substitutions. We will distinguish substitutions, replacements, replacement substitutions, and silent substitutions as defined here. Where more than one category might be relevant to the statement or the meaning is clear, changes or mutations may be used. The computer program numbers all the nodes of the tree, the first 254 numbers being given to the tip nodes that represent the original sequences and the next 252 numbers going to interior nodes of degree 3, nodes with three branches attached. Every branch is also labeled, and that label is the same as the label on the node to which it descends. Thus we can unambiguously speak of tip branches as well as tip nodes using the same number. We define sister nodes restrictively to be those pairs of tip sequences that, topologically, are each other’s most closely related sequence. The trunk of the tree is defined as the set of interior nodes leading from the root down to that tip that is farthest removed from the root. In the present case, that tip is AyWuzhouy1y96. The trunk tip is treated as a tip rather than as a trunk branch. All other branches between the trunk and tip branches are called twigs. The mutations are assigned not only to the branches, but to the positions of the sequence as well. This enables one to count the number of codons that have changed zero, one, two, . . . . times. These numbers can be used to determine if the mutations are being distributed randomly among the positions of the sequence. This implies a (or perhaps more than one) Poisson distribution. It is often found (and it is here) that there are a minimum of three categories of variability. One may observe a set of unvaried positions that in fact turns out to be the sum of two types of positions. The first type is the set of invariable positions, those positions that are so vital to the necessary functioning of a protein that any change in that position causes the organism (virus) to die out. The other type is the group of positions that are variable but (by chance) unvaried nevertheless. Thus the unvaried are the invariable plus the unvaried-variable. The second category are the variable positions, positions that might but need not have varied in the sample. The third category are the hypervariable positions, positions that are changing significantly faster than the variable positions. There are other ways of fitting variability to the observations, such as the g distribution, but these three categories will serve our purposes here. Methods. HA sequences used in this analysis were generated at the Centers for Disease Control and Prevention over a 10-year period as part of ongoing routine genetic analyses of HA genes of variant and typical influenza field strains. Influenza A(H3N2) viruses chosen for this analysis were from the collection of the Centers for Disease Control and Prevention. The abbreviation, country of origin, date of collection, and passage history of these viruses are available from the authors. Viruses were propagated at a low multiplicity of infection in embryonated eggs or kidney tissue culture. Sequence analysis for three viruses was obtained directly from the original
Proc. Natl. Acad. Sci. USA 94 (1997)
7713
clinical isolate by PCR. Viruses sequenced before 1993 were purified by centrifugation on a discontinuous sucrose gradient or pelleted by centrifugation for 1 hr at 35,000 rpm in a SW 50.1 rotor (Beckman Instruments) at 4°C. The methods for virus purification and subsequent isolation of RNA have been described (2). Virus, isolated or obtained after 1993, required no purification before isolation of RNA for sequence analysis. Genomic RNA was extracted by phenolychloroform from purified or pelleted virus (3) or from 100 ml of allantoic fluid or tissue culture media with the Qiagen RNAeasy Total RNA Purification Kit (Qiagen, Chatsworth, CA). A number of the isolates were sequenced directly from the RNA as described (4). Four internal primers complementary to the viral mRNA sense strand were used to sequence the HA1 domain of the HA genes. Primer sequences are as follows: R1073 (59 dCCTGCGATTGCGCCGATT), R792 (59 D-CAGTATGTCTCCCGGTTT), R570 (59D-TGGCATAGTCACGTTCAG) and R362 (59 d-TAAGGGTAACAGTTGCTG). The majority of the isolates were sequenced from reverse transcription–PCR amplified single-stranded or double-stranded DNA (5, 6). Complementary DNA synthesis and PCR amplification of the HA1 domain of the HA genes were carried out using forward primer 7 (59-CTATCATTGCTTTGAGC-39) and reverse primer 1184 (59-ATGGCTGCTTGAGTGCTT-39). The PCRderived single-stranded DNA was used as a template for the Sequenase (United States Biochemical) sequencing kit and the double-stranded DNA by dye terminator cycle sequencing chemistry using a model 373A DNA Sequencing System (Perkin–Elmer). Primers used for the latter two sequencing methods are the same as described for direct RNA sequencing.
FIG. 1. Overall structure of the most parsimonious trees. The thick line running from the lower left (p 5 root) to the upper right (open square) is called the trunk and represents the successful H3N2 lineage. The vertical lines indicate the range of isolates from the flu years (October 1 to September 30).
Colloquium Paper: Fitch et al.
7714
Proc. Natl. Acad. Sci. USA 94 (1997)
FIG. 2. Rate of evolution of human influenza HA1. The y axis shows the number of replacement substitutions between the root and a tip sequence. The x axis shows the time of isolation of the virus to the month where known (206 sequences), or to the month of June if the month was not known (48 sequences). Each of the 254 sequences is represented in the graph but, if there were more than one isolate for the same month and year, their distances were averaged. A least squares fit to the data gives a slope of 3.20 replacement substitutionsyyear. The two tubes show an apparent increase in the rate of replacement substitutions about 1992. However, we cannot rule out the possibility that this is a consequence of a more intensive sampling of the population in the last four years.
The Tree. The most parsimonious tree found, 1,260 substitutions in length, is shown in Fig. 1. Of the 1,260 substitutions, 548 were replacement substitutions while 712 were silent substitutions. There are 116 branches with no substitutions on them. In Fig. 1, all the branches that were of zero length in all examined most-parsimonious trees have been collapsed to
produce ancestral nodes that give rise, not to 2 immediate descendants, but 3–10 immediate descendants. If one were to resolve these nodes to produce all possible, strictly bifurcating, most parsimonious trees, there would be in excess of 1050 different trees. The tree presented, through its multichotomous nodes, is an accurate rendering of that information common to all of those many trees. The root is the lower leftmost node of the tree while the trunk tip is the upper rightmost tip of the tree. The trunk comprises the upper, thick branches of the tree. Thirty sequences are ancestral to other sequences. To the right of the tree are vertical bars that indicate the range of the isolates by year. A peculiarity of those ranges is that there is essentially no overlap between isolates of 1988 and 1989, as if the 1989 form was so fit that all of the 1988 lineages were eliminated. By contrast, 1990 off-trunk strains were surviving to give rise to descendants isolated in every year from 1990 through 1994. Rate of Evolution. The distance of the tips from the root may be plotted against the month of their isolation where known, at June otherwise, and the result is shown in Fig. 2. The slope is 5.67 substitutionsyyear or 5.7 3 1023 substitutionsy nucleotide per year. This is consistent with previous estimates
Table 1. The distribution of replacement substitutions over the codon positions
Table 2. The distribution of replacement substitutions over the branch types
Sequences have been deposited in GenBank under the accession numbers AF008656 to AF008909. Most parsimonious trees were obtained from the 254 nucleotide sequences using test version 4.0d52 of the program PAUP provided by D. L. Swofford (7) using the tree bisection– reconnection option while holding 200 trees. Where there was more than one way of assigning substitutions to the branches, we used the ACCTRAN option which accepts changes as soon as possible. When amino acid sequences were used, we assumed that the nucleotide tree was the correct topology. When we wished to determine replacement substitutions, we backtranslated the amino acids into ambiguous codons using the ANCESTOR program (8). Poisson fits to the substitution frequencies were by the method of Fitch and Markowitz (9).
RESULTS
Codon position 1 2 3 Sum
Observed
Expected
x2
217 254 77 548
224.4 242.3 81.4 548
0.244 0.565 0.238 1.047
The table shows the number of replacement substitutions in each codon position and the number expected if distributed randomly over the gene, assuming the distribution of codons found in these data and that all substitutions have equal probability. The probability of a worse fit occurring by chance is '0.6; df 5 2. Thus we cannot reject the assumptions as untrue.
Branch
No.
Observed
Expected
x2
ryb
Trunk Twig Tip Sum
70 182 254 506
50 118 379 547
75.7 296.7 274.6 547
8.7 31.5 49.7 79.9
0.71 0.65 1.49
The table shows the number of trunk, twig, and tip branches in the second column and, in the third column, the number of changes observed on them. The fourth column shows the number expected if the 547 replacement substitutions were distributed randomly among the branches. The x2 is for the difference between expected and observed. The probability of a worse fit occurring by chance is ,10217; df 5 2. The last column (ryb) shows replacement substitutionsybranch.
Colloquium Paper: Fitch et al.
Proc. Natl. Acad. Sci. USA 94 (1997)
Table 3. The distribution of tip replacement substitutions according to method of DNA amplification
Table 5.
Poisson fit of trunk replacement substitutions Distribution
Number
Isolation method
No. of isolates
Observed
Expected
x2
Embryonated eggs Cell culture PCR Other Sum
126 95 3 30 254
183 144 7 45 379
188.0 141.7 4.5 44.8 379.0
0.133 0.036 1.423 0.001 1.593
The second column shows how the 254 isolates were propagated. The category embyonated eggs includes all isolates that had all or part of their passage history in eggs. Cell culture includes only those isolates that were known to have been passaged only in cells. PCR indicates those isolates sequenced directly from the original clinical isolate. Other includes those isolates that did not fit the above three categories. The third column shows how many tip replacement substitutions occurred in each kind of tip, while the fourth column shows what would be expected if no method peculiarly increased the number of changes observed. The probability of a worse fit occurring by chance is 0.65; df 5 3.
of 5.7 3 1023 (10) and 6.7 3 1023 (11). The overall rate for replacement substitutions is 9.7 3 1023 codonsyyear. But if the faster rate in the recent time period reflects discovering more of the substitutions that occur, then a better estimate of the rate of HA1 evolution is 16 3 1023 replacement substitutionsy codon per year. By estimating the average age of the tip isolates measured from the time when they branched off the trunk, we can get an estimate of how long the losers survive. The average number of substitutions from trunk to tip is only 8.07. Given the rate of evolution as 5.67 substitutionsyyear, the average age of the tips is only 1.42 years. The longest lived branch is 4.8 years old. Replacement Substitutions by Codon Position. Table 1 shows the distribution of replacement substitutions by codon position. The number of each of the 61 nonterminating codons for all 254 sequences was determined. Because we know how many ways each codon can change in the first, second, and third positions so as to change the encoded amino acids, we can immediately estimate the expected number of times a replacement substitution would occur in each of the three codon positions if the 548 replacement substitutions were distributed randomly over the codons (12). One can readily see that they are distributed randomly. While one expects more changes in the second position than in the first position, as we get here, the more common result with other genes is that more replacement substitutions occur in the first than in the second position. This is usually attributed to substitutions being more conservative if they occur in the first position. Thus more radical changes may be permitted in HA. Replacement Substitutions by Branch Type. The number of replacement substitutions for each of the three types of branches, trunk, twig, and tip, are shown in Table 2. Also shown are the expected values if the 547 changes were distributed randomly over the branches. The table shows that the distribution is radically different from that expected for a random distribution. The tips have a greater than expected number of replacement substitutions, while the trunk and twigs have too few. Indeed the tips have more than 2.2 times as many changes per branch as do the other branches. This amounts to about 0.8 extra changes on each tip. Table 4.
No. No. No. No. x2
of of of of
changesycodon codons observed codons expected invariable codons
0 299 16.7 282.3 —
1 19 17.2
2 5 8.8
3 5 3.0
4 1 1.0
0.2
1.7
1.3
0.0
The top row shows the number of changes per codon. The second row shows the number of codons with that number of changes. The third row shows the number of changes expected in a best fit to the model. The model asserts that all the variable codons are equally variable but that there is also a class of invariable codons. Because of the invariables, the fit is to those codons that have changes in them. That fit induces an expected number of variable but unvaried codons (16.7 in this case) which, when subtracted from the number of observed unvaried positions (299) yields the number of invariable codons (282.3). x2 5 3.15; df 5 2; P 5 0.21. The number of trunk variable codons 5 46.7 6 5.4 out of 329.
We asked whether these additional changes are specific to the host substrate used for culture. Accordingly, Table 3 shows the distribution of those changes according to the host in which the virus was propagated and the expected distribution if there were no differences among the amplification methods with respect to their adding extra changes into the tip branches. It can be seen that there is no difference in the number of changes among the tip branches as a function of the host. Replacement Substitutions by Codon. One may ask whether the replacement substitutions have a random (Poisson) distribution over the codons of the HA gene. Table 4 shows how the changes are actually distributed. The first row shows the number of changes per codon. Values below show the number of codons in which that many changes occurred. There are six hypervariable codons (138, 145, 156, 186, 193, and 226), all of which have been observed to have changed during growth in eggs (5, 13–15), and which have evolved more than seven times faster than the other varied codons. Their removal does not allow one to get a good fit to the remaining data, even if we introduce an invariable class. We therefore divided the changes according to whether they occurred on the trunk, twig, or tip branches. Table 5 shows the fit to the trunk data. It is a good fit and implies that there are only 46.7 6 5.4 variable codons. This is a very small number given that there are 329 codons altogether. A fit to the twig data is even better but only if the hypervariable positions are removed from the data. The result is that, for the twigs, there are 90.0 6 7.2 variable codons. A similar calculation for the tip branches does not produce a good fit although with pooling of categories one can get the fit increased to 0.05 probability. Nevertheless, since the difficulty of fitting is in the right end of the distribution, the estimate of the number of invariable positions is not greatly affected. The result is that we estimate that the tip branches have 175.6 6 8.4 variable positions. The overall result is that the number of variable positions in the three classes of branches are all significantly different from each other. One can also estimate the number of positions that are variable in the tip branches by noting that 87 positions have changed on them for strains grown in eggs and 96 for strains grown in kidney cells. Of these positions, 51 have changed in both egg- and cell-grown strains. If we assume that both growth conditions have the same variable set of codons and every
Distribution of replacement substitutions over the codons Distribution
Changes 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Codons 174 63 21 27 9 8 6 5 3 4 2 1 1 1 1 3 The bottom row shows the number of codons that have changed as many times as the number above it.
7715
7716
Colloquium Paper: Fitch et al.
Table 6.
The intercontinental movement of human influenza virus
Weight for Asia From Asia Observed Expected To Asia Observed Expected P
Outward Inward
0.999 1.001
Proc. Natl. Acad. Sci. USA 94 (1997)
1.000 1.000
1.001 0.999
60 63.8
41.5 47.3
26 32.4
8 12.3 0.18
18.5 20.1 0.36
28 28.3 0.25
position is equally likely to have changed regardless of the growth conditions, then the estimate of the number of variable positions required to see those observations is 87 3 96y51 5 163.8. This is not significantly different from the 175.6 6 8.4 observed above. Intercontinental Spread. By analyzing a sequence of a single letter for each isolate, where that letter represents the continent from which the isolate was obtained, and running PAUP on it, we can obtain the fewest possible intercontinental movements. That number, 41.5 changes of continent, implies that 1 of every 12 branches connects strains on two different continents. To obtain the expected number of changes, we scrambled the letters with respect to their assignment to the tips of the tree 1,000 times and found the average change for intercontinental movement given this particular tree and this particular number of representatives from each continent. The result (Table 6) was that there was no difference from random. We also tried a weight matrix for the intercontinental changes, such that the weight from Asia was 0.999 while that to Asia was 1.001, which biases the result toward movement from Asia. The reverse weighting biases the result toward Asia. The results, also given in Table 6, show that, regardless of the weighting scheme, the intercontinental spread of human influenza is what would be expected by random movements. It also shows that the possible range of equally parsimonious movements out of Asia could range from 26 to 60. That there should be so many alternative ways of assigning change is a result of the failure of the maximum parsimony assumption that change should be rare. These results do not contradict the prevailing view that novel pandemic and epidemic strains often emerge in Asia (see Discussion). Positive Selection. In examining the nature of the changes in the six hypervariable positions, we were struck by a preponderance of replacement substitutions over silent changes and decided to look for all cases where there was more than might be expected. Table 7 presents a list of the 31 positions that had probabilities for their distribution of non-silentysilent changes of less than 0.05. Note that the probability that a change is non-silent is p 5 548y1260 5 0.435 and that it is silent is q 5 712y1260 5 0.565. Thus the probability of seeing codon 226 with 22 non-silent changes and 3 silent changes is 25!p22q3y 22!3! 5 5 3 1026. Because one cannot get a probability of less than 0.05 with a sample size of three changes, we examined only codons with at least four changes, of which there were 99. Thus 31 occurrences with a probability of less than 0.05 in a sample of only 99 is greatly unexpected (P , 10233). Moreover, the probability that the 31 changes are divided 25 and 6 into the two halves of the distribution is only 7 3 1024. Thus there is a preponderance of the excess number of improbable changes in the category of more non-silent than silent changes. This significant excess of non-silent changes means that there is positive Darwinian evolution occurring and probably at the positions listed in Table 7. The six codons with the most significant excess are the six hypervariable positions.
DISCUSSION Rate of Evolution. The data in the plot in Fig. 2 show a clear break at 1992 with a rate of evolution that is 2.3 times greater
Table 7. Probabilities of changes for different silent and non-silent distributions Dist.
Prob.
Pos’n
4y0 5y0 6y0 6y1 7y0 7y2 8y0 8y2 9y0 9y1 10y0 10y1 11y1 18y0 19y3 20y1 22y0 22y2 22y3 Total
0.036 0.016 0.007 0.027 0.003 0.034 0.0013 0.018 0.00056 0.003 0.00024 0.001 0.0007 3 3 1027 4 3 1025 7 3 1027 1 3 1028 1 3 1026 5 3 1026
(3) (2) (2) (2) 194, 276 (1) 137 (1) 190 196 133 121 135 145 186 193 138 156 226 (25)
Dist.
Prob.
Pos’n
0y6
0.036
(2)
0y7
0.018
(2)
0y8
0.01
(1)
0y9
0.006
(1)
(6)
The Dist. column lists the distribution (non-silentysilent) of the changes at any position for which the probability of that distribution, or an even less probable distribution, is ,0.05. There are 31 of them in a total sample of 99 positions examined. Those positions were all that had at least four changes because it takes at least four changes to have a probability of ,0.05. The Prob. column shows the probability of the distribution on the left. The Pos’n column shows which codon(s) has (have) the distribution shown on the left if that probability is less than 0.005, otherwise the number of such codons is shown in parentheses. Note that for six changes, the distribution 5y1 (probability 5 0.044) is omitted from the table because the less probable 6y0 distribution (probability 5 0.007), added to it, would give a total probability of 5y1 or better 5 0.051, a value above the cutoff.
after 1992 than it was before. We think this is statistically significant but probably largely artifactual. It has been known for some time that there is a bias in parsimony reconstructions that arises from the fact that any given nucleotide position can be observed to change at most once along any given branch (16). The result is that the more densely the branches of a tree are sampled, the more of these hidden changes that are uncovered. The break in the curve occurs at 1992, precisely the year that separates the region of the tree where less than 10 isolates per year were sampled from the time where 40 isolates per year were sampled. To see if this sampling could be the reason for this apparent increase in rate, we repeated the tree several times using only 11 randomly selected isolates from each of the years 1993–1996. The result was that the break largely disappeared although the appearance of an increase was attenuated more or less according to the number of isolates sampled from the group on the far upper right of Fig. 1. We conclude that there is little evidence to support an increase in the rate of HA evolution after 1992. But if the intensity of sampling accounts for the apparent change in rate, then the rate in that portion of the tree, 16 3 1023 replacement substitutionsycodon per year, is the better estimate of the real rate. Extra Replacement Substitutions on the Tip Branches. There are several possible explanations for their being extra replacement substitutions on the tips. The first explanation is as an artifact of the parsimony procedure. It may be, for a given tree, that there is more than one equally parsimonious way of placing the mutations upon the tree and these alternatives may cause more or less of the changes to occur on the tips. PAUP has a choice of two procedures, ACCTRAN and DELTRAN that bias the placement away from or toward the tips, respectively. We used only the ACCTRAN procedure, so that the bias is against
Colloquium Paper: Fitch et al. the accumulation in tips. Thus the parsimony procedure cannot be the source of extra changes. The bias is not large because using DELTRAN produced only 10 more mutations in the tips. The second explanation arises from the observation that mutations appear to arise during passage of the virus in embryonated eggs as seen by changes in egg-grown sequences not seen in cell grown sequences (5, 13–15). There are two different mechanisms that would account for this based upon the plausible assumption that there are some changes that are selected for by growth in eggs. One mechanism is that mutations occur in the RNA in the embryonated egg that the egg then selects. The other mechanism is that the egg selects pre-existing variants that were there in the patient but in too small amounts to be detected in the absence of the selective ability of the egg. We have no direct test that would distinguish between these two mechanisms. We can, however, ask if we see more replacement substitutions in the egg-grown isolates than in those from other methods of amplification. For the null hypothesis we use the expectation that there is no difference among the amplification methods and we get the result in Table 3. There is no significant difference among the methods of amplification. Although none of these groups are significantly different from expectation, the egg-grown isolates are the only group that has fewer than the expected numbers of changes on them, and the group with the greatest proportional excess is, of all groups, the PCR-amplified isolates. If, as has been suggested, the egggrown sequences should have an extra change by reason of growth in eggs, we should surely have seen it. The tips are getting 0.8 extra mutations above the baseline of only 0.66 changesynon-tip branch. Thus the 212 excess mutations, if largely from the 126 egg-grown isolates, would have shown up readily. It cannot be that some special process in egg growth is responsible for the extra changes in the tip branches. In the one case where there are two sequences from the same isolate, the sequence grown in cells has two changes, the sequence grown in eggs has only one relative to their common ancestral sequence. In any event, no explanation that requires one to distinguish among the methods of growth will account for the tip excess. We are aware that these results apparently contradict considerable evidence demonstrating that when specimens from the same patient are grown both in eggs and in cells, the HA genes of egg-grown isolates commonly have additional changes when compared with the HA genes of the corresponding cell-grown isolates (5, 13–15). Those data show an approximate average of one replacement in the egg-grown material that is not present in the cell-grown material, suggesting that tip branches to egg-grown isolates should, on average, be farther removed from their closest ancestor than are cellgrown isolates. Our isolates are, on average, 1.5 replacement substitutions different from their closest ancestor. This value does not differ between egg-grown isolates and cell-grown isolates. Perhaps there are experimental conditions that differ between these two sets of data. We do not know the source of these differences but the question appears important enough to warrant further study to resolve this conflict. A third explanation is bias in the choice of strains to sequence. All viruses that appear to be new variants as determined by the standard serological analyses performed at Centers for Disease Control and Prevention are chosen for HA sequence analysis. A few representative strains are also chosen, but the emphasis is on sequencing variants. This should make these sequences, on average, have a greater number of differences than sequences chosen at random. This bias should apply specifically to the tip branches and hence differentiate them alone. We do not know the extent to which this bias accounts for the observation but we know of no reason why it could not be the complete reason.
Proc. Natl. Acad. Sci. USA 94 (1997)
7717
A fourth explanation is the presence of deleterious mutations in the population. If most changes are deleterious, most changes will eventually be removed by selection but they may well be present in our samples. They, like those that might occur during egg passage, will not end up in the trunks and twigs determined years later (17). The result of that process would be longer tips. Like the third explanation, we do not know how much of the effect seen is due to this mechanism, but it too might account for the whole effect. Nor can we partition the excess between the two explanations. Replacement Substitutions by Codon. The analysis showed that the trunk has only a small number of variable codons (47, of which 17 have not changed). This is only 18% of the codons available. If the trunk represents the victorious lineage in the race to outrun the immune system, then one area in which to focus future study would be the 30 sites that have changed in the trunk, for it is among these that successful changes in HA must have occurred. The tip branches have 175 variable codons. Variable codons, for data from widely divergent species, mean sites that have successfully replaced amino acids since some common ancestor. These sites would only rarely have unselected deleterious changes among their differences. But at the population level, that is not true. The tip branches do have extra mutations on them and, whether they were induced by growth in eggs or kidney cells, or chosen by the investigators, or are simply the sampling of population variants, those extra mutations may all be deleterious. While one normally thinks of variable codons being those that have at least one alternative amino acid that is not deleterious, in this case that is not necessarily so because we could be seeing alternatives before selection has filtered them out. Thus there is nothing wrong with there being extra changes on the tip branches nor that there are many more variable codons for them. It is reasonable to expect more variable codons if we can include deleterious mutations that selection will remove. Intercontinental Spread. It has been shown that the viruses causing pandemics, and even the year-to-year epidemics, emerge from Asia (18, 19). The fact that we observed that the intercontinental spread is random should not be thought of as arguing against that belief. There is no necessity that the relatively rare new successful variants come from a source that sends out an abnormally fewer or greater number of strains than a chance distribution would predict. Positive Selection. We have observed 25 codons that have significantly more non-silent changes than silent changes. Because silent changes might be expected to reflect the neutral rate of evolution, an excess can be construed as indicating positive Darwinian evolution. This is usually done by counting a pool of codons to get a statistically large enough sample to get significance. In this case were are looking at single codons but getting the large enough sample size by using those codons with at least four changes. By focusing on those codons one at a time, we are thus able to define individual positions as each having positive selection occurring in them. In this case, 14 positions have less than 0.005 probability of having the excess of non-silent changes as shown in Table 7. These 14 positions have accumulated 194 mutations or 35.4% of all the replacement substitutions observed, suggesting that positive Darwinian selection accounts for a large portion of all the changes observed. In addition, the rate of change is 10% faster on the trunk than on the twigs (see Table 2). All but three of these positions (137, 138, and 196) have changed along the trunk and thus we have another reason to regard these positions as important in the changes made by the virus to evade the immune system. Position 138 is interesting because it has had 22 non-silent changes, none of which occurred on the trunk. This is perhaps because favorable mutations may still be selected for in any lineage that has not yet died out. Future Studies. It is possible, with the identification of only a few positions that have changed on the trunk, and with the
7718
Colloquium Paper: Fitch et al.
identification of positions that are under positive Darwinian selection, that we now have most of the residues on which the virus depends for immune avoidance. The next step might be to see which of these position changes cause differences in the HA inhibition assays and to quantify them. This might then lead to knowing, simply from sequence changes, which isolates are most likely to cause future epidemics. We gratefully acknowledge the technical expertise contributed by Huang Jing and Donna Sasso and sequences contributed by Dr. Setsuto Nakajima.
Proc. Natl. Acad. Sci. USA 94 (1997) 7. 8. 9. 10. 11. 12. 13. 14.
1. 2. 3. 4. 5. 6.
Cox, N. J. & Bender, C. A. (1995) Semin. Virol. 6, 359–370. Cox, N. J., Bai, Z. S. & Kendal, A. P. (1983) Bull. WHO 61, 143–152. Cox, N. J., Kitame, F., Kendal, A. P., Maassab, H. F. & Naeve, C. (1986) Virology 167, 554–567. Rocha, E., Cox, N. J., Black, R. A., Harmon, M. W., Harrison, C. J. & Kendal, A. P. (1991) J. Virol. 65, 2340–2350. Rocha, E. P., Xu, X., Hall, H. E., Allen, J. R., Regnery, H. L. & Cox, N. (1993) J. Gen. Virol. 74, 2513–2518. Xu, X., Rocha, E. P., Regnery, H. L., Kendal, A. P. & Cox, N. J. (1993) Virus Res. 28, 37–55.
15. 16. 17. 18. 19.
Nakajima, J., Nakajima, K. & Kendal, A. P. (1983) Virology 131, 116–127. Fitch, W. M. (1971) Syst. Zool. 20, 406–416. Fitch, W. M. & Markowitz, E. (1970) Biochem. Genet. 4, 579– 593. Hayashida, H., Toh, H., Kikuno, R. & Miyata, T. (1985) Mol. Biol. Evol. 2, 289–303. Fitch, W. M., Leiter, J. M. E., Li, X. & Palese, P. (1991) Proc. Natl. Acad. Sci, USA 88, 4270–4274. Fitch, W. M. (1973) J. Mol. Evol. 2, 123–136. Hardy, C. T., Young, S. A., Webster, R. G., Naeve, C. J. & Owens, R. J. (1995) Virology 211, 302–326. Gubareva, L. V., Wood, J. M., Meyer, W. J., Katz, J. M., Robertson, J. J., Major, D. & Webster, R. G. (1994) Virology 199, 89–97. Meyer, W. J., Wood, J. M., Major, D., Robertson, J. S., Webster, R. G. & Katz, J., M. (1993) Virology 196, 130–137. Fitch, W. M. & Bruschi, M. (1987) Mol. Biol. Evol. 4, 381–394. Golding, G. B., Aquadro, C. F. & Langley, C. H. (1986) Proc. Natl. Acad. Sci. USA 83, 427–431. Webster, R. G., Bean, W. H., Gorman, O. T., Chambers, T. & Kawaoka, Y. (1992) Microbiol. Rev. 56, 152–179. Cox, N. J., Brammer, T. L. & Regnery, H. L. (1994) Eur. J. Epidemiol. 10, 467–470.
Proc. Natl. Acad. Sci. USA Vol. 94, pp. 7719–7724, July 1997 Colloquium Paper
This paper was presented at a colloquium entitled ‘‘Genetics and the Origin of Species,’’ organized by Francisco J. Ayala (Co-chair) and Walter M. Fitch (Co-chair), held January 30–February 1, 1997, at the National Academy of Sciences Beckman Center in Irvine, CA.
Genes, peoples, and languages L. LUCA CAVALLI-SFORZA Department of Genetics, School of Medicine, Stanford University, Stanford, CA 94305-5120
with DNA markers, is to add an external group (an ‘‘outgroup’’), practically chimpanzees. Table 1 shows a matrix of genetic distances among continents based on six times as many markers (2). The type of genetic distances used — of which there exist a great many — is usually of little importance. But for a tree representation to be acceptable, the evolutionary hypothesis used for drawing the tree must be correct. The simplest hypothesis is that the evolutionary rate is the same across all branches of the tree, and the evolution is independent in all branches [i.e., there are no (important) genetic exchanges among them or similar conditions creating correlations among branches after their origin]. This can be tested on the matrix, since on the basis of this simple hypothesis the distances should be the same, apart from statistical error, in each column (3). There is one important exception to the rule in Table 1, namely that in the first column of the matrix Europe shows a shorter distance from Africa than do all the other continents. The difference is statistically significant and is consistently found with all markers, ranging from ‘‘classical’’ ones based on gene products [blood groups and protein polymorphisms (1)] to DNA markers such as restriction polymorphisms (4) and microsatellites (5). For incompletely understood reasons, discussed later, mtDNA trees of non-African populations are not as informative as desired. This exception to good ‘‘treeness’’ of the data (3) is most probably responsible for the difference of results using two classes of methods for fitting trees. One of them, unweighted pair–group method with arithmetic mean, is made popular by its practical convenience and by the similarity of its results with those of the statistically most satisfactory method, maximum likelihood, on the assumption of constant evolutionary rates. The tree is shown in Fig. 1a near that obtained with another method most popular these days, neighbor joining (Fig. 1b). The most important difference is in the position of Europe, which with neighbor joining branches out first after the splitting of Africans and non-Africans and with maximum likelihood is the last but one. What we know of the occupation of different continents (1) shows that West Asia was first settled around 100,000 years ago, although perhaps not permanently. Oceania was occupied first from Africa, more or less at the same time as East Asia (both probably having been settled by the coastal route of South Asia), and then from East Asia both Europe and America were settled, the latter certainly from the north, via the Bering Strait (then a wide land passage). The dates are approximately known, and the genetic distances corresponding to the splits in the unweighted pair–group method with arithmetic mean tree (or approximately the averages of appropriate columns and other entries in Table 2; see also ref. 1) are in reasonable agreement with them. This is indicated by the approximate constancy of the ratios D/T (genetic distance/ time of first settlement) in Table 2. There is a marked uncertainty in the time of occupation of the Americas, and genetic data suggest the earlier dates are correct. But if very
ABSTRACT The genetic history of a group of populations is usually analyzed by reconstructing a tree of their origins. Reliability of the reconstruction depends on the validity of the hypothesis that genetic differentiation of the populations is mostly due to population fissions followed by independent evolution. If necessary, adjustment for major population admixtures can be made. Dating the fissions requires comparisons with paleoanthropological and paleontological dates, which are few and uncertain. A method of absolute genetic dating recently introduced uses mutation rates as molecular clocks; it was applied to human evolution using microsatellites, which have a sufficiently high mutation rate. Results are comparable with those of other methods and agree with a recent expansion of modern humans from Africa. An alternative method of analysis, useful when there is adequate geographic coverage of regions, is the geographic study of frequencies of alleles or haplotypes. As in the case of trees, it is necessary to summarize data from many loci for conclusions to be acceptable. Results must be independent from the loci used. Multivariate analyses like principal components or multidimensional scaling reveal a number of hidden patterns and evaluate their relative importance. Most patterns found in the analysis of human living populations are likely to be consequences of demographic expansions, determined by technological developments affecting food availability, transportation, or military power. During such expansions, both genes and languages are spread to potentially vast areas. In principle, this tends to create a correlation between the respective evolutionary trees. The correlation is usually positive and often remarkably high. It can be decreased or hidden by phenomena of language replacement and also of gene replacement, usually partial, due to gene f low. The first book of population genetics I read was Genetics and the Origin of species, by Theodosius Dobzhansky, and it was basic for my understanding of the subject. I later had the chance of knowing Dobzhansky personally and sharing results of my early, relevant research with him. He greatly encouraged me to continue this line of work, and I am happy to share in this opportunity to honor his fundamental contributions. The first tree of evolution based on gene frequencies of living humans was published 34 years ago. It was based on genetic distances among 15 populations, 3 per continent, calculated from 5 blood group systems, with a total of 20 alleles (1). The number of genes used was admittedly small, but it was practically impossible to get more information at that time. The only major correction of that early tree that became necessary later was to change its root. This was not too surprising, since locating the root is notoriously the most difficult problem. The standard solution today, usually possible © 1997 by The National Academy of Sciences 0027-8424y97y947719-6$2.00y0 PNAS is available online at http:yywww.pnas.org.
7719
7720
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Cavalli-Sforza
Table 1. Genetic distances among major continents or continental areas, based on 120 classical polymorphisms
Oceania East Asia Europe America
Africa
Oceania
24.7 20.6 16.6 22.6
10.0 13.5 14.6
East Asia
9.7 8.9
Europe
9.5
Information for this table was adapted from refs. 1 and 2.
small groups of people were responsible for the initial settlement, as suggested also by other considerations, genetic drift may have been especially strong and the time of settlement, calculated from genetic distances, will be in excess. One reasonable hypothesis is that the genetic distance between Asia and Africa is shorter than that between Africa and the other continents in Table 1 because both Africans and Asians contributed to the settlement of Europe, which began about 40,000 years ago. It seems very reasonable to assume that both continents nearest to Europe contributed to its settlement, even if perhaps at different times and maybe repeatedly. It is reassuring that the analysis of other markers also consistently gives the same results in this case. Moreover, a specific evolutionary model tested, i.e., that Europe is formed by contributions from Asia and Africa, fits the distance matrix perfectly (6). In this simplified model, the migrations postulated to have populated Europe are estimated to have occurred at an early date (30,000 years ago), but it is impossible to distinguish, on the basis of these data, this model from that of several migrations at different times. The overall contributions from Asia and Africa were estimated to be around two-thirds and one-third, respectively. Simulations have shown
FIG. 1. (a) Tree derived from data in Table 1 by unweighted pair–group method with arithmetic mean. (b) Tree from same data by neighbor joining. Note the difference in the location of the branch leading to Europe.
Table 2. Genetic distances averaged from Table 1 corresponding to the nodes of the tree of Fig. 1a and therefore to the settlement of continents from Africa, and the probable time of occurrence of these settlements on the basis of archeological data Settlement of
D
T
D/T
West Asia from Africa Oceania from Southeast Asia Europe from Asia America from Asia
21.1 12.9 9.6 8.9
100 55 40 15–40
0.21 0.23 0.24 0.59–0.22
The ratio D/T is expected to be constant if evolutionary rates are constant. D, genetic distance; T, time in thousands of years past.
(7) that this hypothesis explains quite well the discrepancy between trees obtained by maximum likelihood and neighbor joining. Genetic Dating of Population Separations All molecular dating methods used thus far depend on the use of dates from paleontology, and the above results are no exception. These dates are unfortunately subject to modification as new results accumulate. Moreover, the statistical error affecting the dates calculated on the basis of available genetic results is high. One of the first dates given for the first branching in the evolution of modern humans, the separation of Africans and non-Africans, was first estimated by mtDNA at 190,000 years with a large error interval, not well ascertained statistically (8). The result was heavily criticized (e.g., ref. 9). This ball park estimate, however, was confirmed by an independent, more detailed assessment (143,000 6 18,000) based on the full sequencing of the mtDNA of three individuals (10). It should be noted that estimates such as those obtained with mtDNA, based on the first time of occurrence of mutations, are in excess by an unknown amount with respect to the time of division of populations, e.g., of separation from the mother colony of a party of migrants (11). The difference is difficult to estimate, especially in the absence of knowledge of the migration pattern at the time of early colonization of continents. An alternative method does not depend on external reference times. Provided that the mutation rate of genes is known, it is possible to estimate the time of separation of two populations given the total genetic difference accumulated between them. This is especially easy for microsatellites, because the square of the average difference in number of repeats between two populations is equal to twice the mutation rate times the time of separation of the populations (with generation as a time unit). If the populations are at equilibrium for drift, the result is independent of drift (12). The squaring of the difference of the number of repeats is easily understood, considering that the model used assumes a random walk at a constant rate, with an equally probable increase or decrease of one repeat at every mutation. In a random walk, the average displacement is proportional to the square root of time. The mutation rate for dinucleotide microsatellites in vivo has been estimated at 1/2,000 (12), and therefore with a generation time of around 25 years there is one mutation in each branch every 50,000 years. Higher mutation rates might be even more satisfactory for generating accurate estimates. The microsatellite mutation rate method might need correction if mutation rates are sensitive to environmental conditions, if some mutations were responsible for the increase of more than one repeat at a time, if the mutation process were not symmetric, and in other ways. Research is ongoing to test the effect of these conditions. The method could be employed also for single nucleotide polymorphisms, but only if their mutation rate was much higher than is ordinarily the case. If might be sufficiently high
Colloquium Paper: Cavalli-Sforza in the case of some fingerprints with an extremely high number of alleles. The first estimate gave a separation time of the first migrants out of Africa of 146,000 years ago, very close to the date obtained with the mtDNA full sequence. This was based on results with 30 microsatellites (5). More recent results (L. Jin, unpublished work) with 100 microsatellites gave an earlier date. The accuracy of mutation rate estimates and the full understanding of the mutation process will be essential for completely satisfactory accuracy of the dates obtained by this method. More work will be necessary to validate these results, but the ‘‘absolute’’ nature of the dating method is a basic advantage. It is reassuring that the dates of settlement of the various continents thus obtained tend to agree with predictions based on archaeological observations (12). Geographic Versus Historic Analysis Tree analysis is an attempt at reconstructing history of population movements and separations. Its success depends on the choice of populations and markers. In principle, populations of approximately similar size are better suited to analysis. It is essential that the number of markers be large and that results be independent of the markers used. Even under best conditions, however, tree analysis cannot go very far in understanding the genetic factors behind the evolutionary processes. A geographic approach is an important alternative. When applied to a single gene or allele, it favors the study of the places of origin of mutations, the possibility of their repeated occurrence, and the nature of the selective factors involved in their spread, if any. But drift and migration can be best traced by the joint study of many genes, and the shape of trees is mostly directed by these two evolutionary factors. A method that proved especially useful is a geographic study by principal components (PCs) or related techniques (1, 13). It partitions the total variation into independent, additive components, ordered by their relative importance in determining the total variation observed. As for trees, many genes are necessary, and observations must be spread as regularly as possible over the area being analyzed; as for trees, the best check of the validity of the conclusions is their independence from the markers employed: that is, their reproducibility with different sets of markers. Applications to the various continents have detected many different hidden patterns, each of which seems to have a precise historical or prehistorical explanation. Thus, in Europe, the most important hidden pattern (the first PC) has an extremely high correlation with the history of the spread of agriculture from the Middle East in the period 10,000–6,000 years B.P. (Fig. 2). Other lesser hidden patterns include: a migration to the north, probably across the northern Urals, of a population speaking a Uralic family language currently still spoken in Europe by Lapps, Finns, Hungarians, and some other populations; a migration from the region below the Urals and above the Caucasus to most of Europe, which was hypothesized by two different archaeologists to have carried Indo-European languages to Europe; the Greek expansion of the first millennium B.C. and earlier; the Basque speaking region in the western part of the Pyrenees. In general, this analysis has detected in almost every major region a variety of demic expansions, almost always due to some important technological development favoring the generation of new or more food, or improving transportation, or political power (14). It is of particular interest that, whereas all autosomal variation is in agreement with the spread of people from the Middle East toward Europe (and also in other directions), an analysis of the mtDNA variation has shown an essentially flat genetic surface, with a minor ripple in the Basque region (15). By contrast, two Y chromosome alleles showing great variation
Proc. Natl. Acad. Sci. USA 94 (1997)
7721
in Europe have a geographic distribution in excellent agreement with the autosomal data (16). These observations have two possible, noncompeting explanations. It is already clear from other data that the Y chromosome variation shows geographic clustering much higher than mtDNA and probably higher than autosomes (17–19), so that the geographic distribution of Y chromosome variants is more highly focussed. This indicates that males are genetically less mobile than females, probably because at marriage they migrate a shorter distance on average than females. There are anthropological observations that marriage is mostly patrilocal or virilocal, also among hunterygatherers and in addition, there is female ‘‘hypergamy,’’ i.e., females can marry into higher social classes, usually those of conquerors, where they enjoy a higher fertility. Another explanation is that, for reasons mostly not understood, variation among non-African populations for mtDNA is much lower than for African populations. Heteroplasmy of mtDNA might perhaps be high enough that mutants show a conspicuous segregation lag, so that all populations that expanded from Africa have not had the time to segregate most of the new mutants originated after their migration from Africa. Moreover, Europe has a genetic variation in general about three times less than that of other continents (1). All of these reasons make mtDNA variation in Europe especially small and practically undetectable in the conditions in which it was tested by Richards et al. (15). They may also contribute to the poor discrimination among all or most non-African populations observed in mtDNA trees (20). Genetic and Linguistic Evolution A tree of 42 world populations was reconstructed on the basis of some 110 genetic polymorphisms and compared with the incomplete, but nevertheless remarkable, knowledge of the similarities between the languages spoken by the corresponding aboriginal populations (21). The linguistic classification used was largely derived from work by Greenberg and published by Ruhlen (22). Sixteen linguistic families were mapped. The correspondence between the genetic tree and the linguistic tree was remarkable, even if five disagreements were noted. The correlation thus found was statistically significant at a very high probability level with two independent methods (23, 24). Unfortunately, only the lowest branches of the linguistic tree are known. Many linguists do not accept similarities established between more divergent languages and the trees based on them. Even some of the lower branches and taxa established in ref. 22 are not accepted by some linguists, i.e., Greenberg’s three major American families. Differences in methodology account largely for these discrepancies. As discussed by Greenberg (25), distant linguistic relationships need special approaches. Fig. 3 shows the comparison of the genetic and the linguistic trees (21). We observed the following regularities: (i) There are fewer families in the linguistic tree than there are populations in the genetic tree and, therefore, there is on average more than one genetic population per linguistic family. It is usually true that the genetic similarity between populations belonging to the same linguistic family is high, as expressed by their having a common node in the genetic tree, with a low position in the tree hierarchy. This rule is violated only in a few cases in Fig. 3, and we will discuss especially three of them: Lapps, Ethiopians, and Tibetans. Lapps speak a Uralic family language but associate genetically with IndoEuropean-speaking populations. Ethiopians are genetically African and linguistically Afro-Asiatic, a language family spoken predominantly by Caucasoids. Tibetans are genetically northern Chinese, but linguistically they associate with the southern Chinese, who belong to another genetic node. It is easy to understand the origin of these exceptions. Lapps probably migrated to northern Europe from a region east of
7722
Colloquium Paper: Cavalli-Sforza
Proc. Natl. Acad. Sci. USA 94 (1997)
FIG. 3. Coherence between a genetic tree derived from 42 populations with 120 classical polymorphisms (Left) and what is known of the linguistic tree (Right), including two recently reconstructed superfamilies (shown at the extreme right). (From ref. 23.)
the Urals and spoke local languages, related to those of the Samoyeds. In contact with northern Europeans in northern Scandinavia they hybridized extensively with them. Having now more than 50% European genomes, on average, they associated with other Europeans in the genetic tree, but maintained their original languages(s). The Ethiopians genotype is more than 50% African. It is difficult to say if they originated in Arabia and are therefore Caucasoids who, like Lapps, had substantial gene flow after they migrated to East Africa, or if they originated in Africa and FIG. 2. Hidden patterns in the geography of Europe shown by the first five principal components, explaining respectively 28%, 22%, 11%, 7%, and 5% of the total genetic variation for 95 classical polymorphisms (1, 13, 14). The first component is almost superimposable to the archaeological dates of the spread of farming from the Middle East between 10,000 and 6,000 years ago. The second principal component parallels a probable spread of Uralic people and/or languages to the northeast of Europe. The third is very similar to the spread of pastoral nomads (and their successors) who domesticated the horse in the steppe towards the end of the farming expansion, and are believed by some archaeologists and linguists to have spread most Indo-European languages to Europe. The fourth is strongly reminiscent of Greek colonization in the first millennium B.C. The fifth corresponds to the progressive retreat of the boundary of the Basque language. Basques have retained, in addition to their language, believed to be descended from an original language spoken in Europe, some of their original genetic characteristics. (From ref. 1, with permission of Princeton University Press, modified.)
Colloquium Paper: Cavalli-Sforza had substantial gene flow from Arabia, but not enough to pass the 50% mark. We are not helped by knowledge of the origin of Afro-Asiatic languages, which are by far the most common ones spoken in Ethiopia but are also spoken in North Africa, Arabia, and the Middle East. It is known from historical records that Tibetans migrated from northern China to Tibet. Genetically they are associated with the northern Chinese (not shown in the tree of Fig. 3), Koreans, and Japanese (shown in the tree), but northern Chinese are genetically distinct from southern Chinese. Almost all Chinese today speak Sino-Tibetan languages, which were imposed on all of China at the time of its unification, beginning 2,200 years ago. (ii) Some linguists have shown that a few of the families given in the tree associate in superfamilies, three of which are indicated in Fig. 3, on the right side. Two of them, Nostratic and Eurasiatic, are rather similar, having about one-half of the families forming them in common; their existence has been inferred by different authors who have used very different methods, and it seems reasonable to assume that the two superfamilies will eventually merge into a larger one. Another linguist has added to Nostratic the recently formed Amerind family (22). It is truly remarkable that the union of Nostratic plus Eurasiatic plus Amerind includes practically the whole major cluster of the genetic tree, which collects together Caucasoids, Northern Mongoloids, and Amerinds. Another superfamily present in Ruhlen’s classification, Austric, also joins populations that are very similar genetically. At this point one may want to consider why these results, although superficially astonishing, are not unexpected. There are some common evolutionary factors to both linguistic and genetic evolution that are responsible for the observed congruence, and there are also good reasons for possible exceptions. In the spread of modern humans, many groups underwent splits, the two moieties settling in different areas. When these were sufficiently remote that isolation of the splinters was complete, i.e., there was no later migratory exchange, the moieties underwent inevitably genetic differentiation; they also underwent inevitable linguistic differentiation. There was thus a parallelism established between the history of the two phenomena, and very probably both differentiations tended to increase with the time of separation, although at different rates and with different regularities. Even if the separation of two or more populations was not complete but there remained enough migration between them to reduce differentiation (genetic and/or linguistic), some divergence both at the genetic and at the linguistic level would certainly occur. We have evidence that both genetic distance and also linguistic distance are highly correlated with geographic distance. An increase of the latter decreases crossmigration and increases the rate of both genetic and linguistic differentiation. We thus expect both genetic and linguistic processes of differentiation to mirror the same basic historical sequence of events or to follow common geography. But inevitably there are reasons the parallelism cannot be perfect. Exceptions could arise in two different ways: language or gene replacement. (i) Language Replacement. Languages can be replaced entirely (or almost). There has not been a systematic study of this important historical phenomenon. Renfrew (26) has hypothesized three possible mechanisms, which can be reduced to two, pooling the second and third of those he proposed. I will use different names from Renfrew’s, which seem to me to be easier to understand: (a) Demic expansions, in which an area is occupied by a population increasing demographically at a relatively fast rate. This mechanism was called by Renfrew ‘‘demographic– subsistence models.’’ The area may have been initially uninhabited, as in the earliest migrations to New Guinea, Australia,
Proc. Natl. Acad. Sci. USA 94 (1997)
7723
and the Americas, but in most other circumstances there were earlier settlers who usually spoke a different language. When the later settlers came in large numbers, the earlier ones were sometimes completely suppressed. Tasmania and the Caribbean are such cases. The suppression of Australian aboriginals was only partial, and so was that of American natives, although 95% of the original population of the Americas was killed through disease and war (27). In some other cases early settlers were able to survive without losing their language. Examples are Basques, Lapps, Eskimos, Khoisans. Here the expansions were connected with the development of early agriculture (i.e., for Europe, West Asia, and North Africa, from the Middle East and for Central and South Africa with the Bantu expansion). These prehistoric expansions tied with agriculture were probably more peaceful and usually outnumbered local aborigines, who were hunterygatherers (some of them still are). Especially if later settlers were in large numbers, they brought their own genes and languages. But in almost all of these cases there was some degree of intermarriage between earlier and later settlers, and in the areas where some kind of this outbreeding occurred, there was likely to develop a discrepancy between language and average genotype. (b) Subjection of a tribe or nation, by conquest or by economic and social control. This includes Renfrew’s ‘‘elite dominance’’ and ‘‘power collapse’’ (26). In conquest by people with superior military power or skills there is usually no complete destruction of the subdued nations, but simply their submission and exploitation. After the development of agriculture, the earlier occupants are usually very numerous and retain a high majority after the invasion; genetically, there is then little change, except that the new masters reserve for themselves the positions of power and thus form the new aristocracy. A new strong genetic stratification of social classes is thus generated. The overall gene pool change may be modest, but will depend on two factors: the proportions of demographic contributions to the population of aborigines and newcomers and also on their relative growth rates, which may be unequal. The newcomers, especially if they are few and powerful, are likely to retain for themselves the best resources and have higher growth rates. Hypergamy, sex differential migration as discussed above, will complicate the final picture. The new masters are likely to impose their language and thus generate a local discrepancy between the genetic and linguistic pictures. This, however, does not always take place. Even the most extensive demic expansions or conquests were not always effective in totally eradicating all of the languages spoken locally. In general, some relic languages survive in some peripheral areas of their original distribution after expansions of people speaking other languages. There are examples in refuge areas that survived the spread of Indo-European languages to Europe (Basque), northern Pakistan (Burushasky), India (Dravidian languages), and the Caucasus (Caucasian languages); interestingly, they may all belong to a family (Eurasian, different from Eurasiatic) that was spread more than 20,000 years ago to the whole area of Europe, Asia, and America. It has been suggested that this superfamily spread to all of Eurasia at the time of the first occupation of Europe, 40,000 years ago (27). (ii) Gene Replacement. This can be determined by continued gene flow from neighbors. We have seen examples, e.g., Lapps, Ethiopians. There is one major difference between the two mechanisms: language replacement is mostly an all-or-none phenomenon, at least for a large part of the vocabulary and phonology, and almost without exception for structural rules. Gene replacement instead can be completely gradual. A classic example of gene replacement are Black Americans (not represented in the tree of Fig. 3, which includes only aboriginal people), who notoriously have a lighter skin color than Black Africans, their ancestors. This is especially true in the northern States. Genetic analysis shows that African Americans have on
7724
average 30% of their gene pool from European (White American) genes (28). This partial replacement took place over about 300 years of contact, and it is calculated that, if it was constant in time, there must have been about 3% of mixed unions per generation. Laws assured that the child of mixed parentage would be considered Black. Only individuals with a very low proportion of Black ancestry (or of skin color) would be able to ‘‘pass’’ as White. With gene flow continuing at that same rate, only about 30% of the original gene constitution would remain on average after 1,000 years since the beginning, and about 9% after 2,000 years (1). Gene and language replacement can to some extent blur the congruence expected between the two types of evolution, but not completely. The accumulation of further genetic and linguistic data will facilitate the study of the relationship between the two evolutions, making it easier to use the genetic tree for predicting the history of linguistic evolution. Charles Darwin had precisely anticipated this development in his first book, The Origin of Species, published in 1859. But the opposite can also happen, and we look forward to linguistic data for ideas about still undetected genetic relationships. Above all we need an increase in genetic data, which modern molecular techniques such as microsatellite analysis and chip hybridization make possible and unusually powerful. The generation of a world collection of stored DNAs for distribution to scientists is the aim of the Human Genome Diversity Project, the feasibility of which is currently being investigated by the National Research Council and by the National Science Foundation. 1. 2. 3. 4. 5. 6.
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Cavalli-Sforza
Cavalli-Sforza, L. L., Menozzi, P. & Piazza, A. (1994) The History and Geography of Human Genes (Princeton Univ. Press, Princeton, NJ). Cavalli-Sforza, L. L. (1996) Genes Peuples et Langues (Odile Jacob, Paris). Cavalli-Sforza, L. L. & Piazza, A. (1995) Theor. Popul. Biol. 8, 127–165. Mountain, J. L., Lin, A. A., Bowcock, A. M. & Cavalli-Sforza, L. L. (1992) Philos. Trans. R. Soc. London B 377, 159–165. Bowcock, A. M., Ruiz-Linares, A., Tomfohrde, J., Minch, E., Kidd, J. R. & Cavalli-Sforza, L. L. (1991) Nature (London) 368, 455–457. Bowcock, A. M., Kidd, J. R., Mountain, J. L., Hebert, J., Carotenuto, L., Kidd, K. K. & Cavalli-Sforza, L. L. (1991) Proc. Natl. Acad. Sci. USA 88, 839–843.
7.
8. 9. 10. 11. 12. 13. 14. 15.
16. 17.
18. 19.
20.
21. 22. 23. 24. 25. 26. 27. 28.
Ruiz-Linares, A., Minch, E., Meyer, D. & Cavalli-Sforza, L. L. (1993) The Origin and Past of Modern Humans as Viewed from DNA (World Sci., Singapore), pp. 123–148. Cann, R. L., Stoneking, M. & Wilson, A. C. (1987) Nature (London) 325, 31–36. Templeton, A. R. (1993) Am. Anthropol. 95, 51–72. Horai, A., Hayasaka, K., Kindo, R., Tsugane, K. & Takahata, N. (1995) Proc. Natl. Acad. Sci. USA 92, 532–536. Nei, M. (1987) Molecular Evolutionary Genetics (Columbia Univ. Press, New York). Goldstein, D. B., Ruiz-Linares, A., Cavalli-Sforza, L. L. & Feldman, M. W. (1995) Proc. Natl. Acad. Sci. USA 92, 6723–6727. Menozzi, P., Piazza, A. & Cavalli-Sforza, L. L. (1978) Science 201, 786–792. Cavalli-Sforza, L. L. Menozzi, P. & Piazza A. (1993) Science 259, 639–646. Richards, M., Corte-Real, H., Forszter, P., Macaulay, V., Wilkinson-Gerbots, H., Demain, A., Papiha, S., Hedghes, R., Bandelt, H. J. & Sykes, B. (1996) Am. J. Hum Genet. 59, 185–203. Semino, O., Pasarino, G., Brega, A., Fellous, M. & SantachiaraBenerecetti, A. S. (1996) Am. J. Hum. Genet. 59, 964–968. Seielstad, M. T., Hebert, J. M., Lin, A. A., Underhill, P. A., Ibrahim, M., Vollrath, D. & Cavalli-Sforza, L. L. (1994) Hum. Mol. Genet. 3, 2159–2161. Underhill, P. A., Jin, L., Zeman, R., Oefner, P. J. & CavalliSforza, L. L. (1996) Proc. Natl. Acad. Sci. USA 93, 196–200. Ruiz-Linares, A., Nayar, K., Goldstein, D. B., Hebert, J. M., Seielstad, M. T., Underhill, P. A., Feldman, M. W. CavalliSforza, L. L. (1996) Ann. Hum. Genet. 60, 401–408. Mountain, J. L., Hebert, J. M., Bhattacharyya, S., Underhill, P. A., Ottolenghi, C., Gadgil, M. Cavalli-Sforza, L. L. (1995) Am. J. Hum. Genet. 56, 979–992. Cavalli-Sforza, L. L., Menozzi, P., Piazza, A. & Mountain, J. L. (1988) Proc. Natl. Acad. Sci. USA 85, 6002–6006. Ruhlen, M. (1991) A Guide to the Languages of the World (Standord Univ. Press, Stanford, CA). Cavalli-Sforza, L. L., Minch, E. & Mountain, J. L. (1992) Proc. Natl. Acad. Sci. USA 89, 5620–5624. Penny, D., Watson, E. E. & Still, M. A. (1993) Syst. Biol. 42, 382–4. Greenberg, J. H. (1987) Language in Americas (Stanford Univ. Press, Stanford, CA). Renfrew, C. (1987) Archeology and Language (Cambridge Univ. Press, Cambridge, U.K.). Diamond, J. (1997) Guns, Germs and Steel (Norton, New York). Cavalli-Sforza, L. L. & Bodmer, W. (1971) The Genetics of Human Populations (Freeman, San Francisco).
Proc. Natl. Acad. Sci. USA Vol. 94, pp. 7725–7729, July 1997 Colloquium Paper
This paper was presented at a colloquium entitled ‘‘Genetics and the Origin of Species,’’ organized by Francisco J. Ayala (Co-chair) and Walter M. Fitch (Co-chair), held January 30–February 1, 1997, at the National Academy of Sciences Beckman Center in Irvine, CA.
DNA variation at the Sod locus of Drosophila melanogaster: An unfolding story of natural selection ´ EZ, RICHARD R. HUDSON, A LBERTO G. SA
AND
FRANCISCO J. AYALA
Department of Ecology and Evolutionary Biology, University of California, Irvine, CA 92697
selection acts (3). Fig. 1 shows the expected levels of variation in a region surrounding a balanced polymorphism and the observed levels of variation (4) at the Adh locus of Drosophila melanogaster. It seemed that a resolution of the controversy was just around the corner. One needed only to survey DNA variation at a large number of loci known to exhibit electrophoretic polymorphisms. Each of those loci at which balancing selection was acting would be expected to show a large peak of variation at linked sites. Hence, the effects of balancing selection would be clearly manifest and the controversy would be over, one way or the other. Alas, surveying populations for DNA sequence variation is labor intensive and expensive, and progress has been slow. With the advent of PCR methodology the pace has certainly accelerated, but a clear picture has not yet emerged. To advance this data collection effort, we (and our collaborators) have undertaken an in-depth study of variation at the DNA level at the Sod locus in D. melanogaster and related species. The Sod locus was chosen for a number of reasons, including the fact that two electrophoretically distinguishable alleles, designated Fast and Slow, are commonly found segregating at high frequencies in populations around the world and the fact that a number of experiments suggest that Sod variation may be subject to some form of natural selection (5–9). The electrophoretic polymorphism at this locus appeared to be a good candidate for an old balanced polymorphism, for which the signature of a peak of linked variation would be visible. It is also known that the common Slow and Fast alleles differ by a single amino acid (10). We now have DNA sequences of many copies of this gene and regions surrounding the gene. In the following pages, we will summarize the patterns of variation that have been observed at this locus in D. melanogaster. We will also describe our continuing efforts to understand the evolutionary history and significance of the variation at this locus. In particular, we will address the question: Is the Sod polymorphism an old balanced polymorphism?
ABSTRACT Patterns of variation at the Sod locus of Drosophila melanogaster suggest that the protein polymorphism at this locus has very recently arisen. In addition, it appears that a previously rare DNA variant has been recently and rapidly driven to intermediate frequency. From the size of the region (>20 kb) that has been swept along with this rare variant, and patterns of linkage disequilibrium in the region, it is inferred that strength of selection was large (s > 0.01) and that the sweep occurred more than 25,000 generations ago. In addition, there are striking similarities to patterns of variation observed at the Est6 and Est-P loci, which are located approximately 1,000 kb from Sod. In the late 1960s, protein electrophoresis of soluble enzymes emerged as an important tool of population geneticists. It promised to elucidate the amount and the nature of genetic variation within species and, hence, clarify important aspects of evolution. Electrophoretic surveys of hundreds of populations were carried out. Large amounts of variation were documented. Such studies in some cases were very informative about geographic structure and in some cases revealed unsuspected species or subspecies. But the evolutionary nature of the variation was unclear from the start and remains so. John Gillespie (1) expresses this viewpoint well: ‘‘Naturally occurring variation is an enigma. . . . despite the ease of measurement, we remain essentially ignorant of the forces that maintain variation.’’ The debate has continued for 30 years about whether most protein polymorphisms are the result of mutation and genetic drift of selectively equivalent alleles (the neutralist position) or whether most protein polymorphisms are maintained by some form of balancing selection (the selectionist position). This neutralist–selectionist debate has stimulated a great deal of empirical work (surveys of variation in populations, as well as experimental work on enzyme properties and population ‘‘cage’’ experiments). A large amount of population genetic theory was also developed with the goal of helping to resolve this debate. Yet, it seems safe to say that this controversy was not clearly resolved by the theory and empirical efforts applied to electrophoretic variation. [The empirical and theoretical efforts to understand molecular variation in populations are well reviewed by Kimura (2) and Gillespie (1), who hold very different views on the causes of molecular evolution.] Approximately 15 years ago, the first studies of variation at the DNA sequence level began to appear. New hope arose that genetic studies at this ultimate level of genetic resolution would help resolve the neutralist–selectionist controversy. For example, new theoretical analysis showed that the existence of a balanced polymorphism (if maintained for long evolutionary periods of time) results in a characteristic peak of variation in the region of DNA surrounding the site where the balancing
The Initial Survey An initial survey of DNA sequence variation at the Sod locus was carried out several years ago (11). A region 1,410 bp long was sequenced in 41 lines of D. melanogaster from localities in California and Barcelona, and included 19 sequences coding for the Slow form of the enzyme and 22 coding for the Fast allele. This sampling of approximately equal numbers of Slow and Fast alleles was done so that the level of variation within alleles could be effectively compared with the level of divergence between alleles. (The frequency of the Slow allele in the populations sampled is roughly 10%.) A total of 63 nucleotide site polymorphisms and 6 insertionydeletion polymorphisms were observed in the sample. Overall, the amount of variation is typical of what has been observed at other loci in D. melanogaster. In addition, Tajima’ s test (12) and the test of
© 1997 by The National Academy of Sciences 0027-8424y97y947725-5$2.00y0 PNAS is available online at http:yywww.pnas.org.
7725
7726
Colloquium Paper: Hudson et al.
Proc. Natl. Acad. Sci. USA 94 (1997)
FIG. 1. Observed and predicted levels of divergence between Fast (F) and Slow (S) alleles in a sliding window across the Adh region of D. melanogaster. (The window is 100 silent sites wide.) The site coding for the FyS difference is at position 1500. Note the large peak of variation centered on the FyS site. For details see Kreitman and Hudson (4).
Hudson et al. (13) revealed no significant departures from equilibrium neutral models. There were, however, several surprising and interesting aspects of the sequence variation that suggested the recent action of natural selection. First, among the 22 Fast alleles sequenced, it was found that 9 were identical in sequence for the entire 1410 bp long region examined. This sequence was designated the Fast A haplotype. Second, all 19 Slow alleles were identical to each other and differed from the Fast A haplotype by a single nucleotide, the nucleotide that accounts for the amino acid difference between the Fast and Slow forms of the enzyme. This site will be referred to as the FastySlow site. Thus, the SlowyFast polymorphism is clearly not an old balanced polymorphism; in fact, the Slow allele is obviously very recently derived from the Fast A haplotype. Summarizing, we find that about half of a random sample of sequences at this locus would consist of sequences that are nearly identical to each other, and the other half of the sample would be much more heterogeneous, differing from each other at roughly 20 sites of 1,410. This pattern of variation was demonstrated to be highly incompatible with an equilibrium neutral model (11). Our working hypothesis is that a rare variant (perhaps a new mutation) has recently and rapidly increased in frequency to around 50%. As it increased in frequency, the haplotype in which it was embedded was pulled up in frequency at the same time. Although selection on the FastySlow site might have driven the Slow allele to its present frequency, such selection by itself cannot account for the observed high frequency of the Fast A haplotype. Thus, selection on some other site would appear to be involved. It should be noted that the putative polymorphic site upon which selection acts is not necessarily in the region sequenced, but must be tightly linked to it. Such a selective event, whereby a rare variant is driven to intermediate frequency, could potentially affect a large region of DNA. (We will refer to such an event as a partial selective sweep.) Calculations of Kaplan et al. (14) suggest that a selection coefficient equal to 0.01 can sweep away variation at sites up to 10,000 bp from the site of selection (assuming rates of recombination that are typically observed in D. melanogaster). Fig. 2 shows the patterns of variation to be expected before, immediately after, and some period after, such a partial selective sweep. Immediately after the rise in frequency of the previously rare variant, all chromosomes bearing the selected variant will be identical across essentially the whole region, which was swept along with the selected variant. As time
progresses, the ‘‘selected haplotype’’ (the haplotype in which the selected variant arose), will slowly be broken up by recombination events. Eventually, after much time has passed and further mutations accumulate, if the variation at the selected site is maintained by balancing polymorphism, a peak of linked variation should emerge. This pattern would emerge at a point in time much later than that shown in Fig. 2. From the Sod data of Hudson et al. (11), we know that the region partially swept of variation is bigger than 1,410 bp. In addition, it appears that some recombination has occurred between the sequences since the putative partial selective sweep. Sequence Variation at Linked Regions To further investigate this putative selective history, additional lines were sequenced at the Sod locus and at three tightly linked regions. We were particularly interested in assessing the size of the region that had been swept along with the selected site and to assess the amount of recombination and mutation that has occurred since the partial selective sweep. With this additional information, inferences can be made about the strength of selection and the time since the partial sweep occurred. Details of this survey will appear elsewhere, but we will summarize the preliminary results here. In this study, 15 lines of D. melanogaster from El Rio vineyard (Lockeford, San Joaquin County, California) and the Canton S strain of D. melanogaster were sequenced at the Sod locus and three neighboring regions. [The Sod locus of six of these lines, designated here 112, 565F, 581F, 255S, 510S, and 438S, were also sequenced in the earlier study (11).] The three neighboring regions, denoted 2021, 6Kbr3r, and 1819, are located approximately 12.7 kb upstream of Sod, 3.7 kb downstream of Sod, and 19.2 kb downstream of Sod, respectively. Fig. 3 shows the locations and sizes of each of these regions. The polymorphic sites in this sample are indicated in Fig. 4. It is important to note that the Sod locus shows a similar pattern of variation to that observed in the earlier study. The four Slow alleles sequences are not, however, identical in this sample, but consist of two sequences identical to the Slow alleles found earlier and two other sequences that each differ from the other Slow allele sequences at a single site. Of 12 Fast alleles, 5 are the Fast A sequence, 2 more differ by a single site from Fast A, 2 others differ by 3 sites from Fast A, and the 3 others differ considerably from Fast A. Two Fast alleles in
Colloquium Paper: Hudson et al.
FIG. 2. Representation of the pattern of variation in a population of chromosomes just before (A), just after (B), and some time after (C) a partial selective sweep in which a mutation is rapidly driven to a frequency of 50% by natural selection. The site where selection acts is designated by a circle. The parts of chromosomes highlighted by a thick line indicate segments that are descended from the chromosome on which the original mutant arose.
particular are quite similar to each other and differ from Fast A at roughly 30 sites. Thus, in this new sample, roughly 75% of the Fast alleles are Fast A or a very similar haplotype. Though the sample sizes are very small, this new sample suggests that the frequency of the homogeneous class of haplotypes may be somewhat larger than that suggested by the earlier samples. What patterns of variation are observed at the linked regions? Both the 6Kbr3r, which is 3.7 kb downstream from Sod and the 1819 region, which is roughly 19.2 kb downstream from Sod show patterns of variation that are very similar to the pattern observed at Sod. That is, most of the sequences are very similar to each other, forming a very homogeneous subset,
Proc. Natl. Acad. Sci. USA 94 (1997)
7727
whereas the other sequences (between 2 and 5 of 16 lines) are relatively diverged from the homogeneous subset, and show some divergence among themselves. Thus, the region showing the Sod pattern of variation extends at least 20 kb downstream from Sod. As indicated earlier, if this pattern of variation is due to a partial selective sweep, quite strong selection (selection coefficient on the order of 0.01) is required. A second important feature of the data from these downstream regions is the pattern of recombination. We note that some of the lines, which form part of the homogeneous subset at the Sod locus, are lines that constitute part of the heterogeneous subset at the 6Kbr3r region. For example, the 510S line, which codes for the Slow Sod allele, forms part of the homogeneous class of haplotypes in the Sod locus. But this line is quite diverged from the homogeneous class of haplotypes in the 6Kbr3r region. Conversely, line 498F is quite diverged from the Fast A haplotype in the Sod locus, but is a member of the very homogeneous subset (differing at a single site from the most common haplotype) in the 6Kbr3r region. These patterns suggest that considerable recombination has occurred between the 6Kbr3r region and the Sod locus in the time since the selective event. The lack of complete linkage disequilibrium between Sod and 6Kbr3r suggests that the selective sweep is not extremely recent. To be somewhat more precise, since linkage disequilibrium decays approximately as exp(2rt), where r is the recombination rate per generation and t is the time measured in generations, we can estimate t. Because linkage disequilibria between sites in the Sod locus and sites in the 6Kbr3r region are substantially decayed, rt is unlikely to be less than one. The recombination rate in D. melanogaster females is estimated to be about 2 3 1025ykb per generation (15) (except in regions near centromeres and telomeres). Taking into account that recombination does not occur in males, we estimate that the recombination rate between Sod and 6Kbr3r, which are approximately 4 kb apart, is approximately 4 3 1025. This suggests that the time since the selective sweep is roughly 25,000 generations ( 5 1.0y{4 3 1025}) or longer. (Twenty-five thousand generations correspond to 5,000 years, assuming an average of 5 generations per year.) The time since the putative selective event can also be inferred from the amount of variation that has accumulated within the relatively homogeneous subsets. We will assume provisionally that the sampled lines that constitute the homogeneous subset are related by a star genealogy (i.e., all lineages of the sampled regions remain distinct back to a time near the time of the selective event). This is a reasonable assumption if the effective population size is large and the selective event recent. The low observed frequency of most variants in the homogeneous set is also consistent with this assumption. In the Sod region the 3 lines, 581F, 498F, and 968F, constitute a heterogeneous and diverged subset, and the other 13 lines can be considered to constitute the homogeneous subset. The number of polymorphic sites among this homogeneous subset is nine. Similarly, in the 6Kbr3r region there are 10 sequences in the homogeneous subset, and there are 2 sites polymorphic, and finally in the 1819 region there are 13 or 14 lines in the homogeneous subset, depending on whether one counts line 521F as being a member of the subset or not. If the homogeneous subset includes 521F, then there are 12 polymorphic sites in the subset in this sequenced region. The sequence in the Sod region is 1,408 bp long, of which 439 bp are protein coding. Since in protein coding sequences about 25% of changes are synonymous, the sequenced Sod region is approximately equivalent to 1,079 ( 5 1408 2 0.75*439) bp of noncoding sequence. Assuming that the other sequenced regions are noncoding and denoting the neutral mutation rate at noncoding and silent sites by m [assumed to be 16 3 1029 per site per year (16, 17)], we find that the expected number of polymorphic sites in the homogeneous subset is mt (13*1,079 1 10*764 1 14*937) 5 mt*34,785, where t is the time back to the selection event (in
7728
Colloquium Paper: Hudson et al.
Proc. Natl. Acad. Sci. USA 94 (1997)
FIG. 3. Schematic representation of the P1 clone, 112, which was derived from a Canton S strain. The boxes indicate the position of the sequenced regions. The length of these sequences and the distance between their borders are shown. The exons of the Sod coding region are shown as solid areas and indicated with the labels 1 and 2.
years). If we set mt*34,785 equal to 23, the observed number of polymorphisms in the homogeneous subset, and solve for t, we find t ' 41,000 years. Many of these polymorphisms within the homogeneous subset could be the result of conversion from haplotypes in the heterogeneous subset. In particular, the polymorphisms at sites 124, 142, 394, and 1143 in the Sod region, and sites 140 and 690 in the 1819 region could have resulted from conversion from haplotypes observed in the heterogeneous subset. This leaves only 17 mutations, which leads to an estimate of t of 31,000 years. Other polymorphisms in the homogeneous subset could have resulted from conversion or recombination and, in addition, the neutral mutation rate that we have used is based on substitution rates at silent sites in coding regions and may underestimate the neutral
mutation rate in noncoding regions. Hence, our estimate may be biased upward, but is consistent with our conclusion from the pattern of recombination, which is that the selective event is probably older than 5,000 years. The pattern of variation in the 2021 region, which is 12.7 kb upstream from the Sod locus, is completely different. There is no homogeneous subset of any appreciable size. There is a high level of polymorphism and no hint of the partial selective sweep evident in the other regions. The 2021 region is apparently outside the region of the partial sweep, indicating that the upstream boundary is somewhere between the Sod locus and the 2021 region. It would be desirable to confirm that regions further upstream continue to exhibit the ‘‘normal’’ pattern of variation, and to narrow
FIG. 4. Polymorphic sites found in the lines sampled from El Rio, California and the Canton S strain (designated 112). For each sequenced region the line designations appear on the left and the positions of polymorphic sites are indicated at the top. The lines are in a different order in each of the four regions to put similar haplotypes together in each region.
Colloquium Paper: Hudson et al. the position of this upstream boundary by examining additional regions between Sod and 2021. Discussion and Conclusions The SlowyFast polymorphism is clearly not an old balanced polymorphism. On the other hand, the data suggest that natural selection has acted recently and strongly on variation in the neighborhood of Sod. The data appear compatible with a model in which a rare variant has recently risen rapidly in frequency, and is perhaps now subject to some form of balancing selection. The data at this point do not allow us to put an upper bound on the size of the region, which has been partially swept of variation. The 2021 region, which is 12.7 kb upstream from Sod, appears to be outside the swept region. Hence, one boundary of the swept region appears to be between the Sod locus and the 2021 region. Because the 1819 region, roughly 20 kb downstream of Sod appears to be in the swept region, we conclude that the swept region is greater than 20 kb in length. This in turn suggests that surprisingly strong selection acted on the selected site (selection coefficient on the order of 0.01 or higher). The pattern of linkage disequilibrium is like Fig. 2C, from which we infer that the time since the sweep is 25,000 generations or more. These results force one to consider the possibility that balanced polymorphisms may typically be short-lived, arising and being maintained for a time too short to result in the strong peak of linked variation, such as that observed at Adh. The two best documented cases of long-maintained balanced polymorphisms are in the major histocompatibility complex in mammals (18, 19) and the S-locus (mating incompatibility determining locus) of some plants (20). These two cases involve large numbers of maintained alleles and remarkably old lineages (which presumably result in a peak of variation at linked sites.) These two examples may be the very rare exception. The most celebrated case of a balanced polymorphism is the sickle cell variant at the b-globin locus of humans. This is a case where strong balancing selection is well documented, and it is also well documented that the currently segregating sickle cell variants are recently arisen (and have arisen independently in different populations.) Perhaps the sickle cell case, and the Sod case, are illustrative of the most common situation for protein polymorphisms, not the ancient stable polymorphisms that result in peaks of variation at linked sites. These short-lived polymorphisms might be compatible with models that incorporate temporal and spatial variability in selection coefficients (1). On the other hand, it is important to consider alternatives to the partial selective sweep hypothesis. This is particularly so given some similarities between the pattern of variation seen at Sod and the results of surveys at Est6 (21, 22) and Est-P (23), which are located approximately 1 centimorgan or 1,000 kb from Sod (24). The pattern of variation in the Sod locus and in the 1819 region in our recent study is remarkable for having two similar haplotypes that are highly diverged from the rest of the sample. (Note in Fig. 4 lines 968F and 498F in the Sod locus and lines 968F and 112 in the 1819 region.) This pattern of variation seems surprising and may not be expected under the sweep hypothesis. A very similar pattern was also observed in a recent survey at the Est-P locus (23). In that survey, which used a subset of the same lines used in our study, line 357F is highly diverged from the rest of the sample. Est-P and Sod are too far apart to be affected by the same selected sweep with normal rates of recombination. The presence of a small number of highly diverged haplotypes suggests the possibility of a distinct isolated subpopulation that has recently merged with another population. This could have been a geographically isolated population, but another possibility is that the diverged haplotype is or was part of an inversion. In(3L)P is a common and widespread inversion that contains Est-P and Sod
Proc. Natl. Acad. Sci. USA 94 (1997)
7729
(25), but no third chromosome inversions were found to be segregating in the El Rio population (26). It is not known if the diverged haplotypes could represent sequences that have ‘‘escaped’’ from an inversion, as has been previously suggested (22) for sequences at the Est6 locus, which is very closely linked to Est-P. Parenthetically, we note that two highly divergent lines were found in a survey of variation in the Adh region of D. melanogaster (4). The possibility of an inversion was investigated, and no direct evidence for such an inversion was found. It is also worth noting that an earlier study of DNA sequence variation at Est6 found patterns that ‘‘suggest that allozyme 8 has both arisen and proliferated relatively recently’’ (21). Thus, at Est6 as well as at Sod, it appears that certain variants have recently been driven to high frequency. We conclude that the patterns of DNA sequence variation at Est6 and Est-P have intriguing similarities to those at Sod. These may reflect similar independent selection events, but the possibility of some event or process affecting this large segment of chromosome 3 must also be entertained. We thank Kevin Bailey, Evgeniy S. Balakirev, and Eladio Barrio for contributions to the early stages of this project; Shiliang Qin and John W. Jacobs for use of a laboratory of the Hitachi Chemical Research Center, Inc. at the University of California, Irvine; and Evgeniy S. Balakirev, Jordi Bascompte, and Francisco Rodriguez-Trelles for discussions; and Nelsson Becerra for technical assistance. This work was supported by National Institutes of Health Grant GM-42397 to F.J.A. and by a postdoctoral fellowship to A.G.S. from the Spanish Ministry of Education and Science. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26.
Gillespie, J. H. (1991) The Causes of Molecular Evolution (Oxford Univ. Press, New York). Kimura, M. (1983) The Neutral Theory of Molecular Evolution (Cambridge Univ. Press, Cambridge, U.K.). Hudson, R. R. & Kaplan, N. L. (1988) Genetics 120, 831–840. Kreitman, M. & Hudson, R. R. (1991) Genetics 127, 565–582. Lee, Y. M., Misra, H. P. & Ayala, F. J. (1981) Proc. Natl. Acad. Sci. USA 78, 7052–7055. Singh, R. S., Hickey, D. A. & David, J. (1982) Genetics 101, 235–256. Peng, T., Moya, A. & Ayala, F. J. (1986) Proc. Natl. Acad. Sci. USA 83, 684–687. Peng, T. X., Moya, A. & Ayala, F. J. (1991) Genetics 128, 381–391. Tyler, R. H., Brar, H., Singh, M., Latorre, A., Graves, J. L., Mueller, L. D., Rose, M. R. & Ayala, F. J. (1993) Genetica 91, 143–149. Lee, Y. M. & Ayala, F. J. (1985) FEBS Lett. 179, 115–119. Hudson, R. R., Bailey, K., Skarecky, D., Kwiatowski, J. & Ayala, F. J. (1994) Genetics 136, 1329–1340. Tajima, F. (1989) Genetics 123, 585–595. Hudson, R. R., Kreitman, M. & Aguade´, M. (1987) Genetics 116, 153–159. Kaplan, N. L., Hudson, R. R. & Langley, C. H. (1989) Genetics 123, 887–899. Chovnick, A., Gelbart, W. & McCarron, M. (1977) Cell 11, 1–10. Sharp, P. M. & Li, W.-H. (1989) J. Mol. Evol. 28, 398–402. Rowan, R. G. & Hunt, J. A. (1991) Mol. Biol. Evol. 8, 49–70. Klein, J. (1986) Natural History of the Major Histocompatibility Complex (Wiley, New York). Takahata, N. (1993) in Mechanisms of Molecular Evolution, eds. Takahata, N. & Clark, A. G. (Japan Sci. Soc., Tokyo), pp. 1–21. Clark, A. G., (1993) in Mechanisms of Molecular Evolution, eds. Takahata, N. & Clark, A. G. (Japan Sci. Soc., Tokyo), pp. 79–108. Cooke, P. H. & Oakeshott, J. G. (1989) Proc. Natl. Acad. Sci. USA 86, 1426–1430. Odgers, W. A., Healy, M. J. & Oakeshott, J. G. (1995) Genetics 141, 215–222. Balakirev, E. S. & Ayala, F. J. (1996) Genetics 144, 1511–1518. Hartl, D. L. & Lozovskaya, E. R. (1995) The Drosophila Genome Map: A Practical Guide (Landes, Austin, TX). Voelker, R. A., Cockerham, C. C., Johnson, F. M., Schaffer, H. E., Mukai, T. & Mettler, L. E. (1978) Genetics 88, 515–527. Smit-McBride, Z., Moya, A. & Ayala, F. J. (1988) Genetics 120, 1043–1051.
Proc. Natl. Acad. Sci. USA Vol. 94, pp. 7730–7734, July 1997 Colloquium Paper
This paper was presented at a colloquium entitled ‘‘Genetics and the Origin of Species,’’ organized by Francisco J. Ayala (Co-chair) and Walter M. Fitch (Co-chair), held January 30–February 1, 1997, at the National Academy of Sciences Beckman Center in Irvine, CA.
Neutral behavior of shared polymorphism A NDREW G. CLARK Institute of Molecular Evolutionary Genetics, Department of Biology, 208 Mueller Laboratory, Pennsylvania State University, University Park, PA 16802
One of the more striking patterns of naturally occurring genetic variation documented by Dobzhansky (1) is the polymorphism of third-chromosome inversions in Drosophila pseudoobscura. The broad geographic distribution and temporal stability of this polymorphism led Dobzhansky and others to conclude that the polymorphism is stably maintained by natural selection. Laboratory cage experiments suggested that the inversions differed significantly in fitness (2). Analysis of restriction site variation on the third-chromosome inversions revealed that the molecular phylogeny is concordant with the previously inferred inversion phylogeny (based on overlaps in the inversions), and that the persimilis allele clusters within the range of pseudoobscura alleles (3). The inversion polymorphism was estimated to be about 2 million years old, which is about the age of the pseudoobscura-miranda split (3). Given the widespread nature of the pseudoobscura-inversion polymorphisms, and their evidently ancient origin, it becomes an intriguing question why D. persimilis, which had a common ancestor with D. pseudoobscura only 2 million years ago, does not share the polymorphism for at least some of the pseudoobscura third-chromosome inversions. In fact, the two species have only the standard arrangement in common. It appears that the answer lies in the demographics of the process of speciation itself, and an important conclusion of this analysis is that searches for shared polymorphism in species young enough to expect some sharing even of strictly neutral genes may shed considerable light on these demographic processes. One of the most striking examples of shared polymorphism (or ‘‘trans-species’’ polymorphism) can be found among alleles of the class I and class II major histocompatibility complex (MHC) genes. Allele sharing was first noticed in constructing gene trees of MHC class I alleles and finding that the human and chimp alleles are often more closely related than they are to other alleles (4, 5). Statistical significance of the shared polymorphism was verified by showing that neighbor-joining trees yield clusters of alleles that are significantly divergent from other clusters of alleles by bootstrap tests, and each cluster bears alleles from both species (6). Such shared polymorphism can be found in gorilla as well (7). Even more striking is the degree of allele sharing in the DRB genes, which exhibit not only allele sharing, but remnants of haplotype structure seems to be shared (8, 9). In all these cases, the exceptionally long-lived polymorphism is thought to have been maintained by natural selection favoring diversity, particularly in the peptide binding region of the MHC molecules. The argument has been made that such polymorphisms are not consistent with the neutral theory, given the time back to the common ancestor of humans and other primates (10). Shared polymorphism between humans and chimpanzees is striking because it implies that the polymorphisms have been maintained in both species since the time of common ancestry, or about 4 million years. Assuming a generation time of 20
ABSTRACT Several cases have been described in the literature where genetic polymorphism appears to be shared between a pair of species. Here we examine the distribution of times to random loss of shared polymorphism in the context of the neutral Wright–Fisher model. Order statistics are used to obtain the distribution of times to loss of a shared polymorphism based on Kimura’s solution to the diffusion approximation of the Wright–Fisher model. In a single species, the expected absorption time for a neutral allele having an initial allele frequency of 1⁄2 is 2.77 N generations. If two species initially share a polymorphism, that shared polymorphism is lost as soon as either of two species undergoes fixation. The loss of a shared polymorphism thus occurs sooner than loss of polymorphism in a single species and has an expected time of 1.7 N generations. Molecular sequences of genes with shared polymorphism may be characterized by the count of the number of sites that segregate in both species for the same nucleotides (or amino acids). The distribution of the expected numbers of these shared polymorphic sites also is obtained. Shared polymorphism appears to be more likely at genetic loci that have an unusually large number of segregating alleles, and the neutral coalescent proves to be very useful in determining the probability of shared allelic lineages expected by chance. These results are related to examples of shared polymorphism in the literature. Shared polymorphism may be formally defined as follows: suppose species A has two alleles at a locus, A1 and A2, and species B also has two alleles at the homologous locus, labeled B1 and B2. Shared polymorphism occurs if alleles A1 and B1 cluster together and are significantly divergent from alleles A2 and B2, which also cluster together. The biological conclusions to be drawn from shared polymorphism depend on the chance that neutral alleles can exhibit this property. Formally, shared polymorphism may arise either when there was a polymorphism in the population ancestral to the two species examined today, and that polymorphism has been maintained through the two distinct species’ lineages, or by more recent parallel generation of similar alleles. If the identity of alleles is well described, as is the case for DNA sequences, it may be extremely unlikely that multiple parallel mutations had occurred. In such cases, polymorphism maintained in both lineages since the time of the common ancestor is the most plausible explanation. Cases of shared polymorphism generally involve genes whose function suggests a mechanism whereby strong natural selection acts to maintain a highly diverse set of alleles. If a gene of unknown function exhibits interspecific shared polymorphism, we would like to know whether it is appropriate to argue that the gene is likely to be undergoing a similar pattern of diversity-enhancing selection. © 1997 by The National Academy of Sciences 0027-8424y97y947730-5$2.00y0 PNAS is available online at http:yywww.pnas.org.
Abbreviation: MHC, major histocompatibility complex.
7730
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Clark years and a long-term effective population size of 10,000, this represents 20 N generations. The significance of shared polymorphism between humans and our closest relatives is suspected when one compares this figure of 20 N generations to the expected fixation time for a neutral polymorphism (4 N generations). But it is necessary to determine the distribution of time to loss of shared polymorphism, as there may be a long tail with substantial probability density for much greater durations of sharing. Shared polymorphism in the plant kingdom is even more striking, especially in self-incompatibility genes that have generated appallingly diverse alleles. In the Solanaceae, shared polymorphism of S-alleles may be as old as 70 million years and is accompanied by within-species divergence of allelic lineages of over 50% at the amino acid level (11). Population genetic models of self-incompatibility show that the coalescence time of alleles varies inversely with the rate of origination of novel functional alleles, and that for reasonable estimates of the rate of origination of new alleles, such extremely old polymorphisms are not unlikely (12). Interdigitation of alleles from different species continues as more species are added to the list of sequenced S-alleles (13). Closely related species might be expected to show higher levels of shared polymorphism, and in some cases this is borne out. In the species clade of Drosophila melanogaster, simulans, sechellia, and mauritiana, only simulans and mauritiana appear to exhibit substantial levels of shared polymorphism as revealed from gene trees of yp2, per, and zeste (14, 15). Sequence divergence does not suggest that mauritiana is younger than sechellia, yet the level of shared polymorphism is much greater between mauritiana and simulans than it is between sechellia and simulans. This observation suggests that the historical population size of sechellia has been much smaller than that of mauritiana, either through a small founding population or a long-term small population size. Expected Persistence of Neutral Shared Polymorphism When a highly diverse species splits into two species, then depending on the largely unknown demographic aspects of the splitting process, the two new species may bear a sizable proportion of the polymorphism present in the original species. Assuming the reproductive barrier is complete, there follows a period of time during which this shared polymorphism is lost. Relatively little theoretical attention has been paid to the dynamics of this loss, but it appears that some interesting issues arise in this analysis. Considering first a classical approach, imagine two alleles, A and a, segregating in two independent populations, each having N diploid individuals and an initial allele frequency p for the A allele. The distribution of fixation time for a neutral allele was obtained by Kimura (16), and it is an ungainly expression involving the hypergeometric function:
7731
for the probability density of time to loss of shared polymorphism. Fig. 1 shows the probability density for time to loss of shared polymorphism in this context. The important point is that, while it is true that the mean time to loss of shared polymorphism is rather short (1.7 N generations), the density has a long tail to the right, so that with finite probability shared neutral polymorphism can last much longer. In particular, 5% of the time-shared polymorphism is retained until 3.8 N generations, and 1% of the time it is maintained until 5.3 N generations. Loss of Shared Polymorphic Sites Over Time The above discussion relates to the case in which there are two clearly distinct allelic lineages. Often molecular data will not be quite so clear, because each allelic lineage will have suffered many mutations since the time of common ancestry. Molecular data allow one to ask how many of the segregating sites in the two species are shared. The two species will undergo a period of random genetic drift in which shared polymorphic sites go to fixation in one or the other species. If the sites undergo random fixation slowly enough that intragenic recombination is likely between rounds of fixation, then it may be acceptable to consider the case of adjacent sites being lost independent of flanking sites. This results in a process of decay in the number of shared polymorphic sites that follows an approximately geometric distribution, and a simulation of the process is shown in Fig. 2. Number of Shared Polymorphic Sites Expected at Steady State The process of loss of shared polymorphism does not continue until there are zero sites, because by chance one expects to have some shared sites, particularly if both species are highly polymorphic. Consider first the chance of observing shared polymorphic sites if those sites are randomly and independently scattered along the sequence. Suppose one samples two sequences of length S from each of two species, and there are s1 sites that differ between the pair of alleles in species 1 and s2 sites that differ in the pair of alleles in species 2. If the polymorphic sites are uniformly and independently distributed, then the probability that k sites are polymorphic in both species (i.e., that there are k shared polymorphic sites) is given by the hypergeometric density:
f ~ 1,t ! 5 p 1 S i~ 2i 1 1 ! pq(2 1) i F ~ 1 2 i, i 1 2, 2, p ! e2@i~i11!y4N!t.
[1]
If the same Wright–Fisher model is applied to two independent populations, the shared polymorphism is lost as soon as absorption occurs in either species. If f(p,t) is the probability density function of absorption time in one population [and F(p,t) is the corresponding cumulative distribution function], then g(p,t) 5 n[(1 2 F(t)]n21f(t) is the density for the first absorption in a collection of n identically behaved populations. For our case, n 5 2 and we get:
g ~ p,t ! 5 2 @ 1 2 F ~ p,t !# f ~ p,t !
[2]
FIG. 1. Probability density for time to loss of shared polymorphism based on numerical integration of Kimura’s (16) density for absorption time. Dots are simulated values for 1,000 replicate populations with 2 N 5 50, with initial allele frequency p 5 0.5.
7732
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Clark
FIG. 2. Loss of shared polymorphic sites in two species initially segregating 1,000 shared polymorphic sites. One thousand replicate populations with 2 N 5 50 were simulated forward in time. All shared sites were assumed to be unlinked.
Pr ~ k shared sites! 5
S DS D SD s1 k
S 2 s1 s2 2 k . S s2
[3]
One-third of the time a site that is segregating in two species will be segregating for the same pair of nucleotides (assume they share one nucleotide through common ancestry). When one examines multiple alleles, the expected number of sites that have shared polymorphism is a more complex expression, and it is easiest to obtain this null distribution by resampling the data by computer. Fig. 3 illustrates this for some Drosophila and human–chimp MHC data. In both cases, the observed number of shared polymorphic sites exceeds all values in the null distribution obtained by computer resampling, just as had been observed in the case of S-alleles by Ioerger et al. (11). The Neutral Coalescent and Shared Polymorphism Fig. 4 illustrates one way to conceptualize the problem of shared polymorphism in the context of the coalescent. If two species each coalesce to a common ancestral allele more recently than the time at which they share a common ancestor, then there is not shared polymorphism (Fig. 4A). On the other hand, if they do not have this recent coalescence event, then they share polymorphism (Fig. 4B). The chance that two alleles had two distinct ancestors the 1 previous generation is 1 2 2N for a diploid population. The 1 chance that three alleles had three distinct ancestors is (1 2 2N) 2 (1 2 2N), and the chance that n alleles had n ancestors is
PS
n21 i51
12
i 2N
D
(17). This gives rise to the expression that the probability that the first coalescence occurred t 1 1 generations in the past is
SD
n S n2D 2 Pr(first coalescence at generation t 1 1) 5 e 2 2N t. 2N [4] Now consider a sample drawn from two distinct populations each with n alleles. Initially let there be n shared allelic lineages in the samples. The probability that they both have n ancestral n alleles the previous generation is (1 2 (2 )y2N)2, because the
FIG. 3. Observed and expected numbers of shared polymorphic sites under the assumption of a random distribution of polymorphic sites in each species. (A) Data from Drosophila simulans and D. mauritiana for the yp2, zeste, and per genes (15). (B) Number of shared polymorphic sites in a sample of 17 human and seven chimpanzee MHC class I A alleles.
process of coalescence and sampling are independent in the two species. When there is a coalescence in either species, then n 2 1 shared lineages are left. At this point the process starts anew with n 2 1 shared lineages, until one or the other population has another coalescence. Eventually there will be two shared lineages, and the next coalescence results in loss of the shared polymorphism. From this it is not difficult to show:
SD
n n 2 2 S2D ~2t11! Pr~n lineages for t gen, n 2 1 at gen t 1 1! 5 . e 2N 2N [5] This equation gives the recursion for the time to loss of shared polymorphism. Fig. 5 plots a solid line for this time to loss of shared polymorphism and also plots simulated points, showing an excellent agreement. With this theory, we can now make statements about the probability of observing a particular level of shared polymorphism given data on two species, provided we know their time of divergence. Discussion The problem of shared polymorphism and the distribution of numbers of shared polymorphic sites is closely related to the problem of persistence time of polymorphism in a single species. Related problems have been studied in the past, and
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Clark
FIG. 4. (A) Coalescence of all variation to a single common ancestral allele at a more recent time than the common ancestor of the two species. (B) Shared polymorphism arising from lack of complete coalescence of allelic lineages in the timespan since speciation.
some insights can be drawn from those. The reverse of this problem was studied by Nei and Li (18), who determined the probability of identical monomorphism in a pair of species that descended from a common ancestor. They found, given reasonable estimates of population size and mutation rates, that monomorphism for the same allele in humans and chimpanzees is not unlikely under neutrality. Griffiths and Li (19) determined properties of fixation in pairs of lineages under more general initial conditions, that is, rather than conditioning the distribution of absorption time on initial allele frequency, as Kimura (16) did, they considered the starting condition to be the steady-state infinite alleles frequency spectrum. They went on to ask about the number of common alleles shared between two populations that descended from a common ancestor. Computer simulations were done forward in time, scoring the allele distribution at times separated by 2t
7733
generations, a process formally equivalent to simulating two populations each for t generations from a common starting point. Although they did not record the frequency of shared polymorphism, their results were consistent with the current results in demonstrating that the persistence time of alleles can have a long tail. For example, when u 5 0.1, even after 40 N generations, the probability that two species share an allele is 15% (19). Perhaps even more important will be an analysis of the distribution of shared polymorphism under different models of selection. With symmetric overdominant selection, the gene tree has the same branching topology as a neutral gene, but with potentially much deeper times of coalescence (20). Computer simulation of allelic genealogies under overdominant selection showed that with mutation rates, effective sizes, and selection coefficients that seem plausible for MHC loci, the coalescence times may be on the order of tens of millions of years (10). Under conditions where within-species coalescence times are so long, it is reasonable to expect the duration of shared polymorphism between species to be greatly lengthened also. S-alleles and MHC clearly show shared polymorphism due to strong selection to maintain the variation. The basis for the selection in both cases is that the protein products of these genes accrue a fitness advantage to the bearer of those alleles if they are heterozygous or otherwise more diverse. This implies that the selection acts on coding sequences and should impact replacement sites more than silent sites. If one constructs a 2 3 2 table of shared vs. nonshared polymorphisms at silent vs. replacement sites, a significant x2 may be consistent with this mode of selection. In both cases, this test is significant (21). In the Drosophila melanogaster group species, in appears that simulans and mauritiana are recently enough diverged to maintain many shared polymorphisms by common ancestry of neutral variation. Such recent common ancestry allows an exciting opportunity to examine many genes and characterize the distribution of shared polymorphism across those genes. Genes having stronger purifying selection will lose the shared polymorphism faster than neutral genes, and genes having any sort of selection that maintains diversity will have an excess of shared polymorphism. The future of human genetics will see an explosion in interest in inferences about ancestral history than can be drawn from extant genetic variation. Coupling these studies to analysis of chimpanzee and other primate polymorphism is likely to be extremely informative. In the first place, shared polymorphism is an excellent filter for searching for genes under strong selection to maintain polymorphism. The divergence time of humans and chimpanzee is estimated to be about 20 N generations ago—long enough that very few neutral shared polymorphisms will be left. Polymorphisms that are shared will be the targets of study to find what functional aspect of those genes results in such longevity of within-species polymorphism. The consideration of polymorphism in populations of ancestral humans is essential to the use of extant human genetic variation to test hypotheses about human origins. Shared polymorphism may provide an unusual opportunity to test ideas about the demographic changes that occurred in the early history of emerging species. This work was supported by National Science Foundation Grants DEB 9419631 and DEB 9527592. 1.
FIG. 5. Probability density for time to loss of shared polymorphism based on time of first coalescence for a pair of species. The solid line is obtained from Eq. 5, and the dots are from simulating the process of drift forward in time in two finite populations each of size 2 N 5 50. One thousand replicate population pairs were followed.
2. 3.
Dobzhansky, Th. (1937) Genetics and the Origin of Species (Columbia Univ. Press, New York). Anderson, W. W., Oshima, C., Watanabe, T., Dobzhansky, Th. & Pavlovsky, O. (1968) Genetics 58, 423–434. Aquadro, C. F., Weaver, A. L., Schaeffer, S. W. & Anderson, W. W. (1991) Proc. Natl. Acad. Sci. USA 88, 305–309.
7734 4. 5. 6. 7. 8. 9. 10. 11.
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Clark Lawlor, D. A., Ward, F. E., Ennis, P. D., Jackson, A. P. & Parham, P. (1988) Nature (London) 335, 268–271. Mayer, W. E., Jonker, M., Klein, D., Ivanyi, P., van Seventer, G. & Klein, J. (1988) EMBO J. 7, 2765–2774. Nei, M. & Rhzetsky, A. (1991) in Evolution of MHC Genes, eds. Klein, J. & Klein, D. (Springer, Heidelberg), pp. 13–27. Vendetti, C. P., Lawlor, D. A., Sharma, P. & Chorney, M. J. (1996) Hum. Immunol. 49, 71–84. Gongora, R., Figueroa, F. & Klein, J. (1996) Hum. Immunol. 51, 23–31. Satta, Y., Mayer, W. E. & Klein, J. (1996) Hum. Immunol. 51, 1–12. Takahata, N. & Nei, M. (1990) Genetics 124, 967–978. Ioerger, T. R., Clark, A. G. & Kao, T.-h. (1990) Proc. Natl. Acad. Sci. USA 87, 9732–9735.
12.
13. 14. 15. 16. 17. 18. 19. 20. 21.
Clark, A. G. & Kao, T.-h. (1994) in Genetic Control of SelfIncompatibility and Reproductive Development in Flower Plants, eds. Williams, E. G., Clarke, A. E. & Knox, R. B. (Kluwer, Dordrecht), pp. 220–242. Richman, A. D., Uyenoyama, M. K. & Kohn, J. R. (1996) Science 273, 1212–1216. Hey, J. & Kliman, R. M. (1993) Mol. Biol. Evol. 10, 804–822. Kliman, R. M. & Hey, J. (1993) Genetics 133, 375–387. Kimura, M. (1955) Proc. Natl. Acad. Sci. USA 41, 144–150. Tajima, F. (1983) Genetics 105, 437–460. Nei, M. & Li, W.-H. (1975) Genet. Res. 26, 31–43. Griffiths, R. C. & Li, W.-H. (1983) Theor. Pop. Biol. 23, 19–33. Takahata, N. (1990) Proc. Natl. Acad. Sci. USA 87, 2419–2423. Clark, A. G. & Kao, T.-h. (1991) Proc. Natl. Acad. Sci. USA 88, 9823–9827.
Proc. Natl. Acad. Sci. USA Vol. 94, pp. 7735–7741, July 1997 Colloquium Paper
This paper was presented at a colloquium entitled ‘‘Genetics and the Origin of Species,’’ organized by Francisco J. Ayala (Co-chair) and Walter M. Fitch (Co-chair), held January 30–February 1, 1997, at the National Academy of Sciences Beckman Center in Irvine, CA.
Divergent and conserved features in the spatial expression of the Drosophila pseudoobscura esterase-5B gene and the esterase-6 gene of Drosophila melanogaster NATALIA A. TAMARINA*, MICHAEL Z. LUDWIG†,
AND
ROLLIN C. RICHMOND‡§
*Division of Vascular Surgery, Department of Surgery, Northwestern University Medical School, 251 East Chicago Avenue, Suite 626, Chicago, IL 60611; †Department of Ecology and Evolution, University of Chicago, 1101 East 57 Street, Chicago, IL 60637; and ‡Department of Ecology and Evolution, State University of New York at Stony Brook, Stony Brook, NY 11794
The upstream region of the Est-6 gene is capable of directing the expression of a lacZ reporter gene in mature adult flies (older than 1 day posteclosion) in tissues such as antennae, maxillary palps, salivary glands, trachea and air sacks of head and thorax of males and females, as well as organs of the male reproductive tract (the ejaculatory bulb and anterior ejaculatory duct). lacZ expression in these tissues appears at 12 hr posteclosion. In young flies less than 12 hr posteclosion, we observed lacZ expression only on the prefrons. This expression pattern usually disappeared as the mature pattern described above developed (4). These data and other results derived from analyses of deletion mutants lead to the conclusion that the upstream region of Est-6 is composed of several independently acting, tissue-specific regulatory regions (4). We report here an analysis of the functional organization of the upstream regulatory region of the Est-5B gene of D. pseudoobscura and add to our previous results on the regulation of the Est-6 gene of D. melanogaster. This analysis allows us to evaluate the degree and kinds of functional differences that arose in the upstream regions of the Est-5B and Est-6 genes during evolution. We placed the upstream DNA of the Est-5B and Est-6 genes 59 of the reporter gene in a P elementbased, promotorless vector and monitored the patterns of the expression of the resulting hybrid genes after transformation into D. melanogaster. This approach allowed us to compare the action of the tissue-specific, cis-regulatory elements present in the Est-5B and Est-6 genes in a common genetic background. This experimental design also eliminates the possible influence of species-specific trans-acting factors on the expression patterns of transformed genes. Our data show that the upstream regions of the Est-6 (1,150 bp) and Est-5B (5,100 bp and 1,224 bp) genes confer different expression patterns on the reporter gene. The common sites of expression for these genes are the third segment of the antenna, maxillary palps, and salivary glands of the adults. The results of in vitro deletion mutagenesis show that the two genes probably have a different organization of regulatory elements that control expression in conserved tissue locations. These results suggest that the conservation of the expression pattern of genes that evolved from a common ancestor is not necessarily accompanied by an evolutionary preservation of the corresponding cis-regulatory elements.
ABSTRACT The regulatory regions of homologous genes encoding esterase 6 (Est-6) of Drosophila melanogaster and esterase 5B (Est-5B) of Drosophila pseudoobscura show very little similarity. We have undertaken a comparative study of the pattern of expression directed by the Est-5B and Est-6 5*-f lanking DNA to attempt to reveal conserved elements regulating tissue-specific expression in adults. Esterase regulatory sequences were linked to a lacZ reporter gene and transformed into D. melanogaster embryos. Est-5B, 5* upstream elements, give rise to a b-galactosidase expression pattern that coincides with the wild-type expression of Est-5B in D. pseudoobscura. The expression patterns of the Est-5By lacZ construct are different from those of a fusion gene containing the upstream region of Est-6. Common sites of expression for both kinds of constructs are the third segment of antenna, the maxillary palps, and salivary glands. In vitro deletion mutagenesis has shown that the two genes have a different organization of regulatory elements controlling expression in both the third segment of antenna and maxillary palps. The results suggest that the conservation of the expression pattern in genes that evolved from a common ancestor may not be accompanied by preservation of the corresponding cis-regulatory elements. The esterase 5B (Est-5B) gene of Drosophila pseudoobscura and the esterase 6 (Est-6) gene of Drosophila melanogaster derive from a common ancestor and likely are orthologous (1). This hypothesis is based on the observations that Est-5B and Est-6 produce homologous products although with differing tissue specificity (2, 3). Both loci code for two major transcripts in adults. The coding regions of Est-5B and Est-6 are about 80% similar at both the nucleotide and protein levels (3). In contrast, the 59 and 39 flanking regions of these genes have very little similarity. Only about 170–200 bp of the upstream sequences of Est-5B and Est-6 can be aligned to determine similarity, and this alignment requires numerous substitutions and deletionsyinsertions, indicating that these regions are quite different. Sequences more than 200 bp 59 of the start site of these genes show only occasional similarities. These similarities are typically no longer than 12 bp and occur at different distances from the AUG codon and sometimes in different orientations (1). It is apparent that since the divergence of Est-5B and Est-6, DNA upstream of the coding regions of these two genes has changed to a significantly greater degree than the coding regions of the two genes.
MATERIALS AND METHODS P Element-Mediated DNA Transformation. P elementmediated germ-line transformation was carried out using §To whom reprint requests should be addressed. e-mail: Rrichmond
© 1997 by The National Academy of Sciences 0027-8424y97y947735-7$2.00y0 PNAS is available online at http:yywww.pnas.org.
@sunysb.edu.
7735
7736
Colloquium Paper: Tamarina et al.
methods described elsewhere (4). Approximately 200 transformed stocks were checked by Southern blotting to estimate the number of insertions per genome. Southern blotting with digoxigenin-labeled probes was performed as described by Maniatis et al. (5) and according to the instructions provided by the manufacturer of the Genius Kit (Boehringer Mannheim). The frequency of double and triple insertions was less than 10% of the total. All transformed lines analyzed contained only one insertion per stock, and the length of the inserted Est-5B upstream region was not altered as determined by Southern blotting analyses (data not shown). Histochemical Staining. The procedures used for histochemical staining have been described (4). We made the following modifications to our previously published procedures: the fixing and staining solutions contained 10 mM MgCl2; 1% phosphoric acid was used to adjust the pH of working solutions instead of citric acid; samples were stained for 20 hr; in the majority of experiments, 3- to 8-day-old adults were assayed, although for some transformed stocks, adults younger than 2 days old were used; all histochemical results were treated as qualitative data.
Proc. Natl. Acad. Sci. USA 94 (1997) Constructs. A 472-bp HindIII–AflIII fragment from the D. pseudoobscura Est-5B locus (1) containing the Est-5 59 flanking region from 0 to 2472 relative to the AUG codon, was subcloned into pGEM3 (Fig. 1). Subsequently the HindIII site of this plasmid was converted into an EcoRI site, and the AflIII site, after treatment by mung bean nuclease, into a BamHI site. This construct is termed 5B-472. Deletion constructs were obtained from 5B-472 by using convenient restriction sites (3). The 10 restriction fragments of the 472-bp Est-5B promotor region were inserted between the EcoRI and BamHI sites of the P element-based vector CaSpeR-AUG-b-galactosidase (6). Construct 5B-1224 (Fig. 1) was prepared by inserting a genomic fragment BglII(21224)–SalI(258) between the EcoRI and SalI sites of plasmid 5B-472 in pGEM3. This construction required the conversion of the BglII site in the fragment into an EcoRI site. The resulting insert was subcloned into pCaSpeR using EcoRI and BamHI sites. Construct 5B-5100 was prepared by subcloning a genomic fragment, EcoRI(25100)–SalI(258), between the EcoRI and SalI sites of plasmid 5B-472 in pGEM3 in which the BamHI site
FIG. 1. Sites of expression of Est-5B constructs when transformed into D. melanogaster. (A) Schematic organization of the carboxylesterase loci of D. pseudoobscura showing the extent of three of the constructs used in transformation experiments. Shaded boxes show the extent of the transcribed regions of the gene. All Est-5B genes have a short intron shown as a break in the shaded box in the 39 half of the gene. (B) (Left) Diagram of the constructs used to transform D. melanogaster lines. (Right) Total number of lines transformed with each construct and the number of lines showing expression of each construct in the adult tissues studied.
Colloquium Paper: Tamarina et al. had been converted into a KpnI site. The resulting insert was subcloned into pCaSpeR using EcoRI and KpnI sites. All subcloning procedures were performed according to the procedures of Maniatis et al. (5).
RESULTS We studied the location of elements regulating the expression of the Est-6 and Est-5B genes of D. melanogaster and D. pseudoobscura, respectively, by preparing constructs of the 59 flanking sequences of these genes and inserting them 59 to the lacZ gene in the CaSpeR-AUG-b-galactosidase transformation vector (6). Figs. 1 and 2 provide diagrams of the genomic regions of the two esterase genes and show the number and extent of the transformation constructs we studied. DNA fragments of 5,100 and 1,224 bp from the Est-5B gene (Fig. 1) or 1,150 bp from the Est-6 gene (see Figs. 2 and 4) that cover the regions 59 of the AUG codon of both genes were used as the bases for constructing transformation vectors. These DNA fragments presumably contain most of the 59 regulatory sequences of both genes (4). The constructs prepared were used to generate stable transformed lines of D. melanogaster by the method of P element-mediated, germ-line transformation. The expression of the lacZ reporter gene in transformed lines was studied by in situ analysis of Escherichia coli b-galactosidase activity in tissues from adult flies. The number of transformed lines expressing b-galactosidase in the tissues analyzed is summarized in Figs. 1B and 2B and interpreted in Fig. 4. b-Galactosidase Expression Controlled by the Est-5B Constructs. Transformed lines carrying the Est-5BylacZ gene with 5,100 bp or 1,224 bp of Est-5B, 59 DNA (5B-5100 and 5B-1224, Fig. 1) expressed b-galactosidase activity in antennae, maxillary palps (Fig. 3 B and E), fat bodies, and salivary glands in adult flies of both sexes and also in nurse cells of females and in testes and vasa deferentia of males (data not shown). In half of the transformed lines, expression in male accessory glands was observed (Fig. 1B). The major difference between the two larger Est-5BylacZ constructs was the presence in 5B-1224 transformants of additional sites of expression that differed among indepen-
Proc. Natl. Acad. Sci. USA 94 (1997)
7737
dently generated transformed lines with insertions in different chromosomes and genomic locations (as judged by genetic and Southern blot analyses). These additional sites of expression included the tracheae, the ejaculatory bulb (Fig. 1), and many other specific locations in different tissues (data not shown). Such variation also was characteristic of all constructs shorter than 5B-1224. We attribute this variation in the sites of reporter-gene expression in shorter constructs to the influence of different adjacent loci at the different sites of insertion (7). These data suggest that in the region from 21,224 to 25,100 bp there are sequences that obviate the position effect, thus eliminating variation in the tissue specificity of expression. Deletion analysis (Fig. 1B) suggests that regulatory elements essential for the spatial pattern of expression of Est-5BylacZ genes in antennae, maxillary palps, salivary glands, fat bodies, testes, vasa deferentia, and male accessory glands are located within 1,224 bp of Est-5B upstream DNA. Positive regulatory elements controlling the expression of Est-5B in nurse cells, male accessory glands, salivary glands, and fat bodies probably are situated in the region between nucleotides 21,224 and 2472. The region between nucleotides 2403 and 2136 contains elements controlling the expression in antennae and maxillary palps. Positive elements controlling b-galactosidase expression in testes and vasa deferentia presumably are located in the region between nucleotides 136 and 258. Some of the Est-5B deletion constructs directed the expression of b-galactosidase in the thoracic muscles of flies of both sexes (Fig. 1B). This phenotype was not seen in flies transformed with larger constructs. Our data suggest that the region between 2177 bp and 2237 bp contains elements that reduce the expression of b-galactosidase in thoracic muscles, while the positive control elements are located in the region between nucleotides 2136 and 258. Expression of b-Galactosidase in Young Flies Carrying Est-5B Constructs. Expression of the Est-5BylacZ genes in young flies was studied in lines transformed with constructs 5B-472, 5B-237, 5B-177, and 5B-136. In recently emerged flies (,1 day old), the reporter gene is expressed in the hypoderm of head, thorax, and abdomen (data not shown). This expression pattern disappears after 1 day of age. A comparison of the
FIG. 2. Sites of expression of Est-6 constructs when transformed into D. melanogaster. (A) Schematic diagram of the genomic organization of carboxylesterase loci of D. melanogaster. Shaded boxes show the extent of the transcribed regions of the gene. Est-6 genes have a short intron shown as a break in the shaded box in the 39 half of the gene. (B) (Left) Diagram of the constructs used to transform D. melanogaster lines. (Right) Total number of lines transformed with each construct and the number of lines showing expression of each construct in the adult tissues studied.
7738
Colloquium Paper: Tamarina et al.
Proc. Natl. Acad. Sci. USA 94 (1997)
FIG. 3. b-Galactosidase activity in antennae and in maxillary palps of D. melanogaster transformed with Est-6 and Est-5B deletion constructs. See Figs. 1 and 2 for a description of the nomenclature and extent of each deletion. (A) 6-1132, antenna. (B) 5B-5100, antenna. (C) 5B-136, antenna. (D) 6-1132, maxillary palp. (E) 5B-5100, maxillary palp. (F) 5B-136, maxillary palp.
pattern of expression in young flies in the four constructs studied suggests that construct 5B-136 contains elements that are responsible for the expression in hypoderm of young flies. These results are reminiscent of changes in the expression of loci involved in Drosophila olfaction that were studied by Riesgo-Escovar et al. (8). b-Galactosidase Expression Controlled by Est-6 Constructs. Flies transformed with construct 6–1132 containing 1,150 bp of Est-6 upstream sequences (Fig. 2) expressed b-galactosidase in the antennae (Fig. 3A), maxillary palps (Fig. 3D), tracheae, and air sacks of head and thorax, in salivary glands of flies of both sexes, and in male ejaculatory bulb and anterior ejaculatory duct. These results confirm our previous studies (4). Fig. 2B shows the deletion constructs used to localize elements regulating Est-6 gene expression. Fig. 4 summarizes the results of experiments reported here and in our previous work (4). We have localized four independently acting Est-6 regulatory regions that direct expression of lacZ in (i) the ejaculatory duct, (ii) the adult salivary glands, (iii) the antennae, the maxillary palps, and respiratory system, and (iv) the ejaculatory bulb (Figs. 2 and 4). Control of Conservative Modes of Expression of the Est-6y lacZ and Est-5BylacZ Genes. The similarity in the patterns of expression of Est-6ylacZ and Est-5BylacZ in antennae, maxillary palps, and salivary glands provides an opportunity to
compare the mechanisms that regulate these conservative modes of expression for each gene. We describe in detail below the localization of the putative regulatory sequences that control the expression of the two esterase genes in similar tissue locations. Antennae. The pattern of b-galactosidase expression in the antenna of Est-6ylacZ transformants is shown in Fig. 3A. The positive regulatory element that presumably controls the development of this pattern in antennae was previously localized to the region between nucleotides 2328 and 2233 upstream of the Est-6 transcription start site (4). Here we localize this element more precisely. Deletion 6ID-2 (Fig. 2B) eliminates reporter-gene expression in the antennae, thus allowing us to locate the positive regulatory elements controlling expression in this organ to the region 2328 bp to 2266 bp upstream from the transcription start site. Fig. 3B shows the basic staining pattern in antenna that is characteristic of the Est-5BylacZ larger constructs, 5B-5100, 5B-1224, and 5B-472. In these transformed lines, b-galactosidase is expressed in much of the third antennal segment. Particularly intense staining is seen in certain regions such as the sacculus (pit organ, Fig. 3B) of the third segment. This basic staining pattern resembles, although it is more extensive, the pattern observed in lines transformed with the Est-6 gene constructs (Fig. 3A). Reporter-gene expression in the second antennal segment, as was routinely observed for the Est-6 gene
Colloquium Paper: Tamarina et al.
Proc. Natl. Acad. Sci. USA 94 (1997)
7739
FIG. 4. Diagram summarizing the location of the putative regulatory elements in the 59 regions of the Est-5B and Est-6 genes of D. pseudoobscura and D. melanogaster. The positions of both positive (1) and negative (2) regulatory elements are shown.
constructs (Fig. 3A), was detected only with constructs 5B-237, 5ID-2, and some lines of 5ID-4. Deletion analysis of the Est-5B promoter region suggests that there are two or more elements that control the basic pattern of expression in antennae. The basic pattern (Fig. 3B) was eliminated completely by deleting the sequences between nucleotides 2402 and 2136 (5ID-5 and 5B-136, Fig. 1B). Thus, the 2402 and 2136 region likely contains all the necessary regulatory elements for the control of reporter-gene expression in the antennae in Est-5B. Deletion construct 5B-177 produces flies with the basic pattern of antennal expression. This result suggests that there is a positive regulatory element between nucleotides 2136 and 2177. However, if nucleotides between 2136 and 2177 are removed as in deletion 5ID-2, flies transformed with this construct still show the basic pattern of expression in 90% of the transformed lines. This result suggests that the region between 2402 and 2177 likely contains another element that is capable of producing the basic pattern of antennal expression for the Est-5B gene. Yet another pattern of expression in antennae (Fig. 3C, not included in Fig. 1B) is apparent when elements controlling the basic expression pattern (Fig. 3B) are removed in deletion constructs 5ID-5 and 5B-136. Transformed lines carrying these constructs exhibit staining of the third antennal segment in a pattern reminiscent of the distribution of major nerve branches in the antennae (9). The resolution of the staining patterns exemplified in Fig. 3C did not permit us to determine which cell types were actually involved. A similar pattern of antennal expression was found in some transformed lines carrying the short Est-6ylacZ constructs, 6–201 and 6–92 (data not shown). Maxillary Palps. The expression of b-galactosidase activity in the maxillary palps of flies transformed with Est-6ylacZ or
Est-5BylacZ is shown in Fig. 3 D and E, respectively. The reporter gene appears to be expressed primarily in cells near or on the surface of the palps. Obvious differences in the pattern of expression are apparent between the lines transformed with the Est-6 and Est-5B constructs. Transformation with the Est-6ylacZ constructs results in light staining in the proximal half of each maxilla (Fig. 3D). Transformation with the Est-5BylacZ constructs results in intense staining of most of the surface of the maxilla (Fig. 3E). A positive element that presumably controls the expression of the Est-6ylacZ gene in the maxillary palps previously was located in the region 2328 to 2233 (4). Deletion 6ID-2 (Fig. 2B) of the Est-6 upstream region eliminates b-galactosidase expression in the maxillary palps. Thus the positive regulatory elements controlling b-galactosidase expression in this organ most probably lie within the region 2328 to 2266. This is the same general region controlling antennal expression in D. melanogaster (Fig. 4). In flies transformed with deletion constructs of the Est-5By lacZ gene, the reporter gene was expressed coordinately in both the maxillary palps and antennae (Figs. 1 and 4). This suggests that expression in these two organs may be controlled by the same regulatory elements located at nucleotides 2403 to 2181 and 2181 to 136. However, expression in maxillary palps was observed in only half of the lines transformed with deletion construct 5ID-1 (Fig. 1B), suggesting that region 2136 to 258 contains sequences specifically contributing to the maxillar expression. An alternative pattern of maxillary staining (Fig. 3F, not included in Fig. 1B) similar to that observed in antennae (Fig. 3C) was seen in some lines transformed with Est-5BylacZ constructs 5B-136 and 5ID-1. b-Galactosidase expression in
7740
Colloquium Paper: Tamarina et al.
the palps of these lines is observed along the passage of the main nerve branch (9). Lines transformed with Est-6 constructs did not show this pattern of maxillary expression. Variation in Staining Patterns in Antennae and Maxillary Palps. The appearance of staining patterns in antennae and maxillary palps in flies transformed with Est-6 or Est-5B constructs did vary slightly in individual flies from the same stock. Some flies from the same line produced less intense staining than others. The details of the staining pattern also were variable in flies of the same stock and occasionally in two antennae from the same fly. In our interpretations of these data, we have attempted to use the most common or average pattern. Salivary Glands. We have shown previously (4) that the positive regulatory element(s) responsible for the expression of esterase 6 enzyme in the salivary glands of D. melanogaster is located in the region between nucleotide positions 2402 and 2328 of the Est-6 upstream sequence. We identified possible negative elements in the regions 2328 to 272 and 21,132 to 2511 (4). Here we studied two additional Est-6 constructs 6–351 and ID-2 (Fig. 2B). Flies transformed with these constructs do not express b-galactosidase in the salivary glands. Thus we conclude that the putative positive regulatory element most likely spans the nucleotide position 2351 because the disruption of the promoter sequence at this site apparently leads to inactivation of the element. The 5B-5100 and 5B-1224 constructs of the Est-5B gene of D. pseudoobscura direct the expression of b-galactosidase in the salivary glands in 100% of the independently transformed lines (Fig. 1). Constructs containing 472 or fewer bp of the most proximal, 59 Est-5B DNA direct the expression of the reporter gene in only 7% to 18% of the independently transformed lines. This sporadic pattern of expression could be due to a position effect. It follows that the positive element of the Est-5B gene that controls the expression of the reporter gene in the salivary glands probably is located between nucleotides 2472 and 21,224 upstream from the initiation codon (Fig. 4).
DISCUSSION Interspecific gene transfer was used to compare the action of cis-regulatory sequences of homologous esterase genes in a common genetic background. This approach follows a strategy suggested by Cavener and Dickinson (10, 11) and allows us to study variation in the expression of two related genes and to identify conserved, cis-acting regulatory elements. The Est-5BylacZ constructs directed expression of b-galactosidase in accessory glands, parts of testes and vasa deferentia in males, nurse cells in females, third segment of antennae, maxillary palps, salivary glands, and fat bodies in flies of both sexes. This expression pattern reflects features of the normal spatial expression of Est-5B in D. pseudoobscura. Our data are consistent with previous results showing expression of Est-5B of D. pseudoobscura in testes, accessory glands, ovaries, and antenna, and the lack of expression in ejaculatory duct and ejaculatory bulb (1, 3, 4). However, the tissue-specificity of expression of Est-5BylacZ constructs in transformed flies should be confirmed by studies of Est-5B wild-type expression in D. pseudoobscura by methods such as in situ hybridization of mRNA because we cannot rule out the possibility that some D. melanogaster tissues lacked required, trans-acting factors for Est-5B expression. For example, we did not see any expression of our Est-5BylacZ constructs in eyes, although expression in this organ was reported by Brady et al. (3). Those authors transformed D. melanogaster with an Est-5B gene that contained 472 bp of upstream sequences, the coding region, its intron, and 1,200 bp the 39 flanking sequences. It is possible that regulatory elements specifying expression in eyes are located in the 39 flanking, noncoding region of the Est-5B gene. Healy et al. (12) report that there are elements 39 of the Est-6
Proc. Natl. Acad. Sci. USA 94 (1997) structural gene in D. melanogaster that modify the expression of this locus in transformed lines. Our results show that many changes have occurred in the information content of the 59 regulatory regions of the Est-6 and Est-5B genes during evolution. Both Est-5B and Est-6 contain numerous cis-regulatory elements within 59-flanking regions of comparable length (1,150 bp for Est-6 and 1,224 bp for Est-5B) (12). There are pronounced differences in the expression of the two lacZ constructs containing these regions. The sites of b-galactosidase expression specific for the Est-6y lacZ construct were the trachea and air sacks of head and thorax, prefrons in young flies, in the ejaculatory bulb and the anterior ejaculatory duct (4). We have not detected Est-6ylacZ expression in accessory glands, testes, vasa deferentia, nurse cells, or fat bodies. Both genes, however, directed expression of the reporter gene in several common tissue locations (salivary glands, maxillary palps, and the third segment of antenna). The data of Healy and her colleagues (12) suggest that another conserved site of expression is the hemolymph of both larvae and adults. Our experimental procedures did not allow us to detect expression of our constructs in hemolymph. All of these observations lead to the conclusion that selection may have acted to retain these common patterns of expression even though the underlying sequence has changed dramatically. We have localized regions capable of directing the expression of the lacZ reporter gene in two of the three sites common to both D. melanogaster and D. pseudoobscura, the maxillary palps and the third segment of antenna. The putative Est-6 regulatory region is located between 2265 and 2328 (Fig. 4). The same function is apparently performed by two separate regions located between 2136 and 2181 and between 2181 and 2403 for the Est-5B gene (Fig. 4). We compared the sequences of the three regions that are capable of directing reporter gene expression in the maxillary palps and the third segment of antenna to determine if any sequence similarities could be detected. The larger region in D. pseudoobscura (2403 to 2181) has a single 9-bp sequence, CAAATATTT (2373 to 2365), that is found in the corresponding Est-6 region at 2278 to 2270. This element is not found elsewhere in either the Est-5B or the Est-6 59 region. An identical element is located in approximately the same position 59 of the Est-6 loci in D. simulans (2241 to 2249) and D. mauritiana (2277 to 2269). A second smaller element, AAATCT, is found at position 2175 to 2170 in D. pseudoobscura and at two positions in D. melanogaster, 2508 to 2503 and 2298 to 2293. The latter site overlaps the region we have identified as containing positive regulatory elements for the maxillary palps and the antennae. This element also is found in the same relative positions in both D. simulans and D. mauritiana. The conservation of these elements and their positions relative to the start site of the Est-6 gene in the four species for which data are available suggests that they may be responsible for the coordinated expression of esterase 6 in the antennae and maxillary palps. The function of these regulatory elements may be a result of the action of several different trans-acting transcription factors that are capable of specifying expression in both antennae and maxillary palps. Indeed, the spatial patterns of the expression produced by the melanogaster and pseudoobscura genes in these organs (see Fig. 3) are similar, but not identical. This suggests that these patterns might be composed of expression in different or overlapping cell populations. Thus evolution may have acted to conserve a spatial pattern of expression using very different regulatory mechanisms. Similar patterns of evolution have been observed in the regulation of the eve locus in Drosophila (13). It is also possible that these esterase regulatory elements may consist of multiple short consensus sequences (putative transcription factor binding sites) with imperfect similarities.
Colloquium Paper: Tamarina et al. Enhancers with this kind of structure have been described for many genes (14). Healy et al. (12) have identified putative 59 regulatory elements for Est-6 and related loci in several species that appear to have some of these characteristics. In addition to the antennae and maxillary palp sites identified in this work, Healy et al. (12) have located putative ejaculatory bulb and hemolymph sites in D. pseudoobscura. Our procedures did not allow us to detect expression in the hemolymph, but occasional lines did show expression in the ejaculatory bulb (Fig. 1), suggesting that the element identified by Healy and colleagues may function when other regulatory regions are not present. The elements essential for the expression of Est-6 in D. melanogaster in both antennae and maxillae lie within 400 bp immediately 59 of the Est-6 coding sequence (Fig. 4). This region of the genome has remarkably low levels of polymorphism and divergence within several species of the melanogaster subgroup (15). Brady et al. (3) found that the 174 bp immediately 59 of the start site in Est-5B was the most highly conserved 59 region between D. melanogaster and D. pseudoobscura. Karotam et al. (15) argue that strong directional selection or functional constraint would be necessary to retain such a low level of sequence variation. The conservation of esterase expression in the antennae and the maxillary palps suggests that esterase 6 may have a role in olfaction. Vogt and Riddiford (16) identified esterases in the antennae of silk moths, Antheraea polyphemus, that were capable of degrading the female sex pheromone of this species. Carlson and his colleagues (8, 17) have demonstrated that the antennae and maxillary palps of D. melanogaster both are olfactory organs and that a single gene affects the response of these organs to a variety of odorants. The effect of the Est-6 null allele on several mating behaviors (2) and our previous demonstration (18) that esterase 6 is capable of hydrolyzing a Drosophila pheromone suggest that an investigation of the effect of Est-6 alleles on the olfactory responsiveness of antennae and maxillary palps would be fruitful. Developmental geneticists have proposed that major evolutionary changes in morphology can be the result of a few changes in homeotic genes that can affect the development of body segments (19). Population genetic analysis of the eve gene-controlling segmentation in arthropods and vertebrates (13) supports a neo-Darwinian model that postulates that major changes in morphology are the result of numerous genetic changes both in structural and regulatory genes. Our analyses of the regulation of the Est-6 locus in Drosophila
Proc. Natl. Acad. Sci. USA 94 (1997)
7741
support the neo-Darwinian model for the evolution of gene regulation. While the expression of the Est-6 homologue in D. pseudoobscura is markedly different from that seen in the D. melanogaster group, a number of small changes in elements regulating the expression of these loci can readily account for these differences. We thank Susan Brandon for technical assistance, T. Kozlova and Bruce Cochrane for many useful discussions, and Marion Healy for sharing unpublished data. This work was supported by a National Science Foundation grant to R.C.R. 1. 2.
3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19.
Brady, J. P. & Richmond, R. C. (1992) J. Mol. Evol. 34, 506–521. Richmond, R. C., Nielsen, K. M., Brady, J. P. & Snella, E. M. (1990) in Ecological and Evolutionary Genetics, eds. Baker, J. S. F., Starmer, T. W. & MacIntyre, R. J. (Plenum, New York), pp. 273–292. Brady, J. P., Richmond, R. C. & Oakeshott, J. G. (1990) Mol. Biol. Evol. 7, 525–537. Ludwig, M. Z., Tamarina, N. A. & Richmond R. C. (1993) Proc. Natl. Acad. Sci. USA 90, 6233–6237. Maniatis, T., Frisch, D. F. & Sambrook, J. (1982) Molecular Cloning: A Laboratory Manual (Cold Spring Harbor Lab. Press, Plainview, NY). Thummel, C. S, Boulet, A. M. & Lipshitz, H. D. (1988) Gene 74, 445–456. Wilson, C., Bellen, M. J. & Gehring, W. J. (1990) Annu. Rev. Cell Biol. 6, 679–714. Riesgo-Escovar, J., Woodward, C., Gaines, P. & Carlson, J. (1992) J. Neurobiol. 8, 947–964. Demerc, M., ed. (1965) Biology of Drosophila (Hafner, New York). Cavener, D. R. (1992) BioEssays 14, 237–244. Dickinson, W. J. (1988) BioEssays 8, 204–208. Healy, M. J., Dumancic, M. M., Cao, A. & Oakeshott, J. G. (1996) Mol. Biol. Evol. 13, 784–797. Kreitman, M. & Ludwig, M. (1996) Semin. Cell Dev. Biol. 7, 583–592. Stanojevic, D. T., Hoey, T. & Levine, M. (1989) Nature (London) 341, 331–335. Karotam, J., Delves, A. C. & Oakeshott, J. G. (1993) Genetica 88, 11–28. Vogt, R. G. & Riddiford, L. M. (1981) Nature (London) 293, 161–163. Ayer, R. K. & Carlson, J. (1992) J. Neurobiol. 23, 965–982. Mane, S.D., Tompkins, L. & Richmond, R. C. (1983) Science 222, 419–421. Palopoli, M. F. & Patel, N. H. (1996) Curr. Opin. Genet. Dev. 6, 502–508.
Proc. Natl. Acad. Sci. USA Vol. 94, pp. 7742–7747, July 1997 Colloquium Paper
This paper was presented at a colloquium entitled ‘‘Genetics and the Origin of Species,’’ organized by Francisco J. Ayala (Co-chair) and Walter M. Fitch (Co-chair), held January 30–February 1, 1997, at the National Academy of Sciences Beckman Center in Irvine, CA.
A demographic approach to selection (demographic geneticsynatural selectionyDrosophila pseudoobscura)
WYATT W. A NDERSON*†
AND
TAKAO K. WATANABE‡
*Genetics Department, University of Georgia, Athens, GA 30602-7223; and ‡Department of Applied Biology, Kyoto Institute of Technology, Kyoto 606, Japan
phic for chromosomal inversions. Dobzhansky (see refs. 2 and 11 for summaries) had shown earlier that these inversions were under powerful selection in both nature and the laboratory. Later studies by Mueller and Ayala (12, 13) have explored the use of population growth rates as fitness measures and clarified the mechanisms by which population density affects natural selection (14, 15). The 1970s brought not only Genetics of the Evolutionary Process but also the beginnings of a new stage in the demographic analysis of natural selection. Anderson and King (16) used life history tables of longevity and fecundity to represent the fitness of genotypes in computer simulations of selection. Cavalli-Sforza and Bodmer (17) employed what they termed ‘‘genetic demography’’ to analyze the fitnesses of human genotypes. Brian Charlesworth, beginning in 1970 (18), combined an analytical study of models with computer simulations to develop a comprehensive theory of selection in agestructured populations. His demographic theory brings together the approaches to population processes that had been, for the most part, treated separately by population ecologists and population geneticists. Its strength lies in joining the ecology and genetics of populations in the way envisioned in the previous decade by Dobzhansky. Charlesworth (19) included in his studies the evolution of life histories, continuing the development of a central topic in evolutionary ecology. Life history strategies (20) and the evolution of senescence (21, 22) are two related areas that have attracted considerable attention. Charlesworth (18) rediscovered the basic demographic theory of natural selection that had been formulated nearly a half-century earlier by the British mathematician H. T. J. Norton (23). It is interesting to note in passing that Norton had been one of the first people to calculate gene frequency changes under simple models of selection (24). Charlesworth (see ref. 19) went on to greatly expand and extend this demographic theory. The basic Norton–Charlesworth theory establishes the conditions for genetic equilibrium at a single locus when the viability and fertility of genotypes are described by life history tables of longevities and fecundities at all ages, and it shows how the genotypic fitnesses at equilibrium are defined. For the genotype with alleles Ai and Aj, the longevities lij (x) and fecundities mij (x) at age x define an rij that is the intrinsic rate of increase of a population whose members all have this genotype’s life history schedule of births and deaths. This rij is calculated as the root of the demographic equation S xe 2rijxl ij(x)m ij(x) 5 1. For a single locus with two alleles, there will be a stable genetic equilibrium if and only if r11 , r12 . r22. The equilibrium gene frequency, however, is determined by the demographic functions Wij 5 S xe 2rxl ij(x)m ij(x), where r is
ABSTRACT The concepts of demography provide a means of combining the ecological approach to population growth with the genetical approach to natural selection. We have utilized the demographic theory of natural selection developed by Norton and Charlesworth to analyze life history schedules of births and deaths for populations of genotypes in Drosophila pseudoobscura. Our populations illustrate a stable genetic equilibrium, an unstable genetic equilibrium, and a case of no equilibrium. We have estimated population growth rates and Darwinian fitnesses of the genotypes and have explored the role of population growth in determining natural selection. The age-specific rates of births and deaths provide insights into components of selection. Both viability and fertility are important components in our populations. Genetics and the Origin of Species (1) became the cornerstone of the modern synthesis of evolutionary biology because it showed how genetics fits into the processes of evolution, filling the major gap in Darwin’s theory. Dobzhansky was strictly Darwinian in his approach to evolution, and he summarized the available research on genetic changes under natural selection early in the first edition of his book. Although there had been many studies of natural selection by the middle of the 1930s, there had been relatively few studies showing how selection had changed the genetic composition of populations. Dobzhansky’s own pioneering studies on selection in natural and experimental populations of Drosophila were some years away. Each succeeding edition (the second in 1941 and third in 1951) of Genetics and the Origin of Species, as well as the successor book of 1970, Genetics of the Evolutionary Process (2), had an expanded coverage of natural selection as more studies of genetic changes under selection were published. One theme that emerges in these successive editions is an increasing attention to the concept of Darwinian fitness, of the ways in which genotypic differences in survival and reproductive success bring about selective changes in gene frequencies. Late in his life, Dobzhansky summarized his views on Darwinian fitness in an essay, ‘‘On Some Fundamental Concepts of Darwinian Biology’’ (3), that formed the basis of his discussion of this topic in Genetics of the Evolutionary Process. In particular, he turned to a demographic analysis to relate the Darwinian fitness of genotypes to ecological fitness parameters at the population level, such as the intrinsic rate of increase, r. He proposed that r measured the adaptedness of populations, by which he meant their state of adaptation for population growth. A series of experimental studies by Dobzhansky and his colleagues (4–7) and Ayala and colleagues (8–10) explored this demographic approach to natural selection. One set of studies compared r values measured in populations that were genetically monomorphic or polymor-
Abbreviations: AR, Arrowhead; CH, Chiricahua; PP, Pikes Peak; ST, Standard. †To whom reprint requests should be addressed.
© 1997 by The National Academy of Sciences 0027-8424y97y947742-6$2.00y0 PNAS is available online at http:yywww.pnas.org.
7742
Colloquium Paper: Anderson and Watanabe the intrinsic rate of increase of the entire Mendelian population at genetic equilibrium. These Wij values serve as Darwinian fitnesses of the genotypes in determining equilibrium gene frequencies, as in the discrete-generation model: p1 5 (W12 2 W22)y(2W12 2 W11 2 W22). For a stable genetic polymorphism, r is calculated as the root of the equation (W12 2 1)2 5 (W11 2 1) (W22 2 1). There have been only a few experimental determinations of genotypic life histories to which this basic demographic theory of selection has been applied. Charlesworth (19) cites only the study of genotypes in Tribolium castaneum by Moffa and Costantino (25). We have obtained life history schedules of longevity and fecundity for three sets of genotypes in D. pseudoobscura, and we present below an analysis of selection on them.
MATERIALS AND METHODS Chromosomal variants of D. pseudoobscura were chosen as a realistic and practical genetic system for demographic analysis. Natural populations of this species contain an array of inversions on the third chromosome. These inversions segregate as units, just as if they were alleles at a single locus, because crossing over is effectively suppressed within the inverted regions in heterozygotes for the inversions (2). Indeed, crossing over is nearly eliminated over the entire third chromosome in many combinations. The inversions contain linkage blocks of genes much like those thought on theoretical grounds to be the true units of selection (26). Frequencies of these inversions are regulated by selection in nature, and some of these changes can be reproduced in laboratory populations (2, 11). Selection is often intense, as we should expect for genetic units comprising a 10th of the genome. Differences in selection are consequently easier to measure, and the values obtained are more accurate. Four inversions were used in these experiments: Arrowhead (AR), Chiricahua (CH), Pikes Peak (PP), and Standard (ST). ARyAR, ARyCH, and CHyCH Under Nearly Optimal Conditions. The first set of data comes from a study by Nickerson and Druger (27). These authors extracted seven AR and seven CH chromosomes from a population cage begun with chromosomes collected at Pinon Flats, Mount San Jacinto, CA. These strains were intercrossed, both within each inversion type and between the two, to yield the ARyAR, ARyCH, and CHyCH flies used in the experiment. For each chromosomal genotype, or karyotype, 10 replicate groups of 5 females and 8 males were placed in glass vials, each of which contained a spoon of blackened, yeasted food. The spoons were replaced with fresh ones every 24 hr, at which time dead females were counted and removed. The eggs on each spoon were counted, and fecundity was recorded as eggsyfemaleyday. Longevity of preadult life stages was measured as egg-to-adult viability. For each karyotype, 20 groups of 50 eggs were placed in half-pint culture bottles and the number of adults emerging in each bottle was recorded. Longevity of adults was measured on groups of 25 females and 25 males in half-pint culture bottles; at 2-day intervals the flies were transferred to new bottles and the number of dead females were recorded. Twenty replicate bottles were studied for each karyotype. All tests were conducted at 25°C, under nearly optimal conditions. The experiment was continued for 58 days, until all fecundities dropped to zero. Nickerson and Druger did not record the average time spent in preadult life stages, but, fortunately, the development times of karyotypes from cage populations started with the same AR and CH chromosomes from Pinon Flats have been studied by others (6, 7). ARyAR, ARyPP, and PPyPP Under Nearly Optimal Conditions. Ten strains of AR and 10 of PP from collections at Black Forest, 10 miles north of Colorado Springs, CO, were utilized. Marvin Wasserman isolated these chromosomes in
Proc. Natl. Acad. Sci. USA 94 (1997)
7743
1970, and we conducted our experiment shortly thereafter, in 1971–1972. Crosses were made among all strains of each chromosomal type and between strains of the two types, so that no fly, whether a homokaryotype or a heterokaryotype, was homozygous for any one ancestral inversion. Longevity and fecundity were measured, as in ref. 7, simultaneously on groups of three females and three males in small, glass creamers, each containing approximately 5 ml of blackened food. A glass chimney plugged with cotton was taped to each creamer, and a drop of yeast solution was added before use. Twenty replicates were set up for each karyotype, and all cultures were kept at 25°C. Every 24 hr each group of flies was transferred to a fresh creamer, the number of dead flies of each sex was recorded, and all eggs in each creamer were counted. Measurements were discontinued after 35 days of adult life, when their effect on the parameters of growth and selection was small. Samples of 50 eggs were cultured in half-pint bottles to estimate development time and preadult viability; 16 replicate cultures were studied for each karyotype. ARyAR, ARyST, and STyST Under Harsh Conditions. Ten strains of AR and 10 of ST from collections at Mather, CA, in 1959 were employed for measurements of the life history functions l(x) and m(x) under conditions such as those a species might encounter in a harsh environment where population growth was severely restricted. The experiment was conducted exactly as for AR and PP, with two exceptions: five, rather than three, pairs of parents were placed in each creamer, and no yeast was added to the food medium. Thus, the flies suffered greater crowding and they were underfed, if not starved. The daily transfers to new containers did not permit much growth of the yeast and other microorganisms that were transferred on the flies or in their guts. The experiment was continued for 30 days of adult life.
ANALYSIS AND RESULTS Analysis. The first step is the calculation of rij for each karyotype, because the existence of an equilibrium depends on the relative sizes of these quantities. The rij values are calculated from the life history schedules by numerical solution of the equation S xe 2rijxl ij(x)m ij(x) 5 1. If either a stable or unstable equilibrium is indicated, then the equilibrium equation for age-specific selection can be used to calculate the population growth rate r. This equation does not apply when no genetic equilibrium is indicated, that is, when r12 is intermediate between r11 and r22. In this case the population will ultimately grow at the highest of the genotypic rij values, and substituting this rij into the formulas defining the Wij values provides a useful first approximation to the Darwinian fitnesses (19). The fitnesses are calculated as Wij 5 Sxe2r(Dij1x)lij(x) mij (x), for x varying from 1 to G. Here, x 5 age of adults in days, calculated from eclosion; G 5 maximal age of organisms in the experiment; and, for genotype Ai Aj, Dij 5 mean length of preadult life, l ij (x) 5 probability of survival from zygote to age x, and mij (x) 5 fecundity as female eggsyfemaleyday. The equilibrium equation is Z 5 W 212 2 W11W12 2 2W12 1 W11 1 W22 5 0. Beginning with an initial estimate of the population growth rate, r0, an improved estimate r1 is obtained by Newton–Raphson iteration as r1 5 r0 2 Zy(dZydt). The formula is applied repeatedly to give successively improved estimates, and convergence is rapid. The Darwinian fitnesses are then computed by substituting the final estimate of r into the formulas for the Wij values. The net reproduction, or expected lifetime fecundity of a female zygote, is calculated as Rij 5 Sx 1ij(x) mij (x), for x varying from 1 to G. Finally, the equilibrium gene frequency, if it exists, is calculated as p1 5 (W12 2 W22)y(2W12 2 W11 2 W22). ARyAR, ARyCH, and CHyCH. Nickerson and Druger very kindly made their original data available to us, and it is
7744
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Anderson and Watanabe
summarized in Table 1. Longevity is given as the probability that a female zygote will live to adult age x in days; l ij(0) is the preadult viability (of both sexes). The values of l ij(x) were obtained by multiplying the probability that adults survive from eclosion to age x by the preadult viability. Fecundity is given as female eggsyfemaleyday. Fecundity was recorded every day, but longevity was measured every 2 days. Because the longevities changed slowly, we interpolated linearly to obtain values for days between counts. To make the tables manageable, data are given at 2-day intervals initially, and at longer intervals thereafter, but data for each day of life were used in the calculations. Five separate sets of measurements (6, 7) failed to disclose any significant differences among the karyotypes in development time. We have chosen an average value of 14.0 days from this data. The reproductive function V(x) 5 l(x) m(x) is given for each karyotype in Fig. 1. The area under each curve is the net reproduction, Rij, or expected lifetime fecundity of a female at birth. The Rij curves are triangular functions like those described for Drosophila and other insects (5). ARyAR, ARyPP, and PPyPP. The fecundities and longevities of the karyotypes are given in Table 2. They are abbreviated as for the previous set of data, but again, all calculations were done with the daily values of longevity and fecundity. Longevities of only females are given, but those of males were nearly the same. Differences in development times were small and statistically nonsignificant. Preadult viabilities were moderately low; we confirmed these values in an independent experiment. The reproductive functions V(x) are shown in Fig. 2. Because the fecundities fluctuated somewhat, three-point moving averages are plotted to emphasize the main trends in the data. Again, the V(x) curves are roughly triangular. The conditions of this experiment were nearly optimal and the lifetime fecundities were high; for example, R was 115 female eggs for the ARyAR females. ARyAR, ARyST, and STyST. Differences among the karyotypes in development time were small and statistically nonsignificant and, as for AR and PP, male and female longevities were alike. The longevities and fecundities are summarized, in the same manner as for the other sets of data, in Table 3. The Table 1. Life history data for karyotypes with AR and CH inversions ARyAR*
ARyCH†
CHyCH‡
x
l(x)
m(x)
l(x)
m(x)
l(x)
m(x)
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 35 45 55
.84 .81 .80 .78 .76 .75 .74 .72 .70 .67 .64 .62 .59 .56 .53 .50 .44 .28 .10
0.0 0.0 3.8 10.1 14.7 17.7 18.0 13.8 12.1 10.1 9.8 7.9 6.0 7.2 6.7 5.7 2.0 0.9 0.0
.83 .83 .83 .82 .82 .81 .80 .78 .76 .75 .74 .72 .70 .68 .67 .62 .55 .38 .25
0.0 0.0 4.4 10.7 13.3 16.2 19.4 20.2 17.6 16.7 14.2 11.5 9.6 7.0 5.5 4.4 2.3 0.7 0.8
.81 .79 .76 .74 .72 .70 .68 .66 .63 .60 .58 .56 .53 .50 .47 .45 .39 .23 .09
0.0 0.0 4.3 9.1 12.7 14.7 14.1 12.4 10.5 8.7 7.3 5.7 4.7 4.1 3.2 3.0 1.6 0.5 0.0
x, age of adult in days; l(x), probability of living to age x; m(x), fecundity as female eggsyfemaleyday. *Length of preadult life (D) 5 14.0 days. †D 5 14.0 days. ‡D 5 14.0 days.
FIG. 1. The reproductive function V(x) 5 l(x) m(x) for karyotypes bearing AR and CH chromosomes from Pinon Flats, CA. Experimental conditions were nearly optimal.
reproductive functions V(x) are shown in Fig. 3; again, threepoint moving averages are plotted to reduce erratic fluctuations. The V(x) functions under these harsh conditions differ noticeably from the triangular curves found for the two sets of karyotypes under nearly optimal conditions. The results of this demographic analysis for the three sets of data are reported in Table 4. The Wij values as calculated have an average value of 1. To follow convention in reporting fitnesses, we have divided by the heterokaryotype fitness to give relative fitnesses. A stable genetic equilibrium is indicated only for the set of karyotypes bearing combinations of AR and CH chromosomes. The stable age distributions expected for these three karyotypes are nearly the same, and differences between the karyotypes will be small enough that they would disappear under the usual environmental variation.
DISCUSSION Somewhat to our surprise, the three sets of data illustrate all of the possible outcomes of selection on two alleles: stable genetic equilibrium, unstable genetic equilibrium, and fixation Table 2. Life history data for karyotypes with AR and PP inversions ARyPP†
ARyAR*
PPyPP‡
x
l(x)
m(x)
l(x)
m(x)
l(x)
m(x)
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 35
.60 .60 .60 .60 .60 .60 .60 .60 .60 .59 .59 .58 .53 .50 .46 .44 .36
0.0 0.0 23.6 20.7 20.2 23.2 25.8 23.3 25.1 20.1 16.4 15.9 16.6 16.3 17.0 11.1 10.0
.56 .56 .56 .56 .56 .55 .53 .53 .53 .53 .51 .51 .50 .47 .44 .38 .36
0.0 0.0 21.1 21.3 21.4 20.5 23.3 17.7 21.3 16.1 15.3 16.7 17.1 13.2 12.7 11.4 12.0
.57 .57 .57 .57 .56 .56 .56 .56 .56 .56 .56 .56 .55 .53 .50 .50 .44
0.0 0.0 22.6 20.9 22.7 23.6 22.3 20.3 20.2 15.3 14.8 15.0 16.2 14.3 16.1 12.4 10.3
x, age of adult in days; l(x), probability of living to age x; m(x), fecundity as female eggsyfemaleyday. *Length of preadult life (D) 5 13.66 days. †D 5 13.63 days. ‡D 5 13.61 days.
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Anderson and Watanabe
7745
FIG. 2. The reproductive functions V(x) 5 l(x) m(x) for karyotypes bearing AR and PP chromosomes from Black Forest, CO. Experimental conditions were nearly optimal.
FIG. 3. The reproductive functions V(x) 5 l(x) m(x) for karyotypes bearing AR and ST chromosomes from Mather, CA. Experimental conditions were harsh.
of one allele. The stable genetic equilibrium is a particularly interesting case, because the AR and CH chromosomes at Pinon Flats are part of a balanced polymorphism with annual cycles in frequency of three major inversions (28). In laboratory populations, AR and CH chromosomes from this locality usually reach a stable genetic equilibrium with about 75% AR and 25% CH (29). These chromosomes were extracted from a population cage that was maintained and regularly sampled for 101⁄2 years and subsequently was continued for a total of nearly 14 years (30, 31). Generation time is approximately a month in such population cages. To our knowledge, no other population cage of D. pseudoobscura has been followed as long as this one. Inversion frequencies in this population changed rather rapidly at first and then appeared to be stabilizing at about 36% CH at the end of a year. After year 1, the population was sampled 19 times until its termination at month 166. The frequency of CH fluctuated fairly broadly during this time, ranging between 12% and 33%, with an average value of 24%. This average accords with earlier experiments (29). CH was at a frequency
of 21% when the population was sampled to extract AR and CH chromosomes. Table 4 shows that the intrinsic rate of increase for an equilibrium population is greater than it would be in a population monomorphic for either AR or CH. These results are consistent with earlier experiments (4, 6, 7) showing that experimental populations polymorphic for AR and CH chromosomes from Pinon Flats produced greater biomass, had higher r values, and were better competitors with another Drosophila species than were populations monomorphic for AR or CH. The l(x) and m(x) schedules given in Table 1 predict a stable genetic equilibrium because the intrinsic rates of increase for the genotypic life history schedules satisfy the condition r11 , r12 . r22. The Norton–Charlesworth theory indicates a heterozygote advantage in fitness and an equilibrium frequency of 35% for CH. Frequency changes in the inversions calculated from the genotypic r values fit the observed frequencies in the population cage reasonably well during its first year. The Darwinian fitnesses estimated by demographic theory predict rather well the genetic equilibrium the population approaches during its first year, but they do not explain the lower frequencies of CH in later generations. The fluctuations in inversion frequency indicate that selection was not constant but probably changed in response to external environmental
Table 3. Life history data for karyotypes with AR and ST inversions ARyST†
ARyAR*
STyST‡
x
l(x)
m(x)
l(x)
m(x)
l(x)
m(x)
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
.46 .46 .46 .46 .44 .44 .44 .43 .43 .42 .41 .40 .39 .38 .38 .35
0.0 0.0 0.3 0.5 1.1 1.1 1.2 1.3 0.5 0.3 0.8 1.0 0.7 0.7 0.6 0.4
.64 .64 .64 .64 .64 .64 .64 .64 .64 .64 .63 .63 .62 .62 .60 .58
0.0 0.0 0.0 0.2 0.6 0.6 0.6 1.2 0.7 0.5 0.8 0.6 1.0 1.0 0.6 0.4
.77 .77 .77 .77 .77 .77 .77 .77 .77 .77 .77 .76 .75 .75 .75 .74
0.0 0.0 0.0 0.1 0.2 0.2 0.4 0.7 0.3 0.2 0.5 0.4 0.7 0.9 0.6 0.4
x, age of adult in days; l(x), probability of living to age x; m(x), fecundity as female eggsyfemaleyday. *Length of preadult life (D) 5 13.10 days. †D 5 13.22 days. ‡D 5 13.47 days.
Table 4. Demographic analysis of selection on three sets of D. pseudoobscura karyotypes Equilibrium Karyotype
rij
Rij
Wij
Type
Frequency AR
r
ARyAR ARyCH CHyCH
.2105 105 .2172 138 .2031 78
.86 1.00 .73
Stable
.655
.213
ARyAR ARyPP PPyPP
.2350 165 .2326 141 .2330 159
1.05 1.00 1.01
Unstable
.157
.233
ARyAR ARyST STyST
.0774 .0759 .0656
1.04 1.00 .73
None
4.1 4.8 4.2
1
.0774
rij, instantaneous rate of increase (per day) of a population, all of whose members have the l(x) and m(x) schedules of karyotype AiAj; Rij, mean lifetime fecundity (as female eggs) of an AiAj female at birth; Wij, relative Darwinian fitness of karyotype AiAj; r, instantaneous rate of increase (per day) of an equilibrium population.
7746
Colloquium Paper: Anderson and Watanabe
conditions and perhaps to the population’s genetic constitution. The long continuation of the polymorphism for AR and CH in the population cage bears out the prediction of balancing selection based on a fitness advantage of the ARyCH heterozygote. Decomposing natural selection into its component parts has been one of the most important topics in evolutionary biology during the past few decades (32–34). Two of the most important selection components are viability, or survival, and fertility, which in Drosophila is largely female fecundity and male mating success. A great strength of the demographic theory of selection is that life history tables of l(x) and m(x) represent these selection components in a natural way, and also show their variation with age. Viability or survival between two ages is just the ratio of the longevities at the two. Female fertility is represented by fecundity, but including the male component of fertility is more difficult. Demographic theory is normally formulated in terms of female fecundity alone, because population growth is seldom limited by male fertility. The differential mating success of male genotypes often will not affect population growth, but it will cause changes in gene frequency that reflect natural selection. The demographic model for selection assumes that fertility is alike in the two sexes. If l(x) and m(x) schedules differ in the sexes, the r obtained by averaging the r values separately estimated from male and female l(x) and m(x) schedules will describe the population’s growth rate to a good approximation (19). Looking down the columns for l(x) and m(x) of the three karyotypes in Table 1, we see a substantial advantage in l(x) and m(x), and the reproductive functions plotted in Fig. 1 show these significant genotypic differences in terms of fecundity adjusted for survival. It is interesting that Druger and Nickerson (30, 31) sampled eggs from the population cage on several occasions and prepared salivary chromosomes from larvae grown under optimal conditions to determine the frequencies of the karyotypes. They also determined the karyotypes of adult flies emerging from cups of food in the population cage. In neither case was there a significant departure from Hardy–Weinberg expectations, and they concluded that the three karyotypes did not differ in preadult viability. The preadult viabilities from their later (27) measurements of l(x) and m(x), given as l(0) in Table 1, do not differ much and show no heterozygote advantage, confirming their earlier (30, 31) conclusions. Although the preadult viabilities of the karyotypes did not differ, the greater l(x) values for ARyCH indicate a substantial heterozygote advantage in viability during the adult life stage. The fecundity advantage of the heterokaryotype is apparent in early adult life, and becomes pronounced by day 12. It is important to remember that both the demographic equation used to calculate the karyotypic r values and the formula for the Darwinian fitnesses of the karyotypes have the value of V(x) multiplied by what Fisher (35) termed the ‘‘discount factor,’’ e2rx. When r . 0, this discount factor gives greater weight to fecundity contributed by early age classes. When r 5 0, the discount factor becomes one and all ages contribute equally to fitness. Thus, for r 5 0, the Darwinian fitness Wij equals the expected lifetime reproduction Rij. It is this connection between Darwinian fitness and the ecological parameters of population growth that is the special value of the demographic theory of selection. Charlesworth and Giesel (36) used the demographic theory of selection to show how cycles in population growth rate could drive cycles in gene frequency at a polymorphic locus. Cycles can be caused by environmental factors that affect the longevities and fecundities of all genotypes and ages in the same way, leading to no change in the relative sizes of l(x) and m(x), but rather to an altered population growth rate and in turn to an altered pattern of selection. The cycles of inversion frequency in D. pseudoobscura populations on Mt. San Jacinto
Proc. Natl. Acad. Sci. USA 94 (1997) (28), where the AR and CH strains used in the experiments were collected, may well be a case in point (19, 36). The fairly broad fluctuations in inversion frequency in the population cage may also be the result of environmental factors that affect population growth, such as temperature, quality of the nutrient medium, mold infection, and mite infestation. Although this scenario is attractive, fluctuations in inversion frequency could be intrinsic properties of these populations, without requiring that environmental changes drive them. The unstable genetic equilibrium predicted for the AR and PP inversions is not surprising in light of the l(x) and m(x) schedules in Table 2 and the V(x) curves in Fig. 2. The reproductive functions for the karyotypes are only roughly triangular. They rise very steeply, and after V(x) for ARyPP begins to fall, the curve for PPyPP continues upward and levels off before dropping. The V(x) for ARyAR drops a bit after an initial rise and then shows a substantial rise again before dropping off. The heterokaryotype ARyPP is at a clear disadvantage to the homokaryotypes ARyAR and PPyPP. This heterozygote disadvantage is reflected in the unstable equilibrium at 23.3% AR, which would be expected to lead to a decline in frequency of one or the other of the inversions and its eventual loss. The D. pseudoobscura population in Black Forest is polymorphic for AR and PP, and these inversions show seasonal cycles (37). There are no data available on the frequencies of these AR and PP chromosomes in experimental populations, but AR and PP from other localities have not reached balanced polymorphisms in population cages (38). The selective differences between the karyotypes are much smaller for AR and PP than for the karyotypes carrying AR and CH. The demographic analysis suggests that the frequencies of AR and PP would change slowly in experimental populations. AR and ST chromosomes are normally polymorphic in the natural population at Mather, where these chromosomes were collected. They reach equilibria with slightly higher frequencies of ST than AR in experimental populations and with a pronounced heterozygote advantage in fitness (39). The conditions of our experiment were unusually harsh, as shown by the reduction in net reproduction R by a factor of about 25 (see Table 4). The r values were reduced approximately 3-fold by comparison with the other sets of data. The curves for V(x) were not triangular but showed three pronounced peaks. ARyAR showed a strong advantage early in adult life, followed by a peak for ARyST and then one for STyST near the end of the experiment. ARyST was intermediate in the karyotypic rates of increase and Darwinian fitnesses, leading to the prediction that there would be no equilibrium but, rather, a steady increase in the frequency of AR. The net reproduction R for the heterokaryotype was greater than that for either homokaryotype, but the growth rate gave additional weight to the early peak of reproduction by ARyAR, giving it a fitness advantage over the other karyotypes. We have used life history schedules of longevity and fecundity as the basic data for a demographic analysis of selection in three Drosophila populations. The fitness estimates we have obtained, in terms of population growth rates and relative Darwinian fitnesses, make sense in light of what we know about the population genetics of the inversion polymorphism. Two sets of measurements were made under nearly optimal laboratory conditions. From the outset of this study, we wondered whether measurements made under uncrowded conditions would reflect the selection that occurs in crowded population cages, or in nature. Population density is theoretically an important factor in determining fitness (40, 41), but our results indicate that density effects probably do not dominate the selection on the inversions. The D. pseudoobscura inversions are not unique in this respect. Mueller and Ayala (12) found that for Drosophila melanogaster inbred lines, population growth rates at low densities were good indicators
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Anderson and Watanabe of known fitness differences among the lines, whereas population growth rates at high densities were not. One of the strong points of estimating fitnesses from life history tables is the focus it gives to female fecundity and, in particular, to the age at which offspring are produced. Unfortunately, our life history tables do not include male mating success, and this omission is a weakness of our approach. We can imagine experiments in which a life history table of longevity and male reproductive success would be generated and then utilized with the female life history table to generate fitness estimates. Such experiments would be difficult to carry out. We believe that the analysis we have presented, although not complete, is a step in the right direction of a demographic approach to fitness and to natural selection. We thank Drs. Richard Nickerson and Marvin Druger for generously providing us their l(x) and m(x) data on ARyAR, ARyCH, and CHyCH karyotypes, and Dr. Margaret Anderson for help with the calculations. We thank Drs. Daniel Promislow, Mohamed Noor, and Laurence Mueller for their comments on the manuscript, and Dr. Brian Charlesworth for comments on an earlier version. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
Dobzhansky, T. (1937) Genetics and the Origin of Species (Columbia Univ. Press, New York). Dobzhansky, T. (1970) Genetics of the Evolutionary Process (Columbia Univ. Press, New York). Dobzhansky, T. (1968) Evol. Biol. 2, 1–34. Beardmore, J. A., Dobzhansky, T. & Pavlovsky, O. A. (1960) Heredity 14, 19–33. Birch, L. C., Dobzhansky, T., Elliot, P. O. & Lewontin, R. C. (1963) Evolution 17, 72–83. Dobzhansky, T., Lewontin, R. C. & Pavlovsky, O. (1964) Heredity 22, 169–186. Ohba, S. (1967) Heredity 22, 169–186. Ayala, F. J. (1970) in Essays in Evolution and Genetics in Honor of Theodosius Dobzhansky, eds. Hecht, M. K. & Steere, W. C. (Appleton-Century-Crofts, New York), pp. 121–158. Ayala, F. J. (1969) Can. J. Genet. Cytol. 11, 439–456. Mourao, C. A., Ayala, F. J. & Anderson, W. W. (1972) Genetica 43, 552–574.
11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41.
7747
Anderson, W. W. (1989) Genome 31, 239–245. Mueller, L. D. & Ayala, F. J. (1981) Genetics 97, 667–677. Mueller, L. D. (1988) Proc. Natl. Acad. Sci. USA 85, 4383–4386. Mueller, L. D. & Ayala, F. J. (1981) Proc. Natl. Acad. Sci. USA 78, 1303–1305. Mueller, L. D., Guo, P. & Ayala, F. J. (1991) Science 253, 433–435. Anderson, W. W. & King, C. E. (1970) Proc. Natl. Acad. Sci. USA 66, 780–786. Cavalli-Sforza, L. L. & Bodmer, W. F. (1971) The Genetics of Human Populations (Freeman, San Francisco). Charlesworth, B. (1970) Theor. Pop. Biol. 1, 352–370. Charlesworth, B. (1994) Evolution in Age-Structured Populations (Cambridge Univ. Press, Cambridge, U.K.), 2nd Ed. Stearns, S. C. (1992) Evolution of Life Histories (Oxford Univ. Press, Oxford). Rose, M. R. (1991) Evolutionary Biology of Aging (Oxford Univ. Press, Oxford). Promislow, D. E. L., Tatar, M., Khazaeli, A. A. & Curtsinger, J. W. (1996) Genetics 143, 839–848. Norton, H. T. J. (1928) Proc. Lond. Math. Soc. 28, 1–45. Norton, J. T. H. (1915) in Mimicry in Butterflies, ed. Punnett, R. C. (Cambridge Univ. Press, Cambridge, U.K.), pp. 154–155. Moffa, A. M. & Costantino, R. F. (1977) Genetics 87, 785–805. Franklin, I. & Lewontin, R. C. (1970) Genetics 65, 707–734. Nickerson, R. P. & Druger, M. (1973) Evolution 26, 322–325. Dobzhansky, T. (1943) Genetics 28, 162–186. Wright, S. & Dobzhansky, T. (1946) Genetics 31, 125–156. Druger, M. (1966) Heredity 21, 317–321. Druger, M. & Nickerson, R. P. (1972) Evolution 26, 322–325. Prout, T. (1969) Genetics 63, 949–967. Prout, T. (1971) Genetics 68, 127–149. Christiansen, F. B., Frydenberg, O. & Simonsen, V. (1973) Heredity 73, 291–304. Fisher, R. A. (1958) The Genetical Theory of Natural Selection (Dover, New York), 2nd Ed. Charlesworth, B. & Giesel, J. T. (1972) Am. Nat. 106, 388–401. Crumpacker, D. W. & Williams, J. S. (1974) Evolution 28, 57–66. Pavlovsky, O. & Dobzhansky, T. (1966) Genetics 53, 843–854. Dobzhansky, T. (1948) Genetics 33, 588–602. Anderson, W. W. (1971) Am. Nat. 105, 489–498. Prout, T. (1980) Evol. Biol. 13, 1–68.
Proc. Natl. Acad. Sci. USA Vol. 94, pp. 7748–7755, July 1997 Colloquium Paper
This paper was presented at a colloquium entitled ‘‘Genetics and the Origin of Species,’’ organized by Francisco J. Ayala (Co-chair) and Walter M. Fitch (Co-chair), held January 30–February 1, 1997, at the National Academy of Sciences Beckman Center in Irvine, CA.
Phylogenetics and the origin of species (allelic genealogiesygene treesylineagesymitochondrial DNAyphylogeography)
JOHN C. AVISE AND KURT WOLLENBERG Department of Genetics, University of Georgia, Athens, GA 30602
Dobzhansky (1) began Genetics and the Origin of Species with ‘‘an observational fact more or less familiar to everyone . . . the discontinuity of the organic variation.’’ After addressing the sources of genetic variability in sexually reproducing populations and the evolutionary processes configuring this variation, Dobzhansky then culminated his tome with three concluding chapters extending earlier sentiments by Lamarck, Darwin, A. R. Wallace, and others who had identified an important role for reproductive isolation in the origin and maintenance of biotic discontinuity in the living world. Genetics and the Origin of Species provides one of the seminal treatments of what would become known as the ‘‘biological species concept’’ (2), and it remains today among the most eloquent of expositions on the evolutionary significance of speciation as the juncture ‘‘at which the once actually or potentially interbreeding array of forms becomes segregated in two or more separate arrays which are physiologically incapable of interbreeding’’ (1). In Dobzhansky’s judgment, ‘‘biological classification is simultaneously a man-made system of pigeonholes devised for the pragmatic purpose of recording observations in a convenient manner and an acknowledgment of the fact of organic discontinuity’’ (1). Throughout this century, the biological species concept (BSC) unquestionably has provided the primary philosophical framework orienting thought and research on speciation (3). Thus, a recent development in the field of systematics could hardly be of deeper import to evolutionary biology: the rise in
popularity (4) of a competing view that depreciates or, in the extreme, disavows entirely (5, 6) any relevance of reproductive isolation for species concepts. Various formulations of a ‘‘cladistic’’ or ‘‘phylogenetic species concept’’ (PSC) have been advanced (7–16), but all agree that species concepts and definitions should emphasize criteria of phylogenetic relationship (descent) and not reproductive relationship (interbreeding) (14). For example, a phylogenetic species as defined by Cracraft (10) constitutes ‘‘the smallest diagnosable cluster of individual organisms within which there is a parental pattern of ancestry and descent,’’ with diagnosis based strictly on one or more synapomorphic (shared–derived) characters that identify a monophyletic aggregate of individuals (17). A widespread perception of overt conflict between the BSC and the PSC is underscored by numerous published statements such as the following: ‘‘as a working concept, the biological species concept is worse than merely unhelpful and nonoperational—it can be misleading’’ (18); ‘‘a focus on the processes involved in breeding systems and barriers is unnecessary for . . . species recognition’’ (15); ‘‘reproductive isolation should not be a part of species concepts’’ (5); ‘‘a concept consistent with the PSC should replace ‘biological species’ concepts’’ (6); ‘‘evolutionary biologists should abandon the BSC’’ (17); and ‘‘the BSC and all other essentialist definitions should be scrapped once and for all’’ (19). In light of these developments, this colloquium, which is timed to celebrate the 60th anniversary of the publication of Dobzhansky’s classic, is an appropriate forum to reflect once again upon the BSC, one of the book’s orienting foundations. The following questions will be addressed: (i) Is an appropriately constructed phylogenetic concept of species truly revisionary and antithetical to the BSC? (In our opinion, no.) (ii) Does the PSC as typically formulated provide an adequate philosophical or operational framework for achieving the stated goal of clarifying biotic relationships according to phylogenetic descent? (No, because it has failed to take into adequate account established population genetic principles.) (iii) Can desirable elements of the BSC and a properly reformulated PSC be reconciled in ways that will contribute to the scientific understanding of biotic discontinuity? (Yes.) Here we introduce a heuristic approach that is inherently phylogenetic yet applies at the microevolutionary level of recent biological species and their constituent populations. The approach focuses on the individual and collective genealogical transmission pathways available to nuclear alleles in sexual organismal pedigrees. Each allele is defined here as a length of DNA that has been free of recombination within it during the ecological or evolutionary time under consideration, and whose branching pathways of descent therefore describe a nested, nonreticulate transmission history (‘‘allelic
© 1997 by The National Academy of Sciences 0027-8424y97y947748-8$2.00y0 PNAS is available online at http:yywww.pnas.org.
Abbreviations: BSC, biological species concept; PSC, phylogenetic species concept.
ABSTRACT A recent criticism that the biological species concept (BSC) unduly neglects phylogeny is examined under a novel modification of coalescent theory that considers multiple, sex-defined genealogical pathways through sexual organismal pedigrees. A competing phylogenetic species concept (PSC) also is evaluated from this vantage. Two analytical approaches are employed to capture the composite phylogenetic information contained within the braided assemblages of hereditary pathways of a pedigree: (i) consensus phylogenetic trees across allelic transmission routes and (ii) composite phenograms from quantitative values of organismal coancestry. Outcomes from both approaches demonstrate that the supposed sharp distinction between biological and phylogenetic species concepts is illusory. Historical descent and reproductive ties are related aspects of phylogeny and jointly illuminate biotic discontinuity. . . . genetics has so profound a bearing on the problem of the mechanisms of evolution that any evolution theory which disregards the established genetic principles is faulty at its source. —Theodosius Dobzhansky, 1937
7748
Colloquium Paper: Avise and Wollenberg pathway’’ or ‘‘allelic genealogy’’) entirely suitable for phylogenetic or cladistic examination (20, 21). Genealogical Pathways and Organismal Phylogenies. Phylogeny at the level of populations and species. Ever since the publication of Genetics and the Origin of Species, geography and demography have played key roles in most biological speciation scenarios (22). During the evolutionary sequence of events by which an extended reproductive community of organisms (a field for gene recombination) becomes sundered, a curtailment of population genetic exchange by environmental separation typically is envisioned as a necessary prerequisite for the eventual evolution of the intrinsic (genetic) reproductive isolating barriers (RIBs) that are the hallmark of the BSC (Fig. 1). The initial genomic sundering may involve sister populations distributed across broad areas (species D and E in Fig. 1), small founding populations on the periphery of a species’ range (species A) (23), or, in some cases (24), local syntopic populations separated by microhabitat (species B). In each case, population genomic differentiation facilitated by environmental impediments to interbreeding initiates or eventually may lead to an elaboration of intrinsic reproductive barriers. Biological speciation also can take place suddenly in small populations via reproductive sundering processes such as polyploidization, chromosomal rearrangements, or changes in the mating system (20). Species A and B in Fig. 1 could be interpreted as examples. Each such geographic–demographic model yields logical predictions about the coarse-focus phylogeny for particular extant populations or biological species (25). For example, from a traditional perspective, taxa D and E (Fig. 1) are sister biological species that comprise a clade. On the other hand, the widely distributed species C that recently spawned a peripheral isolate A, or a syntopic species B, is paraphyletic with respect to each of these latter forms. As emphasized by Patton and Smith (26), most mechanisms of speciation currently advocated by evolutionary biologists ‘‘will result in paraphyletic taxa as long as reproductive isolation forms the basis for species definition.’’ Such statements pertain to the historical subdivisions of gene pools at the levels of species or welldemarcated populations. In reality, intermediate situations also exist in which biotic subdivisions display incomplete phylogenetic separation because of a semipermeability in the extrinsic or intrinsic barriers to genetic exchange.
FIG. 1. (a) Phylogeny for five biological species (A–E) and two geographically separated populations (C1 and C2) of C. Branch widths are proportional to the populations’ or species’ sizes and also indicate a geographic orientation. Thus, A is a peripheral isolate from C1, and B arose within the range of C2. The sundering agents are intrinsic RIBs (black areas), extrinsic barriers to gene flow (gray areas), or both in temporal order of appearance (gray then black). (b) Simplified ‘‘stick’’ representation of the phylogeny in a.
Proc. Natl. Acad. Sci. USA 94 (1997)
7749
Phylogeny at the level of alleles. In principle, any representation of phylogeny for separated populations or species might be examined under finer focus by reference to organismal pedigrees (Fig. 2). Ineluctably, pedigrees define extended pathways of genetic transmission that constitute rivulets in ‘‘the stream of heredity (that) makes phylogeny’’ (27). Consider, for example, the matrilineal pathway of transmission (F 3 F 3 F 3 F . . . , where F signifies female) for mitochondrial (mt) DNA (Fig. 3 Upper Left). All extant females in taxon E trace genealogically through female ancestors to a shared progenitress at t 2 5, those in D coalesce at t 2 9, and those in the D 1 E assemblage stem to a common ancestor at t 2 12. The great-great . . . -great matrilineal grandmother of all extant individuals in the pedigree existed at t 2 20. With respect to the matrilines in the A–C complex (which coalesce at t 2 11), C1 is paraphyletic to A, and C2 is paraphyletic to B. All such statements reflect the realities of allelic-level ancestry through heredity, as to be distinguished from any estimates of ancestry in empirical appraisals based on molecular or any other data. Similarly, other gender-described classes of genealogical pathways can be envisioned. In any pedigree for sexually reproducing organisms, only four such transmission routes are mutually exclusive in every generation: the matrilineal pathway already mentioned; the patrilineal analogue (M 3 M 3 M 3 M . . . , where M signifies male; the route, for example, of the mammalian Y chromosome); and the generation-to-generation alternating reciprocal pathways ‘‘M 3 F 3 M 3 F . . . ’’ and ‘‘F 3 M 3 F 3 M . . . . ’’ As traced through the organismal pedigree under consideration (Fig. 3), a comparison of these ‘‘independent’’ pathways illustrates two fundamental points. First, the coalescent trees for the
FIG. 2. Same phylogeny as in Fig. 1 but here depicting organismal pedigrees through 21 discrete generations leading to the present. The two lines tracing from each male (■) or female (E) in any generation identify the parents of that individual. They also describe the geographic dispersal of offspring (which is assumed to be distance-limited) and the mating events.
7750
Colloquium Paper: Avise and Wollenberg
Proc. Natl. Acad. Sci. USA 94 (1997)
FIG. 3. Identical phylogeny and pedigree to Fig. 2, but here in which the four allelic transmission pathways that are mutually exclusive in every generation have been highlighted by arrows. (Upper Left) Matrilineal pathway reflecting the ‘‘F 3 F 3 F 3 F . . . ’’ transmission route (e.g., of mtDNA). (Upper Right) Patrilineal pathway reflecting the ‘‘M 3 M 3 M 3 M . . . ’’ transmission route (e.g., of the Y chromosome). (Lower Left) Generation-to-generation pathway through alternating genders ‘‘M 3 F 3 M 3 F. . . . ’’ (Lower Right) The reciprocal of the latter, ‘‘F 3 M 3 F 3 M. . . . ’’ Heavy arrows mark transmission routes through this pedigree that extend to the current generation; light arrows mark the same respective transmission routes that terminated before reaching the extant generation.
allelic pathways can differ from one another in both depth and pattern. For example, extant individuals in species E share a common ancestor in the patrilineal phylogeny at t 2 19 (Fig. 3 Upper Right), whereas they do so in the ‘‘M 3 F 3 M 3 F . . . ’’ tree at t 2 3 (Fig. 3 Lower Left). Second, such
transmission routes can differ from the species- or population-level phylogeny for reasons of lineage sorting, even in the total absence of introgression. With respect to patrilineal genealogy (Fig. 3 Upper Right), for example, the sister species D and E do not display a relationship of reciprocal mono-
Colloquium Paper: Avise and Wollenberg phyly (28), nor do they comprise a sister clade to the A–C group. The total number of allelic transmission pathways as defined in this manner by gender is 2(G11), where G ($1) is the number of generations monitored (29). In the current case, this count is 222 5 4,194,304. Each such multigeneration pathway describes a unique, gender-defined route of potential allelic transmission through the organismal pedigree. These multitudinous genealogical pathways also differ from the four ‘‘independent’’ pathways pictured in Fig. 3, although the degrees of overlap vary widely. For example, the ‘‘F 3 F 3 F . . . 3 F 3 M’’ pathway is identical to the matrilineal pathway (Fig. 3 Upper Left) except for the most recent generation, where the transmission was to sons. This particular example would describe the history of mtDNA in extant males. The reciprocal pathway (M 3 M 3 M . . . 3 M 3 F) describes the history of family surnames as displayed by unmarried females in many human societies: a patrilineal legacy to living daughters (30). All other 4,194,298 genealogical pathways (e.g., F 3 F 3 M 3 M . . . 3 F 3 F 3 M 3 M, or F 3 M 3 F 3 F . . . 3 F 3 M 3 F 3 F) have been equally available to any piece of autosomal DNA trickling through the organismal pedigree under the rules of Mendelian inheritance. This statement holds regardless of the sex ratio in the population, because every individual has a mother and a father. This gender-defined conception of allelic pathways differs somewhat from the usual definition of a ‘‘gene tree’’ for nuclear loci, but is introduced here to emphasize analogies to the conventionally understood pathways for mtDNA and the Y chromosome (as well as to clarify and stimulate thought about the potential numbers and patterns of allelic transmission routes). In actuality, extant alleles at any real autosomal gene collectively will have traversed many different genealogical pathways, such that the particular gender-based transmission routes normally remain unspecifiable by locus (31, 32). This allelic-level heuristic conception of microphylogeny emphasizes that different pieces of DNA can have different genealogical histories both within and between closely related forms. This situation is an inevitable outcome of the quasiindependent transmission histories of alleles within and among loci through the organismal pedigrees of sexual reproducers. Coalescent theory. A theory of gene coalescence as a function of population demography (33–41) and biological speciation mode (28, 42–46) has developed in response to the novel interpretive challenges provided by molecular genealogical data, particularly that from nonrecombining animal mtDNA (47, 48). As a phylogenetically oriented subdiscipline of population genetics, coalescent theory formally addresses lineage sorting and branching processes through organismal pedigrees, such that demographic and life-history factors, including population size, dispersal, and mating pattern, assume importance. Only cursory background will be noted here. Imagine a large, idealized population with nonoverlapping generations and a constant number of Nm males and Nf females who contribute to a gametic pool from which the next generation of diploid zygotes is randomly derived. For particular allelic pathways of the sort pictured in Fig. 3, the expected mean time to common ancestry for pairs of alleles or haplotypes, measured in generations, is Nm or Nf, and the mean coalescence time for large numbers of sampled haplotypes is approximately 2Nm or 2Nf (49). These expectations apply to an mtDNA gene tree, a Y-linked gene tree, or to any hypothetical nuclear allelic pathway described by a specified gender-based transmission route when Nm 5 Nf. For unequal sex ratio, the average time to common ancestry would have to be adjusted to reflect the relative numbers of generations an allele spends in each gender in the particular allelic transmission pathway specified. In the real world, a collective genealogy for multiple alleles at a true autosomal locus will not be delineated so clearly, because in each generation the alleles could have been
Proc. Natl. Acad. Sci. USA 94 (1997)
7751
transmitted along any of four possible routes (F 3 F, M 3 M, F 3 M, or M 3 F). Expected coalescent times at a real autosomal locus are 4-fold greater (50) than those for the idealized allelic pathways described above. Natural populations depart from this idealized model, but expectations are approximated by substituting effective population sizes of males (Ne(m)) and females (Ne(f)) into the equations. Also, at real loci the theory applies strictly to neutral alleles. For example, coalescent times under balancing selection are extended because allelic extinction is inhibited, whereas coalescent times under selective sweeps are truncated. Recent extensions of coalescent theory have examined the consequences of nonequilibrium population demographies (51–55) and varied selection regimes (56–61) on gene genealogies. The lineage-sorting processes underlying coalescent theory also apply across the sundering events that generate phylogenetic nodes and branches at the levels of populations or species (28, 42–45). Consider again the sister species D and E in Fig. 3. At the present time (t 5 0), E is paraphyletic to D in terms of patrilineal genealogy (Fig. 3 Upper Right), but a reasonable expectation under lineage sorting is that this genealogical relationship will soon convert to one of reciprocal monophyly [‘‘exclusivity’’ in other parlance (62–64)] as one or the other of the two ancient patrilines in E goes extinct. In general, a neutral allelic tree is highly likely to have evolved to a status of exclusivity in two sister populations or species separated for more than about 4Ne generations, where Ne refers to the size of the isolates (28, 65). A discordance also exists between the patrilineal genealogy (Fig. 3 Upper Right) and the deepest node (distinguishing A–C from D–E) in the species phylogeny. In the future, depending on which of the two deep patrilines currently within E first goes extinct, this discordance will either (i) disappear or (ii) become cemented in place. Outcome (ii) illustrates how a true allelic– treeyspecies–tree discordance also may characterize species that stem from closely spaced population nodes (relative to effective population size) at distant times in the evolutionary past. The probabilities of such discordance as functions of Ne and internodal numbers of generations are presented by Nei (49). In summary, coalescent theory as applied across the phylogenetic nodes in populational or species trees has shown that the phylogenetic status of a given pair of biological species with respect to allelic genealogy is itself evolutionarily dynamic, with a usual time course subsequent to biological speciation being poly- or paraphyly 3 reciprocal monophyly. Furthermore, the demographic and geographic modes of speciation have major impact on the developing phylogenetic status of allelic genealogies in related biotas (28). Thus, at microevolutionary scales, concepts of phylogeny cannot be divorced from those of population genetics and demography. As a sundering agent at the level of populations and species, extrinsic and intrinsic barriers to interbreeding are keynote evolutionary agents motivating genealogical partitions at the level of allelic lineages. Points of Conf lict Between the PSC and the BSC. Phylogenetic complaints against the BSC. Proponents of the PSC have leveled several criticisms against the BSC. One widespread grievance is that appeals to reproductive isolation in species recognition necessitate unjustifiable and untestable judgments about the future; namely, whether contemporary barriers to interbreeding are permanent or temporary. However, analogous prospective judgments are no less inherent in species concepts based on criteria of perceived phylogenetic separation (66). Thus, whether or not this criticism of the BSC is fully justified, it cannot be used to argue in favor of the PSC. We will briefly address the three other perceived flaws of the BSC that were emphasized in a recent review (6).
7752
Colloquium Paper: Avise and Wollenberg
(i) Reproductive compatibility among populations is a shared primitive (rather than derived) feature, so it provides no criterion for identifying monophyletic units or clades. In other words, ‘‘a serious potential problem of the BSC is the occurrence of paraphyletic, or nonhistorical, groups’’ (6). We appreciate the premise of this sentiment, but fail to see why it is so anathematical to admit paraphyly in species concepts (apart from the fact that this practice violates an operational ethic of cladistic analysis). As illustrated above, many modes of biological speciation initially entail paraphyly both at the levels of population trees and allelic trees. The philosophical stance that we favor acknowledges the reality of paraphyly for many biological species but then capitalizes upon such lineage information in particular instances to recover the historical population demographies that accompanied biological speciation. By tapping the genealogical archives of extant organisms, the bygone demographies of ancestral populations and speciation events may be inferred (67). In this important sense, paraphyly can hardly be equated with ‘‘nonhistory.’’ (ii) A focus on reproductive compatibilities and patterns of interbreeding can cause a ‘‘misrepresentation of the significance of hybridization among differentiated taxa’’ (6). This criticism of the BSC stems from the fact that widely varying levels of hybridization and introgression have been employed by taxonomists as justifications for naming biological species. However, such problems are implementational more than epistemological. The Linnaean binomial system of nomenclature lends itself poorly to the summary of situations with intermediate levels of introgression. Under the genealogical perspective promoted in this paper, the mosaic phylogenetic histories of allelic pathways within an organismal pedigree (including introgressed lineages) are of greater empirical content and conceptual import than are the necessarily simplified taxonomic summaries. (iii) ‘‘A long-recognized drawback of the BSC is its difficulty in ranking allopatric populations . . . . Because reproductive isolation is an epiphenomenon (or emergent property) of divergence, it is not easily related to descriptions of how characters vary geographically’’ (6). This criticism validly notes an inherent implementational difficulty for species-level taxonomy under the BSC. However, even in this restricted context, alternative concepts such as the PSC may fare no better. If individual synapomorphies are to be used to define species as under the conventional PSC, then with sufficient empirical effort nearly every local population will be distinguishable from nearly every other by some character, and the challenge remains as to how to rank the differences. Such ranking might be accomplished more appropriately by considerations of multicharacter genealogies (see below), but these too are emergent properties of population-level divergence through extended pedigrees. Furthermore, because demographic factors such as gene flow and effective population size exert overriding influence on phylogeographic patterns, strict application of the monophyly criterion for species definition will strongly bias toward formal taxonomic recognition of smaller as opposed to larger populations (the latter more often will be paraphyletic with respect to close relatives). Thus, the conventional PSC itself can fail to capture important aspects of organismal history across geography. Diagnostic complaints against the PSC. The overriding difficulty with the traditionally formulated PSC concerns the diagnosability criteria by which clades in sexually reproducing organisms are to be recognized at microevolutionary scales. In the light of coalescent theory, any approach that promulgates clade diagnosis on the basis of synapomorphs at only one or a few genes makes little sense. Consider again Fig. 3. Many phylogenetic species, each comprising a ‘‘diagnosable cluster of individual organisms within which there is a parental pattern of ancestry and descent’’ (10), could be identified depending on which allelic genealogy was scrutinized. Many more such
Proc. Natl. Acad. Sci. USA 94 (1997) ‘‘clades’’ could be identified by various other of the 4.2 million gender-defined transmission pathways in this same pedigree (Figs. 4 and 5). Such diagnostic possibilities are not merely hypothetical. With the resolving power already available in molecular assays of rapidly evolving genes such as mtDNA or autosomal microsatellites, many local populations, family units, and even individual organisms can readily be distinguished by recently derived mutations (20, 68). What conceptual or utilitarian rationale exists for defining each such diagnosable biological unit as a distinct species? The ‘‘clades’’ identified by the synapomorphies in different allelic trees almost inevitably group sexually interbreeding individuals into overlapping arrays, such that the phylogenetic units recognized by different pieces of DNA are neither mutually exclusive nor nested. For example, all extant C1 males form a clade in the allelic tree displayed in the lower right of Fig. 3, whereas they are variously allied to A or C2 and B in the patrilineal genealogy (Upper Right). Only after reproductive ties have been severed for times considerably longer than effective population sizes do deeper topologies in multiple allelic genealogies tend to come into congruence with one another, and with the coarse-focus topologies of the population-level phylogenies that they comprise. From this perspective, speciation under phylogenetic precepts should be viewed more properly as the evolutionary process by which patterns of predominant nonconcordance among shallow allelic genealogies are converted to patterns of predominant concordance in deeper allelic trees.
FIG. 4. Examples of six (a–f) additional gender-defined allelic transmission pathways (analogous to those in Fig. 3) through the organismal pedigree (Fig. 2). Each genealogy terminated in extant females and, hence, is a ‘‘female-tip’’ gene tree. Lineages that did not coalesce at t 2 21 were assumed to do so at t 2 22. At the bottom of the figure is a consensus tree for 20 such randomly chosen female-tip genealogies. Numbers on branches indicate the percentage of allelic trees in which that ‘‘clade’’ was found.
Colloquium Paper: Avise and Wollenberg
FIG. 5. Examples of six (a–f) additional ‘‘male-tip’’ allelic pathways through the organismal pedigree of Fig. 2 and a consensus tree for 20 such randomly chosen male-tip genealogies (see legend to Fig. 4).
In summary, no species concept that results in an overly simplified caricature of organismal phylogeny can hope to capture the rich and varied fabrics of genealogical histories in the multiple pieces of DNA that make up those ‘‘composite’’ phylogenies. As phrased by Maddison (21), ‘‘the species phylogeny is more like a statistical distribution, being composed of various trees (the gene trees), each of which may indicate different relationships.’’ The challenge then becomes to describe these statistical distributions and to properly interpret the demographic and evolutionary processes that have shaped them. Reconciling a Multilocus PSC with the BSC. Most early definitions of phylogenetic species failed to accommodate the realities of microevolutionary genealogy in the pedigrees of sexually reproducing organisms. Avise and Ball (69) therefore concluded that if a broader framework of the PSC was to contribute to a significant advance in systematic practice, a shift from issues of diagnosability to considerations of multilocus magnitudes and patterns of genealogical differentiation would be required. Recent verbal reformulations of PSC-like concepts according to principles of multilocus genealogical concordance (63, 64, 69, 70) have begun to heed this call, although much remains to be developed in a formal multilocus coalescent theory of speciation. To develop conceptual connections between the phylogenetic histories of particular genes and the composite histories of populations and species, we now introduce a modification of traditional coalescent theory that jointly examines multiple allelic genealogies through an organismal pedigree. Two general approaches are employed that, in effect, acknowledge and summarize the genealogical heterogeneity across the multitudinous transmission pathways available to nuclear (and cyto-
Proc. Natl. Acad. Sci. USA 94 (1997)
7753
plasmic) alleles. These approaches bear some analogy to the philosophies of ‘‘separate’’ (71) versus ‘‘combined’’ (72) phylogenetic treatments of multiple, potentially conflicting data sets (see ref. 73). Qualitative concordance. The first of these heuristic approaches compares topologies across multiple allelic genealogies (e.g., through use of consensus trees). For example, three of the four allelic genealogies in Fig. 3 portray a status of reciprocal monophyly (exclusivity) for D and E and also suggest that these species form a clade distinct from the A–C assemblage. The fourth allelic pathway (Upper Right) contradicts these patterns, but would be overruled in a majority consensus representation. Extending this approach, we have examined many additional genealogical transmission pathways through the organismal pedigree in Fig. 2. Such gender-defined allelic pathways terminate either in extant females (‘‘female-tip’’ trees) or males (‘‘male-tip’’ trees). Six random examples from among the millions of pathways definable for each gender-tipped class are shown in Figs. 4 and 5, together with consensus-tree summaries of additional such representative allelic genealogies. Notice both the heterogeneity of detailed branching pattern across the different allelic genealogies and the fact that the consensus trees nonetheless properly capture the major features in the overall organismal pedigree and phylogeny (compare with Fig. 2). Quantitative coancestry. The second heuristic approach considers composite indices of genetic relatedness between organisms and groups individuals into genealogical assemblages accordingly. Here, the computer program SAS (74) was applied to the known pedigree in Fig. 2 to calculate true coefficients of coancestry between all pairs of extant individuals. A coancestry (or kinship) coefficient is defined as the chance that an allele randomly drawn from one individual is identical by descent (autozygous) within the pedigree to an allele drawn from another individual (75), and is equivalent in value to the inbreeding coefficient for these individuals’ hypothesized offspring. Such probabilities are positive functions of the number of genealogical pathways connecting a pair of individuals to all ancestors in the pedigree and inverse functions of the lengths of those pathways. The coancestry matrix for extant individuals in Fig. 2 then was clustered by UPGMA (76), with results presented in Fig. 6. Although the composite genealogical ‘‘tree’’ in Fig. 6 is artificially branched and mostly dichotomous, it again captures the major features of the organismal pedigree (Fig. 2) upon which (ultimately) it was based. Thus, D and E are sister groups separated by a relatively deep node; C1 appears paraphyletic to A; C2 appears paraphyletic to B; and the A–C assemblage joins the D–E group at the oldest node in the phenogram. Readers may object that these qualitative consensus and quantitative coancestry approaches at attempted recovery of a known pedigree, being based as they are on the allelic-tree properties of that pedigree, involve circularities of reasoning. We agree. In a genealogical sense, a composite organismal pedigree cannot be fundamentally distinct from a statistical compilation of the multitudinous transmission pathways within it. A BSC–PSC reconciliation. Both of these multilocus genealogical approaches demonstrate that reproductive barriers (the hallmark of the BSC) are important, even within a strictly phylogenetic species framework, because they generate through time increased genealogical depth and concordance across allelic pathways. This point has seemed rather obvious to us, yet it has not been appreciated fully by most calls in the literature for a replacement of the BSC by the PSC. It has been the intent of this paper to illustrate, using a novel but simple conceptual construct based on considerations of nonanastomatic allelic pathways, how reproductive and phylogenetic aspects of biological differentiation are related intimately.
7754
Colloquium Paper: Avise and Wollenberg
Proc. Natl. Acad. Sci. USA 94 (1997)
FIG. 6. Phenogram based on a cluster analysis of the matrix of coancestry coefficients for the 39 extant individuals in the pedigree of Fig. 2. Note the close resemblance of this representation to that of the original pedigree.
Thus, reproductive barriers tend to demarcate and distinguish deep biotic discontinuities from those that are ‘‘trivial’’ in the sense of being recent andyor idiosyncratic to small fractions of the genome. It is no mere coincidence, for example, that biological species D and E as defined by reproductive criteria (Figs. 1–3) constitute recognizable assemblages of individuals genealogically (Figs. 4–6). Conversely, genealogical considerations can be seen as important under the philosophical framework of the BSC because they force explicit attention on historical and demographic aspects of speciation. For example, the genealogical paraphyly of C to A (Figs. 4–6), and the high coefficients of coancestry between these two arrays of individuals (Fig. 6), jointly imply a recent and perhaps bottlenecked separation of A from C, as indeed was the case (Figs. 1–3). As applied to taxonomy, Avise and Ball (69) suggested that the BSC be retained as a philosophical orientation for species recognition (notwithstanding the operational difficulties acknowledged above), whereas ‘‘significant’’ phylogeographic partitions, as registered by relatively deep and concordant genealogical separations in multiple lineage pathways, provide a justifiable philosophical and empirical basis for the recognition of additional, historically important ‘‘subspecific’’ units. For at least two reasons, such suggestions probably will have minor impact on either the procedures or outcomes of traditional systematics (although this ‘‘boon’’ alone should not be interpreted as a primary justification for genealogical concordance concepts). First, systematists at their best always have sought concordant evidence from multiple characters before making firm taxonomic judgments (77). Second, although the recovery of nuclear gene genealogies has become technically feasible in recent years (e.g., refs. 78 and 79), such molecular appraisals remain laborious and challenging. Thus, in practice, conventional classes of information from multiple character state distributions (molecular or otherwise) no doubt will continue to provide the surrogate phylogenetic information to be included in appraisals of genealogical concordance and the biotic discontinuities thereby registered. Concluding Thoughts. Biological speciation lies at a pivotal boundary where the partially braided collection of allelic pathways of interbreeding individuals bifurcates into two such collections (32). Hennig (80) characterized this boundary as the dividing line between the realms of ‘‘tokogenetic’’ associations (genetic relationships between individuals) and ‘‘phylogenetic’’ associations (genetic relationships between species), or the border between reticulate and divergent relationships. This boundary also demarcates the areas of inquiry traditionally associated with two of the major disciplines within
evolutionary biology: phylogenetic biology (macroevolution) and population genetics (microevolution). The PSC has roots in the field of systematics but, as applied at microevolutionary levels, has ignored principles of Mendelian and population genetics (at its peril). Conversely, the BSC has roots in population genetics but now might profit from an infusion of appropriate phylogenetic considerations to illuminate previously underemphasized elements of genealogical history, both of alleles and of populations over microevolutionary scales. Two approaches can be taken to accommodate the distinct world views of phylogenetic biology and population genetics. The first is to claim that phylogenetic concepts are devoid of jurisdiction and meaning at intraspecific levels (80). However, as evidenced by considerations of allelic genealogies, this perception is incorrect. Thus, a more fruitful endeavor is to attempt affirmative rapprochements between phylogenetic biology and population genetics by drawing conceptual and empirical connections between these disciplines (47). Considerations of multilocus allelic coalescent processes, perhaps as suggested in this paper, provide an interesting avenue for the further exploration of such possibilities. Biological taxonomy and classification can be viable disciplines without any recognition of evolution, just as rocks and minerals can be named and classified into groups. Indeed, Darwin’s (81) elucidation of evolutionary processes had virtually no impact on the day-to-day practice of naming and grouping species. Today, many phylogenetic biologists argue that the recognition of pattern in phylogeny should not be confused with nor unduly influenced by whatever hypothesized processes (e.g., reproductive isolation) might have contributed to the historical configurations (9, 15, 16, 82). The ‘‘pattern cladists’’ make valid cautionary points about objectivity in scientific explanation, and in practice ‘‘species’’ in the natural world can be identified and pigeonholed under appropriate phylogenetic procedures without consideration of evolutionary-shaping processes. However, to cleanse from species concepts all references to reproductive isolation would be to leave an unduly sterile epistemological foundation for the origin and maintenance of the biotic discontinuities so evident to Dobzhansky 60 years ago. If concepts resembling the BSC had not existed throughout this century, in the light of modern coalescent theory and associated multilocus genealogical concordance principles, they surely now would demand invention. O’Hara (66) has likened the challenge of phylogenetic summary in biology to that of cartographic representation in geography. Phylogenies and maps alike are simplifications of reality, generalized representations with events selectively deleted according to the level and nature of detail required. An
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Avise and Wollenberg interstate road map of the United States may be helpful in driving cross-country but is of no use in navigating the Freedom Trail in Boston for which a fine-grained local map provides the appropriate resolution. Similarly, phylogenetic summaries capture varying degrees of generalization about the streams and watersheds of heredity that make phylogeny, and a given depiction should be matched to the problem at hand. It has been the thesis of this paper that the ‘‘species problem’’ cannot be properly addressed from a phylogenetic perspective without reference to the fine-focus details of pedigrees and of lineage sorting processes at microevolutionary scales, and that an incorporation of such perspectives can resolve many of the apparent conflicts previously emphasized between the PSC and the BSC. To paraphrase and adapt the quotation from Dobzhansky (1) that opened this paper: population genetics has so profound a bearing on the problem of the mechanisms of speciation that any speciation theory that disregards established population genetic principles is faulty at its source. We thank the National Science Foundation for continued support of the Avise laboratory.
34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50.
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33.
Dobzhansky, T. (1937) Genetics and the Origin of Species (Columbia Univ. Press, New York). Mayr, E. (1940) Am. Nat. 74, 249–278. Otte, D. & Endler, J. A., eds. (1989) Speciation and Its Consequences (Sinauer, Sunderland, MA). Martin, G. (1996) Nature (London) 380, 666–667. McKitrick, M. C. & Zink, R. M. (1988) Condor 90, 1–14. Zink, R. M. & McKitrick, M. C. (1995) Auk 112, 701–719. Rosen, D. E. (1979) Bull. Am. Mus. Nat. Hist. 162, 267–376. Eldredge, N. & Cracraft, J. (1980) Phylogenetic Patterns and the Evolutionary Process. (Columbia Univ. Press, New York). Nelson, K. C. & Platnick, N. I. (1981) Systematics and Biogeography (Columbia Univ. Press, New York). Cracraft, J. (1983) Curr. Ornithol. 1, 159–187. Cracraft, J. (1987) Biol. Philos. 2, 329–346. Donoghue, M. J. (1985) Bryologist 88, 172–181. Mishler, B. D. & Brandon, R. N. (1987) Biol. Philos. 2, 397–414. de Queiroz, K. & Donoghue, M. J. (1988) Cladistics 4, 317–338. Wheeler, Q. D. & Nixon, K. C. (1990) Cladistics 6, 77–81. Nixon, K. C. & Wheeler, Q. D. (1990) Cladistics 6, 211–223. Cracraft, J. (1989) in Speciation and Its Consequences, eds. Otte, D. & Endler, J. A. (Sinauer, Sunderland, MA), pp. 28–59. Frost, D. R. & Hillis, D. M. (1990) Herpetologica 46, 87–104. Mallet, J. (1995) Trends Ecol. Evol. 10, 490–491. Avise, J. C. (1994) Molecular Markers, Natural History and Evolution (Chapman & Hall, New York). Maddison, W. (1995) in Experimental and Molecular Approaches to Plant Biosystematics, eds. Hoch, P. C. & Stephenson, A. G. (Missouri Botanical Garden, St. Louis), pp. 273–287. Mayr, E. (1963) Animal Species and Evolution (Harvard Univ. Press, Cambridge, MA). Giddings, L. V., Kaneshiro, K. Y. & Anderson, W. W. (1989) Genetics, Speciation and the Founder Principle (Oxford Univ. Press, New York). Bush, G. L. (1975) Annu. Rev. Ecol. Syst. 6, 339–364. Harrison, R. G. (1991) Annu. Rev. Ecol. Syst. 22, 281–308. Patton, J. L. & Smith, M. F. (1989) in Speciation and Its Consequences, eds. Otte, D. & Endler, J. A. (Sinauer, Sunderland, MA), pp. 284–304. Simpson, G. G. (1945) Bull. Am. Mus. Nat. Hist. 85, 1–350. Neigel, J. E. & Avise, J. C. (1986) in Evolutionary Processes and Theory, eds. Nevo, E. & Karlin, S. (Academic, New York), pp. 515–534. Avise, J. C. (1995) Conserv. Biol. 9, 686–690. Avise, J. C. (1989) Nat. Hist. 3, 24–27. Doyle, J. J. (1995) Syst. Botany 20, 574–588. Brower, A. V. Z., DeSalle, R. & Vogler, A. (1996) Annu. Rev. Ecol. Syst. 27, 423–450. Griffiths, R. C. (1980) Theor. Popul. Biol. 17, 370–50.
51. 52. 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 74. 75. 76. 77. 78. 79. 80. 81. 82.
7755
Kingman, J. F. C. (1982) Stochastic Processes Appl. 13, 235–248. Tajima, F. (1983) Genetics 105, 437–460. Avise, J. C., Neigel, J. & Arnold, J. (1984) J. Mol. Evol. 20, 99–105. Avise, J. C., Ball, R. M., Jr., & Arnold, J. (1988) Mol. Biol. Evol. 5, 331–334. Tavare´, S. (1984) Theor. Popul. Biol. 26, 119–164. Takahata, N. & Nei, M. (1985) Genetics 110, 325–344. Hudson, R. R. (1990) Oxford Surv. Evol. Biol. 7, 1–44. Donnelly, P. & Tavare´, S. (1995) Annu. Rev. Genet. 29, 401–421. Pamilo, P. & Nei, M. (1988) Mol. Biol. Evol. 5, 568–583. Takahata, N. (1989) Genetics 122, 957–966. Wu, C.-I. (1991) Genetics 127, 429–435. Hudson, R. R. (1983) Evolution 37, 203–217. Hey, J. (1994) in Molecular Ecology and Evolution: Approaches and Applications, eds. Schierwater, B., Streit, B., Wagner, G. P. & DeSalle, R. (Birkhaeuser, Basel), pp. 435–449. Avise, J. C., Arnold, J., Ball, R. M., Jr., Bermingham, E., Lamb, T., Neigel, J. E., Reeb, C. A. & Saunders, N. C. (1987) Annu. Rev. Ecol. Syst. 18, 489–522. Avise, J. C. (1989) Evolution 43, 1192–1208. Nei, M. (1987) Molecular Evolutionary Genetics (Columbia Univ. Press, New York). Birky, C. W., Jr., Maruyama, T. & Fuerst, P. (1983) Genetics 103, 513–527. Slatkin, M. & Hudson, R. R. (1991) Genetics 129, 555–562. Takahata, N. (1991) Genetics 129, 585–595. Rogers, A. R. & Harpending, H. (1992) Mol. Biol. Evol. 9, 552–569. Marjoram, P. & Donnelly, P. (1994) Genetics 136, 673–683. Rogers, A. R. (1995) Evolution 49, 608–615. Kaplan, N. L., Darden, T. & Hudson, R. R. (1988) Genetics 120, 819–829. Kaplan, N. L., Hudson, R. R. & Langley, C. H. (1989) Genetics 123, 887–899. Hudson, R. R. & Kaplan, N. L. (1995) Genetics 141, 1605–1617. Takahata, N. (1990) Proc. Natl. Acad. Sci. USA 87, 2419–2423. Takahata, N. (1993) Mol. Biol. Evol. 10, 2–22. Takahata, N. & Nei, M. (1990) Genetics 124, 967–978. de Queiroz, K. & Donoghue, M. J. (1990) Cladistics 6, 61–75. Graybeal, A. (1995) Syst. Biol. 44, 237–250. Baum, D. A. & Shaw, K. L. (1995) in Experimental and Molecular Approaches to Plant Biosystematics, eds. Hoch, P. C. & Stephenson, A. G. (Missouri Botanical Garden, St. Louis), pp. 289–303. Moore, W. S. (1995) Evolution 49, 718–726. O’Hara, R. J. (1993) Syst. Biol. 42, 231–246. Ayala, F. J. & Escalante, A. A. (1996) Mol. Phylogenet. Evol. 5, 188–201. Avise, J. C. & Hamrick, J. L., eds. (1996) Conservation Genetics: Case Histories from Nature (Chapman & Hall, New York). Avise, J. C. & Ball, R. M., Jr. (1990) Oxford Surv. Evol. Biol. 7, 45–67. Mallet, J. (1995) Trends Ecol. Evol. 10, 294–299. Miyamoto, M. M. & Fitch, W. M. (1995) Syst. Biol. 44, 64–76. Kluge, A. G. (1989) Syst. Zool. 38, 7–25. Hillis, D. M., Mable, B. K. & Moritz, C. (1996) in Molecular Systematics, eds. Hillis, D. M., Moritz, C. & Mable, B. K. (Sinauer, Sunderland, MA), pp. 515–543. SAS Institute (1988) SAS/STAT User’s Guide (SAS Institute, Cary, NC), Release 6.09. Hartl, D. L. & Clark, A. G. (1989) Principles of Population Genetics (Sinauer, Sunderland, MA), 2nd Ed. Sneath, P. H. A. & Sokal, R. R. (1973) Numerical Taxonomy (Freeman, San Francisco). Wilson, E. O. & Brown, W. L., Jr. (1953) Syst. Zool. 2, 97–111. Palumbi, S. R. & Baker, C. S. (1994) Mol. Biol. Evol. 11, 426–435. Slade, R. W., Moritz, C. & Heideman, A. (1994) Mol. Biol. Evol. 11, 341–356. Hennig, W. (1966) Phylogenetic Systematics (Univ. Illinois Press, Urbana). Darwin, C. (1859) On the Origin of Species (John Murray, London). Brady, R. H. (1985) Cladistics 1, 113–126.
Proc. Natl. Acad. Sci. USA Vol. 94, pp. 7756–7760, July 1997 Colloquium Paper
This paper was presented at a colloquium entitled ‘‘Genetics and the Origin of Species,’’ organized by Francisco J. Ayala (Co-chair) and Walter M. Fitch (Co-chair), held January 30–February 1, 1997, at the National Academy of Sciences Beckman Center in Irvine, CA.
Assortative fertilization in Drosophila THERESE A NN MARKOW* Department of Zoology, Arizona State University, Tempe, AZ 85287-1501
study. This is not surprising, given the ease of quantifying either sexual isolation or hybrid sterility as compared with tests for gametic interactions. Gametic isolation, as a form of reproductive isolation, is equivalent to positive assortative fertilization. This, however, is only half of the story because, as with assortative mating, nonrandom union of gametes may vary between homogamy or positive assortative fertilization, and heterogamy, negative assortative fertilization, with implications for population differentiation similar to those of positive and negative assortative mating. The difference is that assortative fertilization occurs at a later place in the sequence between mating and offspring production. This presentation will concern itself with factors influencing assortative fertilization, once copulation has occurred, but prior to zygote formation, during the ‘‘chains of reactions that bring about the actual union of gametes.’’ Our interest, however, is not in all barriers to fertilization, only with those that underlie assortative fertilization. Fertilization barriers may occur for a variety of reasons: sterility of the male or female, lack of oviposition sites, age of the male or female, interference from the ejaculate of another male, and genotypes of the male and female. Assortative fertilization is a subtype of fertilization barrier that depends upon characters of the female and male such that gametes from like or unlike parents have a greater or lesser than random chance of uniting. In the genus Drosophila, there is an incredible amount of variation in gametes and other internal reproductive characters with the potential to influence assortative fertilization. Evidence for assortative fertilization will be presented, including some recent observations from my own laboratory. Then, the assumptions regarding the mechanisms underlying assortative fertilization will be described. Finally, we will discuss how it may have an impact on population differentiation and speciation.
ABSTRACT The concept of gametic isolation has its origins in the 1937 edition of T. Dobzhansky’s Genetics and the Origin of Species. Involving either positive assortative fertilization (as opposed to self-incompatibility) or negative assortative fertilization, it occurs after mating but prior to fertilization. Gametic isolation is generally subsumed under either prezygotic or postmating isolation and thus has not been the subject of extensive investigation. Examples of assortative fertilization in Drosophila are reviewed and compared with those of other organisms. Potential mechanisms leading to assortative fertilization are discussed, as are their evolutionary implications. In the 1937 edition of Genetics and the Origin of Species, Dobzhansky (1) described four subtypes of ‘‘physiological’’ isolating mechanisms that exist when parental forms occur together. Three of these were given formal names: ‘‘sexual isolation,’’ ‘‘mechanical isolation,’’ and ‘‘inviability of the hybrids.’’ The unnamed mechanism was described as when the ‘‘spermatozoa fail to reach the eggs, or to penetrate into the eggs; in higher plants, the pollen tube growth may be arrested if foreign pollen is placed on the stigma of the flower.’’ These events were attributed to ‘‘chains of reaction that bring about the actual union of gametes, or fertilization proper.’’ Examples of this fourth mechanism provided in the 1937 edition were limited, confined to marine invertebrates and incompatibility between plant species, and were accompanied by speculations as to the viability requirements of sperm of various vertebrates. In the 1951 edition, not only had this mechanism been given a formal name (2), ‘‘gametic or gametophytic isolation,’’ with the definition ‘‘spermatozoa, or pollen tubes, of one species are not attracted to the eggs or ovules, or are poorly viable in the sexual ducts of another species,’’ but also examples were included from several species of Drosophila in which the viability of stored sperm was notably decreased in interspecific crosses (3–5). Despite the growing list of examples of potential gametic isolation, it remains inadequately understood. One reason for this has been the focus of interest on mechanisms that act either earlier or later, as exemplified by the two common terminologies employed in studies of reproductive isolation: premating vs. postmating and prezygotic vs. postzygotic. These are primarily concerned with sexual isolation vs. hybrid inviability or sterility:
Interspecific Assortative Fertilization in Drosophila There are a number of examples of postmating, prezygotic interactions between Drosophila species that result in positive assortative fertilization. In Drosophila, all published reports are of single matings by females. Examples include a failure of heterospecific sperm to enter the female storage organs, a reduction in motility of stored sperm, or both. Patterson and his associates (3) describe crosses between five different species of the virilis group in which the relative amounts and motilities of stored sperm were compared in heterospecific vs. homospecific inseminations. Motility and quantity of stored sperm were less in heterospecific crosses, suggesting an incompatibility with the female reproductive tract that caused their death. A similar observation was reported for sperm of Drosophila athabasca males in the reproductive tracts of Drosophila affinis females (4). Fuyama (7) was able to obtain
1. Prezygotic OOOOOOOOO™™™uOO™™™3 Postzygotic (Sexual isolation) (Gametic isolation) (Inviability, Sterility) 2. Premating™O™™ O u OOOOOOOOOOO3 Postmating Gametic isolation, because it is included by default in one of these forms of isolation (6), tends not to be singled out for © 1997 by The National Academy of Sciences 0027-8424y97y947756-5$2.00y0 PNAS is available online at http:yywww.pnas.org.
*e-mail:
[email protected].
7756
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Markow a small number of inseminations of Drosophila pulchrella females by Drosophila suzukii males. While the numbers of sperm stored in the females appeared to be lower than in either conspecific cross, presumably due to the shorter duration of interspecific copulations, the viability of these sperm was elegantly demonstrated by stimulating their use with injected accessory gland products of conspecific males (7). In many Drosophila species, primarily of the quinaria and repleta groups, homospecific mating normally is followed by the formation of a large opaque mass in the vagina, the insemination reaction, that typically lasts for 7 to 9 hr (8). Its disappearance is coincident with the onset of oviposition and subsequent remating by the female (9). In heterospecific matings this mass usually lasts considerably longer, and judging from the observations summarized by Patterson and Stone (8), and the phylogenetic relationships (10) of the species they examined, the size and duration of the mass appears to be related to the degree of divergence between the species involved. Patterson and co-workers suggested that this insemination reaction functions as an isolating mechanism, because in some crosses, they observed dead sperm within the mass and, in extreme cases, heterospecifically mated females never remated. These examples all involve single heterospecific matings. In two other groups of insects, flour beetles (11, 12) and crickets (13–16), positive assortative fertilization was observed only under conditions of double matings, matings involving both a homo- and a heterospecific male. Heterospecific matings, both in flour beetles and in crickets, result in the production of offspring, although in all cases except crosses involving one type of Tribolium female, T. freemani, the numbers of offspring are significantly lower than in conspecific matings. When females are mated, twice, to a heterospecific and to a homospecific male, regardless of mating order, a significant majority of the offspring were sired by males of their own species. In both cases, positive assortative fertilization is the outcome of interejaculate competition dependent upon female genotype. Most of the progeny produced, regardless of mating order, were sired by the conspecific male. Because in most of these combinations, there were reduced numbers or viability of interspecific sperm after transfer, the observed effect is likely to have resulted merely from a numerical swamping out by the viable and conspecific sperm. This is not likely to be the case in T. freemani females, where sperm of both types of males appear equally viable. Intraspecific Assortative Fertilization in Drosophila Assortative mating, so pronounced between true species, is often examined in detail among distinct populations of the same species to identify antecedents of speciation. The same approach can be employed in treating assortative fertilization. Assortative fertilization within a species can potentially occur when a female has mated to only one male or when she has multiple mates, as seen for interspecific matings. In Drosophila, I could find no examples of positive assortative fertilization involving either single or multiple homospecific matings. This does not mean that they do not exist. There are, however, sperm utilization patterns that clearly appear to exemplify negative assortative fertilization. Widely known in plants, the possibility of self-incompatibility in animals has received little attention. There is some evidence, such as increased recurrent spontaneous abortion among couples sharing HLA haplotypes (17), that similarities with respect to major histocompatibility complex variation may influence fertilization or implantation success, but this is poorly defined. Investigators working with cactophilic Drosophila have long been aware that, compared with Drosophila melanogaster, it is very difficult to create isofemale lines of these species. Markow
7757
(18), in a study of inbreeding, reported the existence of a self-sterility phenomenon in Drosophila mojavensis, revealed when sib-mated lines failed to reproduce. Females mated to their brothers were full of motile sperm, and mature ovarian oocytes, but did not oviposit. The viability of these sperm was demonstrated when they were rescued from the ventral receptacles by the sperm-free ejaculates of unrelated males. The fact that these sperm could be ‘‘rescued’’ suggests that the responsible mechanism is a function of nonsperm ejaculate components and their interaction with females. Further analysis has been precluded by the absence of genetically marked chromosome balancers in this species. A similar phenomenon appears to exist in at least one other species of desert Drosophila. Drosophila nigrospiracula is a cactophilic species in which females will remate up to four times in a given morning. As with D. mojavensis, unless a female is confined with more than one of her brothers (R. L. Mangan, personal communication), attempts to inbreed this species result in most sib-mated lines failing within a few generations (T.A.M., S. Bertram, and S. Murphy, unpublished results). Dissection of adults reveals that parents are fully capable of mating and sperm transfer: in most nonreproductive pairs, females contain large numbers of motile sperm in their ventral receptacles. For some reason, however, these sperm are not being used. Experiments were conducted to assess whether the males or females were, for some unapparent reason, sterile, and if not, whether, as with D. mojavensis, degree of relatedness acts to prevent sperm utilization. In one set of experiments, oviposition was compared between females mated once to a brother following one generation of sib mating, and females mated to a random male. There was no difference in copulation duration for females mated to sibs compared with random males. In both replications, two things are clear (Table 1). First, fewer females laid any eggs when mated to their brothers than did females mated to random males. Second, those females that did not oviposit did not fail to do so for lack of motile sperm. A high proportion of females did not utilize the sperm they carried, at least from a single mating. In the second experiment, females were mated three times in the same morning. Matings were either all to the same male, ‘‘random’’ or ‘‘sib’’ (a brother from a one-generation sib-mated line), or twice consecutively to a sib and then to a random male from the population (Table 1). In both replications, a mating to an unrelated male was associated with increased oviposition, suggesting that degree of relatedness is an important factor in determining whether females of this species give up their Table 1. Reproductive failure and sperm presence in inbred and random pair matings of D. nigrospiracla Exp.
Male
Replications
No. of females
Females not ovipositing
Females with sperm
13a
Brother Random Brother Random 33 23:13 33 23:13
1 1 2 2 1 1 2 2
45 39 31 25 19 21 12 17
34y45 (75%) 12y39 (31%) 24y31 (77%) 9y25 (36%) 13y19 (68%) 4y21 (19%) 8y12 (67%) 2y17 (12%)
31y34 (91%) 7y12 (58%) 19y24 (79%) 5y9 (55%) 11y12 (92%) 4y4 (100%) 8y8 (100%) 2y2 (100%)
13b 33a 33b
Females in experiments 13a and 13b were mated once, to either a male chosen at random from the mass mating culture or to their brother after one generation of sib-mating. Females in experiments 33a and 33b were mated three times, either all three times to the same male, a brother, or twice to their brother and then to a male chosen at random from the mass culture. All matings were observed to ensure normal copulations were achieved. Mated females were separated from males and allowed to lay eggs in yeasted vials, changed daily, for 1 week. Females not ovipositing after 1 week were dissected and examined for the presence of motile sperm.
7758
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Markow
oocytes. Females mating three times produced more eggs than females mating once, but there was no difference in number of eggs between ovipositing females that mated only with related males (number of eggs per female 5 96.1 6 4.9) vs. unrelated males (number of eggs per female 5 94.2 6 7.3). About 95% of these eggs hatch, similar to what is reported in other Drosophila species. It is clear that in both of these cactophilic species, fertilization is negatively assortative. These observations are reminiscent of self-incompatibility in plants, raising the question of whether self-incompatibility mechanisms exist in animals. They may, and simply have gone undetected. A remaining question is whether there can be positive assortative fertilization under conditions of a single intraspecific mating. If assortative fertilization within a species is contributing to the evolution of gametic isolation, we expect to also find examples in which it is positive. In species where females remate, an interaction between overlapping ejaculates may contribute another dimension to fertilization barriers. When ejaculates of more than one male overlap, a suite of other interactions is possible. These come into play following the failure of first males to prevent remating before their sperm is used up by the female. For most Drosophila species, however, these mechanisms are not completely efficient, and in Drosophila species in which it has been measured, females typically carry sperm from more than one male (19–23). Sperm use by multiply mated females, however, is rarely a random process, and other, intermale, interactions may play themselves out inside the female, serving as barriers to fertilization. In most cases, the majority of sperm recovered is from the last male to mate. Equal mixing, however, is more likely in species that transfer fewer sperm when matings are close together (24). Nonrandom recovery of sperm following multiple mating, if it depends only upon the properties of the multiple competing ejaculates, and even in some cases where it is dependent upon female genotype (25, 26), differs from assortative fertilization as defined above, although some mechanisms may be the same. Recently, an effect similar to the negative assortative fertilization observed in D. mojavensis and in D. nigrospiracula has been inferred by Olsson et al. (27) in a sand lizard, Lacerta agilis, except that the observed assortative fertilization was detected in multiply rather singly mated females. Degree of relatedness was inferred by degree of sharing of DNA fingerprint bands. There was a reduction in the proportion of offspring, in multiply fathered clutches, sired by males that were genetically similar in DNA fingerprints to the female. This situation differs from the typically measured outcomes of sperm competition in which the genotype of the female is unimportant. In that study, the authors explain that L. agilis females ‘‘select’’ sperm to use and ‘‘prefer’’ sperm from unrelated males. It is not necessary to use the terms ‘‘select’’ and ‘‘prefer’’ here any more than it is in the case of selfincompatibility in plants. The difference here is that the mechanism is not understood at the same level as it is in plants. In conclusion, there is evidence that assortative fertilization exists within species in Drosophila, and in a lizard, but that it is associated with heterogamy rather than homogamy. The existence of negative assortative fertilization, however, suggests the existence of the requisite mechanisms for positive assortative fertilization as well. Additional and different kinds of studies must be undertaken to define these mechanisms. Mechanisms In organisms such as Drosophila where fertilization is internal and females store sperm, postmating barriers to fertilization can exist at a variety of levels. These constitute the ‘‘chains of reaction that bring about the actual union of gametes . . . .’’
Sperm must successfully enter the female and be transported to the storage organs, the spermathecae or ventral receptacle. They must stay alive with adequate motility until they are utilized by the female, and later be recovered from the storage organs, activated, and enter an oocyte as it passes through the reproductive tract. Entrance of a single sperm to the egg, in the case of Drosophila, through the micropyle, must be normal and trigger the formation of a normal zygote. A failure at any of these steps prevents fertilization. To promote assortative fertilization, there must be some degree of heritable specificity in the male and female components of the above process, thus fertilization barriers that involve abnormal or subviable sperm or ejaculates are not those of interest. The relevant mechanisms are those that are associated with naturally occurring, normal variation in the population, such that the male effectively signals the female to keep his sperm alive and to utilize them in fertilization. For the sake of simplicity, the mechanisms can be envisioned as having three levels:
Ejaculate variability
3
Female variability in detectionyresponse
3
Differential fertilization
This general scheme will be true for cases in which females contain the ejaculate of one or of multiple males, and the potential signals and receptors may be multiple and complex. Both the sperm and nonsperm components of the ejaculate are known to be extremely variable in Drosophila. Sperm in Drosophila species are more variable in length than in all other animal taxa combined. They range from 0.32 mm in Drosophila persimilis (28) to 58.29 mm in Drosophila bifurca (29); in the latter case sperm are about 12 times the length of the male himself. Within species, however, there is usually little variation. Pitnick and Markow (30) examined six strains of Drosophila hydei, finding a range of 23.02 mm to 25.91 mm. Snook (57) found that the long-sperm morph of North American Drosophila subobscura is 0.448 mm compared with 0.327 mm in a European strain. Species also differ with respect to how much of the sperm tail enters the egg (31); such differences could be important in postfertilization isolation. Female sperm storage organs show striking interspecific variability as well (32); species also differ in which of those two organs, the spermathecae or the ventral receptacle, sperm are stored. This enormous variability raises the question as to whether assortative fertilization may be mediated, in part, by a mismatch between sperm morphology and that of female storage organs. For example, in the four species of the nannoptera group, sperm differ considerably in length, as do sites of storage in the females (33). Females of Drosophila pachea and Drosophila wassermani store sperm only in the spermathecae, while Drosophila nannoptera females use only the ventral receptacle. Crosses were made among these species and the fate of sperm was examined (34). In crosses between D. pachea females and D. wassermani males, dead sperm were found in the spermathecae of 14y23 females. In D. pachea females crossed to D. nannoptera males, sperm, dead, were found in the spermathecae in 3y15 females. In this cross sperm storage location was controlled by the female, rather than the male. The lumen of the D. pachea ventral receptacle is half the diameter of the D. nannoptera receptacle (35), and while such a morphological difference may dictate storage site, sperm viability or its reduction may still result from biochemical interactions between the ejaculate and the female reproductive tract. Several approaches might be useful in assessing the role of sperm length differences in assortative fertilization. One would be to transplant testes between species (36), to create the necessary combinations of sperm and female reproductive tracts. Another is to utilize natural differences in sperm length,
Colloquium Paper: Markow in those cases in which they exist, between closely related species or divergent populations of the same species. The existence of qualitative differences in sperm themselves has not been directly shown. Several lines of evidence, however, strongly suggest expressed genetic differences between sperm, at least at the interspecific levels. For example, it is difficult to imagine, in the case of T. freemani, where homospecific sperm are favored regardless of mating order, how positive assortative fertilization could occur without heritable, qualitative differences on the sperm themselves. Furthermore, Thomas and Singh (37) have shown significant differences in testes proteins within and between related Drosophila species, and while these were not shown to be a property of the sperm themselves, they are certainly consistent with the prediction of qualitative differences. The existence of qualitative differences on the surface of the sperm that would interact differently with the female reproductive tract or with the ejaculates of other males seems to be a prerequisite in certain cases of assortative fertilization. On the other hand, extensive variation has been documented in the chemistry and function of the nonsperm component of the ejaculate, specifically the accessory gland proteins or Acps (38). There are approximately 80 of these proteins (M. Wolfner, personal communication), all transferred to females at the time of mating. Only 9 have been characterized in detail with respect to chemistry and function, but it is clear from studies to date that this group of substances serves to (i) alter female behavior, stimulating oviposition and delaying female remating, and (ii) facilitate the storage of sperm in the female. It is clear that within species there is considerable sequence polymorphism at the loci encoding these proteins (39) and that the proteins show considerable sequence divergence between species (40) as well as species specificity (7, 41) of their functions. In D. melanogaster, variation in four accessory gland proteins examined appears to be associated with displacement abilities in ejaculate competition experiments, although these experiments were not designed to detect any interaction with female genotype (42). Once the sperm is inside the female, there are a variety of reactions that must occur for successful fertilization. Some of these reactions occur within the female reproductive tract itself, but others involve the action of male-derived substances in other sites in the female. Evidence of the reactions inside the female reproductive tract is both direct and indirect. For example, Acp36DE has been shown to localize at the entrance to the sperm storage organs, in effect corralling the sperm into storage (43). How this protein specifically identifies the appropriate site in the female tract is unknown, but it must rely on biochemical properties expressed highly locally in the female tract. If this identification is species specific, it could easily explain why in some interspecific crosses, few sperm are seen in storage (3–5). Excellent examples of reactions that occur outside the female tract come from studies of three proteins, the sex peptide (44), esterase 6 (45), and Acp26Aa (38). These small proteins are transferred rapidly to the female hemolymph, even before copulation has terminated, where they produce, in the female, the same response: increased oviposition and decreased receptivity to remating. Their modes of transport from the reproductive tract are unknown. In the case of the sex peptide, both effects on female behavior stem from the same unknown molecular target (46). The increase in oogenesis is mediated by the resultant increase in juvenile hormone synthesis in response to the sex peptide. The exact target of Acp26Aa also is unknown, but it exerts its effect by means of the thoracic ganglion (38). Acp26Aa is a polymorphic protein, raising the question of the efficiency of different morphs in altering female behavior. An important direction for future research is the identification of the female targets of these
Proc. Natl. Acad. Sci. USA 94 (1997)
7759
proteins and the detection of variation in the targets that could provide mechanisms for differential fertilization. Species-specific morphological and biochemical features of the female tract are easily inferred from the larger size and longer duration of the insemination reaction mass, as well as the presence of dead sperm, in certain interspecific crosses. In house flies, female accessory gland secretions have been demonstrated to activate the sperm acrosome, enabling it to penetrate the micropyle (47, 48). The actual substance involved has not been identified, nor is there any evidence as to species specificity in the activation process. Origins and Implications for Population Differentiation and Reproductive Isolation What are the origins of the preceding examples of assortative fertilization? The examples described above are very different and thus are likely to result from different mechanisms. All depend, however, on the specificity or its breakdown of the chain of events leading to fertilization. One possibility is the accumulation of different mutations in geographically isolated populations, similar to the model proposed by Orr (49) to explain the asymmetries in postzygotic isolation. Most of the examples of postmating but prezygotic isolation are asymmetrical. On the other hand, interactions between the sexes taking place within the female’s reproductive tract have important fitness consequences for females, and also for males if their sperm is not used immediately. Thus males are expected to evolve seminal fluid components that are more effective in inducing oviposition and postponing female remating, while females are expected to simultaneously adapt to negate toxic effects of ejaculates (50) and to retain control over their oviposition (51, 52). These conflicting pressures are proposed as the driving force in the rapid coevolution of the signaling that occurs within the reproductive tract, potentially acting as an ‘‘engine’’ of speciation (53). Because this coevolution is expected to evolve differently in different populations, there is a potential for a major mismatch between opposite sexes of separate populations that can manifest itself as assortative fertilization (54). Another potential selective force is pathogen resistance. The insemination reaction has been likened to an immune response in which females react to foreign material they receive at mating. Self-incompatibility in some plants has been shown to be a function of pistil S-proteins, RNases, and their evolutionary origin has been suggested to be the recruitment of RNases originally involved in protecting the pistil from infection (55). The incidence of sexually transmitted extracellular pathogens in Drosophila is unexplored, but the conditions inside the mated female reproductive tract that are conducive to the maintenance of sperm viability, namely the nutrient-rich environment, should also provide a suitable habitat for pathogen growth. Thus it would not be surprising if female reproductive tracts were immunoreactive. If self-incompatibility, or negative assortative fertilization, exists within Drosophila species, its origin could also be associated with inbreeding avoidance. The two species, D. mojavensis and D. nigrospiracula, in which it appears to exist are species in which resource availability and environmental extremes conspire to create situations in which sib mating could be a common occurrence (18, 56). In both of these species, males deliver comparatively few sperm on a given mating and females remate very frequently, additional mating system features that would serve to minimize inbreeding. The evolutionary implications of assortative fertilization resemble those for assortative mating. Gametic isolation, the extreme form of positive assortative fertilization, should prevent gene flow between populations. Whether gametic isolation can be the primary isolating mechanism, especially if other factors promote crossing, or whether it typically appears in
7760
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Markow
some specific order relative to other isolating mechanisms (i.e., premating or postzygotic) has not been addressed. Negative assortative fertilization should promote outcrossing. Whether or not true self-incompatibility exists in Drosophila or other animals remains to be established, as does the nature of the interplay between negative assortative fertilization and other isolating mechanisms that act before or after fertilization.
24. 25. 26. 27.
Discussions with M. Wolfner and R. Richmond and with my former students S. Pitnick and R. Snook have been especially stimulating during the preparation of this manuscript. Bruce Wallace provided valuable editorial comments on an earlier draft. I also acknowledge the assistance of M. St. Louis, S. Murphy, S. Cleland, and S. Bertram and the support of National Science Foundation Grants INT 94-02161 and DEB 95-10645.
30.
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23.
Dobzhansky, T. (1937) Genetics and the Origin of Species (Columbia Univ. Press, New York). Dobzhansky, T. (1951) Genetics and the Origin of Species (Columbia Univ. Press, New York), 2nd Ed. Patterson, J. T. (1947) Univ. Texas Publ. 4720, 41–77. Miller, D. D. (1950) Am. Nat. 84, 81–93. Dobzhansky, T. (1947) Am. Nat. 81, 66–71. Coyne, J. A. & Orr, H. A. (1989) Evolution (Lawrence, Kans.) 43, 362–381. Fuyama, Y. (1983) Experientia 39, 190–192. Patterson, J. T. & Stone, W. S. (1952) Evolution in the Genus Drosophila (Macmillan, New York). Markow, T. A. (1996) Evol. Biol. 29, 73–106. Pitnick, S., Markow, T. A. & Spicer, G. (1995) Proc. Natl. Acad. Sci. USA 92, 10614–10618. Robinson, T., Johnson, N. & Wade, M. (1994) Heredity 73, 155–159. Wade, M. J., Patterson, H., Chang, N. W. & Johnson, N. A. (1994) Heredity 72, 163–167. Hewitt, G. M., Mason, P. & Nichols, R. A. (1989) Heredity 62, 343–353. Bella, J. L., Bultin, R. K., Ferris, C. & Hewitt, G. M. (1992) Heredity 68, 345–352. Howard, D. & Gregory, P. (1993) Phil. Trans. R. Soc. (London) 340, 231–236. Gregory, P. G. & Howard, D. J. (1994) Evolution (Lawrence, Kans.) 48, 705–710. Ober, C., Elias, S., Kostyu, D. & Hauck, W. (1992) Am. J. Hum. Genet. 50, 6–14. Markow, T. A. (1982) in Ecological Genetics and Evolution: The Cactus–Yeast–Drosophila Model System, eds. Barker, J. S. F. & Starmer, W. T. (Academic, New York), 273–287. Milkman, R. & Zeitler, R. R, (1974) Genetics 78, 1191–1193. Cobbs, G. (1977) Am. Nat. 111, 641–656. Richmond, R. C. (1976) Am. Nat. 110, 485–486. Griffiths, R. C., McKechnie, S. W. & McKenzie, J. A. (1982) Theor. Appl. Genet. 62, 89–96. Marks, R. W., Seager, R. D. & Barr, L. G. (1988) Am. Nat. 131, 918–923.
28. 29.
31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45.
46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57.
Markow, T. A. (1985) Anim. Behav. 33, 775–781. Childress, D. & Hartl, D. L. (1972) Genetics 71, 417–427. Zimmering, S. & Fowler, G. L. (1968) Genet. Res. 12, 359–363. Olsson, M., Shine, R., Madsen, T., Gullberg, A. & Tegelstrom, H. (1996) Nature (London) 383, 585. Snook, R. R., Markow, T. A. & Karr, T. L. (1994) Proc. Natl. Acad. Sci. USA 91, 1–5. Pitnick, S., Spicer, G. & Markow, T. A. (1995) Nature (London) 375, 109. Pitnick, S., Markow, T. A. & Spicer, G., (1995) Proc. Natl. Acad. Sci. USA 92, 10614–10618. Karr, T. L. & Pitnick, S. (1996) Nature (London) 379, 405–406. Throckmorton, L. (1962) Univ. Texas. Publ. 6205, 207–343. Pitnick, S. & Markow, T. A. (1994) Am. Nat. 143, 785–819. Russell, J. S., Ward, B. L. & Heed, W. B. (1977) Drosophila Inf. Serv. 52, 70. Ward, B. L. & Heed, W. B. (1970) J. Hered. 61, 248–258. Kambysellis, M. (1968) Univ. Texas Publ. 6818, 71–92. Thomas, S. & Singh, R. S. (1992) Mol. Biol. Evol. 9, 507–525. Wolfner, M. (1997) Insect Biochem. Mol. Biol., in press. Aguade´, M., Miyashita, N. & Langley, C. (1992) Genetics 132, 755–770. Coulthart, M. B. & Singh, R. S. (1988) Biochem. Genet. 26, 153–164. Chen, P. S., Stumm-Zollinger, R. & Claderlari, M. (1985) Insect Biochem. 15, 385–390. Clark, A. G., Aguade´, M., Prout, T., Harshman, L. G. & Langley, C. H. (1995) Genetics 139, 189–201. Bertram, M. J., Neubaum, D. M. & Wolfner, M. F. (1997) Insect Biochem. Mol. Biol., in press. Kubli, E. (1996) Adv. Devel. Biochem. 4, 99–128. Richmond, R. C., Nielsen, K. M., Brady, J. P. & Snella, E. M. (1990) in Ecological and Evolutionary Genetics of Drosophila, eds. Barker, J. S. F., Starmer, W. T. & McIntyre, R. J. (Plenum, New York), pp. 273–289. Schmidt, T., Choffat, Y., Klauser, S. & Kubli, E. (1993) J. Insect Physiol. 39, 361–368. Leopold, R. A. & Degrugillier, M. E. (1973) Science 181, 555. Degrugillier, M. E. & Leopold, R. A., (1976) J. Ultrastruct. Res. 56, 312–325. Orr, H. A. (1995) Genetics 139, 1805–1813. Chapman, T., Liddle, L. F., Kalb, J. M., Wolfner, M. F. & Partridge, L. (1995) Nature (London) 373, 241–244. Bouletreau-Merle, J., Terrier, O. & Fouillet, P. (1989) Heredity 62, 145–151. Bouletreau-Merle, J. (1990) J. Insect Physiol. 36, 119–124. Rice, W. R. (1996) Nature (London) 381, 232–234. Markow, T. A. & Hocutt, G. D. (1996) in Endless Forms: Species and Speciation, eds. Howard, D. & Berlocher, S. (Oxford Univ. Press, Oxford), in press. Kao, T.-h. & McCubbin, A. G. (1996) Proc. Natl. Acad. Sci. USA 93, 12059–12065. Breitmeyer, C. & Markow, T. A. (1997) Funct. Ecol., in press. Snook, R. R., (1995) Ph.D. dissertation (Arizona State Univ., Tempe).
Proc. Natl. Acad. Sci. USA Vol. 94, pp. 7761–7767, July 1997 Colloquium Paper
This paper was presented at a colloquium entitled ‘‘Genetics and the Origin of Species,’’ organized by Francisco J. Ayala (Co-chair) and Walter M. Fitch (Co-chair), held January 30–February 1, 1997, at the National Academy of Sciences Beckman Center in Irvine, CA.
Incipient species formation in salamanders of the Ensatina complex DAVID B. WAKE* Museum of Vertebrate Zoology and Department of Integrative Biology, University of California, Berkeley, CA 94720-3160
morphology led Stebbins (7) to the conclusion that they were parts of a polytypic species arranged in the form of a ring around the Central Valley of California. Stebbins recognized seven subspecies of Ensatina eschscholtzii (Fig. 1). What makes this study so interesting is a historical biogeographic hypothesis and its implications: the species originated in present-day northwestern California and southwestern Oregon and spread southward. Along the coast the species developed a Mullerian mimicry relationship with newts (the model) and evolved a uniform reddish brown dorsal coloration and a light pink to orange ventral coloration. In the inland mountains the species evolved a cryptic, spotted, or blotched color pattern. As the two arms of the expanding distribution moved southward, they came into sympatry in the southern Peninsular Ranges. In Dobzhansky’s view, while the ring showed terminal overlap and demonstrated nearly all stages in a speciation process (primary intergradation with adaptive divergence, secondary contact with hybridization, and finally sympatry), speciation was thwarted by on-going gene flow via intermediates around the ring (6). The demonstration (8–10) of some hybridization in the southern California zone of sympatry added credence to this interpretation. A survey of protein variation in 19 populations throughout the complex disclosed great differentiation and showed that gene flow cannot be holding this far-flung complex together (11, 12). The analysis revealed values of Fst . 0.7, thus refuting the hypothesis of continuous gene flow. While these data do not affect the biogeographic hypothesis (7), they raise the possibility of a group of closely related species whose borders remain to be identified. The hypothesis of a northern origin and a southern spread along two fronts was based on the presence in the north of high levels of variation in color pattern in the subspecies E. e. picta, and to a lesser degree in the surrounding form, E. e. oregonensis. Increasing genetic divergence from north to south was inferred from the progressive divergence in morphology (basically, color pattern) between coastal and interior forms south of the area of continuous distribution at the north end of the Central Valley (7). Free interbreeding was thought to occur in the north, a region of morphological intergradation. To the south hybridization occurs where E. e. xanthoptica from the coast has established populations in the foothills of the central Sierra Nevada, where it meets E. e. platensis (Fig. 1). Sympatry with little hybridization occurs between the southernmost forms, the coastal E. e. eschscholtzii and the inland E. e. klauberi. Subsequent research has shown that xanthoptica and platensis hybridize wherever they meet, but in very narrow hybrid zones (on the order of several home range diameters in
ABSTRACT The Ensatina eschscholtzii complex of plethodontid salamanders, a well-known ‘‘ring species,’’ is thought to illustrate stages in the speciation process. Early research, based on morphology and coloration, has been extended by the incorporation of studies of protein variation and mitochondrial DNA sequences. The new data show that the complex includes a number of geographically and genetically distinct components that are at or near the species level. The complex is old and apparently has undergone instances of range contraction, isolation, differentiation, and then expansion and secondary contact. While the hypothesis that speciation is retarded by gene f low around the ring is not supported by molecular data, the general biogeographical hypothesis is supported. There is evidence of a north to south range expansion along two axes, with secondary contact and completion of the ring in southern California. Current research targets regions once thought to show primary intergradation, but which molecular markers reveal to be zones of secondary contact. Here emphasis is on the subspecies E. e. xanthoptica, which is involved in four distinct secondary contacts in central California. There is evidence of renewed genetic interactions upon recontact, with greater genetic differentiation within xanthoptica than between it and some of the interacting populations. The complex presents a full array of intermediate conditions between well-marked species and geographically variable populations. Geographically differentiated segments represent a diversity of depths of time of isolation and admixture, ref lecting the complicated geomorphological history of California. Ensatina illustrates the continuing difficulty in making taxonomic assignments in complexes studied during species formation. The famous books by Dobzhansky (1) and Mayr (2) initiated a long period of general agreement on species concepts and speciation, but in recent years controversy has again developed. Once ignited (3), the debate raged for years, and only now do I sense a developing consensus (4, 5). New methods and techniques have changed the criteria by which species concepts are made manifest in taxonomies. My focus here is a celebrated ring species, the plethodontid salamander Ensatina, once touted by Dobzhansky (6) as an example of incipient, but incomplete, speciation. Ensatina are fully terrestrial salamanders distributed in coniferous forests and oak woodland along the Pacific Coast from southern British Columbia to northern Baja, CA, extending inland to the western slopes of the Cascades, the Sierra Nevada, and the Peninsular Ranges. At one time four species were recognized, but at the height of popularity of the Evolutionary Synthesis, a detailed analysis of coloration and
*To whom reprint requests should be addressed at: Museum of Vertebrate Zoology, 3101 Valley Life Sciences Building, University of California, Berkeley, CA 94720-3160. e-mail: wakelab@uclink4. berkeley.edu.
© 1997 by The National Academy of Sciences 0027-8424y97y947761-7$2.00y0 PNAS is available online at http:yywww.pnas.org.
7761
7762
Colloquium Paper: Wake
Proc. Natl. Acad. Sci. USA 94 (1997) monophyletic clades in the complex with respect to this gene. The first includes xanthoptica and eschscholtzii as sister groups; these are the southern subspecies of the coastal arm. The second clade includes klauberi, E. e. croceater, and southern populations of platensis; these are the southernmost parts of the inland arm. These data support Stebbins’ biogeographic scenario. The protein and DNA studies were not conducted at a sufficiently fine scale to determine whether or not species formation has already occurred. Questions arose concerning taxonomy; for example, some considered klauberi to be a separate species (ref. 16; but see ref. 17). A second allozymic survey of 49 populations from picta through oregonensis to the blotched forms along the inland arm disclosed a complicated pattern of isolation by distance in the south, relative genetic uniformity in one large northern area, and two distributional and genetic gaps (17). Periods of separation and differentiation were hypothesized to have been followed by secondary contacts, with resumption of gene flow. While evidence of past separation persists in molecular markers, allozymes and mitochondrial haplotypes show transitions in different areas and morphological uniformity prevails across old borders. No taxonomic changes were proposed, pending completion of other studies. One critic has focused attention not on the contact zones but on the areas of relative uniformity, and argued that many, perhaps 11 or more, species constitute the Ensatina complex (18). The controversy, in part, involves what occurs upon the recontact of previously separated units (D.B.W. and C.J. Schneider, unpublished data). As Dobzhansky (ref. 20, p. 205) identified the problem: ‘‘how much gene exchange between diverging populations is possible without arresting and reversing the divergence?’’ Here I present new information bearing on this question. My conclusion is that incipient species formation is occurring in the nearly continuous ‘‘ring,’’ but that species borders remain unclear.
MATERIALS AND METHODS
FIG. 1. The Ensatina complex, showing distribution of taxa recognized by Stebbins (7), but with borders based on molecular markers rather than morphological traits.
width, or a few hundred meters). While klauberi and eschscholtzii hybridize, they do so less frequently and in even narrower hybrid zones (10, 13). At the southernmost area of contact, the two forms are sympatric with no evidence of past or present hybridization (13, 14). I have tested Stebbins’ biogeographic hypothesis. Polymorphism and heterozygosity, estimated from allozymes, are extraordinarily high in northwestern California, among the highest recorded for any vertebrate, whereas more southern populations have less variation (the least occurs in the postulated colonists, the Sierran xanthoptica) (11, 12). The total number of presumptive alleles in the northern populations is also high (e.g., in one population, 59 alleles for 28 allozymes, n 5 10), as expected for old, large populations relative to newer, smaller ones. Genetic distance generally increases between paired comparative populations on either side of the valley from north to south, also as expected (12). A phylogenetic analysis of sequence variation in the mitochondrial gene cytochrome b also shows substantial variation within Ensatina (15). The greatest variation occurs in the north. Within the subspecies oregonensis, picta, and intergrades are several distinct, distantly related haplotypes. There are two
This paper summarizes previously unpublished data regarding interactions of the taxa oregonensis, xanthoptica, and eschscholtzii in central coastal California, mainly from populations ranging along the Pacific Coast from northern Mendocino County to central Monterey County and in the hills east of San Francisco Bay. Although this region encompasses large zones of intergradation (based on morphological studies, ref. 7), for purposes of clarity populations are assigned to taxa. Results are derived from three separate kinds of data: morphological, allozymic, and mitochondrial sequences. Morphological data follow earlier analyses (7, 10), but include a much larger data set. A complex-wide study of proteins (19 populations, 5 of which are relevant to this study, using 26 allozymic loci) laid the foundation for subsequent work (12). A first stage examined 25 loci in 20 populations (n per population 5 8–22; mean, 13.6) from regions east (East Bay) and north (North Bay) of San Francisco Bay; a second studied 27 loci in 20 East and South Bay populations (n 5 2–20; mean, 8.6), and a third used 22 of the most relevant loci in 34 populations (n 5 2–19; mean, 7.0) from the North and South Bay. These will be reported as first, second, and third studies in this paper. It is not possible to directly combine these studies, which were done at different times and used some different buffers, in part because of the large number of alleles detected. This complex data set will be published elsewhere, and only the main results are presented here. Nei (21) genetic distances (D) are reported. Sequences of the cytochrome b gene (664–775 bp) constitute the third kind of data. This is a growing data set (presently including data for over 80 populations), representing an expansion of the initial study (12), and research is actively in progress. Results are based on preliminary analyses of the data.
Colloquium Paper: Wake
RESULTS Populations identified as xanthoptica, unblotched salamanders with large amounts of orange pigmentation (especially ventrally) and a bright yellow upper iris, occur in the North, South and East Bay regions and in the west-central Sierra Nevada. This taxon occupies a key position in the ring complex. A zone of morphological intergradation between xanthoptica and eschscholtzii extends from Atascadero northward in the Coast Range to the Monterey Bay region (7). Morphological intergradation of xanthoptica with oregonensis occurs from near Monterey Bay north to the vicinity of Ft. Ross (7). In the Sierra Nevada xanthoptica hybridizes with platensis (14). While acknowledging the validity of the analysis of coloration (7), there is little evidence of the intergradation described above using molecular markers. General results are summarized in Fig. 2. Although the distribution of xanthoptica is interrupted by major present-day barriers, the taxon maintains some integrity as a unit, especially with respect to coloration and the monophyly of DNA sequences. Minimal D is 0.08 between North Bay and East Bay localities, and 0.05 between East Bay and South Bay localities. However, between South Bay and North Bay localities there is relatively great and varying divergence (D 5 0.15–0.47). The genetic connection between the North Bay and South Bay appears to be via the East Bay; San Francisco Bay and associated Carquinez Straits (north) and Santa Clara Valley (south), which currently interrupt the range, are apparently recent barriers. There are some relatively high D values (to
FIG. 2. Distribution of taxa of Ensatina in the San Francisco Bay region, showing D (21) based on allozyme data between selected neighboring populations. Bold face type, D between taxa; normal type, D within taxa. The mean and range of D between North Bay and South Bay oregonensis is shown.
Proc. Natl. Acad. Sci. USA 94 (1997)
7763
0.19) between the East Bay and the South Bay (populations likely to be even more divergent have not been included in the same study as yet). There is variation within each of these three areas. D within the North Bay reaches 0.15 (n, number of populations compared 5 5), within the East Bay, 0.09 (n 5 4), and within the South Bay, 0.31 (n 5 6 in each of two studies using different populations). In the eastern part of the South Bay distances are below 0.15, but some western populations are highly divergent from everything studied (these also are the populations with the greatest divergence to North Bay xanthoptica). Several populations contain both xanthoptica and oregonensis alleles; these introgressed populations were not classified. There is a finger-like projection of xanthoptica into oregonensis in the North Bay, and this small range is divided by inhospitable (now agricultural and urban) lowlands to the west of Santa Rosa. To the west, north, and east, populations are genetically oregonensis. D values between the two taxa exceed 0.3. Based on allozymes, populations identified by coloration (7) as xanthoptica are correctly assigned, but populations identified as morphological intergrades are assigned to oregonensis (with exceptions discussed below). On the southern San Francisco Peninsula in the Santa Cruz Mountains oregonensis and xanthoptica meet with a genetic gap of D 5 0.16–0.32. Further south, the genetic distance between xanthoptica and eschscholtzii across the Pajaro River is somewhat less (D 5 0.15–0.2). There is little evidence as yet for gene flow between nearby populations of oregonensis and xanthoptica in this region, although two local populations appear to be admixed. There remain small local geographic gaps in our sampling. However, as we have shortened the geographic distance between xanthoptica and eschscholtzii in the vicinity of Monterey Bay, D has dropped from 0.32 (12) to 0.15, and there remains a zone about 30 km in width which is largely unsampled (habitat along the Pajaro River has been disrupted by agricultural activities and urbanization). These data suggest that D will drop further as additional populations are discovered in the intervening area. We sampled only a small portion of the distribution of oregonensis (it ranges to southern Canada), but uncovered surprisingly great local differentiation. The first study included 18 populations extending from northern Mendocino down to southern Marin counties. D ranged as high as 0.26, and 31% of population comparisons exceeded D 5 0.15 (the approximate level at which species borders typically occur in the closely related genus Plethodon; ref. 22). Detailed analysis of this variation is beyond the scope of the present paper, but I observe that variation is great and no areas of high uniformity or of potential species borders were uncovered; furthermore, borders determined from haplotypes do not coincide with those determined from allozymes (D.B.W. and C.J. Schneider, unpublished data). The highest values of D within oregonensis involved comparisons across the range, between populations along the Pacific Coast and those relatively far inland. For no nearest neighbor comparison is D 5 0, and many are in the range D 5 0.02–0.07. The third study included 12 populations (a few repeats from the earlier study but mainly different) of oregonensis extending from the Russian River area through the Coast Range to southern Marin County, with a few populations in eastern Sonoma County. Even in this relatively small region genetic diversification is great, with D reaching a high of 0.23 (across the breadth of the range) and 36% of the comparisons exceeding D 5 0.15. Near neighbors always have the lowest values, but rarely less than D 5 0.04. Genetic distances across the Russian River range from 0.08 to 0.15, suggesting that it has restricted gene flow to some extent. Populations of oregonensis occur in the South Bay, mainly on the northern part of the San Francisco Peninsula, but extending southeast to near Loma Prieta. Within this small peninsular area diversification is great. A maximum D 5 0.16 is
7764
Colloquium Paper: Wake
present in study three (n 5 4), with only one comparison D , 0.1. In study two (n 5 3) the highest value is D 5 0.08. The mean D between oregonensis north and south of the Golden Gate is 0.16 (range, 0.08–0.27; 15 populations). There are three comparisons in the range of 0.08–0.09, showing that the Golden Gate has not been a major distributional barrier. There is a genetic gap between oregonensis and xanthoptica in the North Bay. D ranges from 0.28 to more than 0.5, but in the areas where populations of the two approach most closely D 5 0.3–0.4. There are five to eight potentially useful loci for constructing hybrid indices (14), but none are fixed and there is so much variation, especially in oregonensis, that indices would only be useful locally. Some populations appear to be introgressed or admixed (see below). There is no evidence of hybridization per se (i.e., no clear F1 hybrids or backcrosses). In two areas near Santa Rosa there is evidence of gene flow between oregonensis and xanthoptica (Fig. 3), in the form of admixture. This is at the extreme northwestern limit of the range of xanthoptica, in the hills immediately north of Santa Rosa and on the west side of the valley that separates these hills from the main Coast Range near Forestville. In the first area three populations were sampled from nearly continuous habitat near Mark West Creek. One of these populations (no. 28, n 5 19) is similar to xanthoptica in coloration, and another (no. 31, n 5 10) is similar to oregonensis. These populations are separated by less than 10 km, but D 5 0.34. Both are highly variable (no. 28 has 36 alleles; no. 31 has 34 alleles at 22 loci), but only no. 28 shows signs of limited gene flow from the other taxon (alleles characteristic of oregonensis are present at low frequency for four loci). A third population (no. 24, n 5 5), 5 km south of population no. 31, displays coloration somewhat intermediate between oregonensis and xanthoptica, but genetic distances are high to both neighboring populations (0.22 to no.
FIG. 3. The xanthoptica–oregonensis contact zone north of San Francisco Bay in the Santa Rosa–Russian River area. Populations 22 and 24 are intermediate in nature. D values between selected populations are indicated. Shading in upper part of figure indicates wooded land.
Proc. Natl. Acad. Sci. USA 94 (1997) 28; 0.30 to no. 31). There are 32 alleles in the relatively small sample, but no evidence of F1 hybrids. However, the sample is fixed for an otherwise rare allele for malate dehydrogenase (Mdh; EC 1.1.1.37) (found at a frequency of 0.06 in population 31; absent in population 28), fixed for an allele for Acon 1 (EC 4.2.1.3) that is relatively common in population 31 and absent in no. 28, and fixed for an allele for proline depeptidase (Pep-d; EC 3.4.13.9) which is in high frequency in population 28 (0.91) but absent in population 31. Acon 2 has an allele found only in population 24 and an admixed population across the valley to the west. Population 24 lacks an allele for glutamicoxaloacetic transaminase (Got; EC 2.6.1.1) that is fixed in population 31 but absent in no. 28, and it has two of the three alleles that appear in population 28. Evidently gene flow as well as some sorting of variants has occurred. This suggests that there is no intrinsic barrier (e.g., specific mate recognition systems, or postmating isolating mechanisms) to genetic exchange (there is no evidence of such barriers anywhere in the complex). The region of admixture is narrow, in relation to the range of the taxa, but probably not with respect to the relatively narrow home ranges known to be characteristic of this complex (23, 24). Some additional populations in this area are introgressed as well and these are not assigned to any taxon. The second area is even narrower (Fig. 4). Across the Russian River at the northwestern limit of the range of xanthoptica there is a genetic gap D 5 0.3 in less than 1 km. As much as 0.15 occurs within oregonensis just to the west of the
FIG. 4. Expansion of Fig. 3, showing the Russian River contact zone. Populations are sorted by taxon, but population 22 is intermediate in most respects. pop, Population number in study three; n, # , mean heterozygosity (direct count); P, proportion of sample size; H loci polymorphic; A, number of alleles in 22 allozymic loci; X, fraction of xanthoptica marker alleles present in populations assigned to oregonensis; O, fraction of oregonensis marker alleles in populations assigned to xanthoptica.
Colloquium Paper: Wake contact zone, but the intertaxon distance is substantially greater and implies secondary contact of well differentiated groups. There is also evident change in color pattern on either side of the Russian River; on the east and south salamanders have extensive orange pigmentation and a bright yellow dorsal iris, whereas on the west and north orange pigmentation is greatly reduced, especially ventrally, and the upper iris is much paler. Two relatively large samples separated by less than 5 km have a D 5 0.36. The oregonensis population (no. 25, n 5 16) contains 41 alleles, and the xanthoptica population (no. 23, n 5 11) contains 31 alleles, but only one locus (different in each population) is potentially introgressed from the other taxon in either population. The population of xanthoptica (no. 30) closest to oregonensis (no. 25) is small (n 5 5) and has oregonensis alleles at only one locus. One population (no. 22, n 5 6) has high genetic distance to all neighboring populations, even those less than 10 km distant (D 5 0.11 to one xanthoptica; 0.22 to two oregonensis), including the other admixed population (D 5 0.31 to no. 24). This sample is of mixed origin, but its heterozygosity (mean direct count 0.12) is about the same as populations 25 and 23 and there are no clear hybrid genotypes. Specimens have the coloration of xanthoptica, but alleles characteristic of oregonensis are present in all six potential marker loci and it has a high total number of alleles for a small sample (population 36), further indicating its composite nature. In the South Bay the genetic gap between xanthoptica and oregonensis is generally less than in the North Bay (Fig. 5). In the second study, D ranges from 0.16–0.32 (mean, 0.23) between South Bay oregonensis and all South Bay and East Bay xanthoptica, with the largest and smallest values both representing South Bay comparisons. The mean is identical for South Bay and East Bay comparisons. Two populations (one small sample from the east slopes of the Santa Cruz Mountains and the other from the Pacific coastal zone of the midpeninsula) appear to be admixed, although it is more difficult to detect possibly diagnostic loci than in the North Bay. The
FIG. 5. Modern barriers to dispersal in the San Francisco Bay area for taxa discussed in this paper. Genetic distances between selected populations indicated on lines connecting them. Bold D values are between taxa.
Proc. Natl. Acad. Sci. USA 94 (1997)
7765
coastal population (n 5 7) is genetically equidistant between the two taxa (D 5 0.11–0.29, mean 0.20 to xanthoptica; 0.21–0.24 to oregonensis). In the third study, D between South Bay oregonensis and xanthoptica ranges from 0.22 to 0.35 (mean, 0.28). One population appears to be admixed (D 5 0.17–0.34 to South Bay xanthoptica; D 5 0.17–0.22 to South Bay oregonensis, some only '5 km distant), but it is a small sample (n 5 3) and distances are high because of sampling effects. Distances in the third study would be expected to be greater than in the second, because only more variable, potentially diagnostic loci were selected for study. The values in Fig. 5 reflect the likely upward bias. Distances between xanthoptica and eschscholtzii are slightly less than between xanthoptica and oregonensis: D 5 0.21–0.37 between East Bay xanthoptica and eschscholtzii, and 0.14–0.32 between South Bay xanthoptica and eschscholtzii. In the third study D 5 0.15–0.39, mean 0.23 for South Bay xanthoptica and eschscholtzii; D 5 0.24–0.38, mean 0.29 for eschscholtzii to oregonensis. Populations of xanthoptica and eschscholtzii that are closest geographically have the lowest values. More than 80 populations have been sampled for sequence variation in the cytochrome b gene (ref. 15 and unpublished data). Corrected sequence divergence between the three taxa considered here is 0.05–0.07 for xanthoptica to eschscholtzii, in excess of 0.09 for eschscholtzii to oregonensis, and in excess of 0.11 for xanthoptica to oregonensis. There is substantial variation within all taxa, but especially oregonensis (which is paraphyletic with respect to this gene). A phylogenetic analysis of sequence data indicates that xanthoptica and eschscholtzii are sister taxa and form a monophyletic group (15), but their closest relative is unclear and recent analysis of a much larger sample has failed to find a closest relative. The base of the cytochrome b gene tree for the Ensatina complex is unstable. The contact zones detected with allozymes described herein are also detectable with mitochondrial DNA; a detailed study by D. Parks in this laboratory is in progress.
DISCUSSION While the main features of the historical biogeographic hypothesis (7) for the Ensatina complex are generally supported by recent work, we now can see that the original scenario was too simple. Differentiation is greater than originally envisioned, and there is evidence throughout the ring of subdivision, differentiation, and several recontacts. Nevertheless, the complex displays features of a ring-like series of interacting units. In the north boundaries between apparently old units have been obscured by recurrent gene flow and different character sets do not coincide geographically; as one moves south the units become more distinct and data sets are more coordinated. In regions of secondary contact in the Sierra Nevada and southern California, clear hybridization occurs; although hybrids and backcrosses are healthy and fertile, there is apparently selection against them (14). Elsewhere in the ring there is no unambiguous evidence of hybridization, by which I mean sympatry and production of offspring from mating of unlike forms, but there is evidence of genetic admixture and introgression between geographically adjacent units, usually with more genetic differentiation within units than between them along their borders. The intergradation zones based on morphology (7) are far too broad when compared with data derived from molecular markers. Do the molecular markers identify species borders? If one treats Ensatina eschscholtzii as a simple species, it is more differentiated genetically than most species of vertebrates (25). While several suggestions have been made for taxonomic reclassification (16, 18, 26), all proposed solutions are problematic (D.B.W. and C. J. Schneider, unpublished data). Morphological DNA and allozyme criteria exist for making taxonomic decisions, but in this complex they fre-
7766
Colloquium Paper: Wake
quently do not coincide geographically (D.B.W. and C. J. Schneider, unpublished data). One of the most distinctive taxa in the complex is xanthoptica. A case might be made for recognizing it as a separate species. However, as shown here, there are leaky borders with neighboring taxa and it remains unclear if the taxa are merging or continuing to diverge. Furthermore, there is a broad overlap in comparison of within and between taxon genetic distances, so that genetic distances are much greater within xanthoptica than they are between taxa in the zones of secondary contact. Accordingly, xanthoptica lacks integrity as an historical unit. The other taxa treated here offer contrasts with xanthoptica. For example, eschscholtzii is much less differentiated genetically, suggesting that its southward spread has been recent. On the other hand, oregonensis (including picta) is more deeply differentiated and may represent an ancient, persistent ancestral stock of the complex as a whole. However, we have found no places in northern California where borders identified by one data set are matched by those found with other data sets, so past differentiates have apparently merged as a result of on-going genetic interactions across geography. Even if one recognized eschscholtzii and xanthoptica as separate taxa on phylogenetic grounds (e.g., mtDNA sequences), one would be left with a plesiomorphic oregonensis–picta agglomeration. Proposals that this agglomeration be separated into several species (18) are unsatisfactory (D.B.W. and C. J. Schneider, unpublished data), and if one started down the path of naming as species all identifiable pieces of the phylogenetic nexus there would be far more species than anyone has proposed to date (e.g., in unpublished and incomplete research we have identified many haplotype clades). Accordingly, I recommend maintaining the current taxonomy while research continues. Sequence data suggest that eschscholtzii and xanthoptica are sister taxa (ref. 15 and unpublished data). I propose that a common ancestor of these two was isolated to the south of the main range of what became present-day oregonensis. Secondary contacts among these taxa are a consequence of major geomorphological reorganizations of coastal California associated with the complicated tectonic history of the region. One possible reconstruction is inspired by the historical biogeographic hypothesis for the plethodontid salamander genus Batrachoseps (ref. 27, see also ref. 28) and assumption of a general (but as yet uncalibrated) molecular clock. For various periods during the Tertiary, precursor drainages of the present-day Central Valley entered the Pacific Ocean in the vicinity of present-day Monterey Bay, where the largest marine canyon (of Grand Canyon scale) on the Pacific Coast of North America is found (29). I suggest that the proposed common ancestor of the xanthoptica–eschscholtzii clade may have been isolated south of this region on the order of 5 million years ago, and that differentiation proceeded during this period of isolation. Precursors to xanthoptica and eschscholtzii may have been isolated on either side of the San Andreas Fault (Fig. 6). Land in this area has been extremely unstable over a long period of time, and has been moving at a rate of '35 mmyyear over the last 4–5 million years (30). I postulate that land connections were made and broken repeatedly, and that movement of primordial xanthoptica into the present-day South Bay region occurred relatively early, based on the high degree of genetic differentiation that has taken place. Subsequently xanthoptica moved into the East Bay and North Bay, as well as across the Central Valley. Very recently the Central Valley has established a new drainage to the ocean, at the Golden Gate, as a result of the Inner Coast Range becoming continuous. The northward expansion of xanthoptica brought it into secondary contact with oregonensis, in the South Bay and independently in the North Bay. The expansion of xanthoptica into the foothills of the Sierra Nevada led to contacts with both northern and southern platensis (14, 17). That all of these contacts are recent is suggested by the low minimal D
Proc. Natl. Acad. Sci. USA 94 (1997)
FIG. 6. Hypothetical distribution of the Ensatina complex '5 million years before present. Based on reconstruction of California paleogeography by Yanev (27). Approximate location of precursors to genetically defined units within the Ensatina complex are indicated. oreg-picta, oregonensis and picta; plat 1, northern platensis (15, 17); plat 2-croc-klau, southern platensis plus croceater plus klauberi (15, 17); xanth, xanthoptica; esch, eschscholtzii; SF, approximate position of present day San Francisco; SD, approximate position of present-day San Diego. The approximate positions of the San Andreas Fault and Monterey Canyon (the latter at the outflow of the Pajaro and Salinas Rivers) are indicated.
values (0.05–0.08 between geographic areas within xanthoptica). Secondary contact between xanthoptica and eschscholtzii probably occurred from the north, for apparently xanthoptica had more dispersal access from the Santa Cruz Mountains than did eschscholtzii, which was isolated to the south by a flat, sandy (and thus relatively inhospitable) area east of Monterey Bay, as well as two major rivers (Pajaro, Salinas). The region of the Pajaro River is a major biogeographic border (27), as it marks the southern boundary of many amphibians: Ambystoma macrodactylum, Aneides flavipunctatus, Batrachoseps attenuatus, Dicamptodon ensatus, Taricha granulosa and E. e. xanthoptica. It is the northern boundary of Batrachoseps pacificus and E. e. eschscholtzii. Against the hypothesis laid out above is the fact that in both the North Bay and the South Bay, xanthoptica is relatively differentiated genetically, more so than would be predicted by the lowest genetic distances measured between North and South Bay to East Bay populations. Perhaps the initial recontact between xanthoptica and oregonensis is old; the lowest genetic distances between regions might reflect relatively recent genetic exchange between particular populations. In related taxa in eastern North America, many nearly cryptic species have been recognized (19, 22, 31). I suggest that there are historical reasons for the differences in pattern in eastern and western North America. In eastern North America there may have been far greater effects of Pleistocene glaciation than in the west, and this may have led to more local range restriction as well as extinction. This may have sharpened borders between groups of populations and heightened the genetic cohesion of units. In contrast, in California glaciation effects were more limited, although they have been postulated
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Wake to have played a role in contributing to the differentiation of some taxa and to have sharpened boundaries in the Sierra Nevada (17). Instead, in California there has been a history of extensive geomorphological evolution coinciding with the history of the Ensatina complex. The time and space dimensions of the diversification are interconnected. The history of this complex has probably featured substantial isolation, differen-
7767
tiation, and multiple recontacts (Fig. 7). In effect, there are rings within rings in this complex, resulting from many levels of history being manifest in a single complicated pattern of variation, expressed somewhat differently at the three levels investigated to date—DNA sequences, allozymes, and color pattern. While the complex appears to be in a state of incipient species formation, which makes taxonomy problematic, it provides an instructive evolutionary example. I thank M. Frelow, T. Jackman, D. Nguyen, C. Schneider, and K. P. Yanev for their extensive laboratory assistance, and C. Brown and many other individuals who have helped me in field work. Illustrations are by K. Klitz. I have benefited from discussions, comments on the manuscript, or both, with R. Bello, C. Brown, M. Garcia Paris, C. Haddad, R. Highton, T. Jackman, S. Kuchta, M. Mahoney, D. Parks, C. Schneider, R. Stebbins, M. Wake, K. Yanev, and K. Zamudio. The manuscript was improved by comments from two anonymous reviewers. This work was supported by grants from the National Science Foundation and the Gompertz Professorship. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27.
FIG. 7. Historical biogeographic interpretation for the Ensatina complex. Five zones of secondary interaction are shown. 1, Interaction of klauberi and eschscholtzii. 2, Complex interaction between northern and southern platensis and of these interactors with xanthoptica in the central Sierra Nevada. 3, Interaction of oregonensis and northern platensis in the Lassen Peak area. 4, North Bay interaction of oregonensis and xanthoptica. 5, South Bay interaction of oregonensis and xanthoptica and of xanthoptica and eschscholtzii.
28. 29. 30. 31.
Dobzhansky, T. (1937) Genetics and the Origin of Species (Columbia Univ. Press, New York). Mayr, E. (1942) Systematics and the Origin of Species (Columbia Univ. Press, New York). Ghiselin, M. (1974) Syst. Zool. 23, 536–544. Avise, J. C. & Wollenberg, K. (1997) Proc. Natl. Acad. Sci. USA 94, 7748–7755. deQueiroz, K. (1997) in Endless Forms: Species and Speciation, eds. Berlocher, S. & Howard, D. (Oxford Univ. Press, Oxford), in press. Dobzhansky, T. (1958) in A Century of Darwin, ed. Barnett, S. A. (Harvard Univ. Press, Cambridge, MA), pp. 19–55. Stebbins, R. C. (1949) Univ. Calif. Publ. Zool. 48, 377–526. Stebbins, R. C. (1957) Evolution 11, 265–270. Brown, C. W. & Stebbins, R. C. (1965) Evolution 18, 706–707. Brown, C. W. (1974) Univ. Calif. Publ. Zool. 98, 1–64. Larson, A., Wake, D. B. & Yanev, K. P. (1984) Genetics 106, 293–308. Wake, D. B. & Yanev, K. P. (1986) Evolution 40, 702–715. Wake, D. B., Yanev, K. P. & Brown, C. W. (1986) Evolution 40, 866–868. Wake, D. B., Yanev, K. P. & Frelow, M. M. (1989) in Speciation and its Consequences, eds. Otte, D. & Endler, J. A. (Sinauer, Sunderland, MA), pp. 134–157. Moritz, C., Schneider, C. J. & Wake, D. B. (1992) Syst. Biol. 41, 273–291. Frost, D. & Hillis, D. (1990) Herpetologica 46, 87–104. Jackman, T. & Wake, D. B. (1994) Evolution 48, 876–897. Highton, R. (1997) Herpetologica, in press. Tilley, S. C. & Mahoney, M. J. (1996) Herpetol. Monogr. 10, 1–42. Dobzhansky, T. (1951) Genetics and the Origin of Species (Columbia Univ. Press, New York), 3rd Ed. Nei, M. (1972) Am. Nat. 106, 283–292. Highton, R. (1995) Annu. Rev. Ecol. Syst. 26, 579–600. Stebbins, R. C. (1954) Univ. Calif. Publ. Zool. 54, 47–124. Staub, N. L., Brown, C. W. & Wake, D. B. (1995) J. Herpetol. 29, 593–599. Avise, J. (1994) Molecular Markers, Natural History and Evolution (Chapman & Hall, New York). Graybeal, A. (1995) Syst. Biol. 44, 237–250. Yanev, K. P. (1980) in The California Islands: Proceedings of a Multidisciplinary Symposium, ed. Power, D. M. (Santa Barbara Museum of Natural History, Santa Barbara, CA). Tan, A.-M. & Wake, D. B. (1995) Mol. Phyl. Evol. 4, 383–394. Martin, B. D. & Emery, K. O. (1967) Am. Assoc. Pet. Geol. Bull. 51, 2281–2304. Powell, R. E. & Weldon II, R. J. (1992) Annu. Rev. Earth Planet. Sci. 20, 431–468. Highton, R. (1989) Ill. Biol. Monogr. 57, 1–78.
Proc. Natl. Acad. Sci. USA Vol. 94, pp. 7768–7775, July 1997 Colloquium Paper
This paper was presented at a colloquium entitled ‘‘Genetics and the Origin of Species,’’ organized by Francisco J. Ayala (Co-chair) and Walter M. Fitch (Co-chair), held January 30–February 1, 1997, at the National Academy of Sciences Beckman Center in Irvine, CA.
Genetics and the origin of bird species (reproductive isolationyimprintingyintrogressive hybridizationyecological adaptation)
PETER R. GRANT
AND
B. ROSEMARY GRANT
Department of Ecology and Evolutionary Biology, Princeton University, Princeton, N.J. 08544-1003
of reproductive isolation. The ecology of bird speciation is well known, but the genetics of speciation are the genetics of other organisms, mainly Drosophila. In other respects the book is fascinating and rewarding because many of the issues he grappled with are still unresolved. What is a species? How are species formed? Can species form entirely sympatrically? What are the genetic mechanisms involved in the speciation process? What are the respective roles of selection and drift? What is selected, how, and why? The answers he gave to these questions have left an indelible imprint on the way we think about species and speciation. Here is a brief overview of those aspects most relevant to avian speciation. Dobzhansky defines speciation as a process of evolutionary divergence of two or more populations. It starts with a single population and finishes with the coexistence of reproductively isolated populations: two species from one. Reproductive isolation means little or no gene exchange; it does not necessarily mean no gene exchange. It involves intrinsic, genetically conditioned, properties. ‘‘The first vestige of the isolation develops probably always in allopatric populations. Inviability of F1 hybrids, and low average adaptedness of the F2 and of backcross products, are byproducts of the genetic differentiation of allopatric populations. . . The hybrid inviability and breakdown provide, then, the stimulus for natural selection to build up other reproductive isolating mechanisms. Reproductive isolation diminishes the frequency of the appearance of hybrids, prevents the reproductive wastage, permits the populations of the incipient species gradually to invade each other’s territories, and finally to become partly and wholly sympatric’’ (1). In other words, selection enhances or reinforces (2) the differences in traits that are responsible for reproductive (behavioral) isolation when partly differentiated and previously separated populations meet, helping to complete the process of speciation. These traits constitute reproductive isolating mechanisms (1). Thus the genetics of speciation is (a) the genetics of differentiation and (b) the genetics of reproductive isolation. Dobzhansky emphasized two other features that are relevant to our quest. The first is that speciation might be completed entirely in allopatry, but we would not know it if the populations remain geographically separated (but see ref. 3). The second is that speciation, the development of reproductive isolation, is a process that requires a long time. These ideas may apply without qualification to some cases of bird speciation. Nevertheless we will argue that bird speciation often, perhaps usually, takes a different course and involves some different factors. Information on sexual imprinting, hybridization, and the fitness of hybrids lead us to suggest that premating isolation arises before postmating isolation; the capacity to learn through an imprinting mechanism keeps bird species apart before postmating isolating factors begin to evolve. Some of the factors involved in premating isolation are
ABSTRACT External (environmental) factors affecting the speciation of birds are better known than the internal (genetic) factors. The opposite is true for several groups of invertebrates, Drosophila being the outstanding example. Ideas about the genetics of speciation in general trace back to Dobzhansky who worked with Drosophila. These ideas are an insufficient guide for reconstructing speciation in birds for two main reasons. First, speciation in birds proceeds with the evolution of behavioral barriers to interbreeding; postmating isolation usually evolves much later, perhaps after gene exchange has all but ceased. As a consequence of the slow evolution of postmating isolating factors the scope for reinforcement of premating isolation is small, whereas the opportunity for introgressive hybridization to influence the evolution of diverging species is large. Second, premating isolation may arise from nongenetic, cultural causes; isolation may be affected partly by song, a trait that is culturally inherited through an imprinting-like process in many, but not all, groups of birds. Thus the genetic basis to the origin of bird species is to be sought in the inheritance of adult traits that are subject to natural and sexual selection. Some of the factors involved in premating isolation (plumage, morphology, and behavior) are under single-gene control, most are under polygenic control. The genetic basis of the origin of postmating isolating factors affecting the early development of embryos (viability) and reproductive physiology (sterility) is almost completely unknown. Bird speciation is facilitated by small population size, involves few genetic changes, and occurs relatively rapidly. A major task for evolutionary biologists is to explain the origin of biodiversity. Within this broad field lies the central Darwinian problem of explaining how species are formed. When a solution is reached it will be an amalgam of information from genetics, ecology, and other disciplines, the amalgam will vary from one group of organisms to another, and genetics will be at the core. As stated most succinctly by Dobzhansky (1), ‘‘Evolution is a change in the genetic composition of populations. The study of mechanisms of evolution falls within the province of population genetics.’’ In this article we consider just one group of organisms, birds, and ask ‘‘How are new species produced, and what are the genetic changes involved?’’ Dobzhansky’s 1937 book is a logical starting point for seeking an answer to these questions because it is at the root of current ideas about the genetic basis of speciation. However, in one respect it is disappointing: it has nothing to say about the genetics of birds. In it and the revised version in 1951, birds are used to illustrate some phenotypic and ecological patterns: geographical patterns of morphological variation within species, and changes occurring when a species enters a new area; rapid radiation in archipelagos; competition; and the evolution © 1997 by The National Academy of Sciences 0027-8424y97y947768-8$2.00y0 PNAS is available online at http:yywww.pnas.org.
7768
Colloquium Paper: Grant and Grant under single-gene or polygenic control (plumage and morphology), others are culturally inherited (song). The slow evolution of postmating isolation implies that the scope for reinforcement of premating isolating mechanisms is minimal. Involvement of culturally inherited traits may be partly responsible for the relatively rapid rate of speciation in birds. Speciation in Birds Speciation has been most thoroughly investigated, and for many years, in Darwin’s finches (Geospizinae; refs. 4–7). We therefore begin by describing a model that was devised specifically for these birds on the Gala´pagos Islands (8). We examine the evidence for various aspects of the model, focusing on genetic factors where possible, and consider alternatives. Then we ask what needs to be added to the model to make it a comprehensive statement of speciation in birds in general. A Model of Allopatric Speciation Fig. 1 portrays three stages in the cycle of events leading to the division of one species into two. The choice of islands to illustrate these stages is arbitrary. In step 1, the archipelago is colonized from continental South or Central America. A breeding population becomes established, and its size increases. In step 2 some individuals disperse to another island and establish a new breeding population. Some evolutionary change takes place in the new environment through selection and drift. Step 2 may be repeated several times, giving rise to several differentiated populations of the same species. Step 3 is the contact, through dispersal, of members of two populations possessing different mate signaling and recognition systems. This is the secondary sympatric phase of the cycle, and there are two types of outcomes. In one, members of the two populations do not interbreed, or if they do their offspring are inviable or infertile; the process of speciation in this case has been completed in allopatry. Alternatively the populations are only partly reproductively isolated, interbreeding occurs, and some of the hybrids survive to breed. Reinforcement of the differences between the species then may occur if the hybrids have relatively low fitness.
FIG. 1. Allopatric speciation of Darwin’s finches in the Gala´pagos archipelago (8). After an initial colonization of the archipelago (step 1), dispersal and the colonization of new islands (step 2) gives rise to allopatric populations, which diverge through selection and drift. The process is completed with the establishment of sympatry (step 3). The choice of islands to illustrate the process is arbitrary. [Reproduced with permission from ref. 7 (Copyright 1996, The Royal Society)].
Proc. Natl. Acad. Sci. USA 94 (1997)
7769
Step 1 probably occurred once, or at most a few times, given the large distance separating the islands from the continent, which was greater at the time of initial colonization than at present (9). An argument from major histocompatibility complex variation suggests that there must have been a minimum of 30 individuals in the colonizing population (10). Stages 2 and 3 were repeated several times, giving rise to several species over a period of time estimated to be less than 3 million years (11). The ecological conditions would have varied from one cycle to another, but the essential features were repeated. The varying conditions include the length of the period of the allopatric phase (stage 2) before secondary contact, population sizes and hence the scope for drift, and the difference in the island environments and hence the scope for directional selection. Another important factor was the creation of new islands by volcanic activity (7, 9) and the recent periodic lowering of sea level (12). Over the last 3 million years there has been a net increase in the number of islands despite some disappearing through submergence, paralleling the increase in number of species (7). Thirteen species are recognized on the basis of morphological and biological criteria (4, 6), with as many as 10 occurring on a single island. A 14th species inhabits Cocos Island. Evidence from Field Studies of Darwin’s Finches We observe closely related species in sympatry and infer how they evolved from a common ancestor. Therefore we first consider how species are reproductively isolated, and then work back to their allopatric origin. Reproductive Isolation Species can be recognized by their morphological characteristics and songs (13, 14). With rare exceptions sympatric species pair and breed conspecifically, and as a result are reproductively isolated from each other. They choose mates on the basis of song, sung by males only, and morphological appearance, in which beak size and shape and body size play a part but plumage does not. Imprinting on adult features early in life appears to guide the choice of mates (7, 15, 16). The role of morphology in mate choice has been demonstrated experimentally with tests that show that several pairs of sympatric species of ground finches (Geospiza) discriminate between conspecific and heterospecific visual cues (17). Separately, experiments have shown that males can discriminate between conspecific and heterospecific auditory cues (18). Females were not tested in these acoustic experiments, but it would be surprising if they were not capable of making the same discriminations. The evolution of reproductive isolation in Darwin’s finches is therefore the evolution of differences in song and in morphology. Reproductive isolation is not complete; species hybridize, rarely, and are capable of producing fertile hybrids that backcross to the parental species (12, 19, 20). The rare interbreeding of species and the mating pattern of the hybrids provide further evidence of the importance of song in mate choice. Hybridization occurs sometimes as a result of miscopying of song by a male; a female pairs with a heterospecific male that sings the same song as that sung by her misimprinted father (16). On Daphne Major island hybrid females bred with males that sang the same species song as their fathers (20). All G. fortis 3 G. scandens F1 hybrid females whose fathers sang a G. fortis song paired with G. fortis males, whereas all those whose fathers sang a G. scandens song paired with G. scandens males. Offspring of the two hybrid groups (the backcrosses) paired within their own song groups as well. The same consistency was shown by the G. fortis 3 G. fuliginosa F1 hybrid females and all their daughters, which backcrossed to G. fortis. Thus mating of females was strictly along the lines of paternal song. The independent role of morphology in mate
7770
Colloquium Paper: Grant and Grant
choice is revealed by the rare instances where the usual association between song and morphology is disrupted. Four F1 hybrids (three females and a male) were produced by a G. fortis female that paired with a G. scandens male that sang a typical G. fortis song. The hybrids showed morphological evidence of nonrandom mating; their mates (all G. fortis) were more G. scandens-like in size and bill shape than were the potential G. fortis mates with which they did not breed. Also the one G. fortis 3 G. scandens F1 hybrid male that paired with a G. fortis female, before switching to a G. scandens mate, had the most G. fortis-like beak proportions of all hybrids, male and female. Like its father, it sang a G. scandens song. Although rare, these examples give insights into the cues used in the choice of mates under natural conditions where variables (song and morphology) are normally confounded. Inheritance of Traits that Isolate the Species Isolation involves two sets of signals, song and morphology, and behavioral responses to them. One of the signals, song, is a culturally inherited trait in Darwin’s finches (15). Males (only) sing a single song that is sung unchanged throughout life and acquired by an imprinting-like learning process, usually from their fathers, but in a minority of cases from other conspecific males (15, 21). Beak and body size traits, on the other hand, are quantitative traits that are under polygenic control. Heritabilities of six traits in three species studied in most detail lie generally in the range of 0.5 to 0.75, and the genetic correlations among the traits are similarly high (13, 22, 23). The possible genetic basis of the behavioral responses to song and morphological signals in the selection of a mate, and variation in the responses, are unknown, as they are for all birds (24). Crossfostering experiments are needed to dissociate the effects of parentally inherited genes from the effects of parentally influenced learning (imprinting) on mate choice. Where these have been done with other species they have shown that preferences for mates are not inherited in an inflexible way. Rather, young birds imprint on parental phenotype (25, 26), and visual and auditory stimuli they receive in early life influence their choice of mates much later in life. Thus sexual imprinting is likely to obscure any expression of genetically based variation in mate preferences that might exist. Even after effects of early experience have been experimentally manipulated by crossfostering it is still not known whether zebra finch males possess a biased (inherent) preference for conspecific females (27) or not (28). An inherent basis to female preferences is similarly not known. Within each species there is little indication of assortative mating on the basis of beak size (20, 29) or song (13, 15, 30, 31), and little indication that mating is influenced by the measured morphological features of the parents (20). However, the interbreeding of species and the breeding of hybrids reveals evidence of imprinting. The evidence for song has been given above. With regard to morphology, there are two lines of evidence. First, G. fortis females mated with G. scandens that were morphologically similar to their mothers. Second, the group of G. fortis 3 G. scandens F1 hybrids that paired with G. fortis mates showed evidence of morphologically nonrandom mating; their mates were more G. scandens-like in size and bill shape than were the potential G. fortis mates with which they did not breed. Therefore their choice of mates may have been influenced by the appearance of their G. scandens father. To summarize, the particular song a male sings, and the behavioral responses of females to song and morphological signals, are not genetically inherited in a fixed manner but are determined by learning early in life. Genes that underlie the capacity to receive, use and transmit information are the evolving properties. These genes may evolve during speciation, but this seems unlikely to us. Rather, features of the imprinting process of closely related species of Darwin’s finches (e.g., onset, duration,
Proc. Natl. Acad. Sci. USA 94 (1997) and cessation) are probably very similar, if not identical. Their mate recognition systems may be identical, through shared inheritance, yet usually they pair conspecifically solely as a result of sexual imprinting on their parents or similar models. Polygenic variation does underlie variation in morphological signals, however. Thus the genetics of speciation of Darwin’s finches through the evolution of reproductive isolation are very different from the genetics of Drosophila as described by Dobzhansky and elaborated by many others (32–34). Evolution of Prezygotic Isolation: Reinforcement Do the differences in courtship signals and responses develop entirely in allopatry, or do they arise in allopatry and continue to increase in sympatry reinforced by selection in accordance with Dobzhansky’s reasoning on the minimization of interbreeding? We answer this question by considering morphology. The reinforcement hypothesis in its original form requires that a certain degree of genetic incompatibility has evolved in allopatry, becoming manifest in sympatry (see refs. 1, 35, and 36). This requirement is not met, and therefore the hypothesis is rejected. The evidence is as follows. Interbreeding of allopatric birds (under controlled conditions in captivity) has not been performed, but field observations of natural hybridization have been made on the islands of Daphne Major (14, 37, 38) and Genovesa (13). These show that all six species of Darwin’s ground finches (genus Geospiza) hybridize (rarely) with at least one other congeneric species. In addition some intergeneric crosses are known among the tree finches and warbler finch, and breeding hybrids have been produced (6, 21). On Daphne Major Geospiza fortis (medium ground finch) hybridizes with G. scandens (cactus finch), another resident species, and G. fuliginosa (small ground finch), an uncommon immigrant. Contrary to expectation from the reinforcement hypothesis, hybrids formed by Geospiza fortis breeding with G. scandens and G. fuliginosa are both viable and fertile to a degree similar to that of the contemporary offspring of conspecific matings; so are the first two generations of backcrosses (12, 19). Backcrossing negates the hypothesis of speciation occurring entirely in allopatry. The speciation process therefore continues in sympatry. But the lack of a detectable genetic fitness loss associated with hybridization shows that reinforcement of song and morphology differences in sympatry does not occur, at least not for reasons of genetic incompatibility argued by Dobzhansky. Indeed postzygotic isolation has not evolved in allopatry, either partly or wholly, to any detectable extent. To be more specific, and with reference to the summary of Dobzhansky (1) above, inviability of some F1 hybrids, and low average adaptedness of the F2 and of backcross products, are not byproducts of the genetic differentiation of allopatric populations before secondary contact. They do not cause reproductive isolation; they follow as a consequence. A broader statement of the reinforcement hypothesis allows for reinforcement to occur when the hybrids are at a disadvantage for ecological reasons, regardless of whether there are fitnessreducing genetic incompatibilities or not. For example there are environmental circumstances under which one class of F1 hybrids (G. fortis 3 G. fuliginosa) do not survive well (39). Because the relative fitness of hybrids is environment-dependent it is possible that the conditions for reinforcement (low fitness) could arise episodically solely from environmental causes. However, reinforcement then would occur only if the birds that hybridized were a nonrandom, and hence selectable, subset of the population with respect to a genetically inherited trait, such as bill size. There is no evidence for this (20).
Colloquium Paper: Grant and Grant Evolution of Premating Isolating Factors in Allopatry Despite the evidence against reinforcement, the basis of premating isolation does not evolve entirely in allopatry. If it did we would expect members of well differentiated allopatric populations or closely related species to display a high degree of discrimination against heterotypic individuals in a reproductive context if they were brought into contact. Experiments show that this does not happen; there is instead a strong potential for ‘‘reproductive confusion’’ between immigrants from a partly differentiated population and their resident relatives on an island to which they have dispersed. This potential has been revealed by discrimination tests conducted experimentally with stuffed museum specimens (models) of potential mates. The experiments used a conspecific specimen and a specimen from an allopatric population of a morphologically similar species, simulating the immigration of a closely related species (40). Using a heterospecific model rather than a well differentiated conspecific model should have biased the experimental result toward a clear discrimination, but, in fact, the observed results were the opposite: a statistical lack of discrimination. The results cannot be used as a measure of the degree to which interbreeding on secondary contact would occur. Nonetheless they stand in contrast to the rarity of observed hybridization. This implies that further divergence in courtship signals and responses occurs after the establishment of sympatry; though not, as we have seen, by the Dobzhansky process of reinforcement. A parallel set of experiments gave a similar result with song. Like beak size or shape differences, song differences have the potential of effecting a degree of isolation between the interactants. Experiments with playback of tape-recorded song demonstrate that resident males are capable of discriminating between songs produced by members of their own population and members of a related population on a near island. Residents’ songs elicit stronger responses than do those of the potential immigrants (18). In tests of several species the discrimination was often weak, implying that song difference, by itself, would not be sufficient to prevent interbreeding. Allopatric Divergence The second stages in each cycle are likely to have been followed by some back-migration (dispersal) from derived to original populations, and repeated forward-migration, in view of the generally short distances separating the islands (Fig. 1). Exchange of breeding individuals between two populations tends to homogenize their gene pools. If evolutionary divergence in allopatry has been minor before the interchange, the members of the two populations are likely to treat each other as potential mates and interbreeding will ensue. How, then, do they diverge to the point of developing premating isolation? The answer is a combination of natural selection and cultural drift. Song differences between populations of the same species, and between species, arise initially in allopatry through a random sampling of song types in the newly founded population (41), followed by change as a result of small copying errors in transmission from father to son (15, 16, 20), as well as random extinction of song types; in other words by a process of cultural change analogous to genetic drift (15). Song frequencies also could change as a result of a chance association between song and a naturally selected morphological trait in males (31); evidence for this is not strong (15). The divergence of songs in the new population away from those in the progenitor population would only be prevented if these processes were balanced by repeated immigration and subsequent breeding: song flow. Beak and body size differences arise under the different and strong directional selection pressures affecting the way in which finches exploit the environment, principally the food
Proc. Natl. Acad. Sci. USA 94 (1997)
7771
supply. Islands are known to differ in the food supply available to ground finches, mainly seeds (42–44). Natural selection on beak and body size traits has been measured in a population of Geospiza fortis on the island of Daphne Major at two times of change in the seed supply (45–47). Because the morphological traits are highly heritable (22, 23), natural selection in one generation led to an evolutionary response in the next (48). The magnitude of the response was such as to probably dwarf any genetic changes occurring over the short term by drift and immigration. Unfortunately, we have no measures of these latter two changes. Selective changes in the population of G. scandens on Daphne (37, 38) and G. conirostris on the island of Genovesa (13) also have been quantified, and although evolutionary responses were not determined they probably occurred because these species also display highly heritable morphological variation (13, 23). Sufficient Allopatric Divergence for the Establishment of Sympatry Differences in beak size between closely related species tend to be greater in sympatry than in allopatry (4, 6). Of the two possible reasons for this pattern, reinforcement and ecological divergence, reinforcement has been ruled out. The pattern is one of ecological, rather than reproductive, character displacement. Thus sympatry is established providing that (a) ecological conditions in the form of available food supply are favorable for the establishment of a new species, and (b) sufficient divergence in beak sizes and diets has occurred in allopatry. Further divergence occurs in sympatry (step 3 of the cycle) as a result of the morphologically most similar individuals of each species suffering the greater effects from competition for food. Supporting evidence for this interpretation includes a positive correlation between beak size differences and dietary differences that is predicted from the distribution of seed sizes on each island (42, 44, 49). Although speciation is ecologically driven, reproductive (mating) interactions at secondary contact are far from irrelevant. Beak sizes and shapes have both mate (species) recognition and food handling functions. The outcome of immigration of individuals from a partly differentiated population could hang precariously between interbreeding and the reproductive absorption of immigrants into the population of residents, on the one hand, and establishment, divergence, and coexistence despite some degree of interbreeding, on the other hand. Because death and emigration are two other possible fates of the immigrants, establishment of a new population and completion of step 3 of the speciation cycle is a rare event. When does a Species become Two? In the first decade of the study on Daphne Major none of the F1 hybrids survived long enough to breed, therefore no gene exchange took place between the species, which were, in that sense, completely reproductively isolated. From 1983 onwards the hybrids backcrossed to the parental species and neither they nor their offspring experienced any apparent loss of fitness. The species were not reproductively isolated and were (and still are) moving slowly on a trajectory toward panmixia. The movement is slow because song constrains the mating of members of the backcross generations (12); none of the backcrosses have hybridized (20). Should these be considered three species or one? Given the variety of opinions about how species should be defined and recognized (50), there is no clear answer to that question now, any more than there was when Huxley (51) wrote ‘‘we must not expect too much of the term species. In the first place, we must not expect a hard-and-fast definition, for since most evolution is a gradual process, borderline cases must occur. And in the second place, we must not expect a single or a simple basis for
7772
Colloquium Paper: Grant and Grant
definition, since species arise in many different ways.’’ In our view it is preferable to continue to treat the finches on Daphne as three species, expecting that environmental conditions will change back to those disfavoring the hybrids (14, 39). Elsewhere in the archipelago the three species are morphologically distinctive (52). Avian Speciation in General The model in Fig. 1 was designed to capture the essential features of allopatric speciation, although devised specifically for the insular speciation of Darwin’s finches. We have emphasized divergence in mate recognition and feeding traits in allopatry, song learning and imprinting, hybridization, and the absence of genetic incompatibilities. All of these features are displayed by birds elsewhere. For example, both imprinting (53) and hybridization (19, 54) are widespread in birds. Nonetheless the extent to which hybridization, imprinting, and the other features are generally involved in bird speciation is difficult to gauge because the requisite data for most bird species are lacking. The details of speciation must vary to some extent, because not all species learn song, not all species imprint on parental phenotype because not all exhibit parental care and not all hybridize. Speciation of birds on islands elsewhere appears to follow the major paths of the model (7), with one conspicuous exception; a far greater diversity of plumage colors and patterns has evolved on other islands (7, 55, 56), presumably under sexual selection. The same applies to birds on continents. For a full understanding of avian speciation, on continents as well as on islands, we need to understand the genetic basis of plumage variation and associated behavior in courtship and mate choice, as sexual selection probably has been a major force in bird speciation (57, 58). Furthermore, to understand how plumage characters and behavior change during speciation we need to take account of demography and geography. Genetic Changes in Plumage and Courtship Behavior Plumage features constitute a major component of courtship signals (59). In theory, reproductive isolation between related species could be achieved by a single mutational change affecting a plumage trait that is used to identify mates as a result of the sort of imprinting process outlined for the Darwin’s finches. For example in the Bananaquit (Coereba flaveola) on the Caribbean island of Grenada a plumage polymorphism is governed by a single, diallelic, autosomal locus: melanic plumage is completely dominant to yellow plumage (60) (in other species melanism is a recessive trait; refs. 61 and 62). If the polymorphism arose through mutation in an isolated population, and the mutant allele became fixed, there would be a potential for reproductive isolation at secondary contact. This scheme is too simple for most cases because polygenic inheritance of plumage traits is more common than single-gene inheritance (62). Polygenic inheritance is revealed by hybridization studies that show hybrids and backcrosses to possess traits intermediate between those of their parents (63–66). In these cases speciation may have involved the accumulation of frequency differences in the alleles, including novel alleles, at many loci. Yet even where speciation has involved divergent evolution of several plumage traits the changes may have been brought about by a small number of mutations of sufficient magnitude to produce appreciable effects from the beginning, with subsequent modification by genes of relatively minor effect (63). For example, a minimum of 14 loci influence the crest (color and extent) in F1 hybrids between two species of pheasants, Chrysolophus picta (Golden) and C. amherstiae (Amherst). Nevertheless alleles at a few loci have large phenotypic effects on crest morphology and display dominance
Proc. Natl. Acad. Sci. USA 94 (1997) and recessiveness, while others have smaller and apparently additive effects (63). Evidence of epistasis from hybridization studies is more scarce. One plumage trait is consistently present in the F1 pheasant hybrids yet lacking in both parental species: on each side a conspicuous yellow spot extends across the breast, and the two spots nearly meet in the midline (63). The controlling factors may act epistatically, if not in an overdominant manner. Males transmit signals in courtship through behavioral displays. In the early stages of speciation different (plumage) signals may be transmitted by the same displays. Later the displays themselves diverge. Hybridization studies show that when the parental species do differ the displays of F1 hybrids may (a) be intermediate, (b) comprise new combinations of displays found in one or both parental species, or (c) occur in neither parental species but perhaps occur in a third species (66–70). Intermediate behaviors in the F1 hybrids indicate polygenic inheritance, and different combinations of components of the parental species repertoires in the members of the F2 generation (64) indicate a lack of linkage. As expected if behavioral features are inherited in a Mendelian fashion a broad range of variation in behaviors is shown in the F2 generation, with the all-or-none performance of some displays (64) possibly indicating dominance. Rates of bowing of some hybrid doves (70) were found to be very close to those of one of the species, suggesting dominance, and in other types of hybrids were beyond the range of both parental species, suggesting overdominance or epistasis. Almost nothing is known from hybridization studies about the inheritance of courtship behavior of females, or of their responsiveness to particular male signals. This is particularly unfortunate in view of the importance given in theoretical models of sexual selection to genetic variation in female preferences for male traits (e.g., refs. 24, 71, and 72). Female cardueline finch hybrids solicit copulations in the same way as the parental species (67). Because the parental species in this study were not each other’s closest phylogenetic relative, the results show that copulatory behavior is evolutionarily conservative in this group and does not diverge during speciation. The possible role of inherited factors in the divergence of female precopulatory behavior during speciation needs to be investigated. Genetic Changes During Speciation Electrophoretically detected genetic differences between closely related species provide an indirect measure of the genetic changes taking place during or shortly after speciation. They are indirect because they are unlikely to have any bearing on plumage and courtship behavior, or ecologically important morphological traits. Genetic distances between bird species are unusually small in comparison with other vertebrates (73–76). This has given rise to the idea that few genetic changes are involved in speciation, and that their phenotypic effects are minor (77, 78): ‘‘Perhaps speciation is often a rather superficial phenomenon, involving chiefly inherited changes in behavior and plumage’’ (77). Consistent with this ‘‘few genes’’ view of avian speciation, a small number of changes in plumagegoverning genes apparently are involved in speciation. Closely related species of birds are also chromosomally similar. Based on a comparison of 177 possible congeneric pairs of species, mainly passerines, Shields (79) concludes that chromosome change has not been involved in the promotion of reproductive isolation in most cases. Other authors have reached the same conclusion (77, 80). Nonetheless chromosomal evolution may play some role in speciation as it has been five times as fast in the more rapidly speciating passerines as in the nonpasserines (79). Against a background of chromosomal similarity among congeneric finch species in four families, a few pairs of congeneric species stand out by differing in
Colloquium Paper: Grant and Grant diploid number of chromosomes and several pericentric inversions (81, 82). A minor complication with the ‘‘few genes’’ view arises from hybridization. Natural hybridization in birds is widespread though usually not common (19, 54), and in several cases it occurs without a marked loss of fitness (19). Genetic distances between bird species may be unusually short because such species occasionally exchange genes. Introgressive hybridization is, on the one hand, permitted by the small number of gene differences between closely related species and, on the other hand, contributes to the smallness of the genetic differences. The important point is that closely related species are almost completely reproductively isolated behaviorally, despite having the potential of producing viable and fertile hybrid offspring as a result of their genetic similarity. They remain isolated as a result of different signal and response systems in courtship. The Prevalence and Potential Importance of Hybridization If postmating isolation evolves in allopatry, hybridization at the sympatric stage will be lacking, or at most extremely rare and of little consequence. This was the view expressed by Mayr (83) when he wrote that few situations are known in birds where hybridization occurs, few hybrids are known, the majority are likely to be sterile, and if hybridization occurs and genes introgress the process will be self-reinforcing, with selection eliminating disharmonious mixed gene combinations. Even where hybridization is locally relatively common, as between black grouse (Tetrao tetrix) and capercaillie (T. urogallus) in Europe, it appears to have not led to introgression (83). If, on the other hand, postmating isolation evolves slowly after sympatry is established there will be greater opportunity for hybridization to not only occur but, through introgression, to influence the evolution of the populations that exchange genes. Modern evidence suggests this applies to birds. In the last 25 years several field studies have documented by observation the occurrence of hybridization of bird species and shown that hybrids are viable and fertile to a large degree (19, 54). For example, at least half of 29 sympatric species pairs of North American birds hybridize, and at least a quarter backcross frequently (84). Eight types of intrageneric and intergeneric hybridization have been documented among sympatric hummingbird species in North America (85), and 13 types of intergeneric hybridization have been recorded of a possible 28 in sympatric birds of paradise (86). Introgressive hybridization is underestimated by observation because it is not easy to detect (87). For example, detailed studies show that the F1 hybrids produced by black grouse and capercaillie do, in fact, backcross to capercaillie, possibly to black grouse as well, but, in accordance with Haldane’s rule, it is only the males that do (88). With regard to speciation, the more appropriate hybridization is between the sister taxa capercaillie (Tetrao urogallus) and the black-billed capercaillie (T. urogalloides). Where the ranges overlap in eastern Siberia not only do they hybridize, the hybrids (Kirpicev’s capercaillie) are fertile and show morphological evidence of hybrid vigor (88). Molecular data have revealed evidence of previously unsuspected or underappreciated introgression (89–91). The occurrence of hybridization in both socially monogamous and polygynous species raises the possibility that interbreeding occurs cryptically, even if rarely, outside the pair-bond of the former and away from the main mating arenas of the latter. The prevalence of intergeneric hybridization in some groups of birds (92, 93) with little or no obvious fitness loss argues for long retention of the ability to hybridize successfully (see also ref. 94), as does the hybridization of nonsister taxa within genera (95) without the hybrids experiencing a disadvantage (96). Introgressive hybridization has the potential of leading to further evolutionary change as a result of enhancing genetic variances, in some cases lowering genetic covariances (23),
Proc. Natl. Acad. Sci. USA 94 (1997)
7773
introducing new alleles (97), and creating new combinations of alleles, some of which might be favored by natural selection (1, 98) or sexual selection (99). Sva¨rdson (100) believed that introgression in coregonid fishes has replaced mutation as the major source of evolutionary novelty. Introgression and mutation are not independent; introgressive hybridization may elevate mutation rates (101). A particularly clear example of introgressive hybridization has been described by Chapin (102, 103). Exaggerated plumage traits (crest and tail length) have entered populations of two species of paradise flycatchers (genus Tersiphone) as a result of hybridization with a third species, in both east and west Africa. These traits are likely to have been favored by sexual selection. To judge from the inheritance of crests in quail (65) and extreme tail development in pheasants (63), just a few introgressed alleles could have effected the plumage transformations. The relevance to speciation lies in the fact that regions of introgression are peripheral areas, which could become isolated from the main range of the species through a change in climate and habitat: they are potential sites of speciation. Geography and Demography of Speciation on Continents In the early stages of allopatric speciation on continents, two populations are derived from one either by colonization of a second area through dispersal or by the splitting of a continuous range into two (vicariance). The dispersal mode is not fundamentally different from step 2 in Fig. 1. This is illustrated by ring species, which appear to offer the closest continental parallel to the dispersalist scheme in Fig. 1. According to the classical explanation of Dobzhansky (1), through dispersal a chain of partly differentiated populations becomes converted to a ring (see also refs. 54 and 83). At the point of ring closure or overlap where two populations establish secondary contact they do not interbreed, or do so extremely rarely; e.g., herring gull and lesser blackbacked gull (104). A crossfostering experiment with these gulls showed that, as in Darwin’s finches, misimprinted birds are capable of producing viable hybrids, i.e., once the premating isolating mechanism is broken (104, 105). The vicariance mode is different. Here the problem is to understand how a mutation arising in a large population might increase from an initially extremely low frequency to fixation. Directional natural or sexual selection would have to be persistent and strong, over a range of environmental conditions, to accomplish this. Speciation on continents might be quite different from speciation on islands if vicariance is a relatively common mode and dispersal into peripheral isolates is relatively rare, as is often claimed (106–108). However, the distinction between the two is not easy to make, principally because current distributions may be much broader as a result of postglacial range changes than the ones in which the main evolutionary events took place (108). It seems to us likely that some combination of vicariance and dispersal events might be more frequent than either is alone. The ring species may be a case in point (83). Conditions for fixation to occur are much less stringent in a subdivided population, least stringent of all when the subdivisions occupy small habitat islands in a fragmented landscape: in effect an archipelago of insular populations within a continent. For this reason refuges at high (98, 109) and at low (110, 111) latitudes often have been invoked to account for the evolution of distinctive traits in closely related species. Hall and Moreau (112) have argued that “for continental [African] birds the period of most active speciation is a time of unfavourable climate, such as will break its natural habitat into small blocks or ‘islands’. . .” Quaternary fluctuations in climate have produced these conditions repeatedly (113, 114); humans are doing the same now (102). Each time the ‘‘islands’’ were probably reconstituted differently as a result of semi-
7774
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Grant and Grant
independent (individualistic) responses by components of the biota to climatic change (115). If continental speciation is promoted by temporarily insular conditions the demography and genetics of founding populations may be of importance. It frequently has been suggested that on small islands or in fragmented habitats major genetic changes can take place in the early history of the founding of a new population by a few individuals, or during the bottlenecks caused by subsequent population crashes (116–119). The changes are believed to involve a few major genes that comprise a strongly epistatic polygenic system. Certainly changes can be rapid under these demographic circumstances. Selection is likely to be effective during the period of rapid growth from low population numbers, particularly for low-frequency alleles (120). High speciation rates (77), a greater genetic gap between similar species than between the most different conspecific populations (76), and the fact that populations are large yet genetic differences are small (74, 79), have all been interpreted as evidence for founder effects arising in subdivided populations and contributing to bird speciation. The theory of founder speciation was developed when speciation was viewed as the evolution of postmating isolation largely, if not entirely, in geographical isolation (116). Derivatives of it are useful to account for some speciation phenomena in Drosophila (118–121) and similar organisms, but for birds it needs to be reappraised in light of a refocus on the origin of premating isolation (7). The idea that closely related species of birds differ in coadapted syndromes of mating behavior with an extensive genetic basis (122) is not supported by modern studies of hybridization and imprinting. The theory of founder effects does not explain how novel features like plumage traits arise. Founder effects may have contributed to bird speciation in the more limited way of altering the frequencies of alleles already in existence at the time of founding of a new population and initiating the evolution of premating isolating mechanisms. Completion of Speciation The process of speciation is completed with the cessation of genetic exchange. In broad outline the last stages of speciation are known, but the genetic basis to the physiological details of postmating isolation are not. As species diverge they accumulate different mutations that contribute to lowered viability and fertility of hybrids, possibly also to prezygotic incompatibilities in the female reproductive tract. For example, hybrids of Agapornis parrots develop excessive fat, lipomas, defeathering (males), and gout (uric acid crystals in joints), and females are sterile (69). In accordance with Haldane’s rule, genetic problems first arise in the heterogametic females (123, 124). Premating isolation increases as these forms of postzygotic isolation develop. The sexual display activity and vocalizations of hybrids becomes reduced (e.g., see refs. 64 and 69) or disrupted at points in a courtship sequence corresponding to differences between the parental species (69, 125). The disruptions are caused by conflicts between incompatible units of behavior, missing parts of the repertoire of one or both parental species, or unusual behaviors not expressed by either parental species. Sexual behavior breaks down altogether and is not exhibited in hybrids produced in captivity between very disparate parental species, and cannot be induced by a large dose of sex hormones (67). Such species do not interbreed in nature. Genetic incompatibilities that conform to Haldane’s rule do not always arise in the simple speciation process of one species splitting into two. Sometimes they arise between two species so formed only after each of them has split further into yet more species. The frequency of this is unknown.
General Trends: Six Rules of Avian Speciation As a means of summarizing the preceding discussion and survey of the literature we suggest there are six rules of speciation in birds: 1. Speciation is initiated in allopatry. 2. The sympatric phase of the speciation process is established after an allopatric period of ecological divergence. 3. Allopatric evolution of premating isolating mechanisms precedes the evolution of postmating mechanisms in allopatry or sympatry. 4. Premating mechanisms are governed mainly by additive effects of polygenes, postmating mechanisms are due mainly to nonadditive genetic effects (dominance and epistasis). 5. Premating mechanisms include effects of the cultural process of sexual imprinting. 6. Postzygotic incompatibilities arise first in females. This is Haldane’s rule applied to birds in which females are the heterogametic sex. All rules have exceptions, otherwise they would be laws, and it is doubtful if there are any speciation laws. Birds provide overwhelming support for Haldane’s rule (126), with a few exceptions affecting fertility (e.g., see refs. 32 and 65) but apparently none affecting viability in captive birds (ref. 32; but see ref. 127). Rule 5 would seem to have the greatest number of exceptions, particularly among those species lacking paternal care (128). We include it on the strength of a literature survey that concluded that sexual imprinting ‘‘seems to have been found wherever it has been looked for, and should be considered the rule rather than exceptional for the development of mate preferences in birds’’ (53). The rules should apply to other nonavian taxa, to varying degrees. We conclude with a caveat. Almost 10,000 bird species are recognized under the biological species concept (129). Interpretations of speciation have been applied to perhaps 500 of them. The genetic basis of variation in premating isolating traits believed to be involved in speciation is known (incompletely) for less than 100 species, and the genetic basis of postmating isolation is virtually unknown for all of them. The knowledge base from which to generalize about the genetics of bird speciation is precariously thin. Recognition of this should be a stimulus for future research. The ease with which closely related species can be induced to hybridize in captivity suggests that a program of experimental hybridization has much to teach us about the genetics of bird speciation. We are grateful to the A. von Humboldt Foundation and Peter Berthold of the Max–Planck–Institut at Radolfzell for support while this paper was written, and to T. D. Price, D. S. Woodruff, and an anonymous referee for helpful comments. Research on Darwin’s finches has been supported by the National Science and Engineering Research Council (Canada) and the National Science Foundation (USA). 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
11. 12.
Dobzhansky, T. (1937) Genetics and the Origin of Species (Columbia Univ. Press, New York), revised 1951. Blair, W. F. (1955) Evolution 9, 469–480. Wittmann, U., Heidrich, P., Wink, M. & Gwinner, E. (1995) J. Zool. Syst. Evol. Res. 33, 116–122. Lack, D. (1947) Darwin’s Finches (Cambridge Univ. Press, Cambridge). Bowman, R. I. (1961) Univ. Calif. Publ. Zool. 58, 1–302. Grant, P. R. (1986) Ecology and Evolution of Darwin’s Finches (Princeton Univ. Press, Princeton, N. J.). Grant, P. R. & Grant, B. R. (1996) Philos. Trans. R. Soc. London Ser. B 351, 765–772. Grant, P. R. (1981) Am. Sci. 69, 653–663. Christie, D. M., Duncan, R. A., McBirney, A. R., Richards, M. A., White, W. M., Harpp, K. S. & Fox, C. G. (1992) Nature (London) 355, 246–248. Vincek, V., O’Huigin, C., Satta, Y., Takahata, N., Boag, P. T., Grant, P. R., Grant, B. R. & Klein, J. (1996) Proc. R. Soc. London Ser. B 265, 111–118. Grant, P. R. (1994) Evol. Ecol. 8, 598–617. Grant, B. R. & Grant, P. R. (1997) in Endless Forms: Species and Speciation, eds. Howard, D. J. & Berlocher, S. H. (Oxford Univ. Press, Oxford), in press.
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Grant and Grant 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69. 70. 71. 72.
Grant, B. R. & Grant, P. R. (1989) Evolutionary Dynamics of a Natural Population: The Large Cactus Finch of the Gala ´pagos (Univ. Chicago Press, Chicago). Grant, P. R. (1993) Philos. Trans. R. Soc. London Ser. B 340, 127–139. Grant, B. R. & Grant, P. R. (1996) Evolution 50, 2471–2487. Grant, P. R. & Grant, B. R. (1997) Am. Nat. 149, 1–28. Ratcliffe, L. M. & Grant, P. R. (1983) Anim. Behav. 31, 1139–1153. Ratcliffe, L. M. & Grant, P. R. (1985) Anim. Behav. 33, 290–307. Grant, P. R. & Grant, B. R. (1992) Science 256, 193–197. Grant, P. R. & Grant, B. R. (1997) Biol. J. Linn. Soc. 60, 317–343. Bowman, R. I. (1983) in Patterns of Evolution in Gala ´pagos Organisms, eds. Bowman, R. I., Berson, M. & Leviton, A. E. (American Association for the Advancement of Science, San Francisco), pp. 237–537. Boag, P. T. (1983) Evolution 37, 877–894. Grant, P. R. & Grant, B. R. (1994) Evolution 48, 297–316. Bakker, T. C. M. & Pomiankowski, A. (1995) J. Evol. Biol. 8, 129–171. Immelmann, K. (1975) Annu. Rev. Ecol. Syst. 6, 15–37. Cooke, F., Finney, G. H. & Rockwell, R. F. (1976) Behav. Genet. 6, 127–140. Bischoff, H.-J. & Clayton, N. (1991) Behavior 118, 144–155. Kruijt, J. P. & Meeuwissen, G. B. (1993) Neth. J. Zool. 43, 68–79. Price, T. D. (1984) Evolution 38, 327–341. Millington, S. & Price, T. (1985) Auk 102, 342–346. Gibbs, H. L. (1990) Anim. Behav. 39, 253–263. Coyne, J. A. (1992) Nature (London) 355, 511–515. Coyne, J. A., Crittenden, A. P. & Mah, K. (1994) Science 265, 1461–1464. Wu, C.-I. & Palopoli, M. (1994) Annu. Rev. Genet. 28, 283–308. Howard, D. J. (1993) in Hybrid Zones and the Evolutionary Process, ed. Harrison, R. G. (Oxford Univ. Press, Oxford), pp. 46–69. Liou, L. W. & Price, T. D. (1994) Evolution 48, 1451–1459. Grant, P. R. & Price, T. D. (1981) Am. Zool. 21, 795–811. Boag, P. T. & Grant, P. R. (1984) Biol. J. Linn. Soc. 22, 243–287. Grant, B. R. & Grant, P. R. (1993) Proc. R. Soc. London Ser. B 251, 111–117. Ratcliffe, L. M. & Grant, P. R. (1983) Anim. Behav. 31, 1154–1165. Grant, P. R. & Grant, B. R. (1995) Evolution 49, 229–240. Abbott, I. J., Abbott, L. K. & Grant, P. R. (1977) Ecol. Monogr. 47, 151–184. Smith, J. N. M., Grant, P. R., Grant, B. R., Abbott, I. & Abbott, L. K. (1978) Ecology 59, 1137–1150. Schluter, D. & Grant, P. R. (1984) Am. Nat. 123, 175–196. Boag, P. T. & Grant, P. R. (1981) Science 214, 82–85. Price, T. D., Grant, P. R., Gibbs, H. L. & Boag, P. T. (1984) Nature (London) 309, 787–789. Gibbs, H. L. & Grant, P. R. (1987) Nature (London) 327, 511–513. Grant, P. R. & Grant, B. R. (1995) Evolution 49, 241–251. Schluter, D., Price, T. D. & Grant, P. R. (1985) Science 227, 1056–1059. Endler, J. A. (1989) in Speciation and its Consequences, eds. Otte, D. & Endler, J. A. (Sinauer, Sunderland, MA), pp. 625–648. Huxley, J. (1942) Evolution: The Modern Synthesis (Allen & Unwin, London). Grant, P. R., Abbott, I., Schluter, D., Curry, R. L. & Abbott, L. K. (1985) Biol. J. Linn. Soc. 25, 1–39. Ten Cate, C., Vos, D. R. & Mann, N. (1993) Neth. J. Zool. 43, 34–45. Panov, E. (1989) Natural Hybridisation and Ethological Isolation in Birds (Nauka, Moscow). Amadon, D. (1950) Bull. Am. Mus. Nat. Hist. 95, 151–262. Mayr, E. (1972) Biol. J. Linn. Soc. 1, 1–17. West-Eberhard, M. J. (1983) Quart. Rev. Biol. 58, 155–183. Barrowclough, T. G., Harvey, P. H. & Nee, S. (1995) Proc. R. Soc. London Ser. B 259, 211–215. Baker, M. C. & Baker, A. E. M. (1990) Evolution 44, 332–338. Wunderle, J. (1981) Evolution 35, 333–344. Berthold, P. (1997) Naturwissenschaften 83, 568–570. Buckley, P. A. (1987) in Avian Genetics: A Population and Ecological Approach, eds. Cooke, F. & Buckley, P. A. (Academic, London), pp. 1–44. Danforth, C. H. (1950) Evolution 4, 301–315. Sharpe, R. S. & Johnsgard, P. A. (1966) Behavior 27, 259–272. Johnsgard, P. A. (1973) Grouse and Quails of North America (Univ. Nebraska Press, Lincoln). Scherer, S. & Hilsberg T. (1982) J. f. Orn. 123, 357–380. Hinde, R. A. (1956) Behavior 9, 202–213. Lorenz, K. (1958) Sci. Amer. 199, 67–78. Buckley, P. A. (1969) Z. Tierpsychol. 26, 737–747. Davies, S. J. J. F. (1970) Behavior 36, 187–214. Price, T. D., Schluter D. & Heckman, N. E. (1993) Biol. J. Linn. Soc. 48, 187–211. Schluter, D. & Price, T. (1993) Proc. R. Soc. London Ser. B 253, 117–122.
73. 74. 75. 76. 77. 78. 79. 80. 81. 82. 83. 84. 85. 86. 87. 88. 89. 90. 91. 92. 93. 94. 95. 96. 97. 98. 99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111. 112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125. 126. 127. 128. 129.
7775
Avise, J. C. (1983) in Perspectives in Ornithology, eds. Brush, A. H. & Clark, G. A., Jr. (Cambridge Univ. Press, Cambridge), pp. 262–270. Barrowclough, G. F. (1983) in Perspectives in Ornithology, eds. Brush, A. H. & Clark, G. A., Jr. (Cambridge Univ. Press, Cambridge), pp. 223–261. Johnson, N. K. & Zink, R. M. (1983) Auk 100, 871–884. Corbin, K. W. (1987) in Avian Genetics: A Population and Ecological Approach, eds. Cooke, F. & Buckley, P. A. (Academic, London), pp. 321–353. Prager, E. M. & Wilson, A. C. (1980) in Proc. XVII International Ornithological Congress, Berlin (Verlag der Deutsche Ornithologen-Gesellschaft), pp. 1209–1214. Snell, R. R. (1991) Auk 108, 319–328. Shields, G. F. (1987) in Avian Genetics: A Population and Ecological Approach, eds. Cooke, F. & Buckley P. A. (Academic, London), pp. 79–104. Sandnes, G. C. (1954) Evolution 8, 359–364. Christidis, L. (1986) Can. J. Genet. Cytol. 28, 762–769. Christidis, L. (1990) in Chromosomes Today, eds. Fredga, K., Kihlman, B. A. & Bennett, M. D. (Unwin Hyman, London), Vol. 10, pp. 279–294. Mayr, E. (1963) Animal Species and Evolution (Belknap, Harvard, MA). Anderson, R. V. V. (1977) Am. Nat. 111, 939–949. Johnsgard, P. A. (1983) The Hummingbirds of North America (Smithsonian Inst. Press, Washington). Christidis, L. & Schodde, R. (1993) Bull. Brit. Orn. Club 113, 169–172. Nagle, J. J. & Mettler, L. E. (1969) Evolution 23, 519–524. Klaus, S., Andreev, A. V., Bergmann, H.-H., Mu ¨ller, F., Porkert, J. & Wiesner, J. (1989) Die Auerhu ¨hner: Tetrao urogallus und T. urogalloides (A. Ziemsen Verlag, Wittenberg Lutherstadt). Avise, J. C. & Ball, R. M. (1990) Oxf. Surv. Evol. Biol. 7, 45–67. Gill, F. B., Mostrom, A. M. & Mack, A. L. (1993) Evolution 47, 195–212. Degnan, S. (1993) Mol. Ecol. 2, 219–225. Meise, W. (1975) Abh. Verh. Naturwiss. Ver. Hamburg 18y19, 187–254. Banks, R. C. & Johnson, N. K. (1961) Condor 63, 3–28. Prager, E. M. & Wilson, A. C. (1975) Proc. Natl. Acad. Sci. USA 72, 200–204. Freeman, S. & Zink, R. M. (1995) Syst. Biol. 44, 409–420. Corbin, K. W., Sibley, C. G. & Ferguson, A. (1979) Evolution 33, 624–633. Anderson, E. & Stebbins, G. L., Jr. (1954) Evolution 8, 378–388. Miller, A. H. (1956) Evolution 10, 262–277. Parsons, T. J., Olson, S. L. & Braun, M. J. (1993) Science 260, 1643–1646. Sva¨rdson, G. (1970) in Biology of Coregonid Fishes, eds. Lindsey, C. C. & Woods, C. S. (Univ. Manitoba Press, Winnipeg), pp. 33–59. Woodruff, D. S. (1989) Biol. J. Linn. Soc. 36, 281–294. Chapin, J. P. (1948) Evolution 2, 111–126. Chapin, J. P. (1963) Ibis 105, 198–202. Harris, M. P. (1970) Ibis 112, 488–498. Harris, M. P., Morley, C. & Green, G. H. (1978) Bird Study 25, 161–166. Cracraft, J. (1982) Amer. Zool. 22, 411–424. Cracraft, J. & Prum, R. O. (1988) Evolution 42, 603–620. Chesser, R. T. & Zink, R. M. (1994) Evolution 48, 490–497. Zink, R. M. & Dittmann D. L. (1993) Evolution 47, 717–729. Haffer, J. (1982) in Biological Diversification in the Tropics, ed. Prance, G. T. (Columbia Univ. Press, New York), pp. 6–24. Brumfield, R. T. & Capparella, A. P. (1996) Evolution 50, 1607–1624. Hall, B. P. & Moreau, R. E. (1970) An Atlas of Speciation in African Passerine Birds (British Museum, Natural History, London). deMenocal, P. (1995) Science 270, 53–59. Keast, A. (1961) Bull. Mus. Comp. Zool. Harvard 123, 303–495. Jablonski, D. & Sepkoski, J. J., Jr. (1996) Ecology 77, 1367–1378. Mayr, E. (1954) in Evolution as a Process, eds. Huxley, J., Hardy, A. C. & Ford, E. B. (Allen & Unwin, London), pp. 157–180. Mayr, E. (1992) Oxford Surv. Evol. Biol. 8, 1–34. Carson, H. (1975) Am. Nat. 109, 83–92. Templeton, A. R. (1980) Genetics 91, 1011–1038. Slatkin, M. (1996) Am. Nat. 147, 493–505. Hollocher, H. (1996) Philos. Trans. R. Soc. London Ser. B 351, 735–743. Carson, H. L. (1978) in Ecological Genetics: The Interface, ed. Brussard, P. F. (Springer, New York), pp. 93–107. Tegelstro ¨m, H. & Gelter, H. P. (1990) Evolution 44, 2012–2021. Gelter, H. P., Tegelstro ¨m, H. & Gustafsson, L. (1992) Ibis 134, 62–68. Dilger, W. C. (1962) Sci. Am. 206, 88–98. Coyne, J. A. & Orr, H. A. (1989) in Speciation and Its Consequences, eds. Otte, D. & Endler, J. A. (Sinauer, Sunderland, MA), pp. 180–207. Wu, C.-I. & Davis, A. W. (1993) Am. Nat. 145, 187–212. Schutz, V. F. (1965) Z. Tierpsychol. 22, 50–103. Sibley, C. G. & Ahlquist, J. E. (1990) Phylogeny and Classification of Birds: A Study in Molecular Evolution (Yale Univ. Press, New Haven).
Proc. Natl. Acad. Sci. USA Vol. 94, pp. 7776–7783, July 1997 Colloquium Paper
This paper was presented at a colloquium entitled ‘‘Genetics and the Origin of Species,’’ organized by Francisco J. Ayala (Co-chair) and Walter M. Fitch (Co-chair), held January 30–February 1, 1997, at the National Academy of Sciences Beckman Center in Irvine, CA.
Vagaries of the molecular clock (molecular evolutionyrates of evolutionyglycerol-3-phosphate dehydrogenaseysuperoxide dismutase)
FRANCISCO J. AYALA Department of Ecology and Evolutionary Biology, University of California, Irvine, CA 92697-2525
orders, between animal phyla, and between multicellular kingdoms.
ABSTRACT The hypothesis of the molecular evolutionary clock asserts that informational macromolecules (i.e., proteins and nucleic acids) evolve at rates that are constant through time and for different lineages. The clock hypothesis has been extremely powerful for determining evolutionary events of the remote past for which the fossil and other evidence is lacking or insufficient. I review the evolution of two genes, Gpdh and Sod. In fruit f lies, the encoded glycerol-3phosphate dehydrogenase (GPDH) protein evolves at a rate of 1.1 3 10210 amino acid replacements per site per year when Drosophila species are compared that diverged within the last 55 million years (My), but a much faster rate of '4.5 3 10210 replacements per site per year when comparisons are made between mammals ('70 My) or Dipteran families ('100 My), animal phyla ('650 My), or multicellular kingdoms ('1100 My). The rate of superoxide dismutase (SOD) evolution is very fast between Drosophila species (16.2 3 10210 replacements per site per year) and remains the same between mammals (17.2) or Dipteran families (15.9), but it becomes much slower between animal phyla (5.3) and still slower between the three kingdoms (3.3). If we assume a molecular clock and use the Drosophila rate for estimating the divergence of remote organisms, GPDH yields estimates of 2,500 My for the divergence between the animal phyla (occurred '650 My) and 3,990 My for the divergence of the kingdoms (occurred '1,100 My). At the other extreme, SOD yields divergence times of 211 My and 224 My for the animal phyla and the kingdoms, respectively. It remains unsettled how often proteins evolve in such erratic fashion as GPDH and SOD.
Evolution of Glycerol-3-Phosphate Dehydrogenase (GPDH) in Diptera The nicotinamide–adenine dinucleotide (NAD)-dependent cytoplasmic GPDH (EC 1.1.1.8) plays a crucial role in insect flight metabolism because of its keystone position in the glycerophosphate cycle, which provides energy for flight in the thoracic muscles of Drosophila (8). In Drosophila melanogaster the Gpdh gene is located on chromosome 2 (9) and consists of eight coding exons (10, 11). It produces three isozymes by differential splicing of the last three exons (12). The enzyme is known to be evolutionarily conserved (10), displaying very low heterozygosity within or variation among Drosophila species (13). The GPDH polypeptide can be divided into two main domains, the NAD-binding domain and the catalytic domain. The NAD-binding domain (which in the rabbit is encompassed by the first 118 amino acids) is more highly conserved than the catalytic domain (10). Here I will analyze in 9 fruit fly species a Gpdh gene region comprising most of the coding sequence of exons 3–6 (768 bp of 831 bp), corresponding to 45 codons of the NAD-binding domain plus the whole catalytic domain. The methods of DNA preparation, amplification, cloning, sequencing, and sequence analysis are described in refs. 14 and 15, where the sources of the fruit fly stocks also are given. The present analysis focuses on 9 species, although a total of 27 fruit fly species have been analyzed (14, 15). The taxonomy of the nine species is displayed in Table 1. The medfly, Ceratitis capitata, belongs to the family Tephritidae. The family Drosophilidae is represented by three genera; and the Drosophila genus is, in turn, represented by five subgenera. The phylogeny of the species (represented by their subgenus or genus names) is shown in Fig. 1. This phylogeny is statistically superior to any alternative configuration (14). In any case, the only issues that arise concern (i) whether Chymomyza is the sister group to Drosophila, with Scaptodrosophila being their outgroup (as in Fig. 1) or whether Scaptodrosophila is the sister group to Drosophila, with Chymomyza being their outgroup; and (ii) the branching order of the Drosophila subgenera (14). Neither of these two issues is of substantial import for the present purposes. The amino acid distances between the fruit fly species are given in Table 2 (above the diagonal). Table 3 gives averages (x# ) of these distances, as well as the rate, expressed in units of 10210 amino acid replacements per site per year. It is apparent in Tables 2 and 3 that the rate of GPDH evolution is not constant. The apparent rate of amino acid replacement is 1.1 when species of different subgenera are compared (and also when these species are compared with Scaptodrosophila), but
The hypothesis of the molecular clock of evolution was put forward in the 1960s (1–3) and has prompted much research and yielded important results ever since. The neutrality theory of molecular evolution (4–7) provided a mathematical formulation of brilliant simplicity, which made the clock hypothesis amenable to precise empirical testing. Tests have shown that molecular clocks are more erratic than allowed by the neutrality theory, but the issue of how dependable the clocks are is far from settled. In this paper, I review the evolution of two genes that exhibit peculiar patterns. Gpdh behaves erratically in fruit flies over the last 100 million years (My). It evolves extremely slowly in Drosophila, faster in the related genus Chymomyza, and faster still in the medfly Ceratitis; the rates differ by a factor of 4 or more. The second gene, Sod, evolves fairly regularly in fruit flies, although much faster than Gpdh. When comparisons are made between organisms that diverged over very broad time ranges, extending to 1,000 My ago, the behavior of the two genes becomes reversed: Gpdh evolves fairly regularly, whereas the rate of Sod evolution becomes increasingly slower as comparisons are made between mammal
Abbreviations: GPDH, glycerol-3-phosphate dehydrogenase; My, million years; PAM, accepted point mutations; SOD, superoxide dismutase.
© 1997 by The National Academy of Sciences 0027-8424y97y947776-8$2.00y0 PNAS is available online at http:yywww.pnas.org.
7776
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Ayala Table 1.
7777
Taxonomy of nine fruit fly species
Family
Genus
Tephritidae Drosophilidae
Ceratitis Scaptodrosophila* Chymomyza Drosophila
Subgenus
Species
Sophophora Zaprionus† Drosophila Hirtodrosophila Dorsilopha
capitata lebanonensis amoena procnemis melanogaster tuberculatus virilis pictiventris busckii
*Classified as a subgenus of Drosophila in Wheeler (16); raised to genus category in the revision of Grimaldi (17). †Classified as a separate genus in Wheeler (16), but more closely related to other Drosophila subgenera than the subgenus Sophophora (see Fig. 1).
2.7 when the Drosophila (or Scaptodrosophila) species are compared with Chymomyza, and 4.7 when the medfly Ceratitis is compared with any other species. These rate differences are displayed in Fig. 2 (Left). The different rates of evolution displayed in Table 3 (and Fig. 2 Left) conceal even more disparate actual rates, which become manifest when we take into account that the rates in Table 3 apply to largely overlapping lineages. Consider in Fig. 1 the evolution from node 3 to a Chymomyza species and a Drosophila species (say, D. melanogaster, subgenus Sophophora). The average rate of amino acid evolution is 2.7 3 10210 replacements per site per year (Table 3, line 2) over the 120 My of evolution separating these two species (60 My from node 3 to the Chymomyza species and 60 My from node 3 to the Drosophila species). But during 100 of the 120 My (55 My from node 4 to the Drosophila species and 45 My from node 5 to the Chymomyza species), the rate of evolution is 1.1 3 10210 (Table 3, line 1). Thus, the increase in rate of evolution could have occurred only over the 5 My between nodes 3 and 4 plus the 15 My between nodes 3 and 5. But the rate of evolution between Scaptodrosophila and Drosophila is 1.1, the same as between Drosophila species; the 5 My between nodes 3 and 4 are included in this comparison. It follows that the increased rate of evolution noted between Chymomyza and Drosophila could have occurred only during the 15 My elapsed between nodes 3 and 5 (thick line) in Fig. 1. To account for the rate of 2.7 observed between Chymomyza and Drosophila, the rate of evolution of the Chymomyza lineage during those 15 My must have been 14.2 3 10210 per site per year, or 13 times greater than the 1.1 rate prevailing in the evolution of all other Drosophilid lineages. (The maximum parsimony estimate is that 5.3 amino acid replacements occurred between nodes 3 and 5; at the Drosophila rate, only 0.38 replacement is expected.) Similarly, the rate of GPDH divergence between the medfly Ceratitis and Drosophila is about 4.3 times greater (Table 3, line 3, and Fig. 2 Left) than between Drosophila species. But the rate of evolution could have been accelerated for only a fraction of the 200 My elapsed, namely, between nodes 1 and 2 and between node 1 and Ceratitis; the average rate of GPDH evolution during those intervals must have been more than 6 times faster than between Drosophila species. The evolution of GPDH in dipterans is not clocklike at all. Fig. 3 Left displays the nucleotide distances between the Gpdh sequences for the same set of species as in Fig. 2 Left. It appears that, at the nucleotide level Gpdh is evolving with the regularity expected from a molecular clock. The three rates shown in Fig. 3 Left are 14.2, 16.9, and 13.8 3 10210 nucleotide substitutions per site per year, fairly similar to one another and surely within the sampling variation that one would expect from a stochastic molecular clock. (This observation, by the way, indicates that the discrepancies detected at the amino acid level are not caused by errors in the branching sequence of the taxa or in the assumed times of divergence.)
FIG. 1. Phylogeny of the nine species listed in Table 1 (14). s.g., subgenus. The thicker branch between nodes 3 and 5 indicates an inferred acceleration in the evolution of GPDH. The time scale is based on data from refs. 14 and 18–20.
The apparent regularity of nucleotide evolution seen in Fig. 3 is, however, made up of two components, only one of which is clocklike. Fig. 4 Left shows a plot of Ka, the rate of nonsynonymous substitutions (i.e., those that result in amino acid replacements) against Ks, the rate of synonymous substitutions. The two rates are closely correlated for the Drosophila species (open circles), but not for the comparisons between Drosophila and Chymomyza (gray circles) or between Drosophila and Ceratitis (black circles). The synonymous rate, Ks, is about 10 times faster than the nonsynonymous rate, Ka, in Drosophila, but only about 5 times faster for the comparisons with Chymomyza and only 3.2 times faster for the comparisons with Ceratitis. Given that the rate of evolution is several times faster for synonymous than for nonsynonymous sites, most of the nucleotide substitutions occur in synonymous sites, which thus overwhelmingly dominate the overall rates of nucleotide evolution, so that these seem fairly constant (K2, Fig. 3), even though the nonsynonymous rates are not homogeneous. Evolution of SOD in Diptera The superoxide dismutases (EC 1.15.1.1) are abundant enzymes in aerobic organisms, with highly specific superoxide dismutation activity that protects the cell against the harmfulness of free oxygen radicals (24). These enzymes have active centers that contain either iron or manganese, or both copper and zinc (24). The Cu,Zn superoxide dismutase (SOD) is a well studied protein, found in eukaryotes but also in some bacteria (25). The amino acid sequence is known in many organisms— plants, animals, fungi, and bacteria (19, 20, 26, 27). The Cu,Zn SOD of D. melanogaster is a dimer molecule consisting of two identical polypeptide subunits associated with two Cu21 and two Zn21 per molecule. Each subunit has a molecular weight of 15,750 and consists of 151 amino acids, the same as in other fruit fly species. The Drosophila Sod gene consists of two exons, separated by an intron 300–700 bp in length, located between codons 22 and 23. In the three other fruit fly genera studied in this paper—namely, Chymomyza, Scaptodrosophila, and Ceratitis—there is an additional short intron (,100 bp) between codons 95 and 96 (18, 28, 29). We have sequenced the Sod gene in 27 fruit fly species, the same set (with two inconsequential differences) sequenced for Gpdh (15, 18, 28, 29). In the present paper, I will analyze the results for the 9 species listed in Table 1 (the data for 18 additional Drosophila species will also be used in some cases). Table 2 (below the diagonal) gives the PAM distance (amino acid replacements) between the SODs of the 9 fruit fly species. Table 4 gives averages (x# ) of these differences, as well as the rate of amino acid evolution, expressed in units of 10210 replacements per site per year. In contrast to the discrepancies
7778
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Ayala
Table 2. Amino acid replacements between nine fruit fly species for GPDH (above diagonal) and superoxide dismutase (SOD) (below diagonal) Species 1. 2. 3. 4. 5. 6. 7. 8. 9.
Drosophila melanogaster Drosophila virilis Drosophila busckii Drosophila pictiventris Zapronius tuberculatus Chymomyza amoena Chymomyza procnemis Scaptodrosophila lebanonensis Ceratitis capitata
1
2
3
4
5
6
7
8
9
— 16.6 15.8 21.3 21.2 21.7 21.4 20.3 30.0
1.2 — 10.0 17.2 13.0 20.8 20.8 20.5 30.9
0.8 2.1 — 14.2 11.1 21.8 21.9 20.6 28.3
0.0 1.2 0.8 — 17.8 24.3 26.8 23.3 29.7
1.2 1.7 1.2 1.2 — 23.1 21.0 25.7 26.5
3.8 3.4 3.4 3.8 2.5 — 6.2 23.0 28.1
3.4 2.9 2.9 3.4 2.1 0.8 — 22.9 31.9
1.7 1.7 2.1 1.7 2.1 3.8 2.9 — 26.5
9.2 9.2 8.7 9.2 8.7 8.7 8.3 9.2 —
The numbers given are accepted point mutations (PAM) values (21) per hundred sites. The number of sites compared is 256 for GPDH and 151 for SOD.
obtaining in GPDH evolution, it is apparent that SOD in fruit flies evolves fairly uniformly, at a rate of '16–18 3 10210 replacements per site per year. The three rates of amino acid replacement (obtained as for GPDH by comparisons, respectively, between Drosophila subgenera, between these and Chymomyza, and between Ceratitis and all others) are 16.2, 17.8, and 15.9, similar enough to be acceptable as sample variations from the same stochastic clock. The contrast between the erratic behavior of GPDH and the regularity of SOD is apparent in Fig. 2. The species compared are the same in Left and Right (which, again, excludes the hypothesis that the discrepancies in GPDH rates of evolution are due to erroneous estimates of divergence times). The Sod rate of nucleotide evolution (K2) in the fruit flies is displayed in Fig. 3 Right. The average rates for the three sets of comparisons are given in Table 4 and are also displayed in the figure. The Sod K2 rate is about 40% greater between the Drosophilid species than between them and Ceratitis. This difference is statistically significant if we use the standard errors of the mean differences, but it is difficult to say whether this may be attributed to stochastic variation or represents rather an acceleration in the Drosophilid lineages. Fig. 4 Right shows the correlation between synonymous and nonsynonymous nucleotide substitutions in Sod. Although there is considerable dispersion, we do not see the discontinuity manifest for Gpdh (Fig. 4 Left). Global Rates of GPDH and SOD Evolution: Reversed Patterns Table 5 displays the taxonomy of nine species encompassing three mammalian orders (human, mouse, and rabbit), three metazoan phyla (arthropods, chordates, and nematodes), and three multicellular kingdoms (animals, plants, and fungi). Table 6 gives the amino acid differences between species for GPDH (above the diagonal) and SOD (below the diagonal). Table 7 gives the average number of amino acid replacements (x# ) and the rate of amino acid evolution between three levels of evolutionary divergence. For GPDH, the evolutionary rates for the three levels are 5.3, 4.2, and 4.0 3 10210 amino acid Table 3.
replacements per site per year. These rates are not grossly different from one another, nor from the rate of 4.7 observed above when comparing the two fruit fly families Drosophilidae and Tephritidae. These are rates comparable to those of some intracellular enzymes such as triosephosphate isomerase or lactate dehydrogenase H (5.3 3 10210) and somewhat slower than the extensively studied cytochrome c (6.7 3 10210; ref. 30). Over the time scale of 70 to 1,100 My covered by these comparisons, the evolution of GPDH behaves in a clocklike fashion. Yet, as noted above, the rate of GPDH evolution observed in Drosophila is less than one fourth as fast, lower even than the rate of histones H2A and H2B (1.7 3 10210; ref. 30). These observations are graphically displayed in Fig. 5, where the slow Drosophila rate is set against the much faster long-term rates observed between the animal phyla and between the three kingdoms. The SOD rates of amino acid evolution for the three levels of evolutionary divergence are given in Table 7, where they can readily be compared with the GPDH rates. The discrepancy between the two enzymes is baffling: the three rates are fairly homogeneous for GPDH, but quite heterogeneous for SOD. Between kingdoms the SOD rate (3.3 3 10210) is smaller than for GPDH, whereas between mammals the SOD rate is much higher (17.2 3 10210, the same rate as myglobin or the digestive enzyme trypsinogen; ref. 30) (see also Fig. 6). The contrasting patterns of SOD and GPDH evolution cannot be attributed to distinctive characteristics of particular taxa, since the same set of species are compared for each enzyme (with the inconsequential exception that the plant species is Cuphea lanceolata for GPDH, but a different angiosperm, Ipomea batatas, for SOD). For the same reason, the rate fluctuations cannot be attributed to telescoping consequences of the different time intervals. The SOD rate is much faster for recently diverged taxa than for those that diverged long ago (16.2 3 10210 for Drosophila versus 3.3 3 10210 between kingdoms), but for GPDH it is exactly the other way around (1.1 3 10210 for Drosophila versus 4.0 3 10210 between kingdoms). Moreover, the mammalian rate is similar to the Drosophila rate in SOD, but similar to the long-term rate in
Gpdh evolution in fruit flies Amino acid replacements
Nucleotide substitutions
Comparison
My
x#
Rate
x#
Rate
1. Drosophila subgenera 2. Drosophila–Chymomyza 3. Drosophilidae–Ceratitis
55 6 10 60 6 10 100 6 20
1.2 6 0.0 3.0 6 0.1 9.4 6 0.1
1.1 2.7 4.7
15.8 6 0.2 20.3 6 0.2 27.7 6 0.2
14.2 16.9 13.8
The Drosophila species compared include the 5 listed in Table 1 plus 18 additional species (see refs. 14 and 15). The genus Scaptodrosophila is included in the comparisons made in row 1. Replacements are corrected according to ref. 21; nucleotide substitutions are estimated according to ref. 22. The 6 values are crude estimates of error for My, but are standard deviations for replacements and substitutions. x# values are per 100 residues for differences between species; the rates are lineage values and are expressed in units of 10210 per site per year.
Colloquium Paper: Ayala
Proc. Natl. Acad. Sci. USA 94 (1997)
7779
FIG. 2. Rate of amino acid replacement in fruit flies for the enzymes GPDH (Left) and SOD (Right). Open circles represent comparisons between species of Drosophila with each other (or with Scaptodrosophila); gray circles, between species of Chymomyza and Drosophila; solid circles, between Ceratitis and all other species; bars represent standard errors. It is apparent that a single straight line cannot provide a reasonable fit for all GPDH points; the rates shown on the right are amino acid replacements in units of 10210 per site per year. The SOD rates are 16.2 for the Drosophila subgenera, 17.8 for the comparisons with Chymomyza, and 15.9 for the comparisons with Ceratitis; the two Chymomyza points are slightly displaced right and upwards for clarity. The species of Drosophila include 18 in addition to those listed in Table 1.
GPDH. Finally, the generation-time hypothesis (31–33) cannot account for the erratic patterns of GPDH and SOD. This hypothesis assumes that rates of evolution will be increased in organisms with shorter generation times. But the set of species analyzed is, as noted, the same for GPDH and for SOD. Fixing the Clocks The analyses of Gpdh and Sod evolution have been made after correcting the observed differences between species for multiple hits. We have used other algorithms, in addition to those mentioned above (PAM, K2, Ka, Ks) for correcting for overlapping and back-substitutions. These various algorithms yield the same results, except for trivial numerical differences (14, 15, 18, 28, 29). One possible way to account for the apparent erratic behavior of the enzymes might be the covarion (concomitantly variable codons) hypothesis (34), which asserts that there is a limited number of amino acid sites that can be replaced at any time in any given lineage. This number of invariable sites remains constant through time and lineages, but the composition of the set of invariable sites changes through time and between lineages. The application of this
assumption to a particular protein requires that one determines (i) the size of the covarion set—i.e., the number of sites at which replacements can occur at any given time in a given lineage; (ii) the total number of sites that are invariable—i.e., the number of sites, if any, at which amino acid replacements can never occur in any lineage; (iii) the number of different amino acids that can occur at each variable site, which may range from all 20 amino acids to only two (for example, at a site that must have a negative charge, only aspartate and glutamate can occur); (iv) the persistence of the covarion set—i.e., the rate at which one site in the set becomes replaced by another site; and (v) the rate of amino acid replacement. Fitch and Ayala (19) have analyzed 67 SOD sequences of very diverse organisms from all three multicellular kingdoms to estimate the parameter values corresponding to the five variables just mentioned. The parameter values that maximize the fit between the observed and expected number of amino acid replacements are as follows: (i) the number of covarions is 28; (ii) the number of codons that are permanently invariable across animals, plants, and fungi is 44; (iii) the average number of amino acids that can occur at a variable site is 2.5; (iv) the
FIG. 3. Rate of nucleotide substitution (K2) in fruit flies for the genes Gpdh (Left) and Sod (Right). Symbols and other conventions are as in Fig. 2. The K2 units are nucleotide substitutions 3 10210 per site per year, estimated by Kimura’s (22) two-parameter method; the rates on the right are in the same units, although their intercepts have been drawn at 50 My for clarity. The three rates are fairly similar, so that all points could have been subsumed within the same regression line, except perhaps for the Sod Ceratitis comparisons.
7780
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Ayala
FIG. 4. Rate of nonsynonymous (Ka) versus synonymous (Ks) substitutions in Gpdh (Left) and Sod (Right), estimated according to Li (23). Symbols and other conventions are as in Figs. 2 and 3, except that the Ka scale is in units of substitutions 3 1029 per site per year. In the case of Gpdh the correlation between nonsynonymous (replacement) and synonymous substitutions for comparisons with either Chymomyza or Ceratitis is not homogeneous with the correlation for the comparisons between Drosophila species; for Sod the correlation is more nearly homogeneous, although the dispersion is high.
persistence of the covarion set is 0.01 (that is, there is a 0.01 probability that one site of the covarion set will change whenever one amino acid replacement has occurred); and (v) the rate of amino acid replacement for the whole polypeptide is 4 3 10210 per site per year. In addition, it is determined that the total number of codons is 162 (151 codons are found in most organisms, but 162 sites are necessary to account for the occurrence of insertions and deletions in various species). Table 8 gives the number of amino acid differences observed between increasingly remote taxa and the expected values obtained by means of computer simulations using the parameter values just given. The fit is reasonably good, although systematic deviations occur. For example, the observed values are consistently higher than the expected values for comparisons 2–4 (divergences 60–100 My ago), whereas they are consistently lower for comparisons 5–8 (divergences 125–400 My ago), which means that the covarion model is not sufficient to fully account for the decrease in rate of evolution observed as divergence time increases. The largest discrepancy occurs for the comparison angiosperm–gymnosperm, with an observed value of 29 6 7 versus an expected value of 42 6 5. That the SOD clock is far from perfect, no matter what assumptions are made, is also detected in Table 8 by noting, for example, that the average number of replacements between amphibians and mammals is 49, larger than between fish and tetrapods, which certainly diverged earlier. But these variations could be expected from sampling errors and because of the well founded observation that the rate variation of the clock is greater than expected from a Poisson distribution (32, 35–37). One may, nevertheless, endorse the conclusion reached by Fitch and Ayala (19) that, even though the evolution of SOD appears at first quite erratic, when the constraints under which the clock operates have not been taken into account, it may actually be evolving at a fairly constant rate as postulated by the molecular clock hypothesis. Table 4.
The assumption that there are covarions that constrain the rate of evolution is not helpful to account for the erratic evolution of GPDH. The covarion model can account for the apparent slowdown of evolutionary rates as time increases, but not for the rate acceleration witnessed in the case of GPDH. We can, nevertheless, gain some insight into the processes that account for the scarcity of replacements observed between Drosophila (including Scaptodrosophila) species by examining in detail the actual amino acid replacements and locating them along the branches of the phylogeny. There is evidence of homoplasy; i.e., it is apparent that parallel as well as reversed replacements have occurred in the polymorphic sites (14, 38). Among the 256 amino acid sites assayed in a set of 15 Drosophila species, we have observed that 2 or more replacements have occurred in only 5 of the sites (14). Parallel and reversed replacements are detectable in 4 of these 5 polymorphic sites. For example, at site 193, the replacement glutamic acid 3 aspartic acid occurred at the root of the Drosophila genus (after its divergence from Chymomyza), followed by two parallel reversals aspartic acid 3 glutamic acid at the root of the Drosophila subgenus and at the root of the willistoni group (within the Sophophora subgenus). Wells (38) has exposed three additional homoplasic sites in a GPDH segment not included in our study. Parallel and reversed replacements result in ‘‘observed’’ numbers of replacements between species much smaller than the numbers that have actually taken place. The biological inference drawn from these empirical observations is that there are strict functional constraints in the GPDH of these species so that (i) very few sites accept replacement, and (ii) only two, or very few different amino acids are accepted at those few sites that can change at all. That such constraints exist is not surprising, given the keystone role of GPDH in providing the energy needed for flying in the thoracic muscles (9). The disproportionately large number of differences observed in the comparisons with Chymomyza or
Sod evolution in fruit flies Amino acid replacements
Nucleotide substitutions
Comparison
My
x#
Rate
x#
Rate
1. Drosophila subgenera 2. Drosophila–Chymomyza 3. Drosophilidae–Ceratitis
55 6 10 60 6 10 100 6 20
17.8 6 0.3 21.4 6 0.2 31.8 6 0.6
16.2 17.8 15.9
31.0 6 0.4 34.2 6 0.4 40.0 6 0.4
28.1 28.5 20.0
Species compared and other conventions are the same as in Table 3.
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Ayala
differ from the rates that obtain during periods of slow evolution (2). Kimura’s neutrality theory of molecular evolution (4–7, 39) provided a mathematical formulation for the hypothesis of the molecular clock. New alleles arise in a species by mutation. If alternative alleles are neutral with respect to natural selection (i.e., do not modify the Darwinian fitness of their carriers), their frequencies will change only by accidental sampling errors from generation to generation, that is, by genetic drift. Rates of allelic substitution will thus be stochastically constant—they will occur with a constant probability for a given protein. That probability can be shown to be the mutation rate for neutral alleles (6). The neutrality theory of molecular evolution accepts that, for any gene, a large proportion of all possible mutants are harmful to their carriers; these mutants are eliminated or kept at very low frequency by natural selection. The theory assumes, however, that many functional mutants can occur at each locus that are adaptively equivalent to one another. These mutants are not subject to selection relative to one another because they do not affect the fitness of their carriers (nor do they modify their morphological, physiological, or behavioral properties). According to the neutrality theory, evolution at the molecular level consists for the most part of the gradual, random replacement of one allele by another that is functionally equivalent to the first. The theory assumes that favorable mutations occur, but are sufficiently rare that they have little effect on the overall evolutionary rate of nucleotide and amino acid substitutions. According to the neutrality theory, it is reasonable to assume that the neutral mutation rate of a protein will remain constant over evolutionary time. This persistence of the neutral mutation rate would be the case whenever the protein function and, hence, its structural constraints would not be altered through time, even as the organisms in which the protein functions evolve and diverge. Any such protein would function as a molecular clock: the number of amino acid replacements would be expected to be directly proportional to the time since their divergence from a common ancestor. A propitious state of affairs is that there are many proteins that differ in their functional constraints. Thus, there are many molecular clocks that tick at different rates but are all timing the same evolutionary events. Some molecular clocks, such as fibrinopeptides, tick at a very rapid rate and are useful to investigate recently diverged organisms; other proteins, such as in the extreme some histones, evolve much more slowly and are appropriate for investigating ancient evolutionary events. One advantage of molecular clocks over the radioactive clocks used to time the age of rocks is this, namely that there are tens of thousands of proteins in any organism, all of which can potentially be used for ascertaining any particular event of interest. The number of radioactive isotopes is very small by comparison. The molecular clock postulated by the neutrality theory is not a metronomic clock, like timepieces in ordinary life that measure time exactly. The neutrality theory predicts, instead,
Table 5. Taxonomy of the species studied for long-term evolutionary rates Kingdom
Phylum*
Plants
Angiosperms
Fungi
Yeasts
Animals
Nematodes Arthropods Chordates
Order
7781
Species
Cuphea lanceolata† Ipomoea batatas† Schizosaccharomyces pombe Saccharomyces cerevisiae Caenorhabditis elegans Dipterans Several (Table 1) Lagomorphs Rabbit Rodents Mouse Primates Human
*The taxonomy of plants and fungi uses the category ‘‘division’’ rather than phylum. Flowering plants belong to the division Anthophyta; the yeasts, to the division Ascomycota. †Cuphea lanceolata is the species sequenced for Gpdh; for Sod, the species sequenced is Ipomoea batatas.
Ceratitis might reflect that constraints are more strict in the Drosophila and Scaptodrosophila lineages than in Chymomyza or in Ceratitis, so that many more replacements occur in the latter two lineages. It may also be the case that the constraints are equally intense in all lineages, but the optimal amino acid states changed early in the evolution of the Chymomyza and Ceratitis lineages, so that natural selection caused the rapid replacement of several amino acids. Be that as it may—either relaxation of the austerity of the constraints or selection-driven replacements—the outcomes are changes in net rates of evolution and erratic fluctuations of the clock. Discussion The hypothesis of the molecular clock of evolution emerged from the early observation that the number of amino acid differences in a given protein appeared to be proportional to the time elapsed since the divergence of the organisms compared (1–3). This proportionality was accounted for with the hypothesis that many amino acid (and nucleotide) substitutions may be of little or no functional consequence, and that most substitutions that occur in evolution will be of this kind rather than involve amino acid replacements strongly constrained by natural selection (2). This insightful hypothesis was encased with other conjectures that would become dominant themes in molecular evolution studies in the ensuing decades: (i) the rate of amino acid replacement of a particular protein may be directly proportional to the number of sites that can change without radical alteration of function; (ii) morphological evolution may be largely due to changes in gene regulation and not be reflected in the rate of evolution of polypeptide chains, and (iii) given that functionally significant amino acid replacements are rare relative to the number of inconsequential replacements, the rates of amino acid replacement during periods of rapid morphological evolution may not substantially
Table 6. Amino acid replacements between nine species of animals, plants, and yeasts for GPDH (above diagonal) and SOD (below diagonal) Species 1. 2. 3. 4. 5. 6. 7. 8. 9.
D. melanogaster Ceratitis Human Mouse Rabbit Nematode Plant Sch. pombe Sac. cerevisiae
1
2
3
4
5
6
7
8
9
— 30.0 61.0 64.8 72.0 79.9 56.2 79.9 82.2
9.2 — 64.8 64.8 69.8 75.3 61.8 68.9 84.7
56.2 52.9 — 24.3 23.3 72.1 78.7 79.9 77.6
51.3 49.7 4.3 — 24.5 64.0 76.4 84.6 77.6
58.8 55.4 8.8 9.2 — 71.0 74.2 83.4 81.0
52.6 51.8 55.4 56.2 62.3 — 77.5 93.6 74.2
78.2 71.9 80.4 79.3 82.7 81.0 — 99.2 85.8
89.2 84.4 84.4 84.4 90.5 93.6 83.8 — 62.3
92.3 92.3 89.1 90.4 95.5 103.7 91.6 91.5 —
The numbers given are PAM values (21) per hundred sites. The taxonomy of the nine species is given in Tables 1 and 5. The plant species is Cuphea lanceolata for GPDH but Ipomoea batatas for SOD. The amino acid alignments are given in ref. 20.
7782
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Ayala Table 7.
Long-term evolutionary rates of amino acid replacements in GPDH and SOD GPDH
SOD
Comparison
My
x#
Rate
x#
Rate
1. Mammalian orders 2. Animal phyla 3. Multicellular kingdoms
70 6 10 650 6 100 1100 6 200
7.4 6 1.6 54.7 6 0.3 87.0 6 0.8
5.3 4.2 4.0
24.0 6 0.4 68.3 6 0.9 72.5 6 1.2
17.2 5.3 3.3
The species compared are listed in Tables 1 and 5. Other conventions are the same as in Table 3.
that molecular evolution is a ‘‘stochastic clock,’’ like radioactive decay. The probability of change is constant, although some variation occurs. Over fairly long periods, a stochastic clock is nevertheless quite accurate, and the joint results of several protein or DNA sequences could, in any case, provide fairly accurate time estimates. The clock predicted by the neutrality theory behaves as a Poisson process, so that the ratio, R, of the variance to the mean (s2ym) is expected to be 1, which can readily be empirically tested. The results of many such tests have shown that (i) R is almost universally greater than 1; and (ii) this increase is statistically significant in nearly half of the proteins tested (37). Consequently, several modifications of the neutral theory have been proposed seeking to account for the excess variance of the molecular clock. It has been proposed, for example, that most protein evolution involves slightly deleterious replacements rather than strictly neutral ones; or that the effectiveness of the error-correcting polymerases varies among organisms, so that mutation rates change (6, 22, 37, 39, 40). Either one of these hypotheses could account for the difference between the levels of protein polymorphism observed within species and those predicted by the neutrality theory. Another supplementary hypothesis invokes a generation-time effect. Protein evolution has been extensively investigated in primates and rodents with the common observation that the number of replacements is greater in the rodents (33, 41). In plants, the overall rate at the rbcL locus is more than 5 times greater in annual grasses than in palms, which have much longer generations (32). These rate differences could be accounted for, according to the generation-effect hypothesis, by assuming that the time-rate of evolution depends on the number of germ-line replications per year, which is several times greater for the short-generation rodents and grasses than for the longgeneration primates and palms. The rationale of the assumption
FIG. 5. Global rates of amino acid replacement for GPDH. The points at the lower left are for comparisons between fruit flies (open circles) or between mammals (filled circle). The rates on the right are for replacements 3 10210 per site per year and correspond to the comparisons between Drosophila species (1.1), between species from different animal phyla (4.0), and between species from different kingdoms (4.2). All three rates have been calculated as linear regressions over time.
is that the larger the number of replication cycles, the greater the number of mutational errors that will occur. From a theoretical, as well as operational, perspective, these and other supplementary hypotheses have the discomforting consequence that they involve additional empirical parameters, often not easy to estimate. It is of great epistemological significance that the original proposal of the neutrality theory is (i) highly predictive and, therefore, (ii) eminently testable. These two properties, really two sides of the same coin, become diluted in the modified versions of the theory. Nevertheless, it is commonly assumed that molecular evolution is sufficiently regular over time and across lineages, so that a molecular clock can be assumed for testing phylogenetic hypotheses, or estimating the time of remote evolutionary events. The combined consideration of GPDH and SOD evolution in the same set of species is, however, disquieting. The covarion hypothesis becomes helpful to account for the reduction in rate of evolution that obtains in some proteins when the species compared become increasingly remote, as in the case of SOD. But the covarion model cannot be extended to GPDH, whose rate of evolution increases as the organisms compared become more remote. Similarly, the hypothesis of the generation-time effect cannot account for the divergent patterns of evolution of both GPDH and SOD, since the same set of species is compared in both cases, and thus identical generation times have been involved at all times in the evolution of these species. Similarly, the postulate of slightly deleterious mutations or other subsidiary hypotheses may be adjusted to account for the evolution of one or the other protein, GPDH and SOD, but not for both, without stretching ad hoc their elasticity to make the molecular clock hypothesis universally applicable to any possible empirical state of affairs and, therefore, without any predictive power and untestable (42, 43). I have noted above the constraints that occur in the evolution of GPDH in Drosophila, which considerably restrict the number
FIG. 6. Global rates of amino acid replacement for SOD. Symbols and other conventions as for Fig. 5. In contrast with GPDH, the fastest rate of evolution (16.2) is for comparisons between Drosophila species, and the lowest rate (3.3) is for comparisons between kingdoms. The intermediate rate (5.3) is for comparisons between animal phyla.
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Ayala Table 8. Observed amino acid replacements in SOD and expected values, assuming a covarion model Amino acid differences Comparison
My
Observed
Expected
1. Drosophila subgenera 2. Drosophila–Chymomyza 3. Mammalian orders 4. Drosophila–Ceratitis 5. Monocot–dicot 6. Angiosperm–gymnosperm 7. Mammal–amphibian 8. Tetrapod–fish 9. Vertebrate–insect 10. Animal–yeast
55 6 10 60 6 10 70 6 10 100 6 20 125 6 20 220 6 30 350 6 50 400 6 50 600 6 100 1100 6 200
18 6 3 23 6 2 27 6 2 31 6 2 28 6 3 29 6 7 49 6 2 44 6 4 59 6 3 67 6 4
19 6 3 20 6 4 22 6 4 28 6 3 31 6 5 42 6 5 53 6 6 56 6 7 60 6 6 66 6 7
The expected values are obtained using the covarion model with the parameter values given in the text; they are averages of 40 computer simulations for each entry. The data are modified from Fitch and Ayala (19).
of sites that can accept amino acid replacements and the particular replacements that can occur at each site. It remains obscure why greater constraints would occur in Drosophila than in the Chymomyza or Ceratitis lineages (or, indeed, in other animals, plants, and fungi). But, in any case, the issue is not whether biologically ascertainable processes are at work, which of course they are, in GPDH, SOD, or any other enzymes. The issue rather is whether the processes are of such regularity that some sort of molecular clock may be assumed to be at work. The stark contrast between the pattern of evolution of GPDH and SOD may be an aberration rather than representative of prevailing modes of protein evolution, since protein evolution seems so often to behave in a clocklike manner. But the congruence between observations and the clock predictions are often obtained due to the fact that the data collected do not have sufficient resolution to exhibit likely discrepancies. The operational risks of assuming that protein clocks are fairly reliable are made evident in Table 9. The rate of GPDH evolution is nearly 4 times faster between animals and plants than between Drosophila species, whereas the rate of SOD evolution is 1/5 as fast. If we were to use the observed rate of Drosophila evolution to estimate the time of divergence between plants and animals, GPDH would yield an estimate of 3,990 My, SOD an estimate of 224 My, both grossly erroneous. The practical conclusions to be drawn are that (i) protein clocks should be used cautiously and weighed against any other available evidence, rather than considered decisive; (ii) several protein clocks should be used whenever feasible, particularly
when important evolutionary events need to be determined (44); (iii) whenever possible, synonymous rather than nonsynonymous nucleotide substitutions should be used, given that substitutions that yield amino acid replacements are more constrained by natural selection. The rapid rate of synonymous nucleotide substitutions becomes, however, a problem whenever long evolutionary spans are at stake, because many superimposed substitutions will have occurred so that the differences observed have little statistical reliability for estimating the multiple hits concealed behind the observed differences. The strategy of using as many separate molecular clocks as feasible is grounded on the convergence expected from the ‘‘law of large numbers;’’ statistical and other biases will tend to cancel as the number of observations increases. I am grateful to Walter M. Fitch and Richard R. Hudson for valuable comments and to the members of my laboratory who participated in the research herein reported, particularly Kevin Bailey, Eladio Barrio, Michal Jaworski, Michal Krawczyk, Jan Kwiatowski, and Douglas Skarecky. Stephen Rich’s help with computer graphics is much appreciated. This research is supported by National Institutes of Health Grant GM42397. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26.
Table 9. Rates of evolution of GPDH and SOD and estimates of divergence time derived from the Drosophila rate
27. 28.
Clock estimates, My
29. 30. 31. 32.
Rate of evolution Taxa compared 1. 2. 3. 4. 5.
Drosophila subgenera Mammalian orders Dipteran families Animal phyla Kingdoms
Normalized rate
GPDH SOD GPDH SOD GPDH SOD 1.1 5.3 4.7 4.2 4.0
16.2 17.2 15.9 5.3 3.3
1.0 4.8 4.3 3.8 3.6
1.0 1.1 1.0 0.33 0.20
55 340 470 2,500 3,990
55 74 98 211 224
The rate of evolution is in units of 10210 amino acid replacements per site per year. The normalized rate is relative to the rate between the Drosophila subgenera. The clock estimates of time divergence use the average amino acid replacements between the particular organisms and assume that they are evolving as a molecular clock that ticks at the Drosophila rate.
7783
33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44.
Zuckerkandl, E. & Pauling, L. (1962) in Horizons in Biochemistry, eds. Kasha, M. & Pullman, B. (Academic, New York), pp. 97–166. Zuckerkandl, E. & Pauling, L. (1965) in Evolving Genes and Proteins, eds. Bryson, V. & Vogel, H. J. (Academic, New York), pp. 97–166. Margoliash, E. (1963) Proc. Natl. Acad. Sci. USA 50, 672–679. Kimura, M. (1968) Nature (London) 217, 624–626. Kimura, M. (1969) Proc. Natl. Acad. Sci. USA 63, 1181–1188. Kimura, M. (1983) The Neutral Theory of Molecular Evolution (Cambridge Univ. Press, Cambridge, U.K.). Kimura, M. & Ohta, T. (1971) Nature (London) 229, 467–469. O’Brien, S. J. & MacIntyre, R. J. (1978) in The Genetics and Biology of Drosophila, eds. Ashburner, M. & Wright, T. R. F. (Academic, New York), Vol. 2a, pp. 395–551. O’Brien, S. J. & MacIntyre, R. J. (1972) Genetics 71, 127–138. Bewley, G. C., Cook, J. L., Kusakabe, S., Mukai, T., Rigby, D. L. & Chambers, G. K. (1989) Nucleic Acids Res. 17, 8553–8567. von Kalm, L., Weaver, J., DeMarco, J., MacIntyre, R. J. & Sullivan, D. T. (1989) Proc. Natl. Acad. Sci. USA 86, 5020–5024. Cook, J. L., Bewley, G. C. & Shaffer, J. B. (1988) J. Biol. Chem. 263, 10858–10864. Lakovaara, S., Saura, A. & Lankinen, P. (1977) Evolution 31, 319–330. Kwiatowski, J., Krawczyk, M., Jaworski, M., Skarecky, D. & Ayala, F. J. (1997) J. Mol. Evol. 44, 9–22. Barrio, E. & Ayala, F. J. (1997) Mol. Phylogenet. Evol. 7, 79–93. Wheeler, M. R. (1981) in The Genetics and Biology of Drosophila, eds. Ashburner, M., Carson, H. L. & Thompson, J. N. J. (Academic, New York), Vol. 3a, pp. 1–97. Grimaldi, D. (1990) Bull. Am. Mus. Nat. Hist. 197, 1–139. Kwiatowski, J., Skarecky, D., Bailey, K. & Ayala, F. J. (1994) J. Mol. Evol. 38, 443-454. Fitch, W. M. & Ayala, F. J. (1994) Proc. Natl. Acad. Sci. USA 91, 6802–6807. Ayala, F. J., Barrio, E. & Kwiatowski, J. (1996) Proc. Natl. Acad. Sci. USA 93, 11729–11734. Dayhoff, M. D. (1978) Atlas of Protein Sequences and Structure (Natl. Biomed. Res. Found., Washington, DC). Kimura, M. (1980) J. Mol. Evol. 16, 111–120. Li, W.-H. (1993) J. Mol. Evol. 36, 96–99. Fridovich, I. (1986) Adv. Enzymol. 58, 61–97. Steinman, H. M. (1988) Basic Life Sci. 49, 641–646. Kwiatowski, J., Hudson, R. R. & Ayala, F. J. (1991) Free Radical Res. Commun. 12–13, 363–370. Smith, M. W. & Doolittle, R. F. (1992) J. Mol. Evol. 34, 175–184. Kwiatowski, J., Skarecky, D., Burgos, M. & Ayala, F. J. (1992) Insect Mol. Biol. 1, 3–13. Kwiatowski, J., Skarecky, D. & Ayala, F. J. (1992) Mol. Phylogenet. Evol. 1, 72–82. Wilson, A. C., Carlson, S. S. & White, T. (1977) Annu. Rev. Biochem. 46, 573–639. Wu, C.-I & Li, W.-H. (1985) Proc. Natl. Acad. Sci. USA 82, 1741–1745. Gaut, B. S., Muse, S. V., Clark, W. D. & Clegg, M. T. (1992) J. Mol. Evol. 35, 292–303. Li, W.-H., Ellsworth, D. L., Kruchkal, J. K., Chang, B. H.-J. & Hewett-Emmett, D. (1996) Mol. Phylogenet. Evol. 5, 182–187. Fitch, W. M. & Markowitz, E. (1970) Biochem. Genet. 4, 579–593. Fitch, W. M. & Langley, C. H. (1976) Fed. Proc. 35, 2092–2097. Bousque, J., Strauss, S. H., Doerksen, A. H. & Price, R. A. (1992) Proc. Natl. Acad. Sci. USA 89, 7844–7848. Gillespie, J. H. (1991) The Causes of Molecular Evolution (Oxford Univ. Press, New York). Wells, R. S. (1996) Proc. R. Soc. London Ser. B 263, 393–400. Kimura, M. & Ohta, T. (1972) J. Mol. Evol. 2, 87–90. Li, W.-H. & Graur, D. (1991) Fundamentals of Molecular Evolution (Sinauer, Sunderland, MA). Kohne, D. E. (1970) Quart. Rev. Biophys. 33, 327–375. Popper, K. R. (1959) The Logic of Scientific Discovery (Hutchinson, London). Ayala, F. J. (1994) Hist. Phil. Life Sci. 16, 205–240. Wry, G. A., Levinton, J. L. & Shapiro, L. H. (1996) Science 274, 568–573.
Proc. Natl. Acad. Sci. USA Vol. 94, pp. 7784–7790, July 1997 Colloquium Paper
This paper was presented at a colloquium entitled ‘‘Genetics and the Origin of Species,’’ organized by Francisco J. Ayala (Co-chair) and Walter M. Fitch (Co-chair), held January 30–February 1, 1997, at the National Academy of Sciences Beckman Center in Irvine, CA.
Evolution of codon usage bias in Drosophila JEFFREY R. POWELL*
AND
ETSUKO N. MORIYAMA
Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT 06520-8106
codons coding for the same amino acid should be equally represented in a large sample of genes. Therefore, unequal usage of synonymous codons must be due to either mutation bias or selection. Drosophila has served as a model multicellular eukaryote in the study of codon usage bias (e.g., ref. 3). As we will document below, there is little evidence that mutation bias is the cause of codon usage bias in Drosophila, and thus we are left with selection as the likely candidate to explain codon usage in these flies. Here we will first review the pattern of codon usage bias in Drosophila, then present data relevant to the cause of the bias, and end by discussing the effect of codon bias on intra- and interspecific DNA variation.
ABSTRACT We first review what is known about patterns of codon usage bias in Drosophila and make the following points: (i) Drosophila genes are as biased or more biased than those in microorganisms. (ii) The level of bias of genes and even the particular pattern of codon bias can remain phylogenetically invariant for very long periods of evolution. (iii) However, some genes, even very tightly linked genes, can change very greatly in codon bias across species. (iv) Generally G and especially C are favored at synonymous sites in biased genes. (v) With the exception of aspartic acid, all amino acids contribute significantly and about equally to the codon usage bias of a gene. (vi) While most individual amino acids that can use G or C at synonymous sites display a preference for C, there are exceptions: valine and leucine, which prefer G. (vii) Finally, smaller genes tend to be more biased than longer genes. We then examine possible causes of these patterns and discount mutation bias on three bases: there is little evidence of regional mutation bias in Drosophila, mutation bias is likely toward A1T (the opposite of codon usage bias), and not all amino acids display the preference for the same nucleotide in the wobble position. Two lines of evidence support a selection hypothesis based on tRNA pools: highly biased genes tend to be highly andyor rapidly expressed, and the preferred codons in highly biased genes optimally bind the most abundant isoaccepting tRNAs. Finally, we examine the effect of bias on DNA evolution and confirm that genes with high codon usage bias have lower rates of synonymous substitution between species than do genes with low codon usage bias. Surprisingly, we find that genes with higher codon usage bias display higher levels of intraspecific synonymous polymorphism. This may be due to opposing effects of recombination.
Levels and Patterns of Codon Usage in Drosophila Levels of Bias. Several measures of the degree of codon bias for a given gene have been developed. Here we use one termed the effective number of codons, ENC (4). This is analogous to the effective number of alleles and is related to the ‘‘homozygosity’’ for codons—i.e., the probability that two randomly chosen synonymous codons are identical. ENC ranges from 20 if only one codon is used for each amino acid to 61 if all synonymous codons are used equally. ENC can also be calculated for individual amino acids, what we call ENC-X (X 5 particular amino acid). As originally formulated (4), the contributions of individual amino acids to ENC are dependent upon the number of synonymous codons—i.e., twofold degenerate amino acids can have a maximum ENC-X of 2, fourfold degenerate amino acids have a maximum ENC-X of 4, etc. To allow each amino acid to equally contribute to ENC, we scaled ENC-X to range from 0 (no bias) to 1 (maximum bias) for all amino acids, what we call sENC-X (unpublished work). Fig. 1 shows the distribution of ENC for genes available for several species of both Drosophila and microorganisms. Bacteria and yeast have long been model organisms for the study of codon usage bias (5, 6), and it is clear from Fig. 1 that Drosophila genes are as biased as those of microorganisms. All three species of Drosophila and Escherichia coli have about equal mean ENC, which is somewhat less than for Saccharomyces cerevisiae. However, D. melanogaster has a somewhat greater proportion of very highly biased genes than does E. coli. If we consider extreme bias as an ENC of 35 or less, 8% of D. melanogaster genes and 5% of E. coli genes are in this category. Another way of seeing the same phenomenon is to note that, while having the same mean, D. melanogaster genes display a greater variance (SD) in codon usage bias than do E. coli genes (Fig. 1). Phylogenetic Persistence of Bias. Generally, genes remain at a certain level of codon usage bias across species. Fig. 2 shows the correlations between species of Drosophila. It is important to realize that the level of divergence between the species
As far as is known, synonymous mutations are truly neutral with respect to natural selection. The above quotation from King and Jukes (1) was one of the major, and more reasonable, tenets of the neutral theory of molecular evolution. With few exceptions (e.g., ref. 2), even those researchers who tended toward the selectionist view of molecular evolution were willing to concede synonymous substitutions to the neutralists. After all, such mutations do not affect the structure of the primary gene product and therefore should not be able to affect the phenotype, the level at which natural selection acts. One of the more surprising observations provided by the accumulating DNA sequence data has been the evidence that selection can and does affect synonymous substitutions. One of the strongest pieces of evidence of the nonneutrality of synonymous substitutions is codon usage bias, the unequal usage of codons encoding the same amino acid. If synonymous substitutions are neutral and if mutations are truly random (i.e., equal probability of change to all nucleotides), then all
Abbreviation: ENC, effective number of codons. *To whom reprint requests should be addressed. e-mail:
[email protected].
© 1997 by The National Academy of Sciences 0027-8424y97y947784-7$2.00y0 PNAS is available online at http:yywww.pnas.org.
7784
Colloquium Paper: Powell and Moriyama
Proc. Natl. Acad. Sci. USA 94 (1997)
7785
FIG. 1. Distribution of genes with various degrees of codon usage bias measured by ‘‘effective number of codons’’, ENC; lower ENC is greater bias. The number of genes for each species (Drosophila melanogaster, Drosophila pseudoobscura, Drosophila virilis, Escherichia coli, and Saccharomyces cerevisiae) and the mean ENC 6 SD are shown in brackets.
compared is very high; Ks, the synonymous substitutions per site, is greater than 1 for most genes. This indicates enough evolutionary time has elapsed to radically change codon usage in the absence of constraints. Not only does the level of bias remain conserved, but often the actual pattern as well. One example is Alcohol dehydrogenase (Adh), which has been sequenced in more than 50 species of Drosophila. Table 1 shows the pattern of codon usage for three amino acids. The subgenera Sophophora and Drosophila diverged from each other about 50 million years ago (7), so the avoidance of particular codons in Adh, namely AUA (isoleucine), GGG (glycine), and UUA (leucine), has persisted for a very long time. It is not the case that Drosophila simply cannot use these codons; many genes do use them, an example being the very closely linked Adh-related (Adhr) gene shown in the lower part of Table 1. While most genes display evolutionary conservatism for codon bias, other genes do not. Fig. 2 notes a few examples of exceptions which are of some interest. First, Adh in D. virilis is quite unbiased, having an ENC of about 53, while in D. melanogaster and most other species it is quite biased. (Note that even though low in codon usage bias over all the gene, D. virilis Adh still avoids the three codons noted in Table 1, so the avoidance of these codons is not simply due to overall bias.) Adhr also varies in bias between species, being nearly totally unbiased in D. melanogaster but displaying quite high bias in D. pseudoobscura (Fig. 2). Contrariwise, Adh is more biased in D. melanogaster (ENC 5 31.4) than in D. pseudoobscura (ENC 5 36.7). These two genes are only a few hundred base pairs apart.
The Serendipity genes, indicated by points Sry-b and Sry-d in Fig. 2, are also of some interest. These genes are part of a gene cluster that contains six transcriptional units in an 8-kb stretch of DNA. In Fig. 3 we compare the codon usage bias of these genes between D. melanogaster and D. pseudoobscura. Some genes in this cluster have remained relatively highly biased (e.g., the ribosomal protein gene M(3)99D) and others remain quite unbiased (e.g., janA, janB, and Sry-a). Interspersed are the two Sry genes that shift in level of codon bias between these species. There is evidence that Sry genes are expressed differently in these two species (8), which may be related to their change in level of codon usage bias. Pattern of Codon Usage Bias. While ENC and related measures indicate the overall bias, it is also instructive to look more closely at the pattern of codon bias. Generally, Drosophila genes with high codon usage bias have G and especially C at silent positions (9, 10). Table 2 shows the base composition at two- and fourfold degenerate synonymous sites for the approximately 10% highest and 10% lowest biased genes in D. melanogaster. Do all amino acids contribute to the codon usage bias of a gene and, if so, do they all show the same pattern (i.e., an increase in C ending codons)? Comparing the individual amino acid measure, ENC-X, to overall bias of the gene, we found all amino acids contribute significantly (P , 0.0001) to the overall bias of a gene, although Asp is a clear outlier with relatively little contribution to overall codon usage bias (unpublished work). We then examined if the pattern of bias for each amino acid is similar; Table 3 shows the correlation of
7786
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Powell and Moriyama
positive correlation is an increase in C as bias increases. However, there are exceptions to the preference for C. Val and Leu increase in use of G as bias increases. This pattern is very similar in D. pseudoobscura and D. virilis, a general preference for C except for Val and Leu, which prefer G in the codon third position. For Ile (threefold degenerate) and most twofold degenerate TyC amino acids the highest significant positive correlation is for an increase in C as bias increases. The exception is Asp, which shows no significant correlation in its ENC-X and base composition at the wobble position, in agreement with the previous point. For all AyG twofold degenerate amino acids, G increases as bias increases (unpublished work). Gene Length. As we discuss below, some explanations of codon usage bias may be affected by the length of a gene. Does the length of a gene in D. melanogaster correlate with the degree of codon bias? To answer this, we need to be certain to use a measure of bias that itself is not biased by sample size (i.e., the number of codons in a gene). Wright (4) performed simulation studies on ENC and found little or no detectable bias with sample size; we have confirmed this finding (E.N.M., unpublished data). Fig. 4 summarizes the relationship between gene length and codon usage bias: smaller genes tend to have higher bias than do longer genes. Recombination. There is also an effect of the level of recombination on the level of codon usage bias of Drosophila genes: genes in regions of low recombination tend to have low bias (11). This is attributed to the fact that selection can act more effectively at single loci or nucleotide positions when recombination is high, the so-called Hill–Robertson (12) effect. Causes
FIG. 2. Correlation of codon usage bias across species. The correlation coefficient for the upper graph is 0.42 (n 5 31, P 5 0.01) and that for the lower graph is 0.42 (n 5 63, P , 0.001).
codon usage bias (ENC-X) of each fourfold and sixfold degenerate amino acid and the base composition at synonymous sites. As expected, for most amino acids the highest Table 1.
Mutation Bias. There is evidence that mutation bias may affect codon usage in warm-blooded vertebrates that have mosaic genomes consisting of long stretches of A1T-rich DNA interspersed with long stretches of G1C-rich DNA. This isochore structure, as it is termed (13), is thought to be due to regional differences in mutation bias (14, 15). The observation is that genes in A1T-rich isochores tend to have A1T predominantly at silent sites, while genes in G1C-rich isochores have G1C more often at silent sites (16, 17). This is shown by a correlation between base content of introns and the exons of the same gene.
Codon usage for Adh and Adhr No. of times codon used
Subgenus group Adh Sophophora melanogaster obscura willistoni Idiomyia ‘‘Hawaiians’’ Drosophila repleta virilis Adhr Sophophora melanogaster obscura
Isoleucine
Glycine
Leucine
No. of species
Mean ENC
AUU
AUC
AUA
GGU
GGC
GGA
GGG
UUA
UUG
CUU
CUC
CUA
CUG
9 7 6
31.8 6 3.2 41.5 6 5.8 45.9 6 0.7
72 64 79
136 93 53
0 3 0
51 37 67
81 87 33
36 9 8
1 0 0
0 1 2
30 22 99
3 6 7
33 16 13
0 1 3
224 125 32
10
43.8 6 1.7
113
115
1
66
112
6
0
0
49
43
26
23
103
9 8
44.6 6 2.7 53.5 6 1.7
77 92
144 100
4 3
35 35
89 64
29 41
2 4
2 0
23 18
16 35
36 8
6 15
124 112
6 7
57.1 6 2.0 49.5 6 3.9
42 48
33 49
23 21
28 45
12 46
58 34
11 14
13 7
32 20
6 6
7 15
18 16
42 84
Numbers in main body are numbers of times each codon is used in that group of species. ENC is effective number of codons, defined in the text, and is presented 6SD.
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Powell and Moriyama
7787
FIG. 3. Codon usage bias (ENC) for six tightly linked genes in two species. Note how Sry-b and Sry-d change considerably between species, while the other genes change much less.
Could mutation bias account for codon usage bias in Drosophila? For three reasons this seems to be very unlikely. First, Drosophila have no isochores (18). As noted above, very tightly linked genes can be very different in codon usage bias, notably Adh and Adhr, which are only a few hundred base pairs apart, and the very dense Sry cluster. There is no correlation between base content of introns and exons of a gene (9) or only weak correlation with twofold degenerate amino acids (19); in the latter study the correlations were greatest for weakly biased genes, the opposite of what would be expected if mutation bias was a cause of codon usage bias. A second point is that mutation bias in Drosophila is probably toward A1T, whereas codon usage bias increases with use of G and C. As we noted above, genes in regions of low recombination have less codon usage bias than genes in regions of high recombination, a phenomenon thought to be due to the ineffectiveness of selection at individual nucleotide sites. The extreme example of low recombination is the dot fourth chromosome of D. melanogaster, which exhibits no recombination. Thus, in the absence of effective selection, genes on the fourth chromosome should display base composition at synonymous sites reflective of mutation bias. The bottom line in Table 2 shows the base composition at synonymous sites for the seven fourth chromosome genes available for D. melanogaster. A and T are the most common bases at both two- and fourfold degenerate sites. Consistent with A1T mutation bias is the further observation that introns have higher A1T content than do exons in Drosophila (9, 19). A third argument against mutation bias as a major cause of codon usage bias is that not all amino acids display the same pattern of bias. For example, if mutation bias toward C is why highly biased genes have C most frequently at synonymous sites (Table 2), then all amino acids should show this bias. But as shown in Table 3, two amino acids, Val and Leu, increase in use of G as bias increases. Table 2.
Taken together, these three observations make it unlikely that mutation bias is playing a large role in maintaining codon usage in Drosophila. In the absence of such bias, we are left with some form of selection as an explanation. Selection for Codon Usage. The most plausible and welldocumented selection-based explanation for codon usage bias is selection for efficient translation related to the relative abundance of isoaccepting tRNAs (20, 21). The evidence for this comes primarily from microorganisms, namely bacteria and yeast. Preferred codons are those that can base pair optimally with the most abundant tRNA. Generally this involves Watson–Crick pairing or, when bases are modified in the tRNA, some modifications in optimal binding occur. Codon usage bias in microorganisms is well explained by what have been called ‘‘Ikemura’s rules’’ (21) describing optimal binding. There are two observations which support selection for efficient translation in microorganisms: highly expressed genes have greater codon bias (presumably because selection is more intense for efficient translation of such genes), and the relative abundance of isoaccepting tRNAs do match very well the codon usage. Is there evidence that a similar mode of selection could be operating in Drosophila? Highly expressed genes in Drosophila do tend to be highly biased in codon usage. Among the approximately 10% highest biased genes can be found larval serum proteins, larval and adult cuticle proteins, yolk proteins, chorions, actins, alcohol dehydrogenase, superoxide dismutase, lysozymes, amylases, and a- and b-tubulins; 23 of the 26 known ribosomal proteins are also in this group. Genes which have two copies that differ in level of expression have more codon usage bias in the more highly expressed copy (3). Also similar to the situation in microorganisms, what is known of relative levels of isoaccepting tRNAs matches quite well the codon bias in Drosophila (Table 4, using data from refs. 22 and 23). Eleven of the 18 amino acids with redundant codons are shown here; for the other 7 amino acids, the
Base compositions among different D. melanogaster gene groups Base composition, %
Gene groups
No. of genes
T4
C4
A4
G4
C2
A2
'10% highest bias '10% lowest bias Fourth chromosome
122 127 7
16.2 21.5 33.3
51.1 29.6 18.8
7.7 23.9 32.3
25.0 25.0 15.7
81.0 49.6 39.9
8.0 41.7 63.0
The average base composition at fourfold degenerate sites (T4, C4, A4, and G4) and at the twofold degenerate sites (C2 and A2) are shown. C2 % and A2 % were calculated separately from the TyC and AyG twofold degenerate sites, respectively.
7788
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Powell and Moriyama
Table 3. Codon usage bias of fourfold and sixfold degenerate amino acids Correlation coefficients of codons Amino acid
Codons
Fourfold degenerate Val GTN Pro CCN Thr ACN Ala GCN Gly GGN Sixfold degenerate Leu CTN TTR Ser TCN AGY Arg CGN AGR
NNT
NNC
NNA
NNG
20.48* 20.36* 20.33* 20.17* 20.33*
20.04 0.58* 0.76* 0.79* 0.63*
20.56* 20.31* 20.56* 20.60* 20.22*
0.71* 20.14* 20.27* 20.40* 20.45*
20.52*
0.06
20.28* 20.41* 0.16*
0.45* 0.16* 0.74*
20.41* 20.51* 20.43*
0.86* 20.42* 0.12*
20.51* 20.36*
20.33* 20.41*
The numbers shown are the correlation coefficients between the degree of codon usage bias of the individual amino acid, ENC-X, and the frequency of the base in the third position. The largest positive correlation for each amino acid is highlighted in boldface type. p, Significance at P , 0.001. R 5 A or G; Y 5 T or C.
anticodons of the most abundant isoaccepting tRNAs are not known. However, for 5 (Pro, Thr, Ala, Leu, and Ile) nuclear copies of tRNA with optimal binding are known, although their relative abundance has not been studied. No tRNA sequences are known for the two remaining amino acids, Cys and Gln. Therefore, in every case where relative abundance of isoaccepting tRNAs is known, there is a very good match with the most used codons. The tRNA poolytranslational efficiency hypothesis to explain codon usage bias in microorganisms is thought to be due to the fact that all genes in these single-celled organisms share
FIG. 4. Correlation between gene length (exons only) and codon usage bias. The correlation coefficient was calculated by using each gene separately, while the graph was constructed by lumping genes into size classes for visual clarity.
Table 4.
Most abundant isoaccepting tRNA and preferred codons
Amino acid
Preferred codon in highly biased genes
Val Gly Ser Tyr His Asn Asp Lys Glu Phe Arg
GUG GGC UCC UAC CAC AAC GAC AAG GAG UUC CGC
Most abundant tRNA anticodon Adult
Third instar
First instar
CAC GCC IGA GUA GUG ? QUC CUU SUC GmAA A*CG
CAC GCC IGA GUA GUG ? GUC CUU SUC GmAA A*CG
CAC GCC IGA GUA GUG GUU GUC CUU ? GmAA A*CG
I 5 inosine, which binds optimally to C and U; Q 5 queuosine, which binds C and T about equally; S 5 2-thiouridine, which binds optimally to A and G; and Gm 5 29-O-methylguanosine, which binds optimally with C in AyT-AyT-Y codons. Optimal binding rules follow Ikemura (21). Information on tRNAs is from refs. 22 and 23. *Determined from nuclear copy; most often A is modified to I in first anticodon position.
the same tRNA pool, and thus converge on a single optimal codon usage pattern. In multicellular eukaryotes, it is conceivable that different tissue types have different tRNA pools, perhaps adjusted to match codon usage of genes most expressed in each tissue. The evidence from Drosophila does not indicate tissue specificity in either level or pattern of codon usage bias. For example the avoidance of the three codons noted in Table 1—i.e., AUA, GGG, and UUA—is shared by genes expressed primarily in the fat body (Adh), in ovarian egg chambers (chorions), and in the midgut (Amylase), as well as occurring in genes expressed in all cell types—e.g., myosin and Glyceraldehyde-3-phosphate dehydrogenase (unpublished data). If codon usage in Drosophila is explained by selection for efficient translation, is this consistent with the length effect documented above? Mutations to nonoptimal codons will have a greater relative effect in smaller genes compared with larger genes. Assume a nonoptimal codon requires twice as long to incorporate an amino acid as does the optimal codon. In a short gene, with say 100 codons, a mutation to a nonoptimal codon from an optimal one will increase translation time by 1%, whereas a similar mutation in a gene with 1,000 codons would increase translation time by only 0.1%. Alternatively, the length effect could be explained by the fact that highly expressed genes tend to be short. With the present data it is impossible to eliminate either explanation. Efficiency of translation has two interrelated effects. First is speed of translation, presumably especially important for highly andyor rapidly expressed genes. A second aspect affected by codon usage is accuracy of translation—i.e., the relative rate of misincorporation of amino acids. It is known that, at least in microorganisms, nonoptimal codons misincorporate more frequently than do optimal codons (24, 25). The issue of selection for speed vs. accuracy is very difficult to disentangle, and both may well be occurring. Most misincorporation is thought to occur during the waiting time for the ‘‘search’’ for the ternar y complex (aminoacyl-tRNA– elongation factor Tu–GTP) matching the codon being translated; the longer the wait, the higher the probability of misincorporation. Therefore genes translated fast are also translated more accurately. Akashi (26) has argued that selection for accuracy may account for at least some of the codon bias in Drosophila. He reasoned that selection would be greatest for misincorporation of amino acids at crucial functional sites in a protein. He identified such amino acids by evolutionary conservation and found that conserved amino
Colloquium Paper: Powell and Moriyama
Proc. Natl. Acad. Sci. USA 94 (1997)
7789
FIG. 6. Schematic hypothesis of why, in D. melanogaster, synonymous polymorphisms are not less in genes with higher codon usage bias.
Finally, in this section we note that the tRNA abundancey efficiency of translation hypothesis can account for the maintenance of codon usage bias in Drosophila, but this still begs the question of the origin of the bias. Did genes evolve to match tRNA pools? Or did tRNA pools evolve to match codon usage of genes? This is something of a ‘‘chicken or egg first’’ question which cannot be answered. It is conceivable that other factors could have initiated the codon usage bias which subsequently led to selection for adjustment of the relative levels of isoaccepting tRNAs; this could set up a feedback cycle. It seems unlikely in Drosophila that mutation bias could have been the initiating factor because, as we argued above, mutation bias is toward A1T, while codon bias is toward G1C. However, factors such as transcriptional efficiency and mRNA stability and processing could also potentially cause codon usage bias and be initiators of the selection for adjustment of tRNA pools. Effects of Codon Usage Bias on DNA Evolution
FIG. 5. Correlation between synonymous DNA divergence between species of Drosophila and codon usage bias of a gene. Numbers of synonymous substitutions per synonymous site were calculated by a method of Moriyama and Powell (37). The mean ENC of the gene from the two species was used. p, P 5 0.05; ppp, P 5 0.001.
acids had a tendency to have greater codon usage bias than do amino acids that vary across species. This may account for some codon usage bias, but it cannot explain patterns such as in Adh, where certain codons are avoided throughout a gene (Table 1).
If selection constrains codon usage to differing degrees in different genes, then we expect to see some differences in evolutionary rates of change at synonymous sites. Genes with little or no codon usage bias should exhibit a higher Ks than do genes with high codon usage bias. This has been shown to be the case in both bacteria (27) and Drosophila (28). Here we update this observation by including many more genes than previously available as well as make three species comparisons at different levels of divergence (Fig. 5). Furthermore, we have developed a method (37) for estimating synonymous substitutions that corrects for base composition, and thus the patterns in Fig. 5 are not simply due to artifacts caused by base composition differences among genes. The expected correlations are best observed for the more distant pairs of D. melanogaster–D. pseudoobscura and D. melanogaster–D. virilis. We suspect the rather large scatter and weak correlation for the D. melanogaster–D. simulans pair is due simply to noise in the data because the divergence time is quite short for this pair,
7790
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Powell and Moriyama
the Ks being on average about an order of magnitude lower than for the other two pairs. Also, it is possible sharing of polymorphic alleles in closely related species may also be obscuring the picture. Does a similar phenomenon occur for intraspecific polymorphisms—i.e., do more highly biased genes have less synonymous polymorphism within a species? We observed the opposite for 21 genes in D. melanogaster for which data on intraspecific variation were available (29): there is a statistically significant positive correlation between codon usage bias and level of synonymous polymorphism in a gene. How can this be explained? We speculate this may be due to the effect of variation in recombination. Fig. 6 outlines the argument. Genes vary in their levels of recombination dependent upon position in the genome. Increased recombination can have two effects, one of which is to increase codon usage bias (19) and the other is to increase synonymous polymorphisms (30). Both are due to a decrease in hitchhiking effects of linked genes. As mentioned previously, selection at single nucleotide positions is more effective in regions of high recombination, thus allowing for an increase in selection for optimal codons. The effect of recombination on levels of synonymous polymorphism is thought to be due to selective ‘‘sweeps’’ at linked positions; such sweeps take along with them linked sites which then become less variable. Such selection may be positive when a new linked favorable mutation arises and goes to fixation (31), or may be due to negative ‘‘background selection’’ against deleterious mutations (32); it is not clear which of these processes best fits the data, but their effects are similar. From the observation in D. melanogaster that highly biased genes tend to have higher synonymous polymorphism, the arrows at the bottom of Fig. 6 would seem not to have equal strength in their effects. The expected decrease in synonymous polymorphism caused by codon usage bias is not great enough to overcome the expected increase in such polymorphisms due to lessening effects of selective sweeps. Conclusions While the information available on codon usage bias of both microorganisms and Drosophila provides good evidence that selection can act on what had been considered prime candidates for neutral mutations, are all synonymous substitutions detected at all times? This is highly unlikely, and we argue elsewhere that there is likely a continuum in Drosophila (and other organisms) with codon usage in highly biased genes being primarily affected by selection, whereas other genes may have codon usage controlled primarily by mutation and drift along the lines of models previously proposed (33, 34). This is in agreement with Akashi’s (35) observation that the selection for optimal codons in D. simulans has been more effective than in D. melanogaster. D. simulans is thought to have an effective population size greater than D. melanogaster, so the selection coefficients on synonymous mutations (at least on some genes) are sufficiently small as to be sensitive to population size differences among species of Drosophila. This implies the selection coefficients on synonymous mutations are on the order of Nes equal to 1 (Ne is the effective population size and s is the selection coefficient), consistent with previous studies (35, 36). Further, we note that when selection is ineffective due to reduced recombination, codon usage may well reflect
mutationydrift (Table 2). Nevertheless, at times selection for codon usage must be very effective, as exemplified by the phylogenetic persistence of avoidance of specific codons in specific genes for very long evolutionary periods (Table 1). We thank the organizers of this colloquium, Francisco J. Ayala and Walter M. Fitch, for the opportunity to commemorate Theodosius Dobzhansky and the publication of arguably the most important book on evolution in the 20th Century. We appreciate the helpful review provided by Charles F. Aquadro. This work was supported by National Science Foundation Grant DEB9318836. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37.
King, J. L. & Jukes, T. H. (1969) Science 164, 788–798. Richmond, R. (1970) Nature (London) 225, 1025–1028. Shields, D. C., Sharp, P. M., Higgins, D. G. & Wright, F. (1988) Mol. Biol. Evol. 5, 704–716. Wright, F (1990) Gene 87, 23–39. Ikemura, T. (1985) Mol. Biol. Evol. 2, 13–34. Sharp, P. M. & Li, W.-H. (1987) Nucleic Acids Res. 15, 1281–1295. Powell, J. R. & DeSalle, R. (1995) Evol. Biol. 28, 87–138. Ibnsouda, S., Schweisguth, F., de Billy, G. & Vincent, A. (1993) Development (Cambridge, U.K.) 119, 471–483. Moriyama, E. N. & Hartl, D. L. (1993) Genetics 134, 847–858. Sharp, P. M. & Lloyd, A. T. (1993) in An Atlas of Drosophila Genes, ed. Moroni, G. (Oxford Univ. Press, New York), pp. 378–397. Kliman, R. M. & Hey, J. (1993) Mol. Biol. Evol. 10, 1239–1258. Hill, W. G. & Robertson, A. (1966) Genet. Res. 8, 269–294. Bernardi, G., Olofsson, B., Filipski, J., Zerial, M., Salinas, J., Cuny, G., Meunier-Rotival, M. & Fodier, F. (1985) Science 228, 953–958. Filipski, J. (1987) FEBS Lett. 217, 184–186. Wolfe, K. H., Sharp, P. M. & Li, W.-H. (1989) Nature (London) 337, 283–285. Bernardi, G. & Bernardi, G. (1986) J. Mol. Evol. 24, 1–11. Aota, S. & Ikemura, T. (1986) Nucleic Acids Res. 14, 6345–6355. Thiery, J.-P., Macaya, G. & Bernardi, G. (1976) J. Mol. Biol. 108, 219–235. Kliman, R. M. & Hey, J. (1994) Genetics 137, 1049–1056. Grosjean, H. & Fiers, W. (1982) Gene 18, 199–209. Ikemura, T. (1992) in Transfer RNA in Protein Synthesis, eds. Hatfield, D. L., Lee, B. J. & Pirtle, R. M. (CRC, Boca Raton, FL), pp. 87–111. White, B. N., Tener, G. M., Holden, J. & Suzuki, D. T. (1973) Dev. Biol. 33, 185–195. Sprinzl, M., Steegborn, Bu ¨bel, F. & Steinber, S. (1996) Nucleic Acids Res. 24, 68–72. Precup, J. & Parker, J. (1987) J. Biol. Chem. 262, 11351–11356. Dix, D. B. & Thompson, R. C. (1989) Proc. Natl. Acad. Sci. USA 86, 6888–6892. Akashi, H. (1994) Genetics 136, 927–935. Sharp, P. M. & Li, W.-H. (1987) Mol. Biol. Evol. 4, 222–230. Sharp, P. M. & Li, W.-H. (1989) J. Mol. Evol. 29, 398–402. Moriyama, E. N. & Powell, J. R. (1996) Mol. Biol. Evol. 13, 261–277. Begun, D. J. & Aquadro, C. F. (1993) Nature (London) 365, 548–550. Hudson, R. R. (1994) Proc. Natl. Acad. Sci. USA 91, 6815–6818. Charlesworth, B., Morgan, M. T. & Charlesworth, D. (1993) Genetics 134, 1289–1303. Sharp, P. M. & Li, W.-H. (1986) J. Mol. Evol. 24, 28–38. Bulmer, M. (1991) Genetics 129, 897–907. Akashi, H. (1995) Genetics 139, 1067–1076. Hartl, D. L., Moriyama, E. N. & Sawyer, S. A. (1994) Genetics 138, 227–234. Moriyama, E. N. & Powell, J. R. J. Mol. Evol., in press.
Proc. Natl. Acad. Sci. USA Vol. 94, pp. 7791–7798, July 1997 Colloquium Paper
This paper was presented at a colloquium entitled ‘‘Genetics and the Origin of Species,’’ organized by Francisco J. Ayala (Co-chair) and Walter M. Fitch (Co-chair), held January 30–February 1, 1997, at the National Academy of Sciences Beckman Center in Irvine, CA.
The evolution of plant nuclear genes (multigene familyyalcohol dehydrogenaseychalcone synthaseyrbcSygene duplication)
MICHAEL T. CLEGG, MICHAEL P. CUMMINGS,
AND
MARY L. DURBIN
Department of Botany and Plant Sciences, University of California, Riverside, CA 92521
genes favoring codons ending in A or T based on comparisons of liverwort, tobacco, and rice, which span the evolution of land plants (9). Recent studies also show a strong dependence on adjacent nucleotide site composition in transitionytransversion rates in noncoding regions of cpDNA (10, 11). In addition, insertionydeletion mutations appear to be context-dependent because their rate of occurrence appears to increase in specific liable regions of both coding and noncoding DNA (9, 12, 13). Taken in combination, these and other studies reveal a complex mutational process that does not accord with simple models of molecular evolutionary change. What about nuclear genes? Nuclear genes determine the vast range of phenotypes that are responsible for plant adaptation in nature, and yet knowledge of the molecular evolution of these genes is still at rudimentary stages. Our goal in this article is to discuss present knowledge of plant nuclear gene evolution. We do not attempt to be comprehensive, rather we select examples of genes and gene families that seem to us to illustrate important issues in gene evolution. Accordingly, we will restrict our discussion to a gene family that encodes a chloroplast enzyme component (small subunit of the enzyme ribulose-1,5-bisphosphate carboxylase, rbcS), a gene family that encodes an important component of plant secondary metabolism (chalcone synthase, Chs), and a gene family that encodes a glycolytic enzyme (alcohol dehydrogenase, Adh), and we will discuss pertinent facts that relate to the evolution of plant transposable elements. It will be clear from the discussion that we lack detailed knowledge of the molecular evolution of plant nuclear genes. A comprehensive systematic sampling of plant nuclear genes has yet to be attempted, which makes it difficult to address many of the questions cited above for cpDNA evolution. However, one fact that does emerge from our consideration of plant nuclear gene evolution is the pervasive importance of recombinational processes at all levels of plant gene evolution.
ABSTRACT We analyze the evolutionary dynamics of three of the best-studied plant nuclear multigene families. The data analyzed derive from the genes that encode the small subunit of ribulose-1,5-bisphosphate carboxylase (rbcS), the gene family that encodes the enzyme chalcone synthase (Chs), and the gene family that encodes alcohol dehydrogenases (Adh). In addition, we consider the limited evolutionary data available on plant transposable elements. New Chs and rbcS genes appear to be recruited at about 10 times the rate estimated for Adh genes, and this is correlated with a much smaller average gene family size for Adh genes. In addition, duplication and divergence in function appears to be relatively common for Chs genes in flowering plant evolution. Analyses of synonymous nucleotide substitution rates for Adh genes in monocots reject a linear relationship with clock time. Replacement substitution rates vary with time in a complex fashion, which suggests that adaptive evolution has played an important role in driving divergence following gene duplication events. Molecular population genetic studies of Adh and Chs genes reveal high levels of molecular diversity within species. These studies also reveal that inter- and intralocus recombination are important forces in the generation allelic novelties. Moreover, illegitimate recombination events appear to be an important factor in transposable element loss in plants. When we consider the recruitment and loss of new gene copies, the generation of allelic diversity within plant species, and ectopic exchange among transposable elements, we conclude that recombination is a pervasive force at all levels of plant evolution. Plant molecular evolution has been dominated by studies of the chloroplast genome (cpDNA). There are several reasons for this focus on a single organelle that by itself accounts for less than 0.1% of the genetic complement of plants. First, cpDNA is an abundant component of total cellular DNA, and this facilitated the early molecular characterization of the cpDNA genome. Second, cpDNA turned out to have a conservative rate of nucleotide substitution (1), and slow rates of molecular evolution are ideal for the study of plant phylogenetic relationships at or beyond the family level. Because plant relationships are most controversial at deeper levels of evolution, cpDNA data promised to provide an important new tool for the reconstruction of plant phylogenies (2, 3). This promise has been borne out by extensive use of cpDNA-encoded genes to study plant phylogeny (4, 5). As a consequence, the bulk of research effort in plant molecular evolution has focused on problems in molecular systematics. Despite a primary focus on molecular systematics, other important topics in cpDNA evolution have also been explored (6). For example, it is well established that cpDNA genes vary in rates of evolution among major plant lineages, violating the molecular clock hypothesis (7, 8). Studies of codon bias in chloroplast genes have uncovered a substantial bias of cpDNA
Evolution of the rbcS Multigene Family One interesting class of plant nuclear genes includes those that originally were components of the chloroplast genome but have been transferred to the nuclear genome and subsequently lost from the chloroplast genome. Within this class of genes are those encoding proteins involved in basic cellular processes, such as ribosomal proteins, and genes encoding proteins involved in photosynthesis. The best-studied among these transferred genes is rbcS, which encodes the small subunit of the enzyme ribulose-1,5-bisphosphate carboxylase. The enzyme is responsible for fixation of carbon in photosynthesis by catalyzing the condensation of carbon dioxide with the fivecarbon sugar ribulose-1,5-bisphosphate to form two molecules Abbreviations: ADH, alcohol dehydrogenase; CHS, chalcone synthase; TE, transposable element; STS, stilbene synthase; LTR, long terminal repeat.
© 1997 by The National Academy of Sciences 0027-8424y97y947791-8$2.00y0 PNAS is available online at http:yywww.pnas.org.
7791
7792
Colloquium Paper: Clegg et al.
of the three-carbon sugar 3-phosphoglycerate. The functional holoenzyme consists of eight identical, large subunits encoded by the chloroplast gene rbcL, and eight identical, small subunits encoded by the nuclear gene rbcS. Within cyanobacteria, rbcS and rbcL are adjacent and cotranscribed (14). However, sometime prior to the evolution of land plants rbcS was transferred to the nucleus, where it occurs in all lineages examined, including the green alga Chlamydomonas (15, 16). Within diploid Angiosperms characterized so far, the rbcS gene family consists of two to eight copies. These copies are distributed among one or more loci, often on several chromosomes, and individual gene copies are frequently arranged in a tandem array at a locus. The gene typically consists of coding sequence for up to 189 amino acids (aa) that is interrupted by one intron in monocots and two or three introns in dicots. The carboxyl region of the translation product comprises a transit peptide necessary for targeting the polypeptide to the chloroplast stroma. The transit peptide is cleaved upon, or shortly following, arrival of the polypeptide into the chloroplast yielding the mature protein, which typically is 120–123 aa in length. The sequence similarity is much higher for the mature protein than for the more variable transit peptide. Both 59 and 39 flanking sequences show very little sequence similarity within and between species. Sequence similarity among rbcS genes is hierarchical, with physically adjacent genes within a species showing the highest similarity, often identical in coding sequence, followed by genes at different loci within a species, and genes between species, which are the most diverged. There are no known functional differences associated with the very small differences in mature proteins within a species, and all gene products are considered to be functionally equivalent. Although only a few species have been studied in detail, recombination appears to play an important role in the evolution of the rbcS gene family at several levels. The number of loci varies across Angiosperms in a pattern that implies both the loss and gain of loci (17). The data indicate expansions and contractions in number of gene copies, perhaps through slipped-strand mispairing, within tandem arrays. For example, locus 3 of petunia contains six copies of rbcS, whereas the homologous locus in tomato contains three (18). The overall view is that both loss and subsequent gain of gene copies, as well as homogenization of gene copies in situ via gene conversion, are important mechanisms that govern rbcS evolution within species. Interlocus gene-conversion events occur even for genes differing in number of introns (19). Thus, if we consider the evolutionary history of gene copies at different loci, we may observe first duplication followed by sequence divergence, and then recurrent returns to a state of complete identity owing to conversion events. This pattern of evolution causes all of the copies within a family or even a genus to be clustered together consistent with the rbcS phylogeny depicted in Fig. 1. Evolution of Genes in the Flavonoid Biosynthesis Pathway Another class of well studied nuclear genes that is important in plant adaptation and evolution is the genes of the flavonoid biosynthetic pathway. The pathway is shown in Fig. 2. The end product and side branches of the pathway lead to a general class of phenolic compounds known as flavonoids. Flavonoids have many functions in plants. The colored pigments localized in the vacuoles of flowers act as attractants to pollinators (23). Polymorphisms in flower color can influence pollinator behavior and, ultimately, genetic transmission (24). Flavonoids are also important in protection against UV light (25) and in defense against pathogens and insects (26, 27). Flavonoids are important in induction of nodulation (28), in auxin transport (29), and in pollen function (30). The products of this secondary metabolic pathway enable the plant to better adapt to a stressful environment. The study of the evolution of these genes thus is essential to our understanding of the processes that determine adaptive evolution.
Proc. Natl. Acad. Sci. USA 94 (1997) Evolution of the Flavonoid Biosynthetic Pathway and Associated Regulatory Genes. An important question is: ‘‘How do pathways composed of a series of sequential steps evolve to produce an end product?’’ Stafford (31) postulates that initially each step in plant secondary metabolism derived from an enzyme of primary metabolism and resulted in an intermediate product that was temporarily an end product that conferred some advantage to the plant. For example, chalcone synthase (CHS) is thought to share a common origin with fatty acid synthases of primary metabolism (32) (Fig. 2). The reaction catalyzed by CHS utilizes substrates from both the phenylpropanoid and malonyl CoA pathways of primary metabolism. Subsequent steps in the flavonoid pathway presumably ‘‘borrowed’’ hydroxylases, NADPH-reductases, and glutathione transferase from primary metabolism. As the flavonoid pathway expanded, each new intermediate resulted in some selective advantage such as defense from pathogens or herbivores and UV protection (31). Another layer of complexity in the evolution of a pathway is the regulation of gene expression in terms of timing and tissuespecific expression. Fig. 3 shows the known regulatory genes for two dicot species in the genera Petunia and Antirrhinum (33, 34). The regulatory genes of the flavonoid pathway appear to regulate groups of genes as opposed to individual genes. This may act to organize the pathway into a biosynthetic complex or unit for more efficient functioning of the pathway. Whereas one might envision a single gene coming under new regulatory control by means of chromosomal rearrangement or a transposition event, it is hard to envision a mechanism by which several different genes might come under the control of the same regulatory gene. More evolutionary studies on the regulatory genes of flavonoid biosynthesis are needed to provide insight into the adaptive basis of this complex regulatory network. Duplication and the Acquisition of New Functions. The role of duplication and differentiation in evolution is well illustrated by the genes that encode the first committed step in flavonoid biosynthesis. This step is initiated by the enzyme chalcone synthase, which catalyzes the condensation of three molecules of malonyl CoA and one molecule of p-coumaroyl CoA to produce the 15-carbon naringenin chalcone molecule, which is then further modified in a series of enzymatic steps leading to the colored anthocyanin end product (33). An example of such an event of duplication and differentiation is stilbene synthase (STS). Only a limited number of amino acid changes are required to convert CHS to STS (35). The resulting stilbene phytoalexins produced by the enzyme STS have antifungal properties that confer defense against plant pathogens (36). Phylogenetic analyses indicate that the recruitment of the stilbene synthase function from chalcone synthase has occurred independently several times in the course of land plant evolution (35). In addition, Durbin et al. (37) and Helariutta et al. (38) both report unusual CHS-like gene sequences that differ from both CHS and STS, suggesting that these enzymes are functionally divergent from both CHS and STS. In the morning glory genus (Ipomoea), several Chs genes are more closely related to the unusual ChsB gene of Petunia than to other Petunia Chs genes (37). It has already been speculated that the ChsB Petunia gene is in the process of either acquiring a new function or being inactivated (39). Within Ipomoea these Chs genes appear to have diverged into at least two distinct groups based on amino acid substitutions. In addition, the ratio of synonymous-to-replacement polymorphism is low (about 5.6:1 in Ipomoea) compared with other taxa (e.g., 10:1 in grasses, 16:1 in legumes, and 42:1 in solaneaceous plants; ref. 40). Molecular Population Genetics of Chs Genes. Huttley et al. (41) sampled ChsA genes of the common morning glory (Ipomoea purpurea) from 18 lines that originated from broad geographic collections in Mexico and the southeastern United States. No nucleotide sequence diversity was detected at ChsA
Colloquium Paper: Clegg et al.
Proc. Natl. Acad. Sci. USA 94 (1997)
7793
FIG. 1. A neighbor-joining tree depicting the relationships of the mature gene products of rbcS. The data were taken from GenBank subject to the following restrictions. (i) Only sequences that differed by 5% or more in primary nucleotide sequence were incorporated in the analysis to avoid the inclusion of allelic sequences. (ii) Only sequences that represented a minimum of 50% of the gene were included to avoid biases associated with very short sequences. Amino acid sequences were aligned and the neighbor-joining tree (20) was constructed based on corrected distances (21) using the program CLUSTAL W (22).
7794
Colloquium Paper: Clegg et al.
FIG. 2. Anthocyanin and flavonol biosynthetic pathway. The enzymes in bold represent the core genes of flavonoid biosynthesis. PAL, phenylalanine ammonia-lyase; C4H, cinnamate 4-hydroxylase; 4CL, 4-coumarate-coenzyme A ligase; CHS, chalcone synthase; CHI, chalcone isomerase; F3H, flavanone 3-hydroxylase; F39H, flavonoid 39hydroxylase; F3959H, flavonoid 3959-hydroxylase; DFR, dihydroflavonol 4-reductase; AS, anthocyanidin synthase; UF3GT, UDP-glucose flavonoid 3-oxy-glucosyl transferase; RT, rhamnosyl transferase. Also shown within the box associated with enzyme designations are the gene super families from which particular enzymes are thought to be derived.
in the very limited sample of four lines from Georgia and North Carolina. The Mexican-derived materials had much higher levels of nucleotide sequence diversity with 11 distinct haplotypes. Further examination of the ChsA sequences reveals that the majority of variation resides in exons, whereas flanking sequences and introns show very little polymorphism. Finally, a comparison between the ChsA allele polymorphic sites indicates that at many of these sites one of the nucleotide states is present in all non-ChsA gene family members, and these observations strongly suggest that both the high level of nucleotide diversity present in the I. purpurea ChsA genealogy and the relatively low ratio of synonymous-to-replacement substitutions between ChsA alleles are probably derived from low to moderate rates of interlocus recombinationygene conversion among the different Chs gene family members. To summarize, the Chs genes in Ipomoea have duplicated and diverged in function and they have diverged in specific expression patterns during Ipomoea evolution (ref. 37; unpublished data). In addition, the genes are quite variable both within and between species, but, unlike most cases analyzed, the variation is enhanced in coding regions rather than in noncoding regions (59 and 39 flanking regions and introns). The Ipomoea Chs genes also appear to be evolving at a very rapid rate (41). All the evidence indicates that some of the Chs genes have been subject to adaptive evolution, but the specific phenotypic effects associated with Chs evolution remain obscure. The Alcohol Dehydrogenase Gene Family Alcohol dehydrogenase (Adh) genes encode glycolytic enzymes that have been characterized at the molecular level in a wide range of flowering plant species and in one conifer
Proc. Natl. Acad. Sci. USA 94 (1997)
FIG. 3. Regulation of the anthocyanin pathway in two dicot species. (Asterisks denote genes for which clones have been made from Ipomoea.) The regulatory genes that have been identified in Antirrhinum and Petunia are shown along with the genes that they regulate. Abbreviations are listed in the legend for Fig. 2.
species. Alcohol dehydrogenase (ADH) is an essential enzyme in anaerobic metabolism (42, 43). Transcription from Adh promoters increases under oxygen stress as well as in response to cold stress in both maize and Arabidopsis and to dehydration in Arabidopsis (43). Two or three isozymes are observed in all flowering plant species (44), with the exception of Arabidopsis, which appears to have a single Adh locus (45). Duplication and Divergence Among Adh Gene Family Members. Unlike the previous examples of multigene families that we have considered, the Adh genes in flowering plants are encoded by small multigene families that generally appear to have approximately three duplicate members (46). Isozyme surveys covering an array of dicot and monocot species have revealed that most glycolytic enzymes have two forms in all species (44), probably reflecting a small, and stable, number of loci. The narrow range of gene family size for glycolytic enzymes suggests that additional constraints may also act to determine copy number for this important class of genes. Fig. 4 suggests a slow flux of gene duplication and loss that leads to an approximate dynamic equilibrium in copy number. Molecular Clocks for Adh Genes. The molecular clock hypothesis is one of the fundamental ideas of molecular evolution. The strict molecular clock hypothesis posits a linear rate of accumulation of nucleotide substitution over time (21). Most careful investigations have rejected the strict molecular clock, but a modification known as the generation–time–effect hypothesis, which posits constant rates of nucleotide substitution when time is measured in generation intervals rather than clock time, can also be examined (21, 47). The difficulty with a generation–time-based clock is that it has little practical utility because generation times vary widely both among major lineages and over evolutionary time within lineages. Owing to the extensive database for the chloroplast gene rbcL, it has been possible to thoroughly investigate the molecular clock hypothesis among seed plants for this gene (7, 8). These investigations reject a strict molecular clock, and, within
Colloquium Paper: Clegg et al.
Proc. Natl. Acad. Sci. USA 94 (1997)
7795
The Zea sample also included the teosinte species Z. luxurians and Z. diploperennis. The sequence data spanned a 2.1-kb region of the gene between exon 3 and exon 10. Interestingly, the adh1 sequence data from Zea do not discriminate between the different Zea species in the sample. Indeed, the Z. luxurians and Z. diploperennis adh1 lineages were clustered within the range of Z. mays ssp. mays sequence variation. The adh1 data were subjected to a series of statistical tests to detect selection or interallelic recombination within the Zea samples and within 21 pearl millet adh1 lineages. Tests for interallelic recombination revealed that a minimum of two adh1 alleles were derived from interallelic recombination, and it appears that at least one allele of the 21 pearl millet alleles also has a history of interallelic recombination (49, 50). These results establish that interallelic recombination in plants, as is also the case in Drosophila, can be an important source of allelic novelty. Tests for selection did not reject the null hypothesis of neutrality, although these tests have limited power and a failure to reject the null hypothesis for such small samples is to be expected. What is interesting is that both the coalescence times and the effective population sizes estimated from the data appear to be large. A large effective population size is somewhat surprising in view of the recent history of strong selection for domestication in both species (51). Rates of Duplication and Loss Among Gene Families
FIG. 4. Neighbor-joining tree depicting the relationships of ADH amino acid sequences. The data selection criteria and analytical methods are the same as those described for Fig. 1.
monocots, the data suggest a correlation of synonymous substitution rate with minimum generation time (7). The extent of rate variation within monocots is substantial where rate estimates are approximately 5-fold greater for grasses than for palms. This very large rate contrast presented the opportunity to ask whether synonymous rates showed similar variation for nuclear genes. Because the Adh gene family is one of the most thoroughly studied plant gene families, it was natural to study both relative and absolute rates of nucleotide substitution for this gene family in grasses and palms. Two Adh loci, adh1 and adh2, have been sequenced for maize and rice as well as barley, from which a third locus, adh3, a recent duplication of adh2, has been isolated (citations in ref. 48). In addition, three Adh loci have been fully or partially sequenced from one or more of three palm genera (46, 48). Absolute synonymous rate estimates for the palm Adh loci are 2.6 3 1029 in contrast to 7.0 3 1029 for grass Adh loci. This difference is significant and indicates a deceleration of synonymous substitution rates in palms in parallel with the rbcL gene; however, the difference is closer to 3-fold rather than the 5-fold difference estimated for the rbcL gene. Replacement rates are much more complex. There is strong rate variation between gene family members within the grass lineage, suggesting that positive Darwinian selection played a role in determining rates. Thus, for example, the adh2 locus of grasses appears to be evolving at a rate about 3-fold higher than the adh1 locus of grasses, in the time interval since these two genes duplicated from a common ancestral gene. In addition, the palm adhA locus has a replacement rate that is approximately equal to the grass adh1 locus (48), suggesting a much more rapid replacement rate in palms when time is measured in generations. Molecular Population Genetics of Adh Genes. Gaut and Clegg (49, 50) have studied the evolution of adh1 within the genus Zea and within Pennisetum glaucum (pearl millet, a grass species in the same subfamily as Zea). Eight adh1 sequences were determined from a wide sample of Zea (including both inbred lines and three land race entries of Zea mays ssp. mays).
What have we learned from a comparison of evolutionary patterns in three different gene families? One question of particular interest is to estimate the rate of recruitment of new genes within gene families and within major plant lineages. Before addressing this question, it is important to first discuss the limitations of the present data. To begin with, the great majority of present data was not collected with the goal of addressing evolutionary questions. Consequently, the data are very unevenly distributed across plant taxa. Second, gene family numbers are almost always underestimated. This follows because most investigators are not interested in an exhaustive description of all members of a gene family, and even when that is the goal the vagaries of cloning or PCR make a complete description difficult to accomplish. Finally, there are two processes that can lead to two identical gene copies at different loci (bifurcation of a gene lineage). The first is the origin of a new locus through duplication, and the second is the conversion of genes at two preexisting loci into identical copies. These two classes of events are indistinguishable when the data are solely based on evolutionary comparisons. Accordingly, we count the number of gene loci that have descended from a particular lineage within plant families, but this count confounds duplications and complete conversions. To ensure that our count is conservative, we assume that two gene sequences represent separate loci if and only if they differ by more than 5% in primary nucleotide sequence. With these strong caveats, we have calculated the average number of independent lineages for the Adh, Chs, and rbcS gene families within each of four plant families (Table 1). These data suggest a more rapid rate of duplication (and perhaps recruitment of new function) for the Chs and rbcS gene families than for the Adh gene family. The data also suggest that the appearance of new gene copies occurs infrequently at the family level. If we take the data at Table 1. Numbers of new gene recruitments within plant families for the Adh, Chs, and rbcS gene families Adh Family Poaceae Asteraceae Fabaceae Solanceae
Chs
rbcS
Lineages Copies Lineages Copies Lineages Copies 1 1 1 2
3 1 1 2
1 2 2 2
2 4 7 7
1 2 2 1
4 5 5 5
7796
Colloquium Paper: Clegg et al.
face value, and assume times of 70 million years (MY) for the origin of the Poaceae, 56 MY for the Fabaceae, 40 MY for the origin of the Asterceae, and 40 MY for the Solanaceae, we calculate new gene recruitment rates of 2.9 3 1028 for Adh, 2.8 3 1027 for Chs, and 2.7 3 1027 for rbcS. This represents roughly a 10-fold range, with the highest rates observed for the Chs genes in the Fabaceae and the Solanaceae. The actual range of gene family size is one to three copies for Adh, roughly two to seven for Chs, and approximately two to eight for rbcS; so why do we not see higher levels of recruitment for the rbcS genes, which appear to have slightly greater gene family sizes on average? One important consideration is the rate of homogenization among family members through recombinationyconversion. The rbcS genes, for instance, are often organized in adjacent arrays where recombinational processes would be facilitated, and we have presented direct evidence for interlocus recombinationy conversion in Chs genes in Ipomoea (41). The actual rate of duplication may be considerably above the recruitment rates calculated here, because most new duplicates are unlikely to escape and establish an independent lineage. Instead, the fate of many new duplicates may be conversion back to the sequence of a preexisting gene copy. Other factors such as cosuppression may also act as a barrier to the establishment of new duplicate genes (52). There must be a strong pressure for divergence in function, or in expression patterns, before an escape is favored. It appears that the Chs genes satisfy these constraints, because divergence in function and expression pattern is frequently observed. Recurrent duplication of the rbcS genes may be favored, owing to a requirement for high rates of translation to match the synthesis of the chloroplast-encoded large subunit. Of course, these arguments beg the question of gene loss. We must assume a rough equilibrium between the recruitment of new gene copies and their loss because we observe a rough stability in copy number for each family. So what drives gene loss? Again, it seems likely that recombinational processes are a major factor. We will discuss evidence below that implicates ectopic exchange in transposable element loss, and the same processes are likely also to lead to the production of pseudogenes within gene families that quickly diverge (in evolutionary time) to unrecognizable sequences. Evolution of Plant Transposable Elements Transposable elements (TEs) are a heterogeneous assemblage of discrete DNA sequences that are capable of autonomous or semiautonomous movement from one genomic location to another. There have been very few systematic studies of TE evolution within plant species. Most current knowledge relates to element classification, modes of excisionyreplication, and genome abundance of various element classes. Element Classification. Transposable elements in plants, as in other organisms, fall into two broad categories, class I and class II. Class I transposable elements, often referred to as retrotransposons, are related to retroviruses but differ from them in that they do not form viron particles, which means they are not intrinsically transmissible between cells. As with all retroelements, the replication of class I elements involves reverse transcription, which is the synthesis of DNA from template RNA. Most of the DNA sequence of these elements encodes protein-coding sequences, including the enzyme reverse transcriptase, which catalyzes reverse transcription, and cis-acting sequences required for replication. Retrotransposons exhibit replicative transposition, i.e., transposition does not require excision of an element; rather, a new additional copy of the element is formed. These elements do not exhibit a precise excision process but rather appear to be lost through recombination between long terminal repeat (LTR) sequences or deletion events. Many retrotransposons exhibit a pattern of targeted integration whereby element insertions are much
Proc. Natl. Acad. Sci. USA 94 (1997) more frequent in chromosomal regions away from genes, thus decreasing the relative frequency of transposon-induced mutations (53, 54). Targeted integration may help explain both the very large population sizes and apparently low number of mutant phenotypes associated with class I elements. Class I elements are divided into several broad categories based on arrangement of protein-coding domains and the presence or absence of LTRs. The non-LTR retrotransposons have been more extensively studied in mammals (the L1 or LINE elements) but do occur in plants, for example, Cin4 of maize (55) and del2 of lily (56). Retrotransposons with LTRs are usually referred to by the names given homologous elements first found in Saccharomyces cerevesiae and Drosophila melanogaster and are present in all plants examined (57, 58). Unrivaled in terms of numbers, LTR retrotransposons compose the largest class of transposable elements in plant genomes. The Ty1ycopia elements constitute as much as 50% or more of the maize genome (53), and sizable fractions of the lily genome are composed of Ty3ygypsy elements (59). Within plants these elements have been most thoroughly studied in Arabidopsis thaliana (60–63). With low copy numbers in Arabidopsis and extremely high copy numbers in maize and lily, the abundance of retrotransposons singularly explains most of the observed variation in plant genome size. There is a broad range of class II transposable elements (also referred to as short inverted repeat elements after a common characteristic of the group). Unlike class I elements, transposition of class II elements is directly from DNA to DNA (i.e., does not involve an RNA transposition intermediate), and replication is conservative or nonreplicative, meaning that transposition is coupled with excision. However, these elements may increase in copy number when an elementcontaining DNA strand is used as a template in double-strand gap repair of an empty target site (64). Several structural features typically characterize these elements, including a single open reading frame enclosed by inverted repeat sequences and sometimes containing introns, and often short, target-site duplications that are created upon integration. The protein product of the open reading frame is usually referred to as a transposase, although little is known about its functional attributes. Some members of this class consist of autonomous and nonautonomous elements within the same genome, for example, the AcyDs system of maize. The nonautonomous elements are capable of transposition in the presence of autonomous elements through trans activation. Several groups of class II elements can be distinguished based on their amino acid sequences, for example, the Acyhobo group, which includes the AcyDs system of maize and relatives (Zea spp.); the Tam3 of snapdragon (Antirrhinum majus); and hobo of Drosophila. Other examples of class II elements in plants include elements not yet recognized as members of widespread groups, such as Spm and Mu of maize. Evolutionary Studies. As noted above there are very few studies whose objective is to describe patterns of plant TE evolution. Among the rudimentary studies that do exist are several of AcyDs element evolution in the grass family (65). These data suggest that Ac elements have been in the genome grasses since the origin of the family, approximately 70 million years ago. They also suggest that ectopic exchange plays a major role in generating nonfunctional elements through illegitimate recombination. Owing to the complete absence of any comprehensive systematic study of plant TE distributions, there is no compelling evidence for or against horizontal transmission of plant transposable elements. However, factors such as large population sizes, high mutation rates, and frequent recombination make it very difficult to establish compelling evidence for horizontal transmission (66) Unanswered Questions. We can only speculate about the selective forces responsible for structuring transposable element populations within a genome. Clearly, factors such as
Colloquium Paper: Clegg et al. negative selection associated with deleterious mutations and metabolic costs resulting from replication of increased genetic material have to be rethought given the very large portion of the genome that is composed of transposable elements. Possible positive selective forces that might influence transposable element dynamics also need to be considered, including the role of transposable elements in recombination and chromosome mechanics. The dynamics of plant transposable element populations have yet to be clearly described. What proportion of the transposable element population is transpositionally active? For example, are most elements capable of transposition, or are there relatively few active elements that are the source for most transposition events, and what is the role, if any, of DNA methylation in regulating transposition? Following the previous points, how can models of transposable element evolution be improved? Most of the previous models are based on class II elements in Drosophila and Escherichia coli and are on the whole very inadequate for capturing principal features of class I elements, especially in plants. Given the evolutionarily and genetically heterogeneous nature of transposable elements, more specific models will have to be developed if congruence with empirical observation is to be improved. Current data do suggest that recombinational processes play a major role in TE replication and loss, so we may conclude that recombinational processes play a pervasive role in the fate of TEs just as recombination plays a major role in the evolution of plant gene families. Conclusions The elemental processes that govern plant gene evolution involve nucleotide substitution, the insertion or deletion of strings of nucleotides, and recombinationyconversion between gene copies. As a consequence of these process, we observe increases and decreases in copy number, divergence in function, and divergence in expression patterns. Our study of plant nuclear gene evolution suggests that these processes are both necessary and sufficient to account for observed patterns of gene evolution. Nevertheless, many questions emerge from these data. In the case of the rbcS gene family, the original genes trace to a prokaryotic ancestor. We must assume that some kind of recombinational process facilitated the incorporation of the original plastid genes into the nuclear genome. Subsequent processes led to the duplication and elaboration of the rbcS gene family, but why are the rbcS genes constrained to an approximate upper bound of 10? Why not have many more copies to match the plastid-based synthesis of the large subunit polypeptide? There is good evidence that occasional gene conversion acts to homogenize rbcS gene family members. What other processes lead to the loss of gene copies? We can speculate that illegitimate recombination occasionally leads to pseudogenes that rapidly decay owing to nucleotide substitution. Is this speculation correct? Unfortunately, present data are too sparse to to allow us to measure gene-loss rates to ask whether an approximate equilibrium exists between the loss and gain of copy number. The Chs gene family arose through the evolution of a novel function, which then precipitated the gradual elaboration of a new biosynthetic pathway. There is good evidence for repeated functional divergence of Chs genes based on patterns of amino acid substitution within flowering plants. Because the Chs gene copy number varies within reasonably narrow limits, we must assume that there are controls on copy number; so what determines the number of Chs gene copies in a typical plant genome? Why is this enzyme so plastic and so easily adapted to new uses? In contrast, alcohol dehydrogenase genes presumably have retained a unitary function and show less evolutionary elaboration through time, but why the relatively narrow limits on family size in flowering plants? Why is there a low rate of recruitment of new copies? Are low copy numbers characteristic of most or all glycolytic enzymes as speculated by
Proc. Natl. Acad. Sci. USA 94 (1997)
7797
Morton et al. (46), and, if so, what determines the optimum number of copies? The data we have reviewed also shed new light on several important questions about plant gene evolution and about organismic evolution. For instance, we have learned that interand intrallelic recombination are important processes in generating allelic novelties. We have rejected the strict molecular clock hypothesis, and we have obtained crude estimates for the rates of recruitment of new gene copies for several important gene families. Finally, we have learned that historical species’ effective populations sizes are large for crop plants like maize and pearl millet. What else have we learned from our brief survey of the ecology of plant genomes that might have surprised Dobzhansky and caused him to modify his views of evolutionary processes? The elemental processes of genetic change, enumerated above, were appreciated by Dobzhansky and his contemporaries, and there appears to be no need to invoke new processes of genetic change. But, genes and genomes have been revealed to be much more complex in their organization than suspected during Dobzhansky’s life. Introns were discovered in 1975, the year of Dobzhansky’s death, and molecular proof of transposable elements was also obtained in the mid-1970s. Genetic change now appears to occur at several levels. One level is the nucleotide and its associated influence on protein structure or on DNA-binding signals. A second level is the gene, because new copies may be recruited and elaborated through time. A third level is associated with the activity of mobile elements that infect and mutate genomes. Recombinational processes act at each of these levels to convert sequence information among loci, to disrupt transposon and other duplicate gene sequence continuity, to generate allelic diversity, and to recruit new gene copies. Perhaps Dobzhansky would have accorded a greater significance to the role of recombination in evolution were he writing today. This work was supported in part by a grant from the Alfred P. Sloan Foundation. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22.
Curtis, S. E. & Clegg, M. T. (1984) Mol. Biol. Evol. 1, 291–301. Ritland, K. & Clegg, M. T. (1987) Am. Nat. 130, S74–S100. Clegg, M. T. (1993) Proc. Natl. Acad. Sci. USA 90, 363–367. Chase, M. W., Soltis, D. E., Olmsted, R. G., Morgan, D., Les, D. H., et al. (1993) Ann. Missouri Bot. Gard. 80, 528–580. Soltis, D. E. & Soltis, P. S. (1995) Evol. Biol. 28, 139–194. Clegg, M. T., Gaut, B. S., Learn, J. H., Jr., & Morton, B. R. (1994) Proc. Natl. Acad. Sci. USA 91, 6795–6801. Gaut, B. S., Muse, S. V., Clark, W. D. & Clegg, M. T. (1992) J. Mol. Evol. 35, 292–303. Bousquet, J., Strauss, S. H., Doerksen, A. H. & Price, R. A. (1992) Proc. Natl. Acad. Sci. USA 89, 7844–7848. Morton, B. R. & Clegg, M. T. (1993) Curr. Genet. 24, 357–365. Morton, B. R. (1995) Proc. Natl. Acad. Sci. USA 92, 9717–9721. Morton, B. R. & Clegg, M. T. (1995) J. Mol. Evol. 41, 597–603. Golenberg, E. M., Clegg, M. T., Durbin, M. L., Doebley, J. & Ma, D. P. (1993) Mol. Phyl. Evol. 2, 52–64. Cummings, M. P., King, L. M. & Kellogg, E. A. (1994) Mol. Biol. Evol. 11, 1–8. Nierzwicki-Bauer, S. A., Curtis, S.E. & Haselkorn, R. (1984) Proc. Natl. Acad. Sci. USA 81, 5961–5965. Goldschmidt-Clermont, M. & Rahire, M. (1986) J. Mol. Biol. 191, 421–432. Simard, C., Lemieux, C. & Bellmare, G. (1988) Curr. Genet. 14, 461–470. Manzara, T. & Gruissem, W. (1988) Photosynth. Res. 16, 117–139. Dean, C., Pichersky, E. & Dunsmir, P. (1989) Annu. Rev. Plant Physiol. Plant Mol. Biol. 40, 415–439. Meagher, R. B., Berry-Lowe, S. & Rice, K. (1989) Genetics 123, 845–863. Saitou, N. & Nei, M. (1987) Mol. Biol. Evol. 4, 406–425. Kimura, M. (1983) The Neutral Theory of Molecular Evolution (Cambridge Univ. Press, Cambridge, MA). Thompson, J. D., Higgins, D. G. & Gibson, T. J. (1994) Nucleic Acids Res. 22, 4673–5680.
7798 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36.
37. 38. 39. 40. 41. 42. 43.
Colloquium Paper: Clegg et al. Brouillard, R. (1988) in The Flavonoids, ed. Harborne, J. B. (Chapman & Hall, London), pp. 525–538. Clegg, M. T. & Epperson, B. K. (1988) in The Plant Evolutionary Biology, eds. Gottlieb, L. & Jain, S. K. (Chapman–Hall, London), pp. 255–273. Schmelzer, E., Jahnen, W. & Hahlbrock, K. (1988) Proc. Natl. Acad. Sci. USA 85, 2989–2993. Dixon, R. A., Dey, P. M. & Lamb, C. J. (1983) Adv. Enzymol. 55, 1–136. Lamb, C. J., Lawton, M. A., Dron, M. & Dixon, R. A. (1989) Cell 56, 215–224. Long, S. (1989) Cell 56, 203–214. Jacobs, M. & Rubery, P. H. (1988) Science 241, 346–349. Taylor, L. P. & Jorgensen, R. (1992) J. Hered. 83, 11–17. Stafford, H. A. (1991) Plant Physiol. 96, 680–685. Verwoert, I, Verbree, E. C., Vanderlinden, K. H., Nijkamp, H. J. & Stuitje, A. R. (1992) J. Bacteriol. 174, 2851–2857. Forkmann, G. (1993) in The Flavonoids: Advances in Research Since 1986, ed. Harborne, J. B. (Chapman & Hall, London), pp. 537–564. Moyano E., Martinezgarcia, J. F. & Martin, C. (1996) Plant Cell 8, 1519–1532. Tropf, S., Lanz, T., Rensing, S. A., Schroder, J. & Schroder, F. (1994) J. Mol. Evol. 38, 610–618. Schroder J., Schanz, S., Tropf, S., Karfcher, B. & Schroder, G. (1993) in Mechanisms of Plant Defense Responses, eds. Fritig, B. & Legrand, M. (Kluwer, Dordrecht, The Netherlands), pp. 257–267. Durbin, M. L., Learn, G. H., Huttley, G. A. & Clegg, M. T. (1995) Proc. Natl. Acad. Sci USA 92, 3338–3342. Helariutta, Y., Kotilainen, M., Elomaa P., Kalkkinen, N., Bremer, K., Teeri, T. H. & Albert, V. A. (1996) Proc. Natl. Acad. Sci. USA 93, 9033–9038. Koes, R. R., Spelt, C. E., van den Elzen, P. J. M. & Mol, J. N. M. (1989) Gene 81, 245–257. Glover, D., Durbin, M. L., Huttley, G. & Clegg, M. T. (1996) Plant Species Biol. 11, 41–50. Huttley, G. A., Durbin, M. L., Glover, D. E. & Clegg, M. T. (1997) Mol. Ecol., in press. Freeling, M. & Bennett, D. C. (1985) Annu. Rev. Genet. 19, 297–323. Dolferus, R., deBruxelles, G., Dennis, E. S. & Peacock, W. J. (1994) Ann. Bot. (London) 74, 301–308.
Proc. Natl. Acad. Sci. USA 94 (1997) 44. 45. 46. 47. 48. 49. 50. 51. 52. 53.
54. 55. 56. 57. 58. 59. 60. 61. 62. 63. 64. 65. 66.
Gottlieb, L. D. (1982) Science 216, 373–380. Chang, C. & Meyerowitz, E. M. (1986) Proc. Natl. Acad. Sci. USA 83, 1408–1412. Morton, B. R., Gaut, B. S. & Clegg, M. T. (1996) Proc. Natl. Acad. Sci. USA 93, 11735–11739. Li, W.-H. (1994) Curr. Opin. Genet. Dev. 3, 896–901. Gaut, B. S., Morton, B. R., McCaig, B. & Clegg, M. T. (1996) Proc. Natl. Acad. Sci. USA 93, 10274–10279. Gaut, B. S. & Clegg, M. T. (1993) Proc. Natl. Acad. Sci. USA 90, 5095–5099. Gaut, B. S. & Clegg, M. T. (1993) Genetics 135, 1091–1097. Clegg, M. T. (1997) J. Hered. 88, 1–7. Jorgensen, R. A. (1995) Science 268, 686–691. SanMiguel, P., Tikhonov, A., Jin, Y.-K., Motchoulskaia, N., Zakharov, D., Melake-Berhan, A., Springer, P. S., Edwards, K. J., Lee, M., Avramova, Z. & Bennetzen, J. L. (1996) Science 274, 765–768. Voytas, D. F. (1996) Science 274, 737–738. Schwarz-Sommer, Z., Leclerq, L., Gobel, E. & Saedler, H. (1987) EMBO J. 6, 3873–3880. Leeton, P. R. & Smyth, D. R. (1993) Mol. Gen. Genet. 237, 97–104. Voytas, D. F., Cummings, M. P., Konieczny, A., Ausbel, F. M. & Rodermel, S. R. (1992) Proc. Natl. Acad. Sci. USA 89, 7124–7128. Flavell, A. J., Dunnbar, E., Anderson, R., Pearce, S. R., Hartley, R. & Kumar, A. (1992) Nucleic Acids Res. 20, 3639–3644. Joseph, J. L., Sentry, J. W. & Smyth, D. R. (1990) J. Mol. Evol. 30, 146–151. Voytas, D. F. & Ausbel, F. M. (1989) Nature (London) 336, 242–244. Voytas, D. F., Konieczny, A., Cummings, M. P. & Ausbel, F. M. (1990) Genetics 126, 713–721. Konieczny, A., Voytas, D. F., Cummings, M. P. & Ausbel, F. M. (1991) Genetics 30, 801–809. Wright, D. A., Ke, N., Smalle, J., Hauge, B. M., Goodman, H. M. & Voytas, D. F. (1996) Genetics 126, 569–578. Engels, W. R., Johnson-Schlitz, D. M., Eggleston, W. B. & Sved, J. (1990) Cell 62, 515–525. Huttley, G., MacRae, A. F. & Clegg, M. T. (1995) Genetics 139, 1411–1419. Cummings, M. P. (1994) Trends Ecol. 9, 141–145.
Proc. Natl. Acad. Sci. USA Vol. 94, pp. 7799–7806, July 1997 Colloquium Paper
This paper was presented at a colloquium entitled ‘‘Genetics and the Origin of Species,’’ organized by Francisco J. Ayala (Co-chair) and Walter M. Fitch (Co-chair), held January 30–February 1, 1997, at the National Academy of Sciences Beckman Center in Irvine, CA.
Evolution by the birth-and-death process in multigene families of the vertebrate immune system MASATOSHI NEI*, XUN GU,
AND
TATYANA SITNIKOVA
Institute of Molecular Evolutionary Genetics and Department of Biology, The Pennsylvania State University, 328 Mueller Laboratory, University Park, PA 16802
ABSTRACT Concerted evolution is often invoked to explain the diversity and evolution of the multigene families of major histocompatibility complex (MHC) genes and immunoglobulin (Ig) genes. However, this hypothesis has been controversial because the member genes of these families from the same species are not necessarily more closely related to one another than to the genes from different species. To resolve this controversy, we conducted phylogenetic analyses of several multigene families of the MHC and Ig systems. The results show that the evolutionary pattern of these families is quite different from that of concerted evolution but is in agreement with the birth-and-death model of evolution in which new genes are created by repeated gene duplication and some duplicate genes are maintained in the genome for a long time but others are deleted or become nonfunctional by deleterious mutations. We found little evidence that interlocus gene conversion plays an important role in the evolution of MHC and Ig multigene families. Multigene families whose member genes have the same function are generally believed to undergo concerted evolution that homogenizes the DNA sequences of the member genes (1–3). A good example of this type of gene family is the cluster of ribosomal RNA genes, where all the member genes (several hundred genes) have very similar DNA sequences within species even in nontranscribed spacer regions. For example, the member genes of this cluster in humans are more similar to one another than to most of the genes in chimpanzees (4). This high degree of sequence homogeneity within species is believed to have been achieved by frequent interlocus recombination or gene conversion (Fig. 1A). Concerted evolution was also invoked to explain the diversity and evolution of the multigene families of major histocompatibility complex (MHC) genes (5–7) and immunoglobulin (Ig) genes (8–11). In this case the genetic diversity (polymorphism) for a locus or a set of loci is assumed to be generated by introduction of new variants from different loci through interlocus recombination or gene conversion. However, we have questioned these claims of concerted evolution on grounds that the member genes of MHC and Ig gene families from the same species are not necessarily more closely related to one another than to the genes from different species (12–16). For example, Ota and Nei (14) showed that the Ig heavy chain variable region (VH) genes in humans can be divided into three major groups, and these major groups are shared by mice, amphibians, and teleost fishes, yet gene duplication apparently occurs quite frequently and many duplicate genes subsequently become nonfunctional. Therefore, they concluded that the evolution of this gene family is characterized by the ‘‘birth-and-death model of evolution’’ (Fig. 1B). However, some authors (10, 17–20) still maintain that
FIG. 1. Two different models of evolution of multigene families. E, Functional gene; F, pseudogene.
interlocus gene conversion or recombination plays an important role in the generation and maintenance of genetic diversity of MHC and Ig gene families. To resolve this controversy, it is important to conduct a comprehensive study of the evolutionary patterns of immune system genes. Fortunately, the amount of DNA sequence data for immune system genes is rapidly increasing, mainly because of the genome project that is under way in various organisms. We have therefore compiled DNA sequence data for immune system genes from GenBank and other sources and studied the general evolutionary pattern of these genes. In vertebrates there are three major gene families that play an important role in identifying and removing invading foreign antigens or parasites (virus, bacteria, and eukaryotic parasites). They are the MHC, T cell receptor (TCR), and Ig gene families. The function of MHC molecules is to distinguish between self- and nonself- antigens and present foreign peptides to TCRs of T lymphocytes (T cells), whereas the role of TCRs is to interact with peptide-bound MHC molecules and stimulate T cells to kill virus-infected cells or to give signals for B lymphocytes (B cells) to produce immunoglobulins. Immunoglobulins are primarily responsible for humoral immunity, removing foreign antigens or parasites circulating in the bloodstream. The gene families encoding MHC, TCR, and Ig molecules are all evolutionarily related and are composed of several Abbreviations: MHC, major histocompatibility complex; Ig, immunoglobulin; TCR, T cell receptor; b2m, b2-microglobulin; Myr, million years. *To whom reprint requests should be addressed. e-mail: nxm2@ psu.edu.
© 1997 by The National Academy of Sciences 0027-8424y97y947799-8$2.00y0 PNAS is available online at http:yywww.pnas.org.
7799
7800
Colloquium Paper: Nei et al.
separate gene clusters, each of which spans more than one megabase (Mb) of DNA in the human genome. The purpose of this paper is to present major findings of our recent study on this subject. The immune system is one of the most complicated genetic systems in vertebrates, and detailed results will be published elsewhere. In this paper, we will be concerned primarily with the evolution of MHC and VH gene families.
EVOLUTION OF MHC GENES MHC molecules in vertebrates can be divided into two groups; class I and class II molecules. The class I MHC molecule consists of an a chain and a b2-microglobulin (b2m). The a chain is encoded by a class I MHC gene, whereas b2m is produced by a gene that lies outside the MHC. The class I a chain has three extracellular domains (a1, a2, a3), a transmembrane portion, and a cytoplasmic tail. The a3 domain associates noncovalently with b2m. The class II MHC molecule consists of noncovalently associated a and b chains, which are encoded by class II a-chain (A) and b-chain (B) loci, respectively. Each chain is composed of two extracellular domains (designated as a1 and a2 in the a chain and b1 and b2 in the b chain), a transmembrane portion, and a cytoplasmic tail. In humans class I and class II genes are located on chromosome 6 and form two separate clusters (Fig. 2). The class I MHC consists of three highly expressed and highly polymorphic loci, A, B, and C (classical class I or class Ia loci), and 25–50 nonclassical (class Ib) loci, including pseudogenes. Nondefective class Ib genes are usually monomorphic and expressed in limited tissues, and their function is not well understood. So the definition of class Ib genes is somewhat vague (23). However, the recently discovered class Ib gene HLA-HH seems to have some important function, because mutants of this gene apparently cause the genetic disease hemochromatosis (21). The human class II gene cluster contains six major gene regions: DP, DN, DM, DO, DQ, and DR. Each of the DP, DM, DQ, and DR regions consists of at least one a-chain and at least one b-chain functional gene. The a-chain and b-chain genes in the regions DP, DQ, etc., are designated as DPA1, DPA2, DPB1, DPB2, DQA1, etc. The class II gene cluster also includes many poorly expressed genes and pseudogenes.
FIG. 2. Simplified genomic organizations of the human and mouse MHC genes. Only relatively well characterized genes are presented here, and there are many other genes or pseudogenes in both organisms. The recently discovered class Ib gene HLA-HH (21) is located about 4 Mb away from gene F on the telomeric side. (The original authors used the gene symbol HLA-H, but we changed it to HLA-HH to avoid the confusion with an already established Ib locus with the same name.) The number of genes also varies with haplotype in both class I and class II regions. Open boxes refer to polymorphic or classical loci, whereas closed boxes stand for monomorphic or nonclassical loci. Class III genes or other genes unrelated to class I and class II genes are not shown. Class II genes A and B refer to class II a- and b-chain genes, respectively. The MHCs in humans and mice are often called HLA and H2, respectively. The gene maps in this figure are based on information from Trowsdale (22) and other sources.
Proc. Natl. Acad. Sci. USA 94 (1997) The mouse MHC genes have also been studied extensively. The mouse class Ia genes are not orthologous with the human class Ia genes (24–26), and therefore different gene symbols are used for them (Fig. 2). Actually, most different orders of mammals seem to have nonorthologous class Ia genes. The number of class Ia genes in mammals is usually 1–3, but there are often a large number of class Ib genes. By contrast, class II genes from different orders of mammals usually have orthologous relationships, but the genes from birds and amphibians are not orthologous with the mammalian genes, with a few possible exceptions (27). Polymorphism. The hallmark of MHC genes is the extremely high degree of polymorphism within loci, the extent of polymorphism being the highest among all vertebrate genetic loci (28). The mechanism of maintenance of this polymorphism has been debated for the last 30 years, and it still remains controversial (15, 19, 29). The hypotheses proposed to explain the polymorphism include those of maternal–fetal incompatibility, mating preference, overdominant selection, frequencydependent selection due to minority advantage, and interlocus gene conversion. This problem has been discussed by Hughes and Nei several times (15, 16, 30, 31), and we are not going to repeat the discussion here. However, we would like to mention that in our view the simplest explanation is heterozygote advantage or overdominant selection. In this hypothesis, heterozygotes for a locus have selective advantage over homozygotes, because they can cope with two different types of antigens, whereas the latter can deal with only one type of foreign antigen. Since there are several different functional MHC loci, heterozygotes for all these loci should have substantial selective advantage over homozygotes. Evidence supporting overdominant selection is also increasing (19, 31). In recent years a number of authors (19, 32) presented evidence that new alleles can be created by interallelic recombination at the B locus. However, interallelic recombination is powerless in producing new alleles unless there are abundant polymorphic alleles in the population, and the MHC polymorphism seems to be maintained primarily by point mutation and overdominant selection (15). Another interesting discovery in recent years is the relatively high degree of polymorphism at a class Ib locus, MICA (33). The function of this gene is unknown, but the average heterozygosity per nucleotide site (nucleotide diversity) for the three extracellular domains (exons 2, 3, and 4) is 0.011. Although this is lower than that for class Ia loci (0.04–0.08), it is considerably higher than that (0.0002–0.007) for other nuclear genes (15). The reason for this high degree of polymorphism is unclear, but it is possibly caused by a hitchhiking effect of overdominant selection operating at the B locus, which is closely linked with this locus. In the past, population geneticists have been primarily interested in the extent of polymorphism within loci. In the MHC, however, there is a substantial amount of polymorphism due to gene duplication, insertion, or deletion. For example, in the class II DRB region of humans there are at least five different haplotypes, and the number of genes per haplotype varies from 2 to 5 (22). Furthermore, there seem to be at least 9 distinct gene copies in this region, and only one gene (DRB1) is shared by all haplotypes. The mouse population is also known to have many different haplotypes, and the type and number of class I genes vary considerably with haplotype. For example, class Ia locus D is missing in most haplotypes, and the number of class Ib genes varies considerably with haplotype (34). Intralocus vs. Interlocus Variation Within Species. One way of studying the significance of interlocus recombination or gene conversion is to compare the interlocus and intralocus genetic variation within species. This can be done by constructing a phylogenetic tree for alleles from different loci. If there is any kind of interlocus genetic exchange, one would expect that alleles from each polymorphic locus do not necessarily form a monophyletic cluster in the phylogenetic tree. Fig. 3 shows the phylogenetic tree for different alleles from human MHC (HLA) class
Colloquium Paper: Nei et al. I loci. The numbers of allelic sequences available for class Ia loci A, B, and C are now about 70, 150, and 40, respectively (19), but in this paper we used no more than 8 representative sequences (exons 2, 3, and 4 encoding domains a1, a2, and a3, respectively) from each locus, excluding partial sequences. It is clear that all alleles from the same locus form a single cluster and every allelic cluster is statistically supported. Essentially the same results were obtained when we used all available complete sequences (56 A alleles, 123 B alleles, and 34 C alleles) in the analysis. This indicates that despite the high level of polymorphism of classical loci interlocus variation is much greater than intralocus variation and there is no clear indication of interlocus genetic exchange. This conclusion is in agreement with that of Lawlor et al. (36). As mentioned earlier, nonclassical and pseudogene loci are generally monomorphic, and when polymorphic alleles exist, their nucleotide differences are usually very small (Fig. 3). The genes at these loci (genes E, F, G, J, 70, and 92 in Fig. 3) also tend to evolve faster than class Ia genes (A, B, and C) apparently because of less stringent functional constraint. Exceptions are the alleles of locus H, which are believed to be pseudogenes because they have deleterious mutations (37). The nucleotide diversity among these alleles is quite high (0.021), and the alleles have not evolved very fast. These features are unusual for pseudogenes or unimportant genes, but the reason is unclear at this moment. It is possible that this gene became a pseudogene relatively recently or that it has a new but unknown function, since a frameshift mutation that
FIG. 3. Phylogenetic tree of human MHC (HLA) class I genes and alleles. The phylogenetic trees presented in this paper are neighborjoining trees obtained by the computer software MEGA (35). The tree in this figure was constructed by using Jukes–Cantor distances for the nucleotide sequences (822 bp) of exons 2, 3, and 4, which encode class I MHC domains a1, a2, and a3, respectively. Genes E, F, and G, designated by (b), are functional class Ib loci, whereas genes H, J, 92, and 70, designated by (c), are class Ib pseudogenes. Class Ib genes MICA and HH were not included because they are distantly related (see Fig. 4). Two mouse class Ia genes were used as outgroups. The numbers for interior branches refer to the bootstrap values for 500 replications. The bootstrap values less than 50% are not given. The scale at the bottom is in units of nucleotide substitutions per site.
Proc. Natl. Acad. Sci. USA 94 (1997)
7801
occurred in this gene generates a stop codon near the end of the gene (37). It should also be noted that the class I gene region contains many other defective genes lacking some exons or parts of exons (38). These are clearly dead genes. Until recently, class Ib genes were thought to be relatively unimportant and possibly in the process of becoming pseudogenes (13, 25). However, some of them are apparently functional, as mentioned earlier, and Figs. 3 and 4 show that most class Ib loci in humans diverged earlier than class Ia loci and thus survived longer than class Ia loci. Therefore, it is possible that nondefective Ib genes play some important roles that are not identified at the present time. As mentioned earlier, class II genes are composed of class II A and class II B genes (loci). The A and B genes seem to have diverged about 500 million years (Myr) ago, nearly at the same time when class I and class II genes evolved (27). We have therefore constructed phylogenetic trees for human class II A and B genes separately. Fig. 5 shows the tree for class II B genes from three polymorphic loci (DPB1, DQB1, and DRB1), two poorly expressed loci (DOB and DQB2), and one pseudogene (DPB2). The alleles from each of the three polymorphic loci again form a single cluster with a bootstrap value of 99–100%, whereas poorly expressed genes have survived for an unexpectedly long time. This pattern of evolution is very similar to that of class I loci, though the evolutionary time involved in gene turnover is much longer in class II loci than in class I loci. Our phylogenetic analysis of class II A genes also showed a similar evolutionary pattern (data not shown).
FIG. 4. Phylogenetic tree of MHC class I genes from various vertebrate species. This tree was constructed by using p-distances (39) for the amino acid sequences (274 residues) for domains a1, a2, and a3. Class Ib genes are denoted by (b) except for Xenopus genes NC4, NC7, and NC8. Other genes are primarily class Ia genes. Asterisks indicate cDNA data, where the identification of loci or alleles is ambiguous. In this paper we used common species or genus names to designate their MHC genes to make the paper understandable for nonspecialists. The wallaby is a marsupial species; Xenopus is Xenopus laevis.
7802
Colloquium Paper: Nei et al.
DNA sequences for polymorphic alleles are also available for mice. Lawlor et al. (36) conducted a phylogenetic analysis of class I gene data, but there were some problems in their locus identification. Hughes (40) reanalyzed the data, rectifying the relationships between alleles and loci. This analysis indicated that interlocus genetic exchange is rare in mice as well, though it cannot be ruled out. In the case of class II loci, we conducted our own phylogenetic analysis, which showed that the alleles from the same locus always makes a monophyletic cluster with a bootstrap value of 99% or higher (data not shown). Therefore, the pattern of genetic differentiation of alleles and loci appears to be similar to that of humans. In rats there seem to be two class Ia loci (RT1.A and RT1.Eu), but there are many Ib loci and some of them are apparently expressed (41). Rada et al. (7) reported a possible case of interlocus genetic exchange among class I loci, but Hughes’ (40) reanalysis of the data did not support their claim. At any rate, our phylogenetic analyses have shown that polymorphic alleles from the same locus almost always produce a monophyletic cluster. This finding does not rule out the possibility of gene conversion involving short gene segments (20) but indicates that the contribution of interlocus gene conversion or recombination to the entire genetic variation of MHC loci is small. This is in contrast to the apparent occurrence of interallelic gene conversion within loci, but our conclusion is not unreasonable because the likelihood of gene conversion is known to decline as the extent of sequence divergence increases (42). Long-Term Evolution. To understand the evolutionary dynamics of MHC genes, we need to compare the genes from various organisms. It is already known that the class Ia loci from humans and mice are not orthologous; they were derived from different duplicate genes that originated before separation of the two species (13, 25). Fig. 4 shows a phylogenetic tree for class I genes (generally one allele from each locus whenever the locus is identifiable) from various organisms. As expected from previous studies (13, 25), distantly related organisms (e.g., humans, cats, mice, wallabies, and Xenopus laevis) have different sets of class Ia genes, indicating that the genes have differentiated by duplication
Proc. Natl. Acad. Sci. USA 94 (1997) and deletion or dysfunctioning of the original genes. Even the class Ia loci from the New World monkey Saguinus oedipus (tamarin) are different from those of humans, though some nonclassical loci [e.g., tamarin 3 (So-3) and human F in Fig. 4] are orthologous and shared by the two species (43). However, closely related species such as humans, chimpanzees, and gorillas share the same loci A, B, and C. Gibbons and orangutans, which diverged from the human lineage about 17 and 13 Myr ago, respectively, share the loci A and B with humans but not C (43). Interestingly, orangutans appear to have gained a new duplicate A gene (A01) and a new B gene (B02) (43). If interlocus genetic exchange occurs frequently, one would expect that some genes (loci) from a species are more similar to one another than to those from different species. For example, one would expect that the genes from loci A, B, and C of humans are more similar to one another than to those of gorillas. In practice, this is not the case, and the genetic relationships of genes from loci A, B, and C are the same for humans and gorillas (Fig. 4). This indicates that for relatively closely related organisms different loci maintain their genetic identity despite the apparent occurrence of repeated gene duplication and that no or few interlocus genetic exchanges have occurred among different loci. However, the fact that different sets of class Ia genes exist in different orders of mammals indicates that some Ia genes are deleted or become nonfunctional after gene duplications. For this reason, Nei and Hughes (16) proposed that MHC genes are subject to the birth-and-death model of evolution. We also note that some Ib genes have survived in the genome for a long time. For example, the human MICA and HH genes apparently diverged from other human class I genes before the human and chicken lineages separated. This suggests that these genes have survived in the human lineage at least for about 300 Myr. In fact, Southern blot analysis has suggested that the MICA gene exists in most mammalian orders (33). Human Ib genes E, F, and G seem to have survived longer than Ia genes A, B, and C, as was mentioned earlier. The Ib genes (NC4, NC7, and NC8) in Xenopus also apparently survived for a long time. This suggests that even class Ib genes are not subject to frequent interlocus genetic exchange. The evolutionary pattern of class II genes is more or less the same as that of class I genes, though the time scale of gene turnover involved is much longer. Fig. 6 shows the tree for class II B genes, which indicates that distantly related organisms (zebrafish, Xenopus, chicken, and mammals) again have different sets of class II B genes. These genes therefore clearly experienced birth-and-death evolution. However, humans and rodents share the orthologous loci DP, DQ, DR, and DO. Similarly, loci DQ and DR are shared by humans, dogs, sheep, and cattle. The DP gene in the mouse is a pseudogene and is not shown in Fig. 6. Sharing of the DQ and DO genes by humans and rodents is understandable, because these two genes apparently diverged about 175 Myr ago (27) and humans and rodents apparently diverged about 100 Myr ago (44). The tree in Fig. 6 suggests that cattle and sheep gained a new locus, DIByDYB, around the time of mammalian radiation. The phylogenetic position of marsupial (wallaby) genes DAB and DBB is unclear because of low bootstrap values for the interior branches surrounding this lineage, but they are apparently unique to marsupials (45). The sequence data for class II A genes are still scanty, but available data indicate that the evolution of these genes is also characterized by birth-and-death evolution (46).
EVOLUTION OF IMMUNOGLOBULIN GENES
FIG. 5. Phylogenetic tree of human MHC (HLA) class II B genes and alleles. This tree was constructed by using Jukes–Cantor distances for the nucleotide sequences (564 bp) of exons 2 and 3 that encode the class II MHC b-chain domains b1 and b2, respectively. DMB is known to be distantly related to other B genes (Fig. 6).
Immunoglobulin or antibody molecules consist of two heavy chains and two light chains. Both the heavy and light chains are composed of the variable region (V) and constant region (C). The variable region is responsible for antigen binding and the constant region for effector function. The heavy-chain variable region is encoded by the variable-segment (VH), diversity-segment (DH), and joining-segment (JH) genes, but here we consider only VH
Colloquium Paper: Nei et al.
Proc. Natl. Acad. Sci. USA 94 (1997)
7803
genes might also be subject to overdominant selection, but recent studies have shown that the extent of VH or VL gene polymorphism within loci is much lower than that of MHC genes (48–50). Therefore, it seems that a higher rate of nonsynonymous substitution than that of synonymous substitution in VH and VL genes is caused by directional selection rather than overdominant or balancing selection. A DNA region of 1,100 kb containing the entire human VH repertoire has now been sequenced (51, 52). It contains about 50 functional genes and about 40 pseudogenes. In addition, polymorphic alleles for a substantial number of VH loci have been sequenced. Fig. 7 shows a phylogenetic tree for 40 functional genes and 6 additional polymorphic alleles (3 alleles for 3 loci). (Unexpressed nondefective genes with open reading frames were not used.) It is clear that the sequence divergence of polymorphic alleles is very small compared with that of MHC loci. By contrast, the extent of sequence variation among different VH loci is much higher than that of class I or class II MHC genes. The human VH genes have been classified into seven different families according to sequence similarity (51, 52). Each of these seven families of genes forms a monophyletic cluster, though there are three single-sequence families (families 5, 6, and 7). Ota and Nei (14) and Schroeder et al. (55) noted that the VH genes in mammalian species can be classified into three groups—A, B, and C—or three clans—I, II, and III. The genes belonging to each of these groups
FIG. 6. Phylogenetic tree of MHC class II B genes from various vertebrate species. The tree was constructed by using p-distances for the amino acid sequences (188 residues) of domains b1 and b2. One allele from each locus was used as long as the locus was identifiable. In chicken and Xenopus the class II B loci have not been clearly identified, so that some sequences used may represent different alleles from the same locus. The human and mouse DM B genes, which are known to be distantly related to other class II B genes (31) were used as outgroups. The orthologous genes such as DQ, DP, etc., in different species are bracketed. M-rat, mole rat.
genes because DH and JH genes are very short and are not useful for our purpose. There are two types of light-chain gene families: k-chain and l-chain gene families. Each of these gene families consists of the variable-segment (Vk or Vl) and joining-segment (Jk or Jl) gene families, but Jk and Jl genes are again unimportant for our purpose. In higher vertebrates, the heavy-chain, k-chain, and l-chain gene families are located on different chromosomes and form separate clusters. The human genome contains about 90 VH, 80 Vk, and 50 Vl genes, but the numbers of VH, Vk, and Vl genes vary with haplotype. The numbers of VH, Vk, and Vl genes also vary with species, and some organisms (e.g., chicken) have no Vk genes and others (e.g., mouse) have only a few Vl genes. In cartilaginous fishes such as sharks and skates the genomic organization of Ig genes is different from that of higher vertebrates, which may be described as (Vn-Dn-Jn-Cn). In lower vertebrates, the basic unit of repeat of genes is (V-D-D-J-C). This unit of linked genes is repeated several hundred times in the genome, and the repeats are scattered on different chromosomes. Therefore, the Ig genes in these organisms may be represented by (V-D-D-J-C)n. In this paper we will be concerned primarily with the evolution of VH genes. Diversity Within Species. The diversity of immunoglobulins is primarily caused by multiple copies of VH or VL genes. It has been shown that the rate of nonsynonymous nucleotide substitution is significantly higher than that of synonymous substitution at the complementarity-determining regions of VH and VL genes (47). This observation once suggested the possibility that VH and VL
FIG. 7. Phylogenetic tree of 40 functional VH genes (loci) and 6 additional polymorphic alleles from humans. This tree was constructed by using Jukes–Cantor distances for the nucleotide sequences (228 bp) of framework regions 1, 2, and 3 (14). Loci 1–2, 3–30, and 4–31 are represented by three alleles. Gene and allelic notations are the same as those of Honjo and Matsuda (53) and the V BASE database (54), from which the sequence data were obtained. The human VH genes are classified into seven families, and the first number of each gene symbol designates the family number, whereas the second number refers to the order of the chromosomal location from the DH gene region.
7804
Colloquium Paper: Nei et al.
in humans again form a monophyletic cluster with a high bootstrap value. As will be mentioned later, these groups of genes are shared by amphibians and fishes. Therefore, the three groups of genes have persisted in the human lineage for more than 400 Myr. This evolutionary time is much longer than that for the gene groups in MHC class II A or B clusters. Fig. 7 indicates that the present VH genes in the human genome were generated by repeated gene duplication. One may therefore wonder whether the VH gene repertoire was produced by repeated tandem gene duplication or by the combination of duplication, deletion, and translocation events. If tandem duplication is the major factor, one would expect that closely related genes in the phylogenetic tree are also located closely in the physical map of VH genes on the chromosome. In practice, this is not the case, and the genealogical relationships of the genes are almost independent of the chromosomal locations (Fig. 7). The same conclusion applies to the Vk gene cluster (50) and somewhat weakly to the Vl gene cluster as well (56). These results suggest that the VH, Vk, and Vl gene families in humans were formed by the combination of events of gene duplication, deletion, and translocation. Haplotype Polymorphism. If duplication, deletion, and translocation are important factors of the evolutionary change of Ig genes, one would expect that haplotype polymorphism due to these events is observed in extant species. This is indeed the case, and the number of copies of VH or VL genes is known to vary with haplotype. For example, some human haplotypes have only one VH3–30 gene, but others have a group of five duplicate genes in the vicinity of this locus (53). A more dramatic example of gene duplication is observed in the human Vk gene cluster (50). Human populations have two haplotypes, say H1 and H2. Haplotype H1 has 40 Vk genes including pseudogenes, whereas H2 has 76 Vk genes. Sequencing of these genes has shown that H2 was produced by a block duplication of 36 Vk genes from haplotype H1, and the sequence similarity between the two sets of duplicate genes is very high. This duplication apparently occurred after humans and chimpanzees diverged about 5 Myr ago, because chimpanzees, gorillas, and orangutans do not have this duplication. The frequency of haplotype H1 in human populations is very low and seems to be about 4% (50). Long-Term Evolution. The pattern of long-term evolution of VH and VL genes has been studied by a number of authors (12, 14, 57, 58). These studies are primarily based on sequence data from humans and mice and some limited data from other organisms. The general conclusion obtained from these studies is that the vertebrate genome contains a diverse array of VH and VL genes and they have been formed during hundreds of millions of years. However, recent studies of VH and VL sequences from chicken (59), rabbits (60), sheep (61), etc. suggest that this general conclusion does not apply to all organisms. Fig. 8 shows the phylogenetic tree for representative VH genes from 15 different vertebrate species. According to this tree, the VH genes can be grouped into five major clusters, A, B, C, D, and E (14). The human and mouse genomes contain only group A, B, and C genes (14, 55), but since these three groups of genes are shared by Xenopus, teleost fishes, etc. as well as by other mammals, they must have diverged about 400 Myr ago. This indicates that VH gene diversity in these organisms was generated by gene duplication and diversification during a long evolutionary time. The tree in Fig. 8, however, shows that all VH genes in some mammalian species are closely related and all sequences from a species belong to only one VH group (B or C). They are cattle, sheep, pigs, and rabbits, which are all domesticated animals. Particularly interesting are the cattle and sheep gene clusters which belong to group B. Since cattle and sheep belong to the same suborder (Ruminantia) and apparently diverged about 20 Myr ago, the divergence of the two clusters occurred quite rapidly. Another interesting observation is that pigs, which belong to the same order (Artiodactyla) as cattle, also have one cluster of closely related VH genes but this cluster belongs to
Proc. Natl. Acad. Sci. USA 94 (1997) group C genes rather than group B genes. At any rate, these organisms have a limited VH gene repertoire compared with that of humans and mice, but they have no problem in survival. Another organism which has an unusual set of VH genes is chicken. In this organism the gene cluster has only one functional gene (VH1) and about 100 VH pseudogenes, and VH diversity is generated through somatic gene conversion of the functional gene by pseudogenes (59). Interestingly, all the VH1 and pseudogenes are closely related to one another and belong to group C (62). Furthermore, the Vl gene cluster in chicken also shows essentially the same genetic property (63). A somewhat similar genetic system is observed with the rabbit VH genes (60). In this organism only one VH gene is usually expressed, and other VH genes or pseudogenes are used for somatic conversion of this expressible gene. At the present time, it is unclear how these types of genetic systems have evolved.
DISCUSSION When the major aspects of genomic structure of MHC and Ig genes were first clarified in humans and mice, the evolutionary schemes of the two genetic systems appeared quite different. The diversity of MHC molecules is generated primarily by intralocus polymorphism, whereas the immunoglobulin diversity is generated by interlocus genetic variation barring somatic mutation or somatic gene conversion (64). However, as the genes from other organisms are studied, a number of common features have emerged between the two systems despite the difference in genomic organization and function. In both genetic systems the major force of evolution is the combination of events of gene duplication, deletion, and translocation as well as point mutation
FIG. 8. Phylogenetic tree of 49 VH genes from various vertebrate species. This tree was constructed by using p-distances for the amino acid sequences of framework regions 1, 2, and 3. Most sequence data are from Ota and Nei (14), and additional sequences for cattle, sheep, pigs, and rabbits are from GenBank. The root of the tree was determined by using light chain genes as was done in ref. 14.
Colloquium Paper: Nei et al. and selection. Dysfunctioning of genes due to deleterious mutation is also quite common in both systems. Furthermore, some organisms seem to have more homogeneous functional genes than other organisms. In the following, we discuss the general patterns of evolution of MHC and Ig genes in relation to concerted and birth-and-death evolution. Concerted Evolution. Concerted evolution was originally proposed to explain a high degree of sequence similarity among member genes of a multigene family (1). The first mechanism considered for concerted evolution is interlocus recombination, which generates new duplicate genes and deletes some extant duplicate genes. If this process continues, duplicate genes in a multigene family tend to have similar nucleotide sequences even in the presence of mutation. Later, intergenic gene conversion was introduced as an additional mechanism for homogenizing the member genes of a multigene family (2, 65). Concerted evolution was also invoked to explain the diversity and evolution of Ig and MHC genes, as mentioned earlier. In this second version of concerted evolution, interlocus gene conversion or recombination is regarded as a mechanism of increasing genetic diversity (polymorphism) for a locus or a set of loci by introducing new variants from different loci. Therefore, all loci are assumed to be polymorphic, and the polymorphism at different loci evolves in unison as a result of concerted evolution (5, 6, 11). We have seen that in both MHC and Ig gene families repeated gene duplication occurs quite often and thus generates a set of closely related genes. At first sight, this appears to be consistent with concerted evolution that homogenizes the member genes. In practice, however, this gene duplication does not necessarily lead to deletion of preexisting heterogeneous genes and therefore does not contribute to homogenization of all member genes. Rather, duplicate genes gradually diverge by mutation and selection, though some of them become nonfunctional by deleterious mutations or are deleted from the genome. In fact, both MHC and Ig gene families contain various member genes that have diverged during tens of millions or hundreds of millions of years. Even the human MHC class Ia loci A and B, which were generated by a recent gene duplication, seem to have a history of about 30 Myr (25). Mammalian group A, B, and C VH genes have a history of about 400 Myr (14), and the extent of nucleotide differences among them is very high. We have seen that the VH genes in cattle, sheep, pigs, and chickens diverged relatively recently, but even these genes seem to have a history of 20–40 Myr. This conclusion is essentially the same as that of Gojobori and Nei (12) and does not lend support to the first version of concerted evolution. Our results also contradict the second version of concerted evolution, in which intralocus polymorphism is generated by interlocus gene conversion or recombination. First, VH genes are not very polymorphic as was previously assumed (11), and thus there is no need to invoke interlocus genetic exchange to promote gene diversity in this gene family. Second, classical MHC loci are certainly highly polymorphic, but the phylogenetic trees do not show any significant intermingling of alleles from different polymorphic loci, suggesting that genetic exchange between loci does not contribute significantly to the extent of polymorphism. Some authors (18) claimed the importance of interlocus genetic exchange by finding that a few alleles from a locus were clustered with the alleles from other loci in MHC gene families. However, they have not done any statistical test of allelic clusters, and it seems that this intermingling of alleles from different loci was caused by stochastic or sampling errors. Of course, there are a few clear-cut cases of exon exchanges between different MHC loci (13, 66), but their contribution to MHC gene diversity seems to be minor (67). Birth-and-Death Model of Evolution. In this model of evolution, duplicate genes are produced by various mechanisms, including tandem and block gene duplication, and some of the duplicate genes diverge functionally but others become pseudo-
Proc. Natl. Acad. Sci. USA 94 (1997)
7805
genes owing to deleterious mutations or are deleted from the genome (Fig. 1B). The end result of this mode of evolution is a multigene family with a mixture of divergent groups of genes and highly homologous genes within groups plus a substantial number of pseudogenes. The model of birth-and-death evolution was formally presented by Nei and Hughes (16), but essentially the same process was envisaged at the genome level by Nei (68), who, on theoretical grounds, predicted that the mammalian genome contains a large number of duplicate genes and nonfunctional genes. Nei and his colleagues (12–16, 27) studied this problem statistically when DNA sequences for VH and MHC genes became available and obtained evidence supporting the model. The birth-and-death model of evolution is similar to Klein et al.’s (69) accordion model of MHC evolution, in which the number of MHC genes is assumed to be expanded or contracted, depending on the need to protect the host from ever-changing groups of parasites. The former model, however, allows some MHC genes (e.g., Ib genes) to stay in the genome for a long time or even to acquire slightly modified functions. The results presented in this paper are consistent with what are expected from the birth-and-death model of evolution. We have seen that both the MHC and VH gene clusters in a species have experienced repeated gene duplication and many duplicate genes have become nonfunctional or deleted. About 40% of VH genes and 50% of Vk and Vl genes in humans are known to be pseudogenes. The exact number of pseudogenes in the MHC cluster in humans is still unknown, but it seems to be quite large. In the class I gene region at least 16 pseudogenes or gene fragments have been identified (38). The mouse genome also contains many MHC pseudogenes (22). In the early stage of study of evolution of VH genes it appeared that the number of VH genes per genome is roughly the same for different species and that the effects of the birth and death processes in the VH gene cluster are balanced (12). Recent data, however, show that the number of VH genes varies considerably with species and is probably less than 20 in sheep, cattle, and pigs (61, 70, 71). Therefore, many VH genes have apparently been deleted in these species. This suggests that the number of VH genes need not be very large for the survival of an individual and that in these organisms VH diversity may be enhanced by somatic mutation, somatic gene conversion, and deletionsyinsertions that are generated when the VH, DH, and JH genes are recombined somatically. The number of MHC genes per genome is also known to vary extensively with organism (22). Therefore, gene duplication and deletion seem to be quite common in both Ig and MHC genes. In retrospect, it is quite reasonable that MHC and Ig genes are subject to birth-and-death evolution, whereas ribosomal RNA (rRNA) genes follow concerted evolution. The latter gene family is used to produce a large quantity of the same gene product (rRNA) and thus natural selection would favor a highly homogeneous group of member genes. Concerted evolution is an efficient way to achieve this homogeneity. By contrast, the function of MHC and Ig genes is to defend the host from various forms of invading parasites, and a great amount of diversity is required for them. Therefore, the evolutionary force needed is for diversification of genes, and this can be achieved by gene duplication, mutation, and diversifying selection. Interlocus genetic variation may increase by independent mutation alone during a long evolutionary time, but in the case of MHC and Ig genes there is some evidence that different member genes are often adapted to cope with different types of parasites. For example, alleles from human HLA class I loci A, B, and C have different antigen (peptide) specificity (19). The long-term survival of mammalian group A, B, and C VH genes also suggests that these genes are adapted to different groups of antigens. Indeed, recent studies have shown that the human VH family 3 is adapted to cope with a special group of bacterial antigens (72). In this respect chickens are special in having very closely related
7806
Colloquium Paper: Nei et al.
VH or Vl sequences, but the paucity of VH and Vl diversity is compensated by a high rate of interlocus somatic gene conversion, as mentioned earlier. Some authors (10, 73) suggested that these closely related sequences are subject to germ-line as well as somatic gene conversion. This may well be the case, but chicken and rabbit Ig genes are clearly exceptional and do not seem to follow the general pattern of Ig gene evolution in vertebrates. The genome of higher organisms contains a large number of multigene families, some of which have evolved to produce a large quantity of the same gene product and others of which have evolved to produce a diverse array of proteins. If our conclusion is correct, the former group of gene families would be subject to concerted evolution and the latter to birth-anddeath evolution. We hope that a detailed study will be conducted on this subject for many multigene families and the general pattern of evolution of multigene families will be clarified in the near future. We thank Kei Takahashi for his help in the preparation of the manuscript and Austin Hughes, Jan Klein, Sudhir Kumar, and George Zhang for their comments on an earlier version of this manuscript. This work was supported by National Institutes of Health Grant GM20293 and National Science Foundation Grant DEB-9520832 to M.N. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31.
Smith, G. P. (1973) Cold Spring Harbor Symp. Quant. Biol. 38, 507–513. Zimmer, E. A., Martin, S. L., Beverley, S. M., Kan, Y. W. & Wilson, A. C. (1980) Proc. Natl. Acad. Sci. USA 77, 2158–2162. Irwin, D. M. & Wilson, A. C. (1990) J. Biol. Chem. 265, 4944–4952. Arnheim, N. (1983) in Evolution of Genes and Proteins, eds. Nei, M. & Koehn, R. K. (Sinauer, Sunderland, MA), pp. 38–61. Weiss, E. H., Mellor, A. L., Golden, L., Fahrner, K., Simpson, E., Hurst, J. & Flavell, R. A. (1983) Nature (London) 301, 671–674. Ohta, T. (1991) Proc. Natl. Acad. Sci. USA 88, 6716–6720. Rada, C., Lorenzi, R., Powis, S. J., Bogaerde, J. V. D., Parham, P. & Howard, J. (1990) Proc. Natl. Acad. Sci. USA 87, 2167–2171. Hood, L., Campbell, J. H. & Elgin, S. C. R. (1975) Annu. Rev. Genet. 9, 305–353. Bentley, D. L. & Rabbitts, T. H. (1983) Cell 32, 181–189. McCormack, W. T., Hurley, E. A. & Thompson, C. B. (1993) Mol. Cell. Biol. 13, 821–830. Ohta, T. (1983) Theor. Pop. Biol. 23, 216–240. Gojobori, T. & Nei, M. (1984) Mol. Biol. Evol. 1, 195–212. Hughes, A. L. & Nei, M. (1989) Mol. Biol. Evol. 6, 559–579. Ota, T. & Nei, M. (1994) Mol. Biol. Evol. 11, 469–482. Nei, M. & Hughes, A. L. (1991) in Evolution at the Molecular Level, eds. Selander, R., Clark, A. & Whittam, T. (Sinauer, Sunderland, MA), pp. 222–247. Nei, M. & Hughes, A. L. (1992) in 11th Histocompatibility Workshop and Conference, eds. Tsuji, K., Aizawa, M. & Sasazuki, T. (Oxford Univ. Press, Oxford), Vol. 2, pp. 27–38. Huber, C., Scha¨ble, K. F., Huber, E., Klein, R., Meindl, A., Thiebe, R., Lamm, R. & Zachau, H. G. (1993) Eur. J. Immunol. 23, 2868– 2875. Brunsberg, U., Edfors-Lija, I., Andersson, L. & Gustafsson, K. (1996) Immunogenetics 44, 1–8. Parham, P. & Ohta, T. (1996) Science 272, 67–74. Yun, T. J., Melvold, R. W. & Pease, L. R. (1997) Proc. Natl. Acad. Sci. USA 94, 1384–1389. Feder, J. N., Gnirke, A., Thomas, W., Tsuchihashi, Z., Ruddy, D. A., et al. (1996) Nat. Genet. 13, 399–408. Trowsdale, J. (1995) Immunogenetics 41, 1–17. Klein, J. & O’hUigin, C. (1994) Proc. Natl. Acad. Sci. USA 91, 6251–6252. Rogers, J. H. (1985) EMBO J. 4, 749–753. Klein, J. & Figueroa, F. (1986) Crit. Rev. Immunol. 6, 295–386. Hughes, A. L. & Nei, M. (1989) Genetics 122, 681–686. Hughes, A. L. & Nei, M. (1990) Mol. Biol. Evol. 7, 491–514. Klein, J. (1986) Natural History of the Major Histocompatibility Complex (Wiley, New York). Hedrick, P. W. & Kim, T. J. (1997) in Evolutionary Genetics from Molecules to Morphology, eds. Singh, R. S. & Krimbas, C. K. (Cambridge Univ. Press, New York), in press. Hughes, A. L. & Nei, M. (1988) Nature (London) 335, 167–170. Hughes, A., Hughes, M. K., Howell, C. Y. & Nei, M. (1994) Phil. Trans. R. Soc. Lond. B 345, 359–367.
Proc. Natl. Acad. Sci. USA 94 (1997) 32.
Watkins, D. I., McAdams, S. N., Liu, X., Strang, C. R., Milford, E. L., Levine, C. G., Garber, T. L., Dogon, A. L., Lord, C. I., Ghim, S. H., Troup, G. M., Hughes, A. L. & Letvin, N. L. (1992) Nature (London) 357, 329–333. 33. Fodil, N., Laloux, L., Wanner, V., Pellet, P., Hauptmann, G., Mizuki, N., Inoko, H., Spies, T., Theordorou, I. & Bahram, S. (1996) Immunogenetics 44, 351–357. 34. Weiss, E. H., Golden, L., Fahrner, K., Mellor, A. L., Devlin, J. J., Bullman, H., Tiddens, H., Bud, H. & Flavell, R. A. (1984) Nature (London) 310, 650–655. 35. Kumar, S., Tamura, K. & Nei, M. (1993) MEGA: Molecular Evolutionary Genetic Analysis (Pennsylvania State Univ., University Park). 36. Lawlor, D. A., Zemmour, J., Ennis, P. D. & Parham, P. (1990) Annu. Rev. Immunol. 8, 23–63. 37. Zemmour, J., Koller, B. H., Ennis, P. D., Geraghty, D. E., Lawlor, D. A., Orr, H. T. & Parham, P. (1990) J. Immunol. 144, 3619–3629. 38. Geraghty, D. E., Koller, B. H., Pei, J. & Hansen, J. A. (1992) J. Immunol. 149, 1947–1957. 39. Nei, M. (1996) Annu. Rev. Genet. 30, 371–403. 40. Hughes, A. L. (1991) Immunogenetics 33, 367–373. 41. Salgar, S. K., Kunz, H. W. & Gill, T. J. (1995) Immunogenetics 42, 244–253. 42. Liskay, R. M., Letsou, A. & Stachelek, J. L. (1987) Genetics 115, 161–167. 43. Chen, Z. W., McAdam, S. N., Hughes, A. L., Dogon, A. L., Letvin, N. L. & Watkins, D. I. (1992) J. Immunol. 148, 2547–2554. 44. Hedges, S. B., Parker, P. H., Sibley, C. G. & Kumar, S. (1996) Nature (London) 381, 226–229. 45. Schneider, S., Vincek, V., Tichy, H., Figueroa, F. & Klein, J. (1991) Mol. Biol. Evol. 8, 753–766. 46. Slade, R. W. & Mayer, W. E. (1995) Mol. Biol. Evol. 12, 441–450. 47. Tanaka, T. & Nei, M. (1989) Mol. Biol. Evol. 6, 447–459. 48. Li, H. & Hood, L. (1995) Genomics 26, 199–206. 49. Sasso, E. H., Buckner, J. H. & Suzuki, L. A. (1995) J. Clin. Invest. 96, 1591–1600. 50. Zachau, H. G. (1995) in Immunoglobulin Genes, eds. Honjo, T. & Alt, F. W. (Academic, San Diego), pp. 173–191. 51. Cook, G. P., Tomlinson, I. M., Walter, G., Riethman, H., Carter, N. P., Buluwela, L., Winter, G. & Rabbitts, T. H. (1994) Nat. Genet. 7, 162–168. 52. Matsuda, F., Shin, E. K., Nagaoka, H., Matsumura, R., Haino, M., Fukita, Y., Taka-ishi, S., Imai, T., Riley, J. H., Anand, R., Soeda, E. & Honjo, T. (1993) Nat. Genet. 3, 88–94. 53. Honjo, T. & Matsuda, F. (1995) in Immunoglobulin Genes, eds. Honjo, R. & Alt, F. W. (Academic, San Diego), pp. 145–171. 54. Tomlinson, I. M., Williams, S. C., Ignatovich, O., Corbertt, S. J. & Winter, G. (1996) V BASE Sequence Directory (MRC Centre for Protein Engineering, Cambridge, U.K.). 55. Schroeder, H. W. J., Hillson, J. L. & Perlmutter, R. M. (1990) Int. Immunol. 2, 41–50. 56. Williams, S. C., Frippiat, J.-P., Tomlinson, I. M., Ignatovich, O., Lefranc, M.-P. & Winter, G. (1996) J. Mol. Biol. 264, 220–232. 57. Andersson, E. & Matsunaga, T. (1995) Res. Immunol. 147, 233–240. 58. Rast, J. P., Anderson, M. K., Ota, T., Litman, R. T., Margittai, M., Shamblott, M. J. & Litman, G. W. (1994) Immunogenetics 40, 83–99. 59. Reynaud, C.-A., Dahan, A., Anquez, V. & Weill, J.-C. (1989) Cell 59, 171–183. 60. Knight, K. L. & Tunyaplin, C. (1995) in Immunoglobulin Genes, eds. Honjo, R. & Alt, F. W. (Academic, San Diego), pp. 289–314. 61. Dufour, V., Malinge, S. & Francois, N. (1996) J. Immunol. 156, 2163–2170. 62. Ota, T. & Nei, M. (1995) Mol. Biol. Evol. 12, 94–102. 63. McCormack, W. T., Tjoelker, L. W. & Thompson, C. B. (1991) Annu. Rev. Immunol. 9, 219–241. 64. Tonegawa, S. (1983) Nature (London) 302, 575–581. 65. Slightom, J. L., Blechl, A. E. & Smithies, O. (1980) Cell 21, 627–638. 66. Holmes, N. & Parham, P. (1985) EMBO J. 4, 2849–2854. 67. Hughes, A. L. (1995) Mol. Biol. Evol. 12, 247–258. 68. Nei, M. (1969) Nature (London) 221, 40–42. 69. Klein, J., Ono, H., Klein, D. & O’hUigin, C. (1993) Prog. Immunol. 8, 137–143. 70. Sinclair, M. C. & Aitken, R. (1995) Gene 167, 285–289. 71. Sun, J., Kacskovics, I., Brown, W. R. & Butler, J. E. (1994) J. Immunol. 153, 5618–5627. 72. Silverman, G. J. (1995) Ann. N.Y. Acad. Sci. 764, 342–355. 73. Roux, K. H., Dhanarajan, P., Gottschalk, V., McCormack, W. T. & Renshaw, R. W. (1991) J. Immunol. 146, 2027–2036.