CBMS-NSF REGIONAL CONFERENCE SERIES IN APPLIED MATHEMATICS A series of lectures on topics of current research interest ...
24 downloads
626 Views
5MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
CBMS-NSF REGIONAL CONFERENCE SERIES IN APPLIED MATHEMATICS A series of lectures on topics of current research interest in applied mathematics under the direction of the Conference Board of the Mathematical Sciences, supported by the National Science Foundation and published by SIAM. GARRETT BRKHOFF, The Numerical Solution of Elliptic Equations D. V. LINDLEY, Bayesian Statistics, A Review R. S. VARGA, Functional Analysis and Approximation Theory in Numerical Analysis R. R. BAHADUR, Some Limit Theorems in Statistics PATRICK BILLINGSLEY, Weak Convergence of Measures: Applications in Probability J. L. LIONS, Some Aspects of the Optimal Control of Distributed Parameter Systems ROGER PENROSE, Techniques of Differential Topology in Relativity HERMAN CHERNOFF, Sequential Analysis and Optimal Design J. DURBIN, Distribution Theory for Tests Based on the Sample Distribution Function SOL I. RUBINOW, Mathematical Problems in the Biological Sciences P. D. LAX, Hyperbolic Systems of Conservation Laws and the Mathematical Theory of Shock Waves I. J. SCHOENBERG, Cardinal Spline Interpolation IVAN SINGER, The Theory of Best Approximation and Functional Analysis WERNER C. RHEINBOLDT, Methods of Solving Systems of Nonlinear Equations HANS F. WEINBERGER, Variational Methods for Eigenvalue Approximation R. TYRRELL ROCKAFELLAR, Conjugate Duality and Optimization Sot JAMES LIGHTHILL, Mathematical Biofluiddynamics GERARD S ALTON, Theory of Indexing CATHLEEN S. MORAWETZ, Notes on Time Decay and Scattering for Some Hyperbolic Problems F. HOPPENSTEADT, Mathematical Theories of Populations: Demographics, Genetics and Epidemics RICHARD ASKEY, Orthogonal Polynomials and Special Functions L. E. PAYNE, Improperly Posed Problems in Partial Differential Equations S. ROSEN, Lectures on the Measurement and Evaluation of the Performance of Computing Systems HERBERT B. KELLER, Numerical Solution of Two Point Boundary Value Problems J. P. LASALLE, The Stability of Dynamical Systems - Z. ARTSTEIN, Appendix A: Limiting Equations and Stability ofNonautonomous Ordinary Differential Equations D. GOTTLIEB AND S. A. ORSZAG, Numerical Analysis of Spectral Methods: Theory and Applications PETER J. HUBER, Robust Statistical Procedures HERBERT SOLOMON, Geometric Probability FRED S. ROBERTS, Graph Theory and Its Applications to Problems of Society JURIS HARTMANIS, Feasible Computations and Provable Complexity Properties ZOHAR MANNA, Lectures on the Logic of Computer Programming ELLIS L. JOHNSON, Integer Programming: Facets, Subadditivity, and Duality for Group and SemiGroup Problems SHMUEL WINOGRAD, Arithmetic Complexity of Computations J. F. C. KINGMAN, Mathematics of Genetic Diversity MORTON E. GURTIN, Topics in Finite Elasticity THOMAS G. KURTZ, Approximation of Population Processes (continued on inside back cover)
Mathematics of Genetic Diversity
This page intentionally left blank
J.F.C. Kingman University of Oxford Oxford, England
Mathematics of Genetic Diversity
SOCIETY FOR INDUSTRIAL AND APPLIED MATHEMATICS PHILADELPHIA
Copyright ©1980 by the Society for Industrial and Applied Mathematics. 1098765432 All rights reserved. Printed in the United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688. Library of Congress Catalog Card Number: 80-51290 ISBN 0-89871-166-5 is a registered trademark.
Contents Foreword Terminology and Notation
vii ix
Chapter 1 THE PROBLEM 1.1 Why mathematics? 1.2 Genes and their inheritance 1.3 Selection 1.4 Mutation
1 2 3 5
Chapter 2 SURVIVAL OF THE FITTEST 2.1 Balanced polymorphisms 2.2 Multi-locus selection 2.3 Balance between selection and mutation 2.4 The house of cards 2.5 The diploid house of cards 2.6 The resistance of polymorphisms to mutation
7 10 12 15 18 20
Chapter 3 THE NEUTRAL ALTERNATIVE 3.1 Evolution in the absence of selection 3.2 A general model for mutation in finite populations 3.3 The random walk case 3.4 The frequency spectrum 3.5 The Ewens sampling formula 3.6 The Poisson-Dirichlet distribution 3.7 Partition structures 3.8 Testing neutrality
24 28 32 35 39 42 46
Chapter 4 SELECTION IN FINITE POPULATIONS 4.1 Deleterious mutants 4.2 The Wright-Fisher model 4.3 Wright's formula 4.4 The infinite alleles limit
49 50 54 58
Appendix I AN INEQUALITY
61
Appendix II THE GENEALOGY OF THE WRIGHT-FISHER MODEL Bibliography
23
. . . .
63 67
V
This page intentionally left blank
Foreword A Regional Conference held at Iowa State University, in June 1979, under the auspices of the Conference Board of the Mathematical Sciences and of the National Science Foundation, has given me this opportunity of drawing together some mathematical ideas which have in recent years been useful in population genetics. Of course, population genetics covers a wide area of science, and even those aspects which admit some mathematical analysis would be impossible to survey exhaustively in a slim volume. The present account, like the lectures on which it is based, concentrates on a few aspects which seem to be both biologically relevant and mathematically interesting. My choice is entirely personal, and the book should be read as the adventures of one mathematician in the fascinating, but formidably complicated, world of population genetics. Were it not for the efforts of Oscar Kempthorne and his colleagues at ISU, the conference would not have taken place, nor would it have attracted a distinguished audience whose comments have improved both form and content of these chapters. To them I express my thanks, but I acknowledge too my debt to Warren Ewens and Geoff Watterson, whose profoundly original work has inspired, as their friendship has encouraged, my own contributions to the subject. Finally, I dedicate this book to John and Charlotte, excellent examples of its theme of genetic diversity. J. F. C. K. Oxford, 1979
vii
This page intentionally left blank
Terminology and Notation The now standard notationR. R n . N . Z is used for, respectively, the real line, n-dimensional Euclidean space, the set of natural numbers 1. 2. 3, • • • , and the set of (positive and negative) integers. Probability and expectation are denoted by P and E. Words like "positive" or "increasing" are to be understood in the weak sense unless qualified by the word "strictly."
ix
This page intentionally left blank
CHAPTER 1
The Problem 1.1. Why mathematics? The essence of the application of mathematics to any branch of science is the recognition and exploitation of pattern or regularity. The regularity may be rigid and striking, like the procession of the planets around the sun, or it may be a dimly observed tendency hardly distinguishable amidst a general confusion. The biological sciences more often yield examples of the latter than of the former kind, but genetics is unusual because, despite the many complex and unpredictable factors which govern the everyday life of a biological organism, it inherits its genetic make-up from its parents according to a mechanism which does possess a definite structure. Indeed, the inductive process by which Mendel inferred the existence of genes and their role in transmitting heritable characteristics was itself a fine example of mathematical reasoning. As it became accepted that the evolution of a natural population was to be understood in terms of Mendelian genetics and Darwinian natural selection, so also it became clear that this understanding could not be sought only at a qualitative level. The biologist needed to know how fast evolution would be expected to take place, how much genetic diversity would persist in a population, how intrusive would be the effect of various chance factors, and so on. Such questions led to the work of such scientists as R. A. Fisher, J. B. S. Haldane and Sewall Wright, who analysed models of populations in which Mendel's laws were combined with assumptions about the effect of different features of the natural environment. One of the goals of "mathematical population genetics" has been to explore the mechanisms which maintain diversity in a population. In any group of humans, it is rare to find two who look alike, and any superficial resemblance (save between identical twins) is unlikely to reflect a genetical identity. A similar observation would be made by an educated fruit fly about his fellow Drosophila; it is a general characteristic of natural populations to exhibit great genetic diversity. Yet if Darwin was right to stress "survival of the fittest", should not the less fit characteristics have been driven out by the fitter? How is the observed diversity to be reconciled with natural selection? This is not a question with a single answer, since there are many different mechanisms that could explain the persistence of variety. The mathematician can construct models to display such mechanisms at work, so that the biologist can compare them with experiment and observation, and so (perhaps) arrive at a judgement as to which mechanism is at work in any particular population. The purpose of the mathematical analysis is to start with a set of biological
1
2
CHAPTER 1
assumptions and to explore the way in which a population would evolve if those assumptions were satisfied. Typically this process also clarifies and refines the assumptions themselves, and helps to sharpen the questions which the scientist asks of the world. It also serves as the starting point for the statistician, who can set up sensitive methods for using available data to discriminate between different models, and to estimate the parameters which appear in them. In all this, the basic framework is supplied by the existing biological concepts and experimental techniques. As science advances, so new mathematical problems are posed and, probably, new mathematical tools must be forged. One such advance is a major theme of these notes: the realisation that, at a particular chromosome locus, the number of possible alleles is almost unlimited. Since the classical theory concentrated on loci with two, or perhaps three or four, alleles, this is clearly a fundamental change of emphasis. But of course mathematics thrives on large numbers, and the new emphasis has led to interesting mathematics, some of which is described herein. 1.2. Genes and their inheritance. In order to establish a framework for the mathematics, it may be helpful to begin with a greatly over, simplified account of the basic genetical theory. Most interesting organisms are diploid, which means that the genetical information which governs their development is carried by several pairs of chromosomes. Thanks to modern advances in molecular biology, we now have a picture of what a chromosome is; it is essentially a very long string of symbols drawn from a four-letter alphabet (the four different nucleotides of DNA). Most of the symbols appear to have no meaning, but at certain particular places on the chromosome there is a meaningful string of (several hundred) symbols. This is called the gene at that particular locus. Thus we can think of a gene as a message written in this four-letter alphabet, on a tiny label which is attached to the chromosome at a particular place. The possible messages that could be written on that label are called the alleles of that locus. A diploid organism has a chromosome from each of its parents, so at a particular locus it has two genes (which may or may not be the same allele). It often makes sense to concentrate attention on a particular locus, so that the genetic character of an individual is described by specifying its two genes at that locus, but it should always be remembered that the genetics really depends on all loci at once (some aspects of this fact will be mentioned in §2.2). An organism reproduces by mating with another, and usually contributes one of its chromosomes, as does its mate, to each child. Here there are many possible complications. Many species are dioecious, which means that they have two sexes which must be considered separately. The choice of mate may be (and in species like our own invariably is) correlated with genetic characteristics. Moreover, the chromosomes may break up and recombine in the reproductive process, so that what is passed on to the next generation is not a perfect copy of the parental chromosome. These and other effects ought to enter any realistic model of the reproductive
THE PROBLEM
3
process, but they will not do so in the models discussed in this monograph. We shall assume that our populations are monoecious (have only one "sex") and that each individual chooses its mate randomly from the population. Sometimes the same methods work, and sometimes similar conclusions are valid, for dioecious populations (as, for example, a recent account by Ethier and Nagylaki [1] suggests). Our main interest will be with the gene frequencies in a large population at a particular locus. If there are N individuals in the population, there are 2N chromosomes containing this locus. For each allele, we consider the proportion of the 2N chromosomes at which the gene is of this allele. This gives a probability distribution over the set of possible alleles which clearly describes the genetic make-up of the population as far as the chosen locus is concerned. The problem is to model the dynamical process by which this distribution changes in time, or from generation to generation. In some populations there are well-defined breeding seasons, and it therefore makes sense to talk of successive generations. This is the point of view adopted here, although some species (like our own) breed all the time and the generations soon get mixed up. We then have to work in real time, with allowance for the ages at which mating takes place; the analysis is more complicated but not essentially different. It is convenient to have a definite model for the reproductive process in a monoecious, randomly mating population. Suppose we have a population whose size N is held constant (for example, by constraints on living space or food supply). Direct attention to a particular locus. One can then imagine that each individual produces a very large number of cells called gametes, each of which contains only one gene at the locus. Half the gametes inherit copies of one of the individual's genes, the other half copies of the other. All the gametes produced by all the individuals are thrown into a pool, and an individual of the next generation is produced by drawing two gametes at random from the pool and combining them. The N individuals of the next generation are obtained by 2N independent drawings from the pool. This basic model is usually referred to as the Wright-Fisher model; it is of course a greatly simplified one and any conclusion from it needs to be checked for robustness. The gene frequencies in a generation are the same as those in the subsequent gamete pool, so instead of successive generations we can consider the successive intercalated pools. This simplifies the problem by replacing the diploid organism by haploid gametes (having only one set of chromosomes). For this reason it is often sufficient to develop a theory of haploid populations, though care needs to be exercised because many biological mechanisms (notably selection and recombination) act in an essentially diploid way. 1.3. Selection. The genetic composition of an organism determines, by interaction with its environment, the way in which the organism will develop, and, therefore, the way it will adapt to the problems of its existence: how successful it will be in garnering food, in warding off predators, and in mating and
4
CHAPTER 1
producing offspring. Some alleles, or combinations of alleles at several loci, will tend to be more successful than others, and so tend to increase their frequency in the population at the expense of those less successful. This selective effect, as far as a single locus is concerned, can easily be incorporated into the Wright-Fisher model. One imagines that initially many more than N individuals are produced by drawings of pairs of gametes, and that the probability of survival to maturity depends on the two genes (the genotype) of the individual, so that the N who survive to maturity will typically form a population with gene frequencies weighted according to these survival probabilities. The relative values of these probabilities, normalised in some convenient way, are the relative fitnesses of the possible allele-pairs. With this model, we can begin to consider the problem posed in §1.1, the reconciliation of observed genetic diversity with postulated selective pressure. Several general answers might be envisaged, of which the most important are the following. (i) It may very well happen that our population is not in equilibrium, that the less fit alleles are on the way out but have not yet disappeared. Certainly it is important to understand the nonequilibrium behaviour of models like the Wright-Fisher, 1 but it hardly suffices for a general explication of genetic diversity to argue that it does exist, but that by some genetical analogue of the second law of thermodynamics it is steadily decreasing. So a convincing model ought to maintain diversity even when it has reached equilibrium. (ii) Diversity can be maintained by differences in environment in space (or perhaps time). If one allele is adapted to one environment, and another to another, and there is migration between the environments, then one equilibrium can be established in which both alleles thrive (see, for example, Fleming [1], Gillespie [2]). (iii) Selection typically acts on allele pairs rather than on individual alleles, and an allele can be favourable in some combinations but not in others. That this can result in a stable configuration with several alleles will be shown in §2.1. (iv) The last two mechanisms may explain why diversity is maintained, but they give no clue as to how it originated. The usual explanation is in terms of mutation, spontaneous change which means that an offspring inherits not the gene from the parental gamete but a different one. The label has not been accurately transcribed. Such mutations occur with very small probability, but their cumulative effect over many births may provide enough variety to balance the effect of selection. Models for mutation selection balance will be discussed in §§2.3-2.5, and from a different point of view in Chapter 4. (v) Some geneticists take the view that the effect of selection has been exaggerated, and that much genetic diversity is to be understood mainly in terms of mutation and the random fluctuations which are inherent in the reproductive 1
Methods exist for that; see. for example, Kimura [1], [2], Nei and Li [1], Kingman [12].
THE PROBLEM
5
process. Such an attitude leads to "neutral" models without selection; these are the subject matter of Chapter 3. It should be stressed that there is no question of showing that one of these five answers (or perhaps yet a different one) is universally valid. The behaviour of any real population will probably have ingredients from each, together with other complications. But the biologist's explanations will give prominence to one or more, and fall to be judged in the light of an understanding of models based on them. 1.4. Mutation. Suppose that the alleles at a locus are listed A l 5 A 2 , • • • , Ak . A parental gene At appears also as A, in the genotype of an offspring, unless mutation has occurred, but mutation may cause the offspring to inherit a different allele Aj. If uu denotes the probability that mutation changes A; to Aj , and
then the uu (j =£ i) are small and ul{ is near 1. The usual type of mutation postulated is the change of a single letter in the "message" which constitutes the gene. For such a mutation, it is natural to suppose that different mutations are statistically independent, so that the mutation model is completely specified by the matrix (w ( j ; i,j — 1, 2, • • • , k}. This is the viewpoint which will be adopted here. There are, however, other types of events which give the same mutation effect but are much more difficult to model. A chromosome break can occur in the middle of a single gene, so that the daughter gene contains part of the messages from each of its parents. The mutation rate then depends on whether the parents had the same genes at that locus, and a complicated correlation between different mutations enters. Even with the simple picture of "single letter" mutation, it is far from clear what form to assume for the matrix ( H J J ) . The alleles A, are (some or all of) the messages that can be written with the appropriate number of symbols from the four-letter alphabet. Then uu is zero unless A, and Aj differ in only one letter. A model reflecting this structure will be described in §3.3. Fortunately, however, the results are insensitive to the detailed properties of (Uij) so long as the possibility of recurrent mutation can be neglected. That is to say, once an allele A; has been corrupted by a mutation, it is supposed unlikely that further corruption will by coincidence restore At. In §3.5 we explore the consequences of this approximation and the very useful simplification which it brings to the analysis, when selective effects are ignored. In selective models, one has also to bring in the fitness of the genotype A / A j , and genetical theory is silent as to how this is likely to depend on the messages of the alleles A, and Aj. To make any progress, a much more stringent hypothesis is made about mutation, namely, that the mutant allele is independent of its parent, so that uu does not depend on / at all, so long as / + j. The ration-
6
CHAPTER I
ale for this assumption is that it is extremely unlikely that the corrupt message means anything at all. If it does, it probably bears no biological relation to the original (just as there is no direct association between the meanings of the words "chance" and "change"). An understanding of the real nature of mutation will no doubt come as biologists learn the language in which the allele labels are written, but it seems reasonable in the meantime to exploit as a first approximation the simplification which the independence assumption brings. These arguments assume, however, that the alleles are all distinguishable. Experimental techniques have hardly reached this level. For example, it is common to detect variation at a locus by electrophoretic techniques which give each allele a (positive or negative) integer, representing in a sense the total charge on that part of the chromosome. If this is a sum of contributions from the different letters, then a mutation will probably cause an increase or decrease of one charge unit. Thus it is supposed that the alleles are placed at integer points of a line, that we can only distinguish between alleles at distinct points, and that the mutation probabilities uu are nonzero only if A, and A-} are adjacent. This gives a model in which ( n u ) is the transition matrix of a random walk. This can be generalised to cover the situation in which several independent electrophoretic determinations are made (as for instance in Singh, Lewontin and Felton [1]), when the alleles sit at points of a c/-dimensional integer lattice Z'7, and (//,;) is again a random walk transition matrix. Models of this type are discussed in §3.3, where it is shown that (in the absence of selection) the approximation of nonrecurrent mutation becomes quite quickly accurate as the dimension d increases. To this extent, the random walk model (also called the "charge-state" or "ladder-rung" model, and first introduced by Ohta and Kimura [1]) is appropriate only to the situation in which alleles are very imperfectly distinguished.
CHAPTER 2
Survival of the Fittest 2.1. Balanced polymorphisms. Throughout this chapter it is supposed that the population considered is so large that random fluctuations can be ignored. Thus the evolution of gene frequencies can be modelled deterministically. In this section we confine our attention to a single locus, and to the.effect of selection alone. Suppose that the possible alleles at this locus are listed as A l 7 A 2 , • • • , Ak. Let the corresponding gene frequencies be p1, p2, • • • , pk, so that in the corresponding gamete pool a gamete chosen at random has probability /?, of carrying a gene of allele A,-; clearly
The process of drawing from the gamete pool yields an individual of genotype AjAj with probability/?,PJ, this reflecting the assumption of random mating. (In fact, nature does not distinguish between A,-Aj and A ; A ( , but we can do so for present purposes.) Now suppose that the probability of an individual with genotype A/Aj surviving to maturity is vi' ( j ; thus
and only the ratios of the fitness \ru are relevant. The number of such survivors is proportional to u'ijPiPj- This enables us to compute the gene frequencies/?• in the next generation as
where
Thus we can plot a point p = ( p 1 , p 2 , • • • , /?*•) in R A to represent the genetic composition of the population. The conditions (2.1.1) confine p to a subset A of R*, a (A - l)-dimensional simplex. Equations (2.1.3-4) can be written in the 7
8
CHAPTER 2
form
where O is a function from A into A determined by the fitness matrix (\\'ij'. i,j = 1, 2, • • • , k). If p(r) denotes the point of A describing the rth generation, then p(r) is obtained from p(0) by /applications of (2.1.5), so that the evolution of the population is determined by the iterates of the function <J>. It is instructive to consider first the simple case k = 2. Here ( 2 . 1 . 5 ) can be written
where
The function $ maps the unit interval onto itself and satisfies <£(0) = 0. (/>(!) = 1. It also satisfies
lies in (0, 1), and there are no other solutions of 4>(p) = p in this interval. From this it is easy to see what happens in four different cases (omitting borderline situations in which two of n' n , u' 2 2 , ^'12 are equal): (I) H' n < H' 12 < u'22- In this case p' < p unless p = 0, and p(t) decreases to zero as t —> ^. Thus allele Al dies out and A2 is "fixed." (II) H' 22 < H'i 2 < H' U . The opposite to (I); p' > p and allele Al is fixed. (III) H'j2 < H' n , H^- The ultimate outcome now depends on /?(0), because 0 < p < 1 and p' < p if 0 < p < p while p' > p if p < p < 1. Thus A} is fixed if /XO) > /?, but A 2 is fixed if/?(0) < p . (IV) vt'i2 > H ' U , H' 2 2 - Here p' always lies between p and/?, and it follows that p(t) -^ p as t -^ 3c, whatever the value of /?(0). Thus the two alleles will coexist indefinitely if the heterozygote A-^A., is fitter than either of the homo-ygotes A1Al and A2A2. If this condition is satisfied. then selection alone can account for the variety at the locus. It is natural to try to generalise this argument to more than two alleles, and the key to doing this is to observe that the "mean fitness" W, regarded as a function of p, increases from one generation to the next. To prove this, the Mandel-Scheuer inequality [1], note that the value of W in the daughter genera-
S U R V I V A L OF THE FITTKST
9
tion is
Interchanging the roles of / and / and adding, we obtain
In this chain, the first inequality comes from the fact that the arithmetic mean of two numbers exceeds the geometric mean, the second from the convexity inequality
when r = 2, and the third from the same inequality when r = 1. The argument shows, moreover, that W > W unless p' = p. Hence W(p(t)) increases with /. and it follows without difficulty that, if W attains its maximum in A at a point p with p, > 0 for all /, then p(t) —» p as t —> ^, whatever the starting point p(0). The condition for W t o have a stationary value at p is that, for some W,
10
CHAPTER 2
and this will be a strict maximum if the fitness matrix is such that
for all jc ( -, not all zero, satisfying X( — 0. (See Kingman [1] for further details, and for the conditions under which random fluctuations may safely be ignored.) If W attains its maximum on the boundary of A, this argument fails. Various possibilities exist, but what always happens is that at least one allele dies out. After this has happened, the evolution is governed by equations of the form (2.1.3-4), but with a smaller value of A. and a reduced fitness matrix. Thus selection according to a fitness matrix (wu) can maintain A alleles in equilibrium, but only if the equations (2.1.8) admit a strictly positive solution (pi), and the negative-definiteness condition (2.1.9) is satisfied. When k is large, these are very demanding conditions (cf. Gillespie [1]), very unlikely to be fulfilled by a fitness matrix written down at random. It might therefore be thought that the simple theory of this section cannot explain a polymorphism with a large number of alleles without some mechanism for evolving a rather special array of fitnesses. This however is not so. If the (k x k) fitness matrix does not lead to an internal equilibrium, some alleles will die out, and eventually equilibrium will be established with some smaller number k' of alleles. The number k' depends on the wu and on the initial gene frequencies, but there seems no reason why it should not be large if A: is large. How large then would one expect k' to be for "typical 11 fitness matrices and initial conditions? One could formulate this question by allowing the wu (1 ^ / ^ jf 5i A') to be. say, independent random variables uniformly distributed on (0,1), and p(0) uniformly distributed on A. The distribution of A:' in terms of k could then in principle be computed, though the problem would be one of great difficulty. An upper bound for k' given in Kingman [1] (see also Karlin [2]) shows that on the average k' is at most H' for k' large, but how sharp this is remains obscure. Of course this whole approach is unrealistic in that it ignores the very mechanism by which the k alleles are presumably established: mutation. A general, perhaps too general, approach to the synthesis of mutation and selection in a one locus diploid context is outlined in §2.5. In §2.6 a much cruder approach is adopted, which suggests that large polymorphisms may be less susceptible to attack by new mutations. 2.2. Multi-locus selection. The elegant theory sketched in the last section is, alas, too good to be true, and much more difficult problems arise when the fitness of an individual depends on the genes at more than one locus. Take the simplest case of two loci. Let the alleles at the first be Aa(a = 1, 2, • • • , a) and at the second B0((3 = 1, 2, • • • , & ) . Then the possible gametes will be described by pairs / = (a, /3) and the possible individuals by genotypes (/, j )
SURVIVAL OF THE FITTEST
11
which are pairs of pairs. The fitness matrix is now of order ab, the fitnesses wu being indexed by the pairs / = («, (3),j = (a , /3'). This complication would not be serious if equation (2.1.3) continued to hold, but it does not, because of the possibility of recombination. The gametes contributed to the pool by an individual of type ( i , j ) are not all of types / = (a, /3) and./ = («', /3'); some are of the mixed types (a, (3') and (a', /3). Moreover, the proportions of these four varieties depend on whether the two loci are on the same chromosome, and if so how far they are apart. The effect is to replace (2.1.3) by a recurrence of the form
where aul ^ 0 depends on the fitnesses and on the recombination probabilities, and «j, = The passage from (2.1.3) to (2.2.1) is fatal to the methods of §2.1. Moran [2J showed that the mean fitness does not always increase, and efforts to find a function with this (Lyapunov) property have not prospered. Moreover, Karlin and others (Karlin [1], Karlin and Carmelli [1], Karlin and Liberman [1]) have shown that there can be many stable equilibria with all alleles present, whereas in the one-locus case there can be at most one. This throws doubt on the application of the results of §2.1 to any natural polymorphism. Observed diversity at a particular locus could be the result of a multi-locus system in stable equilibrium. In principle we could estimate the fitness matrix at that locus, but (2.1.3) could only be applied bearing in mind that the fitnesses will themselves be changing, since they are functions of the gene frequencies at the linked loci. Thus there is an urgent need for a general theory of multi-locus systems and the consequent marginal behaviour at single loci. Some progress in this direction has been made by Ewens and Thomson [1], using the fact that the evolution of such a system is always governed by an equation of the form (2.2.1) for suitable coefficients au!. Moreover, these coefficients satisfy certain conditions which can be exploited. Perhaps the most important conclusions of Ewens and Thomson concern the situation at one A-allele locus of an arbitrary multi-locus system in stable equilibrium. At that locus it makes sense to define vv ;j (/, j = 1, 2, • • • , A.) as the average fitness of those genotypes which have alleles A, and A j . Then it is shown that the marginal gene frequencies pt do satisfy (2.1.8). What is not known, but is conjectured to be true (cf. Karlin and Carmelli [1]) is that (2.1.9) also holds for wu so defined. For example, when k = 2, the equilibrium frequency of A: is given by (2.1.7). Since this expression lies outside (0, 1) if >v 12 falls between H'U and M* 2 2 , a necessary condition for stability is that either iv12 < w n , w22 or that vv12 > vv n , w22.
12
CHAPTER 2
What is conjectured is that the former possibility cannot arise from a stable system; it is a measure of the difficulty of the problem that this has not been resolved even for a general two-locus system with two alleles at each locus. Even the problems of existence of, and convergence to, an equilibrium in (2.2.1) are very difficult to resolve in general, though some deep theorems of Kesten [1] (and with random fluctuations, [2]) are sometimes applicable. 2.3. Balance between selection and mutation. Little progress has been made with the analysis of models involving both mutation and selection at the level of generality of the arguments in §2.1. To get useful results it seems that some restriction must be placed on the fitnesses \ru. Following Moran [4], [5], [6], we shall assume multiplicative fitness, so that for some if, (/' = 1, 2, • • • , A),
This would hold if selection operated on the gametes rather than on the diploid organism, and more generally it would be a reasonable approximation for loci without dominance, in which the heterozygote A j A } has a fitness intermediate between those of the corresponding homozygotes /4,-/4 ( - and A j A j . When (2.3.1) holds, the equations (2.1.3-4), governing gene frequencies under selection alone, simplify to the point of triviality:
and the composition of the /th generation is given by
As ( increases, all the alleles die out except that (or those) for which n' ( is greatest. Mutation can be introduced into this simple model by supposing, as in §1.4, that of the gametes of allele A, which are contributed to the pool, a proportion uu mutate to A} before the selection process which forms the next generation. As before, we define uu by (1.4.1), so that ( / / ( J ; L j = 1, 2, • • • , k) is a stochastic matrix which describes the effect of mutation. Then (2.3.2) must be replaced by the equation
S U R V I V A L OF THE FITTEST
13
It is to obtain this simple bilinear equation that we adopt the stringent assumption (2.3.1); without it we would be faced with an intractable biquadratic recursion. It reduces our diploid problem to a haploid one; however, we shall return to the truly diploid situation in §2.5. The way to solve (2.3.4) is to follow Moran [5] in defining a matrix V = (Wjiijj). Denote by rjf the (/, j)th element of the rcth power Vn. Then by induction on f,
Thus the behaviour of the gene frequencies in successive generations depends on that of the powers of the positive matrix V. For simplicity, suppose that some power of V has no zero element. Then the Perron-Frobenius theorem asserts that V has a maximal eigenvalue X > 0 and left and right eigenvectors (.v,) and ( y / ) which can be normalised so that
Moreover (see, for example. Seneta [1, Thm. 1.2]),
which, when substituted into (2.3.5), shows that
regardless of the initial frequencies p,(0). In other words, the equations
have a unique solution (/;,) satisfying (2.1.1) for exactly one value of X. This is the limiting set of gene frequencies, from whatever starting point, and the value of X is the limiting (haploid) mean fitness. When k is small, these equations may be approached directly, giving a polynomial equation of degree k for X, but when k is large it may be difficult to extract useful information from (2.3.9). To see how this works out in detail, consider a very simple situation in which
14
CHAPTER 2
A! is favourable over equally fit alleles A 2 , • • • , Ak:
Then (2.3.9) becomes
Summing over j, we obtain
so that (2.3.11) has the iterative solution
in terms of the powers (/^f) of the stochastic matrix (///,). Putting j = 1. we have the equation
which determines A.. With this value of X, PJ is then given by
Of course, the fact that (2.3.13) has a unique real solution X > 1 is obvious, because the left-hand side decreases continuously with X, with limits V //j"' and 0 as X —> 1 and X —» x, and the positivity condition on V7 implies that Without the positivity condition it might happen that this series converges: indeed
where/i is the probability, in the Markov chain with transition probabilities uu, of ultimate return to the state 1 (Feller [1]). Thus equation (2.3.13) has a solution only if s'1 < fj(\ - fj\ that is, if
a nontrivial condition if/ t < 1.
S U R V I V A L OF THE FITTEST
15
For finite Markov chains,/] < 1 only occurs if return to 1 is forbidden from some other states, which is hardly likely in the context of mutation. The analysis of the special case (2.3.10) does give a danger signal when, for convenience of modelling, we admit the possibility of infinitely many alleles. For example, it was noted in §1.4 that, in the electrophoretic picture, it is natural to regard the "alleles" as the integer points on the line, with the only nonzero nu as
For this random walk,/, = ! — / / + - / / _ j , so that (2.3.13) has a solution for all s only if // + - / / _ . The situation is even worse when (2.3.17) is replaced by the corresponding values for a t/-dimensional random walk, since then even the symmetric case has j\ < 1 when d ^ 3. In one sense, this is an artificial difficulty, since A. is always finite in real systems, and a model like (2.3.17) can always be truncated by introducing suitable boundaries at larger distances. But the lack of a solution to the infinite model will show itself even in the finite model because the distribution (p^ will be very thinly spread among the k alleles. It is therefore an interesting question to ask. given an infinite stochastic matrix (un: /, / = 1, 2, • • • ) and a set of positive numbers hr,-: / = 1, 2, • • • ), whether the equations (2.3.9) with k — ?~ admit a solution (/?,) satisfying (2.1.1) with k — sc. No necessary and sufficient condition is known, but useful sufficient conditions have been established by Moran [5], [6] and Kingman [5]. Thus it is sufficient that, for some c. it should be true that (i) if, ^ c for all but a finite number of /, and ( i i ) there is a finite subset B such that the maximal eigenvalue of the matrix (n'jUjj'. i,j E B) is strictly greater than c. In genetical terms, conditions like this insist that a few alleles are sufficiently advantageous to prevent the population spreading itself too thinly over many different alleles. 2.4. The house of cards. The two special models for mutation which have been most deeply studied are the electrophoretic model (2.3.17) and, at the other extreme, that in which nn does not depend on / (for j ^ i). The latter is more appropriate when the alleles are supposed to be completely distinguished, and when, as in §1.4, the effect of mutation is to bring down the biochemical "house of cards" painfully built up by past evolution. Whether or not this is regarded as a realistic picture of mutation, it does make a useful contrast to the random walk model, and has the great advantage of admitting fairly explicit analysis. Thus in this section we assume that the mutation matrix is given by
16
CHAPTER 1
where the // ; are strictly positive numbers with
// is a measure of the overall mutation rate for each gamete, and njit is the probability that a mutant is of allele A-,. Substituting (2.4.1) into (2.3.4) we obtain
so that, writing
the gene frequencies p}(t) in the /th generation are given recursively by
If for a moment we regard Wf as known, this linear recursion can be solved separately for each /', to give
where
Summing (2.4.5) over j gives the recursion
which determines the 11, and hence the Wt. In particular, W, ^ w = max vr,-, and hence so long as w z\ < 1, (2.4.7) can be expressed in the generating function form
S U R V I V A L OF THE FITTEST
17
which simplifies to
It is easy to check that the right-hand side of this equation is a rational function of z whose nearest singularity to the origin is a simple pole at the unique real point z in 0 < z < [(1 - //Kr]" 1 satisfying
This solution will be written z = W 1. Expressing (2.4.8) in terms of partial fractions and expanding as a power series in z, we see that II, ~ CW as t —> ^, for some C > 0, and therefore
Since W > (1 - //)n- ^ (1 - / / ) M ' J , it follows from (2.4.5) that, as / —» ^c.
In other words, whatever the initial gene frequencies, the frequencies in the tth generation converge to those of a stable limiting distribution
the value of W being exactly that which makes the p} satisfy (2.1.1). It should be noted that (2.4.11) expresses the equilibrium distribution of fitness (which assigns probability p-3 to the point \v-} in [0, u 1 ] for each j ) in terms of the mutant fitness distribution (which assigns probability //_,-/// to each w}) and the mutation parameter //. In this interpretation the model is a special case of one studied in Kingman [11], where a new mutant is supposed to have a fitness drawn at random from a given (not necessarily discrete) distribution. For example, if the mutant fitness distribution has a density/on [0, 1], the equilibrium fitness distribution has a density
18
CHAPTER 2
where W is chosen in (1 - «, 1), if possible, so that
However, it can happen that no value of W satisfies this condition, in which case the equilibrium fitness distribution has an atom of probability at the upper limit of mutant fitness. The only difficulty in using these formulae is in evaluating the limiting mean fitness Wfrom (2.4.9). Once this is done (perhaps approximately, using the fact that // is small) the explicit form of the equilibrium gene frequencies shows very clearly the way in which selective differences affect the genetic structure of the population. 2.5. The diploid house of cards. The engagingly simple conclusions of the last section encourage the hope that it might be possible to adapt the treatment to a truly diploid model, in which the fitnesses are not supposed to take the multiplicative form (2.3.1). We therefore seek a general model without (2.3.1), in which a new mutant always exhibits a completely novel allele, never before encountered, so that in particular there is no recurrent mutation. In this situation the natural way to label the alleles is in the order of their arrival on the scene, taking those which appear between one generation and the next in random order. Thus we envisage an infinite sequence of possible alleles A I . A2, • • • , and correspondingly an infinite array of fitnesses ir ( j satisfying (2.1.2). Because mutation is a random process, the \\'u are random variables, and we have to model the random array (M-,,). Rather than proposing a specific model, it seems more convincing to postulate a simple symmetry property, which asserts that the joint distribution of the H'(J would be the same if the alleles happened to arise in some other order. This is reasonable if any dependence between a mutant and its parent allele is ignored. More precisely, it will be assumed that, for any n and any permutation 77 of (1, 2, • • • , n}, the joint distribution of the random variables HVi.TTj 0\y — 1, 2, • • • , n] does not depend on 77. This has been called \veak exchangeability, and is a very much weaker property than exchangeability of the n2 variables wu. (It involves invariance under a group of order n\ rather than « 2 !) However, an analogue of de Finetti's theorem has recently been established by Aldous [1], who shows that the most general infinite array, satisfying wu = u'jj and the weak exchangeability property, is of the form
for some function F, symmetric in the second and third arguments, where the random variables £, £ , - ( / = 1, 2, • • • ) and e / ; (/, j = 1, 2, • • • ) are independent (except for the constraint eu = e//) and uniformly distributed on the interval (0, 1).
SURVIVAL OF THE FITTEST
19
This remarkable and deep result can for our purposes be somewhat simplified, since the variable £ runs through all the alleles, and takes only one value throughout the evolution of the population. The fact that it might take other values in other realisations can never affect the population, so that £ can be taken as constant and absorbed into the function F. Hence it will be assumed that fitnesses are of the form
for some function F: (0, I) 3 —» [0, oc) which is symmetric in its first two arguments. One can now argue as follows, assuming as throughout this chapter that the population is large enough for random fluctuations to be ignored. Suppose that in a particular generation the £-values of the genes in the gamete pool have an empirical distribution approximated (the population being large) by a probability density p(x) (0 < jc < 1). The individuals in the next generation inherit a £-value from each gamete, and so may be described by a pair of £-values (£', £") whose joint distribution, assuming random mating, is p(x)p(y}. Individuals with this description have on the average a fitness K ( g ' , £"), which by (2.5.2) is given by
so that the joint density in the mature population is
where
Thus, were it not for mutation, the ^-values in the next gamete pool would have density
but mutation at rate // (remembering that the mutant ^-values are uniformly distributed) replaces this by (1 - i^p^x) + it. Hence, assuming equilibrium, we have the equation
20
CHAPTER 2
which together with (2.5.4) is a nonlinear integral equation determining p in terms of K and //. Of course, p is not a function of direct interest because the ^-values have no biological significance, but once p is known it determines via (2.5.2) the distribution of quantities defined in terms of the fitnesses. When K is of the form K(x, v) = k(.\)k(y) the analysis reduces in effect to that of §2.4. Other special forms of K would be relevant to other biological contexts. For example, recessive selection in which )\-u = maxdr,-,-, U j j ) implies that AXv.v) = max{A(.v), A(v)} for some function A:, with corresponding simplification to (2.5.5). In the presence of (2.5.5), equation (2.5.4) reduces to
It is convenient to write (2.5.5) as
so the problem is to solve (2.5.7) for p and cr = (1 - u)W l subject to the condition (2.5.6). Note that, when // = 0, (2.5.7) is the continuous analogue of (2.1.8), and it might be conjectured to have similar properties. Thus if W, regarded as a quadratic form in p, attains its supremum over all probability densities at a smooth strictly positive density p 0 , then p 0 satisfies (2.5.7) with // = 0, and a perturbation technique would presumably yield a solution pu as a power series in i< for sufficiently small u (and // is always small in biological reality). On the other hand, when Wdoes not have such a smooth maximising Po, as in the multiplicative case, such a simple argument would fail, and the dependence of p on K would be singular at u = 0. This seems a possibly fruitful area for future research. It should be stressed that, in the generality contemplated here, the functions Fand K may be very irregular in behaviour. Nevertheless, one could consider more restrictive models in which the fitnesses take the form (2.5.2) and the £,do have some biological meaning. An example of this (complicated by random environmental fluctuations) is the "SAS-CFF" model of Gillespie [2]. 2.6. The resistance of polymorphisms to mutation. Suppose a polymorphism at a single locus is in equilibrium under selective forces alone, as in §2.1. Thus alleles A1, • • • , Ak are present in frequencies p1, • • • , /?/,., and the fitnesses u'ij satisfy
for all /, where W is the mean fitness. Suppose for definiteness that the \ru lie
S U R V I V A L OF THE FITTEST
21
between 0 and 1, and that W (which is the maximum of the quadratic form over all x in A) is only slightly below the upper limit 1. This is likely when k is large, because for instance W cannot be less than the maximum homozygotic fitness max/if,-,-. Now suppose that mutation introduces a new allele A 0 . Either this will die out again, or it will establish itself and in doing so create a new polymorphism with (k + 1) or fewer alleles. Let us estimate (rather crudely) the probability P that the latter event takes place, so as to get an idea of how vulnerable the original polymorphism is to mutation. If the genotype A0Aj has fitness u' 0 j , the mean fitness of all the individuals containing the mutant (assuming that the initial population frequency of AQ is negligible) is
Hence the selective advantage of A 0 is initially
A celebrated result of Fisher (see, for instance, Ewens [1], §7.1) indicates that the probability that A0 survives, and takes its place in a new equilibrium, is approximately P = 2e + , where e+ = max(e, 0). Hence we arrive at the approximate formula
To see how this probability depends on the parameters of the polymorphism, think of the H' 0j as being independent random variables, drawn from some distribution on (0, 1), and replace P by its expectation/*. For any random variable y, and any 6 > 0.
so that
If therefore ir = ir 0i has
22
CHAPTER 2
we arrive at the inequality
Moreover, by analogy with the large deviation theorem of Chernoff [1], one would expect this upper bound to be reasonably sharp^ It is a general consequence of the form of (2.6.2) thatP is least when all the/?, are equal (see Appendix 1), so that the best-defended polymorphisms with given W and k are those with p} = k~l for ally. For these the right-hand side of (2.6.3) is
Hence P decays to zero as e~yk, where
This is of course a very crude argument indeed, but it does perhaps suggest that large polymorphisms, once established in stable equilibrium, are exponentially more difficult to overturn by mutation. The detailed justification or refutation of this suggestion is left to future research.
CHAPTER 3
The Neutral Alternative 3.1. Evolution in the absence of selection. Models like those described in the last chapter show that there is no necessary conflict between selection and diversity. Selective pressure alone can maintain a balanced polymorphism when heterozygotes are sufficiently favourable, and even when the selective effects are such as to reduce diversity, these can come into balance with the opposite tendency of mutation. On the other hand, such selective models are not immune from criticism. A stable polymorphism at a single locus requires the fitnesses to satisfy the strong condition (2.1.9). for which there is no clear biological reason. If mutation is an important factor, formulae like those of §2.4 show that selective differences must be small so as not to swamp the comparatively small mutation rates observed in reality. More generally, selective models fall under suspicion because they can too easily explain any observed situation; for example, any set of frequencies PJ can be realised in a balanced polymorphism by a suitable choice of the fitnesses wu. (It suffices to take u' 0 = 1 for / ^ j and wit = 1 - apY1 for small a.) Partly for these reasons, there has been a tendency for some geneticists (see, for example, Kimura [3]) to seek to explain diversity without appeal to selection. More precisely (since no one claims that differences in fitness have no genetical component) it is maintained that at many loci the observed genetic differences contribute little to the overall fitness of the organism. Hence in this chapter we explore the models which have been constructed for the evolution of populations in which selection plays a negligible role. These models all involve mutation and assume that a mutant allele is equally favourable with the existing alleles. This is not true in practice, since most mutants are deleterious. In effect what we are assuming is that every mutant is either "good," in the sense of being as fit as the existing alleles, or "bad", in that it is so much less fit that its contribution to future generations can be ignored. Thus a bad mutation is here being treated as a death. A more realistic treatment of deleterious mutations will be sketched in Chapter 4 (see Li [3] for a more extended discussion). All the analysis of Chapter 2 can be specialised to the neutral situation simply by setting the letter \v (in lower case or capitals, with or without affixes) equal to 1 whenever it appears, and everywhere this reduces the argument to complete triviality. In §2.1, for example, the evolution equation becomes p- = /?,-, so that gene frequencies apparently remain constant from generation to generation. But this cannot be realistic even in the absence of mutation; if the population evolves for long enough there will be chance fluctuations with no stabi23
24
CHAPTER 3
lising mechanism to counteract them. Eventually this may cause some alleles to die out, so that diversity will be reduced not by selection but by randomness (or "genetic drift" as the unfortunate biological nomenclature has it). This reduction takes place only slowly in large populations, and it therefore makes sense to imagine it being counteracted by mutation. Hence there is a need to understand a balance between mutation and the random effects which come from the finiteness of the populations considered. Models for this balance are of necessity stochastic, and it turns out that comparatively sophisticated tools of probability theory are needed to formulate and manipulate them with efficiency and economy. 3.2. A general model for mutation in finite populations. Let us explore the effect of introducing mutation into the Wright-Fisher model of § 1.2. A population of fixed size N was there postulated, so that a particular generation G, consists of N diploid individuals. Consider a single locus at which the possible alleles are At (i G S), labelled by the elements of a countable set 5 (which for later convenience will not be assumed finite). As far as this locus is concerned, the genetical structure of the population is then specified by 2N elements, not necessarily distinct, of the set 51, the corresponding A; being the genes carried by the N individuals. As far as the production of the next generation is concerned, the order in which these elements of 5 appear is irrelevant, and it helps the mathematics to list them in random order as X^t), X2(t), • • • , X2N(t). Thus, if the 2N genes are sampled without replacement, the rth to be drawn is A-, where / = Xr(t). Each gene in the daughter generation G,+1 comes from a copy of one of the form A,-, where / is equally likely to be any one of the X,.(t). Moreover, this random choice is independent, as between one gene and another. (These assertions are equivalent to those of the Wright-Fisher model, with no selection.) However, mutation ensures that this daughter gene is of the form A} with probability uu (as in §1.4). Hence, conditional on G f , the random elements Xr(t + 1) of S are independent, with a common conditional distribution
(To ease the notation we sometimes write u ( i , j ) for uri.) The distribution of a typical member of the rth generation is given by the probabilities
and because of (3.2.1),
THE NEUTRAL ALTERNATIVE
25
Thus Trt(j) is determined by recursion on / by the equation
and this is just the equation for the successive distributions in a Markov chain with transition matrix (ui}). Suppose that this transition matrix is irreducible, aperiodic and positive-recurrent (the last being trivially true if S is finite). Then the limits
exist and form a probability distribution on S, and do not depend on the initial conditions. The random variables Xr(t) for different r are not independent. Consider the joint distribution
of two members of G,; by the symmetry of the definition of the Xr this does not depend on r and s so long as r ^ s. By (3.2.1) and conditional independence,
on splitting the double summation into the two cases a ^ ft and a = (3. Hence
The presence of the factor 1 - 1/2N < 1, together with (3.2.4), make it easy to
26
CHAPTER 3
see that the limits
exist, and form a probability distribution on the set S' 2 = S x S of pairs ( /,. /.,). They are determined uniquely (supposing the n(j) are k n o w n ) by the equation (3.2.6) with the suffix / removed. To take the simplest possible example, in which there are only two alleles A} and A2, and in which // 1 2 = n.2l = it, equation (3.2.6) has the solution
In more complicated cases, the solution in explicit terms of (3.2.6) may be very difficult. The argument leading to (3.2.6) generalises to the higher joint distributions
The limits
exist, and are determined by analogues of (3.2.6) whose algebraic complexity increases rapidly with n. For each n, these equations are to be solved for the TrU'n • • • < . / « ) m terms of 77(7'!, • • • , ,/,„) for ni < n. The recursion on n ends when n — 2/V, when it gives the joint distribution of all the X,. in the limit as / —» DC.
This algebraic programme is quite impracticable, but some advantage can be taken from the fact that, in most cases of biological interest, the mutation rates u.. a ^ j ) are small while the population size N is large. In the simple case (3.2.8) a good approximation is 7r(l, 2) = 2 N u / ( \ + 8/V//K which depends only on the product Nu. One may ask whether there is more generally an approximation to (3.2.6) and its higher order analogues when N is large and the n-ti are of order N~l. This question can be answered by supposing that we can write
for some stochastic matrix U/, ; ) and some parameter n of order N'1. Substituting this into (3.2.6) and ignoring terms of order N~2. we arrive at the equa-
THE NEUTRAL ALTERNATIVE
27
tion
Here 8 (j = 8(/, y) is the Kronecker delta, and we adopt the conventional notation
Although perhaps (3.2.12) does not look much simpler than (3.2.6), the approximation yields dividends with the higher order equations; the limit (3.2.10) satisfies
where vn is the number of b ^ a with jh = jn .l Methods for extracting useful information from these elaborate equations will be considered in the next three sections. However, there is one general consequence which is worth noting here (Kingman [10, §5]). Equation (3.2.14) solves to give, by induction on «, a probability distribution Trn on the set Sn of //-tuples, and this for every // ^ 1 since the restriction n ^ 2N evaporates as N —» 3c. Because of the symmetric definition of the Xr this distribution is exchangeable in the sense that TT(J\, j2, • * • , ./„) is a symmetric function of its n arguments. Moreover, the irn (n = 1, 2, • • • ) form a consistent family, in the sense that
Hence de Finettfs theorem applies, to show that there is a family of random variables p ( j ) , j E S, satisfying
1
More precisely, what can be shown is that the limits in (3.2.10) themselves converge as N —> •* to quantities satisfying (3.2.14).
28
CHAPTER 3
and such that, for any /?,
The methods of Kallenberg [1] or of Kingman [4] may be used to show that the joint distribution of the/?(j) is the limiting distribution (as N —> ^ with 6 = 4Nu fixed) of the empirical distribution of the 2N variables Xr. We return to this interpretation in §3.7. 3.3. The random walk case. The analysis of the last section allows a quite general transition matrix (w ( j ) describing mutation, and it is not surprising that progress is impossible at this level of generality. To be more specific we must make some detailed assumption about the mutation rates. In §1.4 it was noted that an "electrophoretic" picture suggested that the alleles be labelled by the set Z of (positive and negative) integers, with
and all other ui} zero. Unfortunately this transition matrix is not positive-recurrent, but we can recover this property by truncating Z to the set Zm = {0, 1, 2, • • • , m - 1} made into an additive group by addition modulo m. (Or equivalently, winding the integers round a circle of circumference m.) When m is large, the distinction between Zand Z m will be insignificant. The refinement by which d independent electrophoretic measurements are made leads in an obvious way to the choices S = Zrf or 5 = /?„, the respective direct products of d copies of Zor Zm • It is then natural to generalise (3.3.1) at least to the extent of supposing that ui} depends only on the (componentwise) difference j - i. Thus it might be supposed that various biological interpretations could be modelled by taking S to be a finite abelian group (written additively) with the assumption that uu depends only on j - i. If this is so, then u — \ — uit does not depend on /, and there is a probability distribution g, (j G S) such that (3.2.11) holds in the special form
We describe this as the ''random walk" case, and will show how (3.2.14) can to some extent be analyzed by the Fourier transform in the group S. Before doing this, we note that the random walk case covers a simple model of the fine structure of the gene. If indeed a gene is a long word, we can label the word at some initial moment as a string of d zeroes (d being the word length), and indicate a mutation of one letter as a change from 0 to 1. If the effect of multiple mutations of a single letter is ignored, the alleles are then words in the letters 0 and 1, so that S =Z$. Moreover, a natural assignment of mutation rates is (3.3.2), where q, = d'1 if j has the symbol 1 exactly once, and
THE NEUTRAL ALTERNATIVE
29
Thus an electrophoretic model has S = Z^, for large m and small d, while the more detailed picture may be modelled by the same 5, but with m = 2 and large d. The trivial case (2.3.8) corresponds to m = 2, d = 1. Every finite abelian group possesses a Fourier transform, but we shall for simplicity confine attention to the case S = Zf n . the direct product of d copies of Z m . If a = (a1, a 2 , • • • , a d ) andj = (/,/, • • • , jd) are typical elements of 5, write
for their inner product, and define the Fourier transform of 77 „ by the formula
Knowledge of rr(ai, • • • , an) for all c^, • • • , « „ E S determines 77 by the inversion formula
Applying (3.3.4) to (3.2.14) and simplifying, we obtain the equation
which allows the fr to be expressed in terms of
To do this it is convenient to write
30
CHAPTER 3
then
and so on, so long as the a-variables sum to zero (modulo m). If V" = ] aa ^ 0 in 5, then 77(011, • • • , a,,) = 0, as is evident from the invariance of the problem under translations (i.e., actions of 5 on itself). As a very simple example of the use of these results, consider the "effective number of alleles" v. This is defined by requiring that v~l be the probability that two genes drawn at random display the same allele: Using (3.3.5) and (3.3.9), we obtain
where q { ( a ) = ?[q(a) + q(—a)] is the Fourier transform of the symmetrised form
of q}. Hence
so that finally
where
THE NEUTRAL ALTERNATIVE
31
is the probability of being at the origin after n steps of a symmetric random walk on 5 with transition probabilities (3.3.11). It is perhaps somewhat unsatisfactory that the electrophoretic model is approximated by a random walk on Z,,( for large m. In fact, the whole analysis can be carried out directly on Z, so long as attention is confined to the relative differences of the X,.. The details may be found in Kingman [4]. Also in that paper is an inequality about the genealogy of the Wright-Fisher process which is of some interest in itself (see Appendix 2). Equations (3.3.9) and (3.3.12) remain valid for random walks on Z a n d Z d , as can be seen either by the direct argument or by letting m —» x. For example, the electrophoretic model with u+ = n_ has
and substituting into (3.3.12) gives
The calculations are more difficult for ^/-dimensional random walks. For symmetric walks it is easy to see (cf. Kingman [7]) that
which gives a lower bound for v. When d is large this lower bound is near the universal upper bound
so that, to a good approximation, v — 1 + 9 for large d. An intermediate case has been suggested by M. Turelli. Consider a random walk on /'' in which the step distribution is given by with the nonzero component in other than the first place otherwise. For large c/, //„ is given approximately by
32
CHAPTER 3
and substitution gives
lying between (3.3.13) and the upper bound (3.3.14). 3.4. The frequency spectrum. Returning to the general model of §3.2, let / be any continuous function on the closed interval [0, 1], and consider the functional which sends/into
Because of (3.2.15), this is a positive linear functional which takes the value 1 when/ = 1, and is hence of the form JO/(A")/A (dx) for some probability measure /JL on [0, 1]. If fji has a density, this is conventionally written as x is a positive function with
Setting g(x) = xf(x), we then have the identity
The function <£ has been of considerable interest, and is often called the frequency spectrum. It may be interpreted by saying that <&(x) dx is the probability that there is an allele whose relative frequency in a large population lies between x and x + dx. (In the language of point processes, it is the first moment density, or intensity, of the /?(/).) If in (3.2.16) all the ja are set equal, we obtain
Hence, using (3.2.14) and the symmetry of TT, we have
THE NEUTRAL ALTERNATIVE
33
and so
Now, using (3.2.16) again.
where ^(x) is the conditional expectation of ^ (/)(/,./), given that p(j) = x. To proceed further, we assume that the function i//j does not depend on j. (This is true in all the examples of §3.3, because of group invariance.) Then the first term on the right-hand side of (3.4.3) is
so that we arrive at the identity
If this is written as
it becomes a recurrence relation expressing the /?„ in terms of the k n ; remembering that hl — 1, and noting that A-0 — 1, we obtain
34
CHAPTER 3
When we change the variables from (//, v) to (.v, i;), where .\ = I — H + uc, this becomes
Since this holds for all n ^ 1, we must have
Differentiating this integral equation converts it into a differential equation for <3>, whose general solution is easily verified to be
where C is a constant of integration, whose value may be determined by (3.4.1). Equation (3.4.6) is essentially equation [11] of Kimura and Ohta [2]. Their argument is difficult to make rigorous, since it applies results of one-dimensional diffusion theory to the non-Markovian components of multi-dimensional diffusions. (Why the invalid argument should lead to the right answer is a question which might repay further thought.) The analysis given here can incidentally be arranged so as to prove that the measure ^ is absolutely continuous, thus establishing the existence of $ under the homogeneity assumption i//j = i//. Of course, (3.4.6) only determines the frequency spectrum in terms of the regression function t// of ^ p(i)q(Lj) on p(j), and this function is no easier to evaluate than c£> itself. But properties of $ can be read off from the equation. For example, since 0 ^ 0,
which limits the singularities which can have. On the other side, if q(i, i) = 0 and q ( i , j ) ^ q for all /, j and some constant q, then i//(.v) ^ q( 1 — .v), which implies that
Substituting these two inequalities into (3.4.1) gives
which may be substituted back to give upper and lower bounds for 3> which are quite sharp when q is small.
THE NEUTRAL ALTERNATIVE
35
This whole analysis is based on that of §3.2, in which the matrix of mutation rates was assumed positive-recurrent. Thus it does not apply directly to random walks on ZorZ d , but the methods of Kingman [4] enable a very similar analysis to be carried out, and (3.4.6) remains valid. Kimura and Ohta [2] have some numerical results for this case (when d - 1 and u+ = « _ ) , which suggest for example that i//(y) has a nonzero limit as y -^ 0. This is surprising, since it would imply (from (3.4.6), or from the fact that kg = 1) that
which Kimura and Ohta interpret, perhaps correctly, as a severe limitation on the genie variety to be expected. A rigorous proof or disproof of (3.4.10) for random walk models with infinitely many alleles would be of some interest. The quantities hn have a useful interpretation, since hn is the probability that all the genes in a sample of size n are of the same allele. For example, hz is the quantity written as v~l in §3.3. The inequalities (3.4.7-9) imply bounds for the hn, but it is more efficient to use (3.4.4) directly. If, as before, i//U) ^ q(\ — x ) , then
Hence
and
leading to the inequalities
which again are fairly sharp if q is small (and n not too large). 3.5. The Ewens sampling formula. Although the discussion of the last two sections encourages the hope that at least some cases of the general mutation model may be amenable to calculation, the algebraic complexities are clearly formidable. Moreover, as usual in applied mathematics, a result, in order to be useful, must not be too sensitive to the detailed assumptions built into the model; one therefore seeks results with a degree of robustness. Looking back with this in mind, note first the suggestion that the expression (3.3.12) for the effective number of alleles takes a value near (1 + 6) if returns
36
CHAPTER 3
to the origin in the random walk (3.3.11) are neglected. Since v~l = h2 this is consistent with (3.4.11), which implies that h2 differs from (1 + 6)~l by at most q(6/(l + #))2. More generally, (3.4.11) suggests the approximation
which is robust certainly in the sense that it holds in the random walk case whenever all the q ( i , j ) (i ^ j ) are small. In fact, more can be said than this, since the approximation holds good for the general model of §3.2 and not just in the random walk case. To see this, set all they a in (3.2.14) equal and sum overj, to obtain after simplification
This immediately yields the left-hand inequality of (3.4.11). Now suppose that
Then the sum in (3.5.2) is at most q, so that
Hence
These are very crude inequalities (the upper bound always worse than that in (3.4.11)), but they do show that a condition like (3.5.3), with q small, makes the approximation (3.5.1) a good one. More refined analysis shows that (3.5.3) is an unnecessarily strict requirement. What is really happening is that return to the initial state in the Markov chain with transition probabilities q ( i , j ) is being ignored. Biologically, the assumption is that recurrent mutation, in which a mutational event produces a previously occurring allele, is to be neglected. This would certainly be unrealistic at the electrophoretic level (at least in one dimension), but it accords well with the detailed structure of a gene when different alleles are well discriminated. If a sample of n genes is taken from the population, one may talk about its allelic partition. Certain alleles will be present as singletons, represented only once; other alleles will be represented by two, three or more alleles. If ar is the number of alleles represented r times (r = 1, 2, • • • n), then the nonnegative
THE NEUTRAL ALTERNATIVE
37
integers ar satisfy
and the collection
contains all the information given by the sample unless there is some meaningful way of labelling the alleles. In a common mathematical notation, (3.5.6) represents the partition
of the integer n. To take a concrete example: the data of Singh, Lewontin and Felton [1] can be summarised in the partition
of the sample size
(cf. Watterson [7]). The composition of the sample will be random, both because of the sampling process and because the population is itself subject to random fluctuations. Hence a is a random partition of n, and has a distribution (P n (a); a £ TJT,,), where wn is the (finite) set of partitions of n. In particular, if a is the partition with ar = 0 (r ^ n — 1) and an = 1, then Pn(a) is just the probability //„, for which (3.5.1) provides an approximate expression when recurrent mutation is neglected. What is remarkable is that this can be generalised to give in explicit terms a similar approximation for Pn(a) for any partition a:
This, the Ewens sampling formula, was established by a partly heuristic argument in Ewens [2]; his argument was made rigorous by Karlin and McGregor [2]. An alternative approach using diffusion theory will be mentioned in the next section. It is also possible to derive (3.5.8) from (3.2.14) in the same way that we have dealt with the special case hn. Though complicated, this can be made to provide explicit error bounds.
38
CHAPTER 3
Much of the importance of (3.5.8) resides in its lack of dependence on the underlying model (except through the single parameter 9). This aspect is emphasised if the formula is derived from suitable axioms rather than as the result of extended calculation. For example, it is shown in Kingman [9] that if (i) for each n, Pn is a strictly positive probability distribution on wn which (as between different values of n) satisfies a natural consistency condition (cf. §3.7); and if (ii) given that an allele chosen at random from the sample of size n has r representatives, the partition of the remaining (n — r) genes has conditional distribution Pn-r, then Pn is given by (3.5.8) for some 0. Another characterisation has recently been given by F. P. Kelly (private communication). He imagines the genes to be sampled sequentially, and assumes that, if a newly sampled gene has the same allele as one already represented, the probability that it is a particular allele is proportional to the frequency of that allele in the sample. It turns out that this implies that Pn can be expressed as a mixture of the expressions (3.5.8) for different values of 6 (Bayesians please note). Combining all these different approaches, it can be said with some confidence that the Ewens sampling formula is reliable when (a) the size of the population is large compared to n, and the expected total number of mutations per generation is moderate (differing from 6 only by a numerical factor), (b) the population is in statistical equilibrium under mutation and genetic drift, with selection at the locus playing a negligible role, and (c) mutation is nonrecurrent, so that every mutant allele is a completely novel one. The formula may be used in a number of different ways. For example, if observed values of the ar are substituted, (3.5.8) as a function of 6 is the likelihood function; omitting a constant,
where
is the number of different alleles in the sample and is clearly a sufficient statistic for 6. Standard methods enable estimates and confidence intervals for B to be derived in terms of k. One can also try to use the Ewens formula as the basis of a test of goodness of fit, to judge whether assumptions (a), (b) and (c) are consistent with experimental data. This is a much more difficult matter, to which we shall return in §3.8.
THE N E U T R A L ALTERNATIVE
39
I close with a mathematical remark, which I owe to W. Ledermann (see also Watterson [4]). According to a theorem of Cauchy, the combinatorial part
of (3.5.8) is just the number of permutations of/? objects in which the cycle representation has, for each /-, exactly ar cycles of length r. Thus if such a permutation is chosen at random (each permutation having the same probability I//;! of being selected) and if the partition o f / ? corresponding to its cycle representation is computed, this random partition has the Ewens distribution with 0 = 1 . The connection between the algebra and the biology is obscure, though the characterisation theorem of Kingman [9] can be used to give a simple proof of Cauchy's theorem. 3.6. The Poisson-Dirichlet distribution. The well-informed reader may be surprised that this point has been reached with only passing reference to the diffusion approximations which form so large a part of the literature of mathematical population genetics. That these methods have been set aside is partly because they need no further advertisement from me, having been very persuasively advocated by Kimura [2] and many others. Partly too I have been influenced by the very considerable difficulty of justifying the approximations, which since the pioneering work of Watterson [1], [2] has tended to steer mathematicians towards problems only indirectly of genetical significance (Sato [1], Ethier and Norman [1], Ethier and Nagylaki [1]). In fact, although diffusion methods seem indispensible for the study of dynamic problems (cf. Kingman [12]), I am not convinced that they are the best approach to problems of populations in statistical equilibrium. Having said that, it is appropriate to record a particularly illuminating approach to the Ewens formula, by way of diffusion theory, due to Watterson [5]. Watterson starts by considering a A>allele model, which is the special case of that described in §3.2 in which 5 — {1, 2, • • • , K] and
where of course u = (K — \ ) v . For large N and fixed 6 = 4Nn, he approximates the process by a (K - l)-dimensional diffusion, and notes that, in equilibrium, the population frequencies pl, • • • , p K o f the alleles have a joint distribution of Dirichlet form, with probability density
40
CHAPTER 3
with respect to Lebesgue measure on the (K — l)-dimensional simplex
where
From this, the distribution of the allelic partition in a sample of size n can be computed directly. For example,
and standard properties of the Dirichlet distribution yield
If ^is allowed to tend to infinity with 9 fixed, (3.6.5) converges to (3.5.1). Similar (but of course algebraically complicated) calculations can be carried out for Pn(a) when a is a general partition of n, and Watterson shows that the resulting expressions converge, as K —> oo, to the Ewens values (3.5.8). The interesting feature of this derivation of the Ewens formula is that it suggests that the corresponding population frequencies should be regarded as having approximately a Dirichlet distribution with a large value of K and a correspondingly small value of a, and the precise formulation of this idea was given in Kingman [6]. To describe the underlying concept, first note that there is no direct sense in which the distribution (3.6.2) has an interesting limit as K —» oc? since E (p}) — K~l —* 0. If however, the p-t are arranged in descending order as
then the distribution of the largest frequency /? (1) , and likewise the joint distribution of the n largest/? a) , /? (2) , • • • , p ( n ) , converge to nondegenerate limits as K-* oc. This somewhat surprising result was proved in Kingman [3], but the credit for applying it to genetical problems belongs to Watterson, who in [5] gives interesting properties of the limiting distributions. The precise statement is as follows. Let V be the set of all sequences U n ; n ^ 1) satisfying
Then for every 6 > 0, there is a probability distribution on V, called the
THE NEUTRAL ALTERNATIVE
41
Poisson-Dirichlet distribution with parameter 6, for which the marginal distribution of the first n components is (for any n ^ 1) the limiting joint distribution, as K —> oc and Ka —> 6, of the order statistics p (1) , /? (2) , • • • , p(n) from the Dirichlet distribution (3.6.2). Unfortunately, these marginal distributions (see Watterson [5]) are complicated in form and convey little intuitive idea of the nature of the PoissonDirichlet distribution. The distribution can, however, be described in other ways, which give alternative avenues of approach to its properties. Consider first a nonhomogeneous Poisson process on the positive half line having rate function
so that for instance the number of points in the interval (a, b) has a Poisson distribution with mean
Notice that this integral converges when a > 0 even if b = =c, but diverges if a = 0. For this reason 0 is a limit point of the process but =c is not, and the points of the Poisson process may be labelled in descending order as z
i
Zo
• • • . The sum s —
zr has expectation
and is thus almost certainly finite. It is therefore possible to define a random point of V by
Then the distribution of x is the Poisson-Dirichlet distribution with parameter 0. For another (though closely related) description, consider a gamma process, that is a random process y(t) (t ^ 0) such that (i) y(0) = 0, and v is a nondecreasing function of /, (ii) for /! < t2, the increment y ( t 2 ) ~ >'Ui) on the interval ( t ^ , t 2 ) has the probability density
where t = tz - tv, and (iii) the increments of y on disjoint intervals are independent random variables.
42
CHAPTER 3
It is well known that such processes exist, and that they have sample functions which increase only in jumps, but they have infinitely many jumps in every finite interval. For any 6 > 0, list the heights of the jumps of v in (0, 9) in descending order as 17! ^ 772 = 17.3 = ' ' ' '- the assertion that y increases only in jumps means that
Hence xn — y),,/y(0) determines a random point in V, and once again it turns out that this point has the Poisson-Dirichlet distribution with parameter 9. Faced with a problem involving random sequences having this distribution, one therefore has a choice of approaching it as a limit of finite-dimensional Dirichlet distributions, as a nonhomogeneous Poisson process, or in terms of the gamma process. Examples of calculations carried out in these different ways may be found in Kingman [3] and Watterson [5]. Watterson's approach shows that, in the symmetric A'-allele mutation model with large K, the Poisson-Dirichlet distribution describes the joint distributions of the population frequencies, when these are arranged in descending order. Kingman [6] shows that, not only does the Poisson-Dirichlet property of population frequencies imply the Ewens sampling formula, but, conversely, a population all samples from which have the Ewens property enjoys the PoissonDirichlet property. Moreover, if a sequence of populations has the Ewens property in the limit, then the population frequencies (in descending order) converge in joint distribution to the Poisson-Dirichlet distribution. 3.7. Partition structures. Passing reference was made in §3.5 to the need for the distribution of the allelic partition in a sample of size n to be consistent as between different values of/;. The time has now come to develop this systematically, so as to understand the special position which the particular distribution (3.5.8) enjoys and to explore possible alternatives to it. We shall for the moment adopt a less specific terminology and speak of'"colours" rather than "alleles". Suppose we have a population of (finite or infinite) size N each of whose members exhibits one of a (finite or infinite) number of possible colours. The mechanism by which these colours have been assigned may or may not be random. If a random sample of size n is taken (without replacement), the colours of the n members selected will define a partition of //,
where, for each /•, ar colours are represented exactly r times. Unless we have names for, or relationships among, the colours, the partition a summarises all the information available in the sample.
THE NEUTRAL ALTERNATIVE
43
Let Pn denote the distribution of the random partition a, so that Pn(a) is the probability of its taking a particular value a, and
Since we can envisage taking a sample of size n for any n in 1 ^ n ^ TV (or 1 ^ n if N = «0, Pn is defined for all such n. One way of taking a random sample of size n is first to take a sample of size (n + 1), and then to select an item at random from this sample and discard it. This means that Pn can be written down explicitly in terms of Pn+l'.
We shall here consider the case N — °°, so that Pn is defined for all « is 1. Then our argument shows that any sensible model must lead to partition distributions Pn satisfying (3.7.2) for all n ^ 1. For example, we would be gravely disturbed if the Ewens formula (3.5.8) did not satisfy (3.7.2); the reader will easily check that it does. A family of probability distributions Pn (n ^ 1) over the respective finite sets vjn is said to form a partition structure if (3.7.2) holds for all n ^ 1. An obvious question is: What is the most general partition structure? This was answered in Kingman [13]. To describe the solution, first let V be the set of all sequences x = (xn; n ^ 1) satisfying
For any x G V, define
and note that V_is just the subset of V consisting of those x with jc0 = 0. For any x E V, we define a particular partition structure consisting of distributions P%. These are defined by colouring the elements of an infinite population; for each n ^ 1 a proportion xn is given the colour Cn. By (3.7.4) this leaves a proportion x0 uncoloured; these are all given colours different from one another and from all the Cn. If a sample of size n is then taken, it will have a
44
CHAPTER 3
colour partition; we now define
to be the probability that this colour partition is a. It is clear that Pft satisfies (3.7.2), and that_CP£) is therefore a partition structure for each sequence x - (xlf x2, • • • ) in V. Not every partition structure is of this form. Because (3.7.2) is linear, any mixture of the form
where p is a probability measure on V, is again a partition structure. Kingman [13] proves that every partition structure is of the form (3.7.5) for some such p. Moreover p, the representing measure of the structure, is uniquely determined. For example, the Ewens structure (3.5.8) can be represented in the form (3.7.5), where it turns out that the measure p is concentrated on the subset V of V and is just the Poisson-Dirichlet distribution with the same parameter 6. (This is an immediate consequence of the general theory and of the calculations of Watterson cited in §3.6.) The measure p in the representation formula (3.7.5) has an explicit interpretation. In a sample of size n, list the colours present in descending order of frequency, and let xr(ri) denote the relative frequency in the sample of the rth most frequent colour. Thus for instance x^n) = j/n, where 7 is the largest integer with Uj ^ 1. Each xr(n) is likewise a function of the partition a G wn, and
is a random element of V, and has a distribution pn which can be computed in terms of Pn. It turns out that pn —» p as n —»• °°, in_the usual sense of "weak" convergence (relative to the product topology on V). (To see why we need to work in V rather than the more natural space V, consider the partition structure with
for all « s 1.) Applying this analysis to the genetical context, we of course read "colours" as alleles (or as equivalence classes of alleles which can be discriminated experimentally). If a model admits in principle the possibility that a sample of size n may be taken, for any value of n, then the distribution of the allelic partition of that sample must be given by (3.7.5) for some probability measure p on V. Thus p labels all the possible models. It may be interpreted as the joint distribution of the frequencies, arranged in descending order, of the alleles in a hypothetical infinite population.
THE NEUTRAL ALTERNATIVE
45
The significance of this representation for the practical problem of testing neutrality will be explored in the next section, but something should be said about the nature of the result from a mathematical viewpoint. Let Un denote the set of all probability distributions over wn. This is a finite-dimensional simplex, whose dimension is one less than the number of partitions of the integer n. Equation (3.7.2) defines a linear mapping crn from Hn+1 into H n , thus defining a projective system
For such a system, the projective limit is defined as the set of all sequences (/»„; n ^ 1) with
and so is just the set FIX of all partition structures. General theory (see, for instance, Choquet [1]) tells us that IIS must be a Choquet simplex, and that the general partition structure is accordingly given by an integral representation of the form (3.7.5) with V replaced by a suitable set of labels for the extreme points of 112. Thus the burden of (3.7.5) is that the extreme partition structures (i.e., those that cannot be expressed as mixtures of other partition structures) are those of the formP% for* E V. Choquet theory cannot (I think) help us prove this; the only proof so far known is a formalisation (using martingale theory) of the description of p given above. There are affinities between this and work of Martin-Lof [1] and Lauritzen [1] on extreme models in statistics, and indeed with de Finettrs theorem on exchangeable random variables (cf. Kingman [10]). It is a fact of some technical importance that the relation between the partition structure (Pn) and the representing measure p is continuous in both directions. To make this precise we give V the weakest topology in which the coordinate mappings xn (n = 1, 2, • • • ) are continuous, noting that this makes .v0 only upper-semicontinous (a price we are happy to pay to achieve compactness of V). Convergence of probability measures onV is weak convergence relative to this topology (Billingsley [1]). Then if a sequence of probability measures pr converges to another measure p, the corresponding partition structures converge pointwise:
for all n ^ 1, a G wn. Conversely, suppose that (P ( n r) ; n ^ 1) is for each r a partition structure, and that the limit
46
CHAPTER 3
exists for all n ^ 1 and a G wn. Then (Pn) is a partition structure, and the corresponding representing measures converge in the sense described above. In Kingman [9, §5] an even stronger result is proved in which (roughly speaking) the /*r) come from finite populations of possibly random size tending to infinity in probability. This is only stated for structures for which the representing measure is concentrated on V, but it extends at once to the general case. However, the greater generality obtained by allowing p to extend over V has, so far as I know, no biological application. 3.8. Testing neutrality. It was remarked in §3.5 that the Ewens formula is very robust, and appears to be valid for all models satisfying the conditions (a), (b) and (c) of that section. For this reason it is very tempting to use it as a null hypothesis, to determine whether data in the form of an allelic partition are consistent with (3.5.8) for some value of 8. If not, this is to be taken as evidence that one of the assumptions does not hold; the most vulnerable of the three is neutrality. For this reason, the Ewens formula has been used to embody a null hypothesis of neutrality, to be tested against selective alternatives. For example, Watterson [7] has used the Singh, Lewontin, Felton [1] partition
and that of Coyne [1],
and has concluded that they are, as they stand, incompatible with (3.5.8). Notice that, although formally each is a single partition a, each includes a good deal of information (much in the same sense that a single realisation of a random process can be informative). The decision that a partition a is inconsistent with the Ewens formula is made on the basis of a test statistic, which may be selected in a number of ways. Thus Watterson uses (among others) the homozygosity statistic
which is the Neyman-Pearson criterion for testing (in a diffusion approximation) a neutral model against a diploid model with
A choice based on Neyman-Pearson theory is clearly more justifiable than a purely ad hoc selection (and of course infinitely more so than one based on the data), but is nevertheless open to criticism. It will be powerful against the par-
THE NEUTRAL ALTERNATIVE
47
ticular alternative considered, but not against others, which may be biologically at least as plausible, even if mathematically less simple. Indeed the class of sensible alternatives is so large that any conceivable data (even the most probable under (3.5.8)) will lead to rejection of the null hypothesis for some choices of Neyman-Pearson test statistic. There is scope for further numerical study of these questions, but the picture is to some extent clarified by the theory developed in the last section, since this describes all the possible models that could be contemplated as alternatives to (3.5.8). If the sample size n is so much smaller than the population size N that the latter can be taken as infinite, then every model is of the form (3.7.5) for some probability measure p on V. Thus p labels the possible models, and (3.5.8) corresponds to the Poisson-Dirichlet distribution with parameter 6 (which will for this purpose be denoted by pg). Using this fact we can reformulate the statistical problem as follows. The unknown "parameter" is a probability measure p on V. The observed data are generated by a two-stage process. First a point _Y is chosen at random in V, according to the distribution p. This is used to colour an infinite population in the manner described in §3.7, and a sample of size n is taken from this population; the observed partition a (of n) is the colour partition of the sample. The problem is to make inferences about p, and in particular to test whether the data are inconsistent with the null hypothesis that p = p0 for some 6. Whatever his philosophy of statistical inference, it is quite clear that the statistician is worse off knowing a than he would be if per impossibile he could observe the point x directly, since a is obtained from x by a random mechanism independent of x and not involving p. Hence one may replace the real problem by the idealised, and less demanding, one of making inferences about p from a knowledge of x. This idealised problem is in a sense the limit of the real one as the sample size n increases. The null hypothesis is composite because of the nuisance parameter 0, but this complication disappears in the idealised problem, since 0 is (almost surely) a function of A'. In other words, at most one of the components pe of the null hypothesis is consistent with the data jr. We therefore arrive at the conclusion that the real problem is worse than the following: given an observation x arising from an unknown distribution p, is this consistent with the particular distribution p wx) ? The_only esoteric feature is that x takes its values in the infinite-dimensional space V rather than, say, the real line. The first reaction of the statistician to this problem would be to complain at having to work with a single observation jr, and to demand replication. But replication is probably impossible to achieve, since it means working with populations which have been separate long enough to achieve equilibrium, but whose biological environments have been identical over that period. Granted that only one value of ,v is available, the idealised problem seems soluble (in the sense that useful information about p is contained in jc) only if there is very strong prior information about the alternatives to pe. In particular,
48
CHAPTER 3
if a possible choice of p is the distribution ef concentrated at A, this will presumably be preferred to any diffuse distribution such as pe. The intrusion of ex is very germane because of the general conclusions of Chapter 2. If in a selective model the selective differences are large compared with mutation rates and with the reciprocal of population size, then analysis of that model will predict the gene frequencies, and .Y will thus be determined without error. Moreover, the selection coefficients can always be so adjusted as to yield any prescribed point .v in V (though perhaps not in V). Thus it is plausible to regard the distributions p = ex (.v G V) as the strongly selective alternatives to p a . For any observed ,Y in V we will_reject peu.} in favour of the degenerate distribution e. r . (Moreover, if .Y falls in V - V we must reject pft, which is concentrated on V.) It therefore seems that, in the idealised problem, we will always reject the null hypothesis whatever the data unless there are strong a priori restrictions on the alternatives envisaged, restrictions which rule out in particular most strongly selective models. In the real problem, we will fail to reject the predictions of the neutral theory only if the sample size is not large enough, or if a limited choice of test statistic implies restrictions on the alternatives. For this reason I am inclined to be sceptical about analyses of (unlabelled) gene frequencies, even in large samples, which purport to find significant differences from the Ewens formula. This is not of course to say that that formula is always valid, but it does indicate that the explanation of departures from neutrality is not urgent until the significance of such discrepancies has been established by other means.
CHAPTER 4
Selection in Finite Populations 4.1. Deleterious mutants. It was noted in §3.1 that no completely neutral model can ever be realistic, because most mutants are known to be much less fit than the existing genes. Unless deleterious mutants are immediately lethal, the population will contain descendants of such mutants on their way to extinction. These will poison any sample, and cast doubt on the validity of the Ewens sampling formula. For example, the sufficient statistic for 8, the total number of alleles, will be inflated by deleterious mutants of low frequency. Such ideas have often been put forward by advocates of a neutralist position wishing to protect it from evidence apparently incompatible with the sort of model described in Chapter 3. As we have seen, such a defensive attitude may be premature. Nevertheless, it remains necessary to assess the effect of the deleterious mutations on the composition of both population and sample. Ideally this would be done by synthesising the selective models of Chapter 2 with the nondeterministic models of Chapter 3; in the next section this is attempted for the multiplicative fitness case. Even with the simplest assumptions this leads to mathematical difficulties at present insurmountable. A rather less ambitious approach is described in §4.3. This depends on a heuristic formula of Sewall Wright that has recently been exploited to good effect by Li [1], [2] and Ewens and Li [1]. It is possible to set up a truly diploid model by combining these approaches with the device used in §2.5, but this has not yet led to useful results. The discussion in this chapter will be much more speculative and less definitive than in preceding ones. If the reader is encouraged to do better, then it will have served its purpose. Before embarking on the problem, it is worth assessing its significance in a rough way by means of the results of §2.4. It will be recalled that, if a mutant has fitness (measured, say, on the interval (0, 1)) having a probability density/, then the equilibrium fitness distribution in the population has density
where // is the total mutation rate and W (the mean fitness) is determined by the equation
49
50
CHAPTER 4
To see how this works out in a simple case, let/be the uniform density on (0, 1). Then (4.1.2) becomes
which when u is small has the approximate solution
Substituting in (4.1.1) gives
which when u is small has a very sharp peak at x = 1. All the quantiles differ from 1 by terms of the order of e~u~l. Similar conclusions hold whenever the mutant fitness density is approximately uniform near the upper limit of fitness. They perhaps suggest that seriously deleterious mutants may not be present in sufficient numbers to make a noticeable difference. However, it should be stressed that this argument is a very crude and approximate one. Starting from rather different assumptions, Li [2] is led to different conclusions on the basis of extensive computations. 4.2. The Wright-Fisher model. Let us see how the general model described in §3.2 can be modified to allow for selective differences. Thus consider a diploid population of fixed size N, and a particular locus at which the possible alleles are At (i G S). In a particular generation Gt, list the genes present in the population, in random order, as
We shall assume here that selection acts at the genie level so that the fitness of the genotype AtAj is of product form wtWj for positive constants wt (i e S). A simple way of building this into the model is to suppose that, in the daughter generation Gt+1, each gene is a copy of one of the form At, where / is Xr(f) with probability proportional to w(Xr(t)). This means that the Wright-Fisher model is modified by replacing the symmetric multinomial distribution of offspring by an asymmetric multinomial, weighted according to fitness. The mutation mechanism is supposed to act as before, and we follow through the argument of §3.2 to conclude that, conditional on Gt, the random elements Xr(t + 1) which make up Gt+l are conditionally independent, with
SELECTION IN FINITE POPULATIONS
51
When all the wt are equal, this reduces to (3.2.1). In deriving (4.2.1), we have been vague about the exact biological mechanisms and life cycle by which mutation and selection act. It is easy to complicate matters further, but this seems premature in light of the great difficulty of even this simple model. To be rather more precise, the foregoing argument regards the successive generations G 1? G 2 , • • • , Gt, • • • as a Markov sequence, each G, being a sequence of 2N elements Xr(t) of S, where r = 1, 2, • • • , 27V. The italicised conclusion above defines the transition matrix of the Markov chain, with state space S2N, in terms of the allele fitnesses w, and the ui}. When S is finite and (utj) is irreducible and aperiodic, it can be shown that this chain has a proper limiting distribution, so that a unique joint distribution for the Xr(t) which is independent of/ (and necessarily exchangeable in r) exists. The corresponding question when S is infinite is more delicate, but the methods of §2.3 can be applied to show that the conditions there described as sufficient for the stability of the deterministic model also guarantee the existence and uniqueness of the stationary joint distribution for the present model. The interesting problem, however, is how to get useful information about this distribution out of (4.2.1). Section 3.3 essentially describes a trick for doing this when the wt are all equal, but this fails in general because of the nonconstant denominator on the right-hand side. I know of no way of generalising this trick to cover unequal wt, even for simple mutation rates. If we are concerned with determining the likely proportion of deleterious mutants in a sample, it might make sense to confine our attention to the distribution of fitness in the population. Thus for any y, we can consider the proportion
of genes with fitness not exceeding y; each Ft is a discrete distribution function. This would be particularly appropriate if the mutation rates took the symmetrical form (3.6.1), because the sequence of functions Fl, F2, • • • would then itself possess the Markov property. It would then be possible to seek the stationary distribution of this Markov chain. To pursue this ambitious programme somewhat further: the stationary distribution depends on the number of alleles K, the mutation rate u, and the empirical distribution of the fitnesses w,-; it might be expected to take a simpler limiting form if K —> °c, N —> <*, 4Nu —» 8 and the empirical distribution converges. I do not know if these conjectures are valid but the analysis of the next two sections provides some circumstantial evidence. If any direct approach to these problems is possibk, it is probably by way of a diffusion approximation, so that Ft is regarded s.s a random function varying in a Markovian way as a diffusion on a suitable space of functions. The machinery for analysing such processes probably exists, and rather similar problems have been successfully attacked in different contexts (see, for example, Fleming and Viot [1]). When even the simplest model of a physical system resists analysis, it is
52
CHAPTER 4
worth asking whether one of the assumptions is causing the difficulty, and if so whether this is so compelling that a more tractable model which violates it would yield no useful information. In this case, the trouble seems to stem from the requirement of fixed population size N, and there is a very simple model without this restriction first contemplated by Karlin and McGregor [1]. This is a continuous-time formulation, in which new alleles arrive randomly in time according to a Poisson process of rate A.. Each allele is then supposed to fluctuate in abundance according to some random process, and the crucial assumption is that these fluctuations are independent from one allele to another. The equilibrium structure of this process is well understood (see, for example, Kingman [2]). If vr denotes the number of alleles present with frequency r at a particular instant, then the random variables vl, v2, • • • are independent, and vr has a Poisson distribution with mean \gr, where gr is the expected total time for which an allele has frequency r before extinction. Note that the total number of genes is no longer the constant 2N but the random variable
which has expectation
and variance
Thus this process represents a population whose size fluctuates substantially, rather than one constrained by external factors. Of course, in any real situation the population is partly regulated from outside and partly by internal factors, and one might expect behaviour somewhere between the Wright-Fisher and the Karlin-McGregor pictures. The resulting population structure clearly depends critically on the numbers gr, which in turn depend on the process by which each allele population fluctuates before extinction. If, for example, this is supposed to be a simple linear birth and death process with birth and death rates b and d, then it is easy to compute that, if b < d so that extinction is certain,
and thus that
SELECTION IN FINITE POPULATIONS
53
If (4.2.6) is substituted into the expression
for the joint distribution of the vr, we obtain an expression of the form
where C(M) depends on the vr only through M, and 6 = k/b. Comparing this with (3.5.8), we see that the conditional joint distribution of the vr, given that M = m, is just given by the Ewens sampling formula for a random partition of m. Now suppose that a sample of size n-^m is taken (without replacement). Then, because of the consistency property enjoyed by (3.5.8), the allelic partition in the sample again has the Ewens distribution, conditional on M = m. Since this now does not depend on m, we may conclude that, given only that M ^ n, a sample of size n yields a random partition with the Ewens distribution (3.5.8), with 6 = \(b. This conclusion of course depends critically on (4.2.6), but when E(M) is large a similar conclusion will hold approximately whenever (4.2.6) holds asymptotically for large r, as it does, for example, if the fluctuations of the allele population follow a branching process (Athreya and Ney [1]). It is really rather striking that the Ewens formula should turn up again in this way, in a process which does not in any very direct way reflect an assumption of neutrality. If, however, the different alleles have, in some sense, different fitnesses w, then presumably b and d depend on w, and if the fitnesses of different alleles are independently drawn from a probability density/(as in §2.4), then (4.2.6) must be replaced by
With this form of gr, the cancellation leading to (4.2.8) no longer takes place, and the Ewens formula no longer holds in general. Indeed, the computation of the distribution of the allelic partition in a sample becomes a complicated matter, though perhaps amenable to approximate treatment when M is large compared with n. The standard result already used to give the equilibrium joint distribution of the vr actually tells us more, namely that the w-values of the vr alleles present with frequency r are independent random variables with a probability density f ( w ) g r ( w ) , where gr(w) is the same as gr, but evaluated for alleles with frequency w. Superposing these w-values for different r, we find that the fitnesses
54
CHAPTER 4
of the alleles represented form a Poisson process with rate function
and that such an allele, with fitness w, is represented in the population r times with probability
Moreover, the abundances of different alleles, conditional on their fitnesses, are independent. In particular, the total numbers of genes with fitnesses falling in several disjoint intervals are independent random variables. Moreover, in the argument we have nowhere used the fact that the word "fitness" has a biological meaning, and could equally refer to some other characteristic of the alleles. Thus it is quite generally true that, in the Karlin-McGregor model, if alleles are classified into a number of categories, then the total numbers of genes in the different categories are independent random variables. This obviously facilitates the calculation of the distribution of the partition induced by this classification in a sample. 4.3. Wright's formula. Recent work of Li [1], [2], [3] has revived interest in an approach to the problem of selection in finite populations which goes back to Wright [1]. This relates to a large population of N diploid individuals, and to a AT-allele locus with mutation rates given by (3.6.1) (although a slightly more general form is possible). If the genotype A,-Aj (/, j = 1, 2, • • • , K) has fitness wu, then it is asserted that the probability density of the population frequencies Xj of the alleles A, takes the form
where
is given by (3.6.4), and C is chosen so that
SELECTION IN FINITE POPULATIONS
55
For the most part, we shall confine attention to the multiplicative case
though, as already remarked, the methods of §2.5 perhaps permit a more general treatment. The proofs of (4.3.1) given by Li [1] and by Watterson [8] depend on diffusion approximation yielding a Fokker-Planck (or Kolmogorov forward) equation which (4.3.1) is found to satisfy. Thus they accept as given the adequacy of the diffusion approximation, although it is far from clear under what conditions the argument is valid. Indeed, the difficulty is underlined because the factor W 2V appears in Watterson's work as e2NW, showing that there is an implicit assumption that W is nearly constant in the significant region of integration. Perhaps this is not too much of a problem—recall
—but it does suggest a need to examine the status of (4.3.1) further. An important clue is contained in the fact that, when (4.3.1) is shown to satisfy the forward equation, this is done by veryifying a stronger ''detailed balance" condition characteristic of reversibility; the diffusion process is reversible. Now the Wright-Fisher model is not reversible even in the neutral case, but a modification due to Moran [1] is. (See Trajstman [1], Kelly [1], [2], Watterson [6].) This suggests that if Moran's model is set up to include the effect of selection, then it might prove reversible, with a stationary distribution approximating to (4.3.1). Moran's process is formulated in continuous time, and regards the population as consisting of a fixed total number M of individuals (haploid organisms, or gametes of a diploid organism) each of which displays one of the possible alleles A I , A 2 , • • • , A K . It evolves as a Markov process, the transitions of which represent a birth, followed by possible mutation, and a compensating death. Thus we assume that, in a small interval of length Ar, there is a probability /3 Ar of a birth occurring, the parent being equally likely to be any of the M individuals. Thus, if rii is the number of A, individuals at the beginning of the interval, the probability that the birth occurs to an A, individual is (ni/M)(3 Ar. Taking account of the possibility of mutation, assuming the mutation rates (3.6.1), we ascertain that the probability that the individual produced is of allele Aj is
We now introduce selection by assuming that the new individual survives to maturity with probability w}. The population is maintained at total size M by displacing one of the old individuals, each such being equally likely to die.
56
CHAPTER 4
Hence the net effect is that in an interval A/\ n( is reduced to «,- — 1 and Hj is increased to n} + 1, with probability yu A/, where
depends of course on the state n = ( n l , nz, • • • , nK] at the beginning of the interval. In the terminology of Kingman [2], this describes a closed Markov population process with migration rates (4.3.6). The special form of the yu (linear in each of nt and n^ brings it within the special class studied by Karlin and McGregor [3], and their methods yield representations for the transition probabilities (see also Griffiths [2], [3]). For our purposes, however, it suffices to note that, if a probability distribution (/?(n); n G A M ) over the set
can be found so that, with
the identity
holds, then the process is reversible with p as its stationary distribution. It can be checked by direct substitution that (4.3.7) is satisfied by
where
and CM is chosen so that
CMI is the coefficient of ZM in the series expansion of
SELECTION IN FINITE POPULATIONS
57
Hence this variant of the Moran model is reversible, and we have an explicit expression for the joint distribution of the numbers n-, of the different alleles in the population: it is the distribution of independent negative binomial variables conditioned on their sum. Note that, although the Moran process is biologically oversimplified, the formula (4.3.8) is an exact consequence of the assumptions of the process, and does not depend on any diffusion approximation. In the neutral case when all the if, are equal, it is well known and easy to show that (4.3.8) converges to the Dirichlet distribution (3.6.2) of gene frequencies n-,/M. The parameter a. is given for small // by (3.6.4), so long as the Ewens parameter 0 is now defined by
Thus in order to reconcile the neutral Moran model with the results of Chapter 3, we need to take
and not M — 2N as might seem more natural. Equation (4.3.8) shows that the density of the distribution of n, relative to that when all the ir, are equal, is proportional to
where ir is the weighted geometric mean of the M'J. Thus when N is large. (4.3.8) gives a distribution of gene frequencies of the form
This is to be compared with (4.3.1), which reduces under (4.3.4) to
where
is the corresponding arithmetic mean. Hence the Moran process justifies the Wright approximation in the case of multiplicative fitness, so long as the fitnesses are sufficiently close in value for the geometric and arithmetic means to be nearly equal, even when raised to the
58
CHAPTER 4
high power 4N. However, (4.3.13) is more realistic when some mutations are severely deleterious. In this section and the next, we shall work with (4.3.13) and its explicit form
However, all our conclusions hold if vr is replaced by ir, or indeed by the Watterson analogue. One consequence, indeed, holds even for the exact formula (4.3.8). Suppose that several of the alleles, A j , A 2 , • • • , AL, say, have equal fitnesses \\'i = \\>2 = • • • = WL = n'. Then the joint distribution of nl, nz, • • • nL, conditional on nl + n2 + • • • + nL - /«, is given by summing (4.3.8) over values of nL+l, • • • , nK which sum to M — ni, and then renormalising. In the result the u ' j cancel, and we are left with a conditional joint distribution
This is the same as in the neutral case, so that the relative frequencies of the alleles A!, • • • , AL are not affected by the different fitnesses of AL+l, • • • , AK. For large m, the frequencies n-Jm (j ^ L) have Dirichlet joint distribution and all the consequences described in Chapter 3 remain valid. Thus the presence of deleterious mutants does not disturb the relative proportions of equally fit alleles. In particular, the Ewens sampling formula remains true; if we could recognise and discard the deleterious members of our samples, the predictions of the neutral theory would apply to those remaining. In view of the discussion in §4.2, one might ask whether distributions like (4.3.1), (4.3.13) or (4.3.14) can arise from processes of Karlin-McGregor type. The answer is no: only in the neutral case can the effect of the unconstrained population size be ignored. This means of course that one must ask before applying the formulae of this section whether it is really true in practice that a population is controlled primarily by external factors. 4.4. The infinite alleles limit. The starting point of this section is the approximate formula (4.3.15) for the joint distribution of the allele frequencies X i , .v2, • • • , XK. For its validity N needs to be large enough for the Xj to be regarded as continuous variables, but the expression cannot be a limit as N —> x since it contains N explicitly. It could be converted into a formal limit by setting wt = 1 + N~lSj for constants s,, but this seems artificial and it is probably better to proceed directly but with caution. As in Chapter 3, we rearrange the Xj in descending order as jf a) =s A"(2) = • • • =A' ( /OI and then ask whether the joint distributions of - v d)' -%)' ' ' • converge to meaningful limits as K —» ^ and K a —» 0. From (4.3.15), the joint distribution of jc (1) , .v (2) , • • • , x(k) is, for any K ^ k, a sym-
SELECTION IN FINITE POPULATIONS
59
metric function of w : , w 2 , • • • , WKand we must therefore make some assumptions about these fitnesses in order to study the limiting behaviour of the distribution. The simplest such, which suffices for this exploratory treatment, is to suppose that the Wj are independent random variables drawn from a probability density/on, say, the interval (0, 1). If then we take (4.3.15) as the conditional density of the x, given the Wj, the unconditional density is
so that
where
Actually, this is an approximate result, since the "constant" C in (4.3.15) depends on the Wj, but it is not difficult to see that it converges in probability to a constant_as K —» *>, and so in the limit may be brought outside the expectation. Since $ is symmetric, it is also the probability density of the xu), if it is restricted to the region xv ^ xz = • • • = XK and the constant C is correspondingly adjusted by a factor K\. Hence we conclude that the joint distribution of the xu} has a density, with respect to that in the neutral case, which is proportional to
Proceeding formally to the limit as K -» 20, the joint distribution is seen to converge to a limiting distribution on the space V which has a density with respect to the Poisson-Dirichlet distribution (with parameter 0} proportional to
Since jc(j) tends to zero geometrically fast with probability one (Kingman [3]), this infinite product converges unless/has an exceptionally strong peak near w = 0. To give a rigorous justification for this tentative argument would not be a trivial matter, but if it could be done it would allow, in principle, the computation of the distribution of any function of the ordered gene frequencies, and in
60
CHAPTER 4
particular that of any symmetric function of the gene frequencies themselves. For example, in the notation of §3.7, the probability of obtaining a partition a in a sample of size n would be given by the formula
Whether such formulae can ever be made to yield information of real biological interest must remain, for the time being, an open question.
APPENDIX I
An Inequality Let XT_, X2, • • • , Xn be random variables whose joint distribution is exchangeable (i.e., unchanged by permutation, as for instance when the variables are independent and identically distributed). Let PI, • • • , pn satisfy
and let g be a convex function. Then the expectation
is least when all the PJ are equal, i.e..
The proof of this inequality depends, as it must, on Jensen's inequality. The exchangeability means that the distribution of
does not depend on the permutation 77 of {1, 2, • • • , n}. In particular.
Summing over all permutations 77, and using Jensen's inequality, we obtain
61
62
APPENDIX I
which is equivalent to (AI.2) since, for each j,
See §2.6 for a particular application of this simple result.
APPENDIX II
The Genealogy of the Wright-Fisher Model Some of the properties of the general model for neutral mutation formulated in §3.2 do not depend on the mutation rates, but simply on the family tree established by the Wright-Fisher reproductive mechanism. It may be helpful to isolate these strands in the argument, and indeed the theory may be of interest in its own right or in other contexts. It is often convenient to think about the Wright-Fisher model backwards in time. Each of the 2N members1 of the /th generation G( chooses a parent from among the 2N members of the (/ - l)th generation G,-l, each such parent being equally likely to be chosen, and the choices of different members of Gt being independent. Moreover, the whole random mechanism by which Gt selects its parents is independent of the corresponding mechanisms for other generations. Now fix /, and concentrate attention on r (^2N) particular members of Gt. In an earlier generation Gt-n each of these r members will have an ancestor, but there is no reason why these should all be distinct. Denote by £„ the number of members of Gt_n who are ancestors of at least one of the r selected individuals in Gt. Then
The transition from £„ to gn+1 takes place (independently of £ 0 > ' ' ' < £n-i) when each of the gn ancestors in Gt_n choose a parent independently from Gt-n-i< £n+i is the number of distinct parents. It then follows that (£„) is a Markov chain whose transition probability
is the probability that, when / balls are placed at random in 2N boxes, exactly j boxes are nonempty. This has the explicit form
in terms of the Stirling numbers 5(/, j) of the second kind (Riordan [1]). It should be noticed that pu does not depend on /•, t or n. 1 We use 2;V for the total population size in accordance with genetical custom, but there is no reason whv 2/V should not be an odd number.
63
64
APPENDIX II
If as usual pff denotes the (/, y)th element of the nth power of the matrix (pu: ij= 1, 2, • • • , 2AO, then
so that the p(ff contain a good deal of information about the genealogical relationships between the r individuals. Unfortunately these quantities have no simple algebraic form for general values of the variables; there is, nevertheless, approximate information which is surprisingly accurate. The algebra can be carried out exactly for small values of r. In particular, it can be checked by induction that
where
Thus the probability that two individuals in Gt have the same ancestor in Gt-n is 1 - A". More generally, r individuals in Gt will have a single common ancestor in Gt_n unless at least one pair has no common ancestor, so that
The right-hand inequality is, however, rather crude when /• is at all large. In particular, it is of little use when /• = 2N, to estimate the probability //2'/v,i that the whole of G, has a common ancestor in Gt-n • For this reason it is important that (£) can in fact be replaced by a factor which is bounded by an absolute constant. The trick is to prove that the transition probability pu satisfies the inequality
This is a straightforward exercise in the manipulation of Stirling numbers (see Kingman [4] for details). It implies that
so that
THE GENEALOGY OF THE WRIGHT-FISHER MODEL
by induction on n. If
1, then
65
and hence
This strengthens very markedly the right-hand inequality of (AII.7), and implies in particular that
for any r, n, 2N. Because the stochastic matrix (p{j; /, j = 1, 2, • • • , 27V) is lower-triangular, its eigenvalues are the diagonal elements
which form a strictly decreasing sequence as j increases. Hence (noting that p22 = A) there are constants c r , r ^ 2, such that, as n —> °c,
and in particular,
Since
we have
Writing this in the form
for /- ^ 3, we determine the cr recursively (starting with c 2 = 1). Because of (All.3), the result is a rational function of 2N. To stress this fact we write it as cr(2N). The leading term c> (=c) in this rational function can be calculated once we
66
APPENDIX II
note that
and that
Hence
whence by induction
Fhis shows that in (All. 10) the numerical constant 3 is best possible, for if
for all r, n, 2N, (All. 11) implies that
Letting 2N —> ^ with r fixed gives
and letting r —> ^ then shows that C ^ 3. In particular, the probability that the whole of Gr has a single common ancestor in Gt^n lies between 1 — 3\n and 1 — A.n, and the constant 3 cannot be replaced by any smaller universal constant.
Bibliography D. J. ALDOUS [1] Representations for partially exchangeable arrays of random variables, J. Mult. Anal., to appear. K. B. ATHREYA AND P. E. NEY [1] Branching Processes, Springer, Berlin, 1972. H. CHERNOFF [1] A measure of asymptotic efficiency for tests of a hypothesis based on a sum of observations, Ann. Math. Statist., 23(1952), pp. 439-507. G. CHOQUET [1] Lectures on Analysis, Benjamin, Reading, MA, 1969. J. A. COYNE [1] Lack of genie similarity between two sibling species of Drosophila as revealed by varied techniques, Genetics, 84(1976), pp. 593-607. S. N. ETHIER [1] A class of degenerate diffusion processes occurring in population genetics, Comm. Pure Appl. Math., 29(1976), pp. 483-493. S. N. ETHIER AND T. NAGYLAKI [1] Diffusion approximations of Markov chains with two time scales and applications to population genetics, Adv. Appl. Prob., to appear. S. N. ETHIER AND M. F. NORMAN [1] Error estimate for the diffusion approximation of the Wright-Fisher model, Proc. Nat. Acad. Sci., 74(1977), pp. 5096-5098. W. J. EWENS [1] Population Genetics, Methuen, London, 1969. [2] The sampling theory of selectively neutral alleles, Theor. Pop. Biol., 3(1972), pp. 87-112. W. J. EWENS AND W.-H. Li [1] Frequency spectra of neutral and deleterious alleles in a finite population, to appear. W. J. EWENS AND G. THOMSON [1] Properties ofequilibra in multi-locus genetic systems, Genetics, 87(1977), pp. 807-819. W. FELLER [1] An Introduction to Probability Theory and Its Applications, Vol. 1, Wiley, New York, 1957. W. H. FLEMING [1] A selection-migration model in population genetics, J. Math. Biol., 2(1975), pp. 219-233. W. H. FLEMING AND M. VIOT [1] Some measure-valued population processes, to appear. J. H. GlLLESPIE
[1] A general model to account for enzyme variation in natural populations. Ill: Multiple alleles, Amer. Naturalist, 110(1976), pp. 809-821. [2] A general model. . . . V: The SAS-CFF model, Theor. Pop. Biol., 14(1978), pp. 1-45. R. C. GRIFFITHS [1] On the distribution of allele frequencies in a diffusion model, Theor. Pop. Biol., 15(1978), pp. 140-158. [2] A transition density expansion for a multi-allele diffusion model, Adv. Appl. Prob., 11 (1979), pp. 310-325. [3] Exact sampling distributions from the infinite neutral alleles model, Adv. Appl. Prob., 11 (1979), pp. 326-354. O. KALLENBERG [1] Canonical representations and convergence criteria for processes with interchangeable increments, Z. Wahrscheinlichkeitsth., 32(1973), pp. 309-321. 67
68
BIBLIOGRAPHY
S. K A R L I N
[1] General two-locus selection models: Some objectives, results and interpretations, Theor. Pop. Biol., 7(1975), pp. 364-398. [2] Theoretical aspects of multi-locus selection balance. Studies in Math. Biology, 16, pp. 503-587. KARLIN AND D. C A R M E L L I [1] Numerical studies on two-loci selection models with general viabilities, Theor. Pop. Biol., 7 (1975), pp. 399-421. KARLIN AND U. LIBERMAN [1] Representation of nonepistatic selection models and analysis of multi-locus HardyWeinberg equilibrium configurations, to appear. K A R L I N AND J. L. MCGREGOR [1] The number of mutant forms maintained in a population, Proc. Fifth Berkeley Symp., 4 (1967), pp. 415-438. [2] Addendum to a paper ofW. Ewens, Theor. Pop. Biol., 3(1972), pp. 113-116. [3] Linear growth models with man\ t\pes and multidimensional Hahn polynomials, Theory and Application of Special Functions, Academic Press, New York, 1975. P. KELLY [1] On stochastic population models in genetics, J. Appl. Prob., 13(1976), pp. 127-131. [2] Exact results for the Moran neutral allele model, Adv. Appl. Prob., 9(1977), pp. 197-201. KESTEN [1] Quadratic transformations: A model for population growth. Adv. Appl. Prob., 2(1970), pp. 1-82 and 179-228. [2] Limit theorems for stochastic growth models, Adv. Appl. Prob., 4(1972), pp. 193-232 and 393-428.
S.
S.
S.
F.
H.
M.
KlMURA
[1] Random genetic drift at a tri-allelic locus: Exact solution with a continuous model, Biometrics, 12(1956), pp. 57-66. [2] Diffusion models in population genetics, J. Appl. Prob., 1(1964), pp. 177-232. [3] The neutral theory of molecular evolution and polymorphism, Scientia, 112(1977), pp. 687-707. M. KIMURA AND T. OHTA [1] A model of mutation appropriate to estimate the number of electrophoretically detectable alleles in a finite population. Genet. Res., 22(1973), pp. 201-204. [2] Stepwise mutation model and distribution of allelic frequencies in a finite population, Proc. Nat. Acad. Sci., 75(1978), pp. 2868-2872. J. F. C.
KlNGMAN
[1] A mathematical problem in population genetics, Proc. Camb. Phil. Soc., 57(1961), pp. 574-582. [2] Markov population processes, J. Appl. Prob., 6(1969), pp. 1-18. [3] Random discrete distributions, J. Roy. Statist. Soc. B, 37(1975), pp. 1-22. [4] Coherent random walks arising in some genetical problems, Proc. Roy. Soc. A, 351(1976),
pp. 19-31.
[5] On the properties of bilinear models for the balance between genetic mutation and selection, Proc. Camb. Phil. Soc., 81(1977), pp. 443-453. [6] The population structure associated with the Ewens sampling formula, Theor. Pop. Biol., 11(1977), pp. 274-283. [7] A note on multi-dimensional models of neutral mutation, Theor. Pop. Biol., 11(1977), pp. 285-290. [8] Remarks on the spatial distribution of a reproducing population, J. Appl. Prob., 14(1977), pp. 577-583. [9] Random partitions in population genetics, Proc. Roy. Soc., 361(1978), pp. 1-20. [10] Uses of exchangeability, Ann. Prob., 6(1978), pp. 183-197. [11] A simple model for the balance between selection and mutation, J. Appl. Prob., 15(1978) pp. 1-12.
BIBLIOGRAPHY
69
The dynamics of neutral mutation. Proc. Roy. Soc. A, 363(1978). pp. 135-146. The representation of partition structures. J. Lond. Math. Soc., 18(1978). pp. 374-380. LAURITZEN Sufficiency, prediction and extreme models, Scand. J. Statist., 1(1974). pp. 128-134. Li Maintenance of genetic variability under mutation and selection processes in a finite population. Proc. Nat. Acad. Sci., 74(1977), pp. 2509-2513. [2] Maintenance of genetic variability under the joint effect of mutation, selection and random drift. Genetics. 90(1978), pp. 349-382. [3] Maintenance of genetic variability under the pressure of neutral and deleterious mutations in a finite population, to appear. S. P. H. MANDEL AND P. A. G. SCHEUER [1] An inequality in population genetics. Heredity. 13(1959), pp. 519-524. P. MARTIN-LOF [1] Repetitive structures and the relation between canonical and microcanonical distributions in statistics and statistical mechanics. Proceedings of the Conference on Foundational Questions in Statistical Inference, Aarhus. 1973.
[12] [13] S. L. [1] W.-H. [1]
P. A.
P.
MORAN
[1] Random processes in genetics. Proc. Camb. Phil. Soc.. 54(1958), pp. 60-72. [2] On the none.\istence of adaptive topographies. Ann. Hum. Genet., 27(1964). pp. 383-393. [3] Wandering distributions and the electrophoretic profile. Theor. Pop. Biol., 8(1975), pp. 318-330. [4] A selective model for electrophoretic profiles in protein polymorphisms. Genet. Res., 28(1976). pp. 47-54. [5] Global stability of genetic systems governed by mutation and selection, Proc. Camb. Phil. Soc.. 80(1976). pp. 331-336. [6] Global stability. . . . II. Proc. Camb. Phil. Soc.. 81(1977). pp. 435-441. M. NEI AND W. H. Li [1] The transient distribution of allele frequencies under mutation pressure. Genet. Res., 28(1976), pp. 205-214. J. RlORDAN
[1] An Introduction to Combinatorial Analysis. Wiley. New York, 1958. K. SATO [1] Diffusion processes and a class of Markov chains related to population genetics. Osaka J. Math.. 13(1976), pp. 631-659. [2] Convergence to a diffusion ofmulti-allelic model in population genetics. A d v . Appl. Prob., 10(1978), pp. 538-562. E. SENETA [1] Non-Negative Matrices. Allen & Unwin. London. 1973. R. S. SINGH. R. C. LEWONTIN AND A. A. FELTON [1] Genetic heterogeneity within electrophoretic 'alleles' of .\anthine dehvdrogenase in Drosophila pseudoobscura. Genetics. 84(1976). pp. 609-629. A. C. TRAJSTMAN [1] On a conjecture of'G. A. Watterson. Adv. Appl. Prob.. 6(1974). pp. 489-493. G. A. WATTERSON [1] Some theoretical aspects of diffusion theory in population genetics. Ann. Math. Statist., 33(1962). pp. 939-957. [2] The application of diffusion theory to two population genetic models of Moran, J. Appl. Prob.. 1(1964). pp. 233-246. [3] The sampling theory of selectively neutral alleles. Adv. Appl. Prob., 6(1974). pp. 463-488. [4] Models for the logarithmic species abundance distributions. Theor. Pop. Biol., 6( 1974), pp. 217-250. [5] The stationary distribution of the infinitely-many neutral alleles diffusion model. J. Appl. Prob.. 13(1976), pp. 639-651.
70
BIBLIOGRAPHY
[6] Reversibility and the age of an allele. I: Moran's infinitely many neutral alleles model, Theor. Pop. Biol., 10(1976), pp. 239-253. [7] An analysis of multi-allelic data. Genetics, 88(1978), pp. 171-179. [8] Heterosis or neutrality? Genetics, 85(1977), pp. 789-814. S. WRIGHT [1] Adaption and selection, in Genetics, Paleontology and Evolution, Jepson, Simpson and May, eds., Princeton University Press, Princeton, NJ, 1949, pp. 365-389.
(continued from inside front cover) JERROLD E. MARSDEN, Lectures on Geometric Methods in Mathematical Physics BRADLEY EFRON, The Jackknife, the Bootstrap, and Other Resampling Plans M. WOODROOFE, Nonlinear Renewal Theory in Sequential Analysis D. H. SATHNGER, Branching in the Presence of Symmetry R. TEMAM, Navier-Stokes Equations and Nonlinear Functional Analysis MKLOS CSORGO, Quantile Processes with Statistical Applications J. D. BUCKMASTER AND G. S. S. LuoFORD, Lectures on Mathematical Combustion R. E. TARJAN, Data Structures and Network Algorithms PAUL WALTMAN, Competition Models in Population Biology S. R. S. VARADHAN, Large Deviations and Applications KIYOSI ITO, Foundations of Stochastic Differential Equations in Infinite Dimensional Spaces ALAN C. NEWELL, Solitons in Mathematics and Physics PRANAB KUMAR SEN, Theory and Applications of Sequential Nonparametrics J.ASZL6 LOVASZ, An Algorithmic Theory of Numbers, Graphs and Convexity E. W. CHENEY, Multivariate Approximation Theory: Selected Topics JOEL SPENCER, Ten Lectures on the Probabilistic Method PAUL C. FIFE, Dynamics of Internal Layers and Diffusive Interfaces CHARLES K. CHUI, Multivariate Splines HERBERT S. WOP, Combinatorial Algorithms: An Update HENRY C. TUCKWELL, Stochastic Processes in the Neurosciences FRANK H. CLARKE, Methods of Dynamic and Nonsmooth Optimization ROBERT B. GARDNER, The Method of Equivalence and Its Applications GRACE WAHBA, Spline Models for Observational Data RICHARD S. VARGA, Scientific Computation on Mathematical Problems and Conjectures INCRID DAUBECHIES, Ten Lectures on Wavelets STEPHEN F. McCoRvncK, Multilevel Projection Methods for Partial Differential Equations HARALD NIEDERREITER, Random Number Generation and Quasi-Monte Carlo Methods JOEL SPENCER, Ten Lectures on the Probabilistic Method, Second Edition CHARLES A. MICCHELLI, Mathematical Aspects of Geometric Modeling ROGER TEMAM, Navier-Stokes Equations and Nonlinear Functional Analysis, Second Edition GLENN SHAFER, Probabilistic Expert Systems PETER J. HUBER, Robust Statistical Procedures, Second Edition J. MICHAEL STEELE, Probability Theory and Combinatorial Optimization