Biostatistics (2001), 2, 2, pp. 147–162 Printed in Great Britain
Mapping quantitative trait loci in oligogenic models HSIU-KHUERN TANG, D. SIEGMUND† Department of Statistics, 390 Serra Mall, Sequoia Hall, Stanford University, Stanford, CA 94305-4065, USA Email:
[email protected] S UMMARY We discuss strategies for mapping quantitative trait loci with emphasis on certain issues of study design that have recently received attention: e.g. genotyping only selected pedigrees and the comparative value of large pedigrees versus sib pairs. We use a standard variance components model and a parametrization of the genetic effects in which the ‘segregation’ parameters are locally orthogonal to the ‘linkage’ parameters. This permits simple explicit expressions for the expectation of the score statistic, which we use to compare the power of different strategies. We also discuss robustness of the score statistic. Keywords: Gene mapping; Genome scan; Quantitative trait; Variance components.
1. I NTRODUCTION The goal of genetic mapping is to locate the genes affecting particular traits by analysis of the correlation between phenotypic values and genetic markers distributed throughout the genome. The traits can involve a 0–1 phenotype (e.g. human diseases) or can be based on quantitative measurement. One expects relatives who have similar phenotypes to have similar genotypes at marker loci close to genetic loci that influence the trait, while the markers behave stochastically according to the rules of Mendelian inheritance at distant loci. Until recently, the theory and practice of mapping quantitative trait loci (QTLs) in humans has been relatively undeveloped. The purpose of this paper is to discuss, essentially from first principles, the statistical theory for mapping QTLs in humans under the simplifying assumptions of an oligogenic model of inheritance and completely informative genetic markers. Although this model has the deficiency of a strong assumption of normality, its interpretability and its computational tractability in the incorporation of covariates, multivariate phenotypes, interactions, etc. is sufficiently attractive to have encouraged substantial recent development (e.g. Amos (1994); Fulker and Cardon (1994); Kruglyak and Lander (1995); Fulker and Cherny (1996); Almasy and Blangero (1998); Page et al. (1998); Williams and Blangero (1999)). We try insofar as possible to give explicit analytic accounts of a number of issues that have usually been treated in the literature by numerical methods or by simulation, especially the relative values of large versus small pedigrees (cf. Almasy and Blangero (1998); Page et al. (1998); Williams and Blangero (1999)) and of genotyping selected versus random sibships (Risch and Zhang, 1995; Eaves and Meyer, 1994). In the process we generalize the notion of a discordant sib pair to a discordant sibship. In the following section we describe the model to be studied. We discuss sib pairs in Section 3. Sibships of arbitrary size are the subject of Section 4. Selective genotyping is the subject of Section 5. In Section 6 † To whom correspondence should be addressed.
c Oxford University Press (2001)
148
H.-K. TANG AND D. S IEGMUND
we give a brief discussion of robustness of the components of variance model when a biallelic major gene model is correct. Our main conclusions are discussed in Section 7. To facilitate our calculations we introduce a parametrization in which the ‘segregation’ parameters (those that can be estimated from segregation data) and ‘linkage’ parameters (those requiring data from linked markers for their estimation) are orthogonal under the null hypothesis of no linkage (Cox and Hinkley, 1974, p. 324). We also use the standard asymptotic framework of statistical large sample theory, that the sample size N is large and the noncentrality parameter for a single observation is inversely proportional to N 1/2 . These techniques allow us to compute score statistics, their asymptotic expectations, and Fisher information matrices comparatively explicitly. 2. M ODEL We assume Hardy–Weinberg equilibrium throughout. Suppose there is a QTL at the locus τ , which is an unknown parameter. The phenotypic value Y is assumed to be given by Y = µ + αU + αV + δU,V + e.
(2.1)
The mean value µ may be expanded as a linear model with only minor changes to what follows. The parameter αa denotes the additive genetic effect of allele a at locus τ , δa,b the dominance deviation of alleles a, b. A subscript U denotes the allele contributed by the mother while a subscript V refers to the father. The phenotypic variance is σY2 = E[(Y − µ)2 ]. The variances of the additive and dominance 2 . Implicitly we expect that there are effects associated with the QTL at τ are σ A2 = 2EαU2 and σ D2 = EδU,V several QTLs, which may interact. For this paper we assume that other QTLs are on other chromosomes, are in linkage equilibrium with and do not interact with the QTL at τ . Under these assumptions their contribution to the phenotype Y can be assumed to be a part of the residual term e, which we assume is uncorrelated with the other terms in (2.1) and has variance σe2 . Then σY2 = σ A2 + σ D2 + σe2 . The locusspecific heritability associated with the QTL at τ is h 2 = (σ A2 + σ D2 )/σY2 . The term ‘oligogenic’ of the title is intended to suggest that at least one QTL has a substantial locus-specific heritability, say h 2 > 0.1, and hence might be detectable by linkage analysis with reasonable sample sizes. In order to have some idea of the magnitudes of different components of variance and their relations to more directly interpretable genetic parameters, we shall occasionally be interested in the special case of two alleles A1 and A2 with frequencies of p and q = 1 − p. Using a and −a for the genotypic values of the homozygotes A1 A1 and A2 A2 , respectively, and d for the dominance deviation of the heterozygote, we have the standard formulas: σ A2 = 2 pq[a + (q − p)d]2 ,
σ D2 = 4 p 2 q 2 d 2 .
(2.2)
Consider a pair of siblings satisfying the model (2.1). Recall that at any locus two relatives share alleles identical by descent if they inherit the same alleles from a common ancestor. Two siblings can share 2, 1 or 0 alleles identical by descent depending on whether they inherit the same alleles from both mother and father, from one but not both, or from neither. Let ν = ν(τ ) denote the number of alleles identical by descent at τ . Letting Yi denote the phenotypic value of the ith sibling (i = 1, 2), we have (Fisher, 1918; Kempthorne, 1955) Cov[Y1 , Y2 |ν] = σe2r + σ A2 ν/2 + σ D2 1{ν=2} ,
(2.3)
where r = corr(e1 , e2 ) accounts for the correlation between sibs arising from other QTLs and from a shared environment.
Mapping quantitative trait loci in oligogenic models
149
Taking the expectation of (2.3), we find the unconditional covariance. Hence we can rewrite (2.3) as Cov[Y1 , Y2 |ν] = Cov(Y1 , Y2 ) + [(σ A2 + σ D2 )/2](ν − 1) − (σ D2 /2)[1{ν=1} − 12 ],
(2.4)
so the terms involving ν have mean 0 and are uncorrelated. In the following we will be interested in marker loci t distributed throughout the genome and ν(t) for a sib pair as a stochastic process in t. For markers t1 and t2 on different chromosomes, ν(t1 ) and ν(t2 ) are stochastically independent. For markers on the same chromosome Cov[ν(t1 ), ν(t2 )] = 2−1 [1−2φ], where φ is a function of the recombination frequency. We find it convenient to assume that recombination follows the Haldane model of no interference, so 1 − 2φ = exp(−4λ|t1 − t2 |). Here the marker location t denotes genetic distance in centimorgans (cM) from a designated end of the chromosome, and λ = 0.01/cM. 3. L IKELIHOOD THEORY FOR SIB PAIRS In this section we consider genome scans to detect the QTL that is explicitly modelled in (2.1), although we acknowledge that several QTLs may contribute to the trait. Since the location τ of the QTL is unknown, our statistic takes the form of a stochastic process indexed by markers at loci t distributed throughout the genome. Large values of this process indicate the likelihood of a QTL located near to the markers where the large values occur. Initially we consider N pairs of siblings, and for simplicity we assume that markers are completely informative, so the values of ν(t) are known with certainty. Consider a sample of N sib pairs with phenotypes (Y11 , Y12 ), . . . , (Y N 1 , Y N 2 ) and data ν1 (ti ), . . . , ν N (ti ) from markers ti distributed throughout the genome. Our basic modelling assumption is that (Yn1 , Yn2 ) are independent with a common distribution, which conditional on ν = νn (τ ) is bivariate normal with common means µ, variances σY2 , and correlation (cf. (2.4)) ρν = ρ + {α0 (ν − 1) − δ0 [1{ν=1} − 12 ]}/σY2 ,
(3.1)
where α0 = [σ A2 + σ D2 ]/2,
δ0 = σ D2 /2.
(3.2)
It is convenient to define Dn = (Yn1 − Yn2 )/21/2 and Sn = (Yn1 + Yn2 − 2µ)/21/2 . For simplicity we assume initially that the parameters µ and σY2 are known and equal to 0 and 1, respectively. As we show below, this has no effect on the asymptotic theory. We also make the working assumption that the QTL τ is one of the ti (cf. Remark (iii) below), so the marginal log likelihood function, = (τ, α0 , δ0 , ρ), is given by = −2−1 [log(1 − ρν2n ) + Dn2 /(1 − ρνn ) + Sn2 /(1 + ρνn )]. (3.3) n
Here νn = νn (τ ), and ρν is given by (3.1). This can be regarded as the conditional likelihood of the phenotypic data given (νn , n = 1, . . . , N ) or as the unconditional joint log likelihood of the phenotypic data and the (νn ). All expectations are taken with respect to the joint distribution. Partial derivatives with respect to unknown parameters are denoted by appropriate subscripts. Let Cn = Cn (α0 , δ0 , ρ) be defined by Cn = ρνn /(1 − ρν2n ) + Sn2 /2(1 + ρνn )2 − Dn2 /2(1 − ρνn )2 .
(3.4)
150
H.-K. TANG AND D. S IEGMUND
Regarding τ momentarily as known, we obtain the components of the efficient score: α (τ ) = [νn (τ ) − 1]Cn ,
(3.5)
n
δ (τ ) =
n
[ 12 − 1{νn (τ )=1} ]Cn ,
and ρ =
(3.6)
Cn .
(3.7)
n
When α0 = 0 (hence also δ0 = 0), the three coordinates of the score vector are uncorrelated, so the Fisher information matrix is diagonal, with easily computed entries. When it is assumed that there is no dominance variance, so δ0 = 0, the score statistic for testing α0 = 0 at a putative QTL τ = t is 1/2 Z 1 (t) = α (t)/Iαα , (3.8) + where now Cn = Cn (0, 0, ρ) ˆ and ρˆ = N −1 is the maximum likelihood estimator under n Y1,n Y2,n the null hypothesis. Since the true value of τ is unknown, to scan an entire genome for linkage, we can use
max Z 1 (t),
(3.9)
t
where the max is taken over all marker loci. Linkage is detected whenever maxt Z 1 (t) b for a suitable threshold b. Thresholds to control the false detection rate have been discussed by Feingold et al. (1993) for markers equally spaced at distance 0. See also Lander and Kruglyak (1995) and Lander and Schork (1994). To allow for the possibility that δ0 > 0, we can define 1/2
(3.10)
[Z 12 (t) + Z 22 (t)]1/2 ,
(3.11)
Z 2 (t) = δ (t)/Iδδ . and use the two degree of freedom statistic
constrained by the relation that 0 δ0 α0 (cf. (3.2)). Dupuis and Siegmund (2000) discuss this possibility in the context of sib pair analysis of qualitative traits. R EMARKS (i) For unknown σY2 , µ, and ρ one uses µˆ = (2N )−1 n (Y1,n + Y2,n ), σˆ Y2 = + 2 (2N )−1 n [(Y1,n − µ) ˆ 2 + (Y2,n − µ) ˆ 2 ] and ρˆ = N −1 ˆ 2,n − µ) ˆ /σˆ Y . The asymptotic n (Y1,n − µ)(Y results are unchanged since the scores for the ‘segregation’ parameters σY2 , µ, ρ are uncorrelated with those for the ‘linkage’ parameters α0 , δ0 under α0 = δ0 = 0 and asymptotically for local alternatives. (ii) The components of variance analysis uses the relation for normally distributed X that the variance of X 2 is twice the square of the variance of X , and hence it is not robust to violations of the assumed normality. A simple device to obtain a more robust test would be to refer the statistic given above to ˆ This would make the type I its conditional distribution given (C1 , . . . , C N ), where Cn = Cn (0, 0, ρ). error probability nonparametric with respect to the distribution of the phenotypes while maintaining full asymptotic efficiency if the normality hypothesis is satisfied. Asymptotically it amounts to replacing (3.8) and (3.10) by 1/2 1/2 Z 1 (t) = α (t)/ Cn2 , Z 2 (t) = δ (t)/ Cn2 , (3.12) n
n
Mapping quantitative trait loci in oligogenic models
151
Fig. 1. Expected value of the score statistic: QTL location is τ , expectation at τ is ζ , detection threshold is b, and the × denote simulated values at equally spaced marker loci.
which has the effect of using fourth moments of the phenotypic values to estimate variability. (iii) We have assumed completely informative markers in order to simplify the analysis and have made the working assumption that the QTL τ is one of the markers. If either of these assumptions fails to be true, the likelihood function involves a mixture based on the conditional distribution of νn (τ ) given the marker data, say Mn , in the nth family. A convenient representation for the likelihood function is E 0 [exp((τ, α0 , δ0 , ρ) − (0, 0, ρ))|M, Y ],
(3.13)
where M = (M1 , . . . , M N ), Y = (Y11 , Y21 , . . . , Y1N , Y2N ), is given by (3.3), and the subscript 0 denotes that the expectation is computed under the assumption that α0 = δ0 = 0. For the case of partially informative markers, on the evidence of simulations, Fulker and Cherny (1996) report very good success with the test to detect an additive effect that simply replaces νn in (3.3) by νˆ n = E 0 (νn |Mn , Yn ). Equation (3.5) provides an explanation for the Fulker–Cherny results, since the efficient score for testing α0 = 0 is precisely (3.5) with Cn = Cn (0, 0, ρ) ˆ and with νn replaced by νˆ n . To prove this, differentiate the logarithm of (3.13) with respect to α0 and evaluate the result at α0 = δ0 = 0. See Teng and Siegmund (1998) for a discussion of the impact of marker informativeness and intermarker distance on the power to detect linkage. The same calculation gives the efficient score when the QTL τ is located between markers. In principle one might extend the max in (3.9) from marker loci to all loci t, but in most cases there seems to be very little power gained by this device (cf. Darvasi et al. (1993); Dupuis and Siegmund (1999)), so we do not pursue the more complicated analysis. Display (4.8) derived below gives an expression for the noncentrality ξ = E[Z 1 (τ )]. At a marker t linked to τ , E[Z 1 (t)] = ξ exp[−4λ|t − τ |]. These relations are illustrated in Figure 1, where ξ > b, so the probability of detection is large. The asymptotic squared noncentrality of (3.11) at the QTL τ is given by (4.8) with α02 /2 replaced by α02 /2 + δ02 /4. In Table 1 we have used (4.8) to evaluate the sample size necessary to achieve 90% power for the score statistic (3.8) and various values of α0 and ρ. We consider a genome scan at 1 cM and assume a genome of 23 chromosomes of average length 140 cM. This yields a detection threshold of b ≈ 3.91 (cf. Feingold et al. (1993) or the Appendix). We assume the trait locus is at zero recombination distance from
152
H.-K. TANG AND D. S IEGMUND Table 1. Number of sib pairs for 90% power. The overall correlation between two siblings is ρ, and the locus-specific heritability is 2α0 /σY2 . For = 1, the QTL is assumed to lie at a marker; a 2% larger sample size is required for a QTL midway between two flanking markers. For = 10, the QTL is assumed to lie midway between two markers ρ 0.25
α0 /σY2 0.25 0.15 0.10 0.05
N ( = 1) 662 1838 4136 16544
N ( = 10) 780 2168 4878 19512
0.40
0.25 0.15 0.10 0.05
487 1352 3041 12166
574 1594 3587 14348
the nearest marker. For 90% power the approximation of Feingold et al. (1993) given in the Appendix, leads to the value of N that makes the squared noncentrality parameter given by (4.8) about equal to 25. The required sample size would increase by about 2% if the QTL is midway between markers. We have also included the sample size required for 90% power when markers are spaced at 10 cM and the QTL is midway between markers. The detection threshold decreases to approximately b = 3.6; but for a QTL midway between markers, we need an approximately 16–18% increase in sample size compared to the 1 cM spacing. Simulations indicate that these approximations are very accurate. An interesting comparison is available from Page et al. (1998), who have estimated the sample size required to have a probability of 0.9 for the LOD score evaluated at a single marker to exceed the conventional level of 3, or equivalently b = 3.72 on the normal scale, which we are using here. Although this is in principle quite different from the situation that arises when multiple, closely spaced markers are used, in the special case of 1 cM spacing and the QTL exactly at a marker, by a numerical accident their definition also leads to the value of N that makes the squared noncentrality parameter equal to (3.72 + 1.28)2 = 25. Hence our analytic approximations can be compared with the results in Table 4 of Page et al. (1998), which were based on 5 000 000 simulations of an additive model with one biallelic major gene and a polygenic component. The agreement is generally excellent. If the true mode of inheritance is roughly additive, the two-dimensional statistic based on (3.11), which under the conditions given above requires a threshold of about b = 4.11 (Dupuis and Siegmund, 2000), has less power than the corresponding one-dimensional statistic. Some numerical experimentation indicates that (3.11) is more efficient than (3.8) only for a fairly rare recessive allele, which must have a large effect for the QTL to be detectable with a reasonable sample size. Hence the simpler one-dimensional statistic would seem adequate in most situations. 4. L IKELIHOOD THEORY FOR SIBSHIPS OF ARBITRARY SIZE Starting with Blackwelder and Elston (1982), a number of authors have observed on the basis of simulations that sibships of size s provide considerably more power than sib pairs, perhaps as much as s(s − 1)/2 independent sib pairs. See also Page et al. (1998); Williams and Blangero (1999). Suppose
Mapping quantitative trait loci in oligogenic models
153
we have a sample of N sibships, each of size s. We index sibs within a sibship by i and j and sibships by n = 1, . . . , N . The subscript n is often suppressed in our notation. For ease of exposition we again assume that µ = 0 and σY2 = 1. Let νi j (t) denote the number of alleles shared identical by descent at the marker locus t by the ith and jth sibs in the nth sibship. Let Aν denote the s × s matrix with entries νi j − 1 for i = j and zeros along the diagonal. Let ν = E(Y Y |Aν ). The log likelihood for a single QTL at τ is = (τ, α0 , δ0 , ρ) given by = −2−1
N
{log |ν | + trν−1 Y Y },
(4.1)
n=1
where ν = ν(τ ). Recall that if G is a nonsingular matrix depending on x, then ∂ log |G|/∂ x = tr(G −1 ∂G/∂ x) and −1 ∂G /∂ x = −G −1 ∂G/∂ x G −1 . By differentiation of (4.1) we obtain the score equations α = 2−1 {−tr(ν−1 Aν ) + tr(ν−1 Aν ν−1 Y Y )} (4.2) n
and ρ = 2−1
n
{−tr(ν−1 B) + tr(ν−1 Bν−1 Y Y )},
where B = ∂ν /∂ρ = 11 − I . We omit the similar expression for δ (cf. (3.6)). Also −αα = {−2−1 tr(ν−1 Aν ν−1 Aν ) + tr(ν−1 Aν ν−1 Aν ν−1 Y Y )}.
(4.3)
(4.4)
n
Similar expressions for ρρ , αρ , δδ , etc. are easily obtained. Let E 0 denote expectation under the hypothesis that α0 = 0 (hence also δ0 = 0) and let = E 0 (Y Y ) = (1 − ρ)I + ρ11 . It is easy to see that Iαα = E 0 (−αα ) = (N /2)trE 0 ( −1 Aν −1 Aν )
(4.5)
and Iαρ = E 0 (α ρ ) = 0. Hence the asymptotic noncentrality of the score statistic 1/2 Z t = α (t, 0, 0, ρ)/I ˆ αα (0, 0, ρ) ˆ
(4.6)
can be evaluated by taking the expectation of (4.6) with ρˆ replaced by ρ. From (3.1) and (4.2) we obtain for t = τ E[α (τ, 0, 0, ρ)] = (N /2)α0 trE( −1 Aν −1 Aν ). Hence by (4.5) the noncentrality of (4.6) is α0 {(N /2)trE 0 ( −1 Aν −1 Aν )}1/2 . To evaluate (4.7), we first observe that −1 = K ρ {[1 + (s − 1)ρ]I − ρ11 },
(4.7)
154
H.-K. TANG AND D. S IEGMUND
where K ρ = {(1 − ρ)[1 + (s − 1)ρ]}−1 . See, for example, Rao (1973, p. 67). It is easy to see that νi j = ν ji , and for i < j the νi j are pairwise independent whenever they differ for at least one subscript. Hence E A2ν = [(s−1)/2]I . By straightforward and somewhat tedious algebra we find that E(Aν 11 Aν ) = (s/2 − 1)I + (1/2)11 . Combining these results, we obtain trE 0 (
−1
Aν ) = 2
K ρ2
s {[1 + (s − 2)ρ]2 + ρ 2 }, 2
which can be substituted into (4.7) to obtain the square of the asymptotic noncentrality parameter for N sibships of size s: s {[1 + (s − 2)ρ]2 + ρ 2 } ξ =N 4 . 2σY 2 {(1 − ρ)[1 + (s − 1)ρ]}2 2
α02
(4.8)
Although (4.8) increases rapidly with s, there are dependencies among the νi j within a sibship, so α has a skewed distribution when s > 2, and hence a larger threshold is required to maintain a fixed false positive error rate. In addition, standard asymptotic theory to the effect that the asymptotic variance of the score statistic for small, positive α0 is effectively the same as when α0 = 0 is not a reasonable approximation for large s (hence relatively small sample sizes). Consequently, the increase in power with increasing s is less than is suggested by considering only the increase in the asymptotic noncentrality parameter. Numerical examples are given in Table 2, for which we used a more refined asymptotic analysis, which we describe in an Appendix and have spot checked for accuracy by simulations. (Even for s = 2 there is a small discrepancy between the sample sizes in Tables 1 and 2 because of the different approximations used.) We see from
Table 2 that, except for large s or large α0 , a sibship of size s turns out to be roughly as powerful as 2s independent sib pairs. For small sibships Williams and Blangero have also obtained the noncentrality parameter (4.8) and have computed sample sizes directly from (4.8) without the corrections mentioned in the preceding paragraph. These can be misleading, especially if carried out for larger s. For example, for the third row of Table 2 their method would yield N = 100; for the sixth row it would yield N = 12. Our more precise asymptotic analysis seems to be consistent with the simulations of Page et al. (1998). Exactly as for sib pairs, we can obtain a distribution-free false positive rate if we consider the conditional distribution of α given the phenotypic values. Asymptotically that means that α should be standardized by {E 0 [2α |Y1 , . . . , Y N ]}1/2 , which is given by the square root of a sum of terms of the form E 0 {[−tr( −1 Aν ) + Y −1 Aν −1 Y ]2 | Y } s 2 µ22 sµ4 4sµ3 Y¯ 2s(s − 3)µ2 Y¯ 2 =− − + + (1 − ρ)4 [1 + (s − 1)ρ](1 − ρ)3 [1 + (s − 1)ρ]2 (1 − ρ)2 (1 − ρ)4 2sρµ2 s(s − 1){ρ[1 + (s − 1)ρ] + (1 − ρ)Y¯ 2 }2 − + , 3 [1 + (s − 1)ρ](1 − ρ) [1 + (s − 1)ρ]4 (1 − ρ)2
(4.9)
where µk = s −1 (Yi − Y¯ )k for k = 2, 3, 4. In the following section we find (4.9) useful for a completely different purpose. Our methods can be adapted to study extended pedigrees, although it is more difficult to obtain explicit analytic results. An exception is the case of nuclear families consisting of parents and their children. In addition to the inter-sib correlation ρ, we let ρ p denote the parental correlation (due to shared environment only) and ρ˜ = (σ A2 /2 + r˜ σe2 )/σY2 the parent–sib correlation. For N nuclear families, each containing s
Mapping quantitative trait loci in oligogenic models
155
Table 2. Sample sizes required for 90% power. Sibships of size s; sib pair correlation is ρ; sample size is N ; threshold is b; noncentrality is ξ ; rel. eff. is the number of independent sib pairs needed to have the same power as one sibship of size s; see the text for definitions of other parameters ρ 0.25
α0 /σY2 0.25
s 2 3 4 5 6 10
b 3.91 4.02 4.11 4.19 4.26 4.48
ξ 5.05 5.34 5.60 5.86 6.10 7.00
N 675 237 125 80 56 23
σ02 1.13 1.52 1.97 2.48 3.05 5.86
σ12 1.08 1.28 1.48 1.68 1.89 2.74
rel. eff. 1.0 2.8 5.4 8.5 12 30
0.25
0.125
2 3 4 5 6 10
3.91 3.97 4.02 4.07 4.12 4.27
5.01 5.15 5.29 5.43 5.57 6.07
2 658 881 445 272 186 69
1.03 1.20 1.39 1.60 1.81 2.84
1.02 1.11 1.19 1.28 1.38 1.75
1.0 3.0 6.0 9.8 14 39
0.25
0.0625
2 3 4 5 6 10
3.91 3.94 3.97 4.00 4.03 4.12
4.99 5.07 5.14 5.21 5.29 5.55
10 546 3 417 1 679 1 001 670 230
1.01 1.09 1.17 1.26 1.36 1.77
1.01 1.04 1.09 1.13 1.17 1.34
1.0 3.1 6.3 11 16 46
0.40
0.15
2 3 4 5 6 10
3.91 3.99 4.07 4.13 4.19 4.38
5.02 5.23 5.44 5.63 5.83 6.54
1 363 446 225 137 95 37
1.07 1.33 1.64 1.98 2.36 4.18
1.05 1.17 1.31 1.46 1.61 2.21
1.0 3.1 6.1 10 15 38
0.40
0.125
2 3 4 5 6 10
3.91 3.98 4.04 4.10 4.16 4.33
5.01 5.19 5.36 5.53 5.70 6.34
1 954 632 314 191 130 49
1.05 1.26 1.50 1.78 2.07 3.46
1.03 1.14 1.25 1.36 1.48 1.97
1.0 3.1 6.2 10 15 40
sibs, the squared noncentrality equals N α02 s [1 − 2ρ˜ 2 /(1 + ρ p ) + (s − 2)(ρ − 2ρ˜ 2 /(1 + ρ p )]2 + (ρ − 2ρ˜ 2 /(1 + ρ p ))2 . [1 − 2ρ˜ 2 /(1 + ρ p ) + (s − 1)(ρ − 2ρ˜ 2 /(1 + ρ p ))]2 (1 − ρ)2 2σY2 2 This is somewhat larger than (4.8) for the siblings alone. For a completely additive trait with ρ = 0.25, for two sibs the squared noncentrality with parents included is 15% larger than without parents. For five sibs it is only 7% larger.
156
H.-K. TANG AND D. S IEGMUND 5. S ELECTIVE GENOTYPING
In some cases it may be relatively easy and inexpensive to phenotype individuals. When the cost of phenotyping is indeed small compared to the cost of genotyping, it is possible to achieve an increase in power by genotyping only sib pairs with particularly favourable phenotypes. See, for example, Risch and Zhang (1995), who recommend using sib pairs with discordant phenotypes, and Xu et al. (1999) for an application. In this section we study Risch and Zhang’s suggestion. We begin by considering instead of α the simpler statistic suggested by Risch and Zhang, (ν − 1)/(N /2)1/2 ,
(5.1)
where the summation extends over all genotyped sib pairs. Also assume for simplicity that µ = 0, σY2 = 1. For a given sib pair, by Bayes’ formula one can easily show that E(ν − 1|D, S) is given by a fraction, the numerator of which equals P(ν = i)(i − 1)ϕ[S/(1 + ρi )1/2 ]ϕ[D/(1 − ρi )1/2 ]/(1 − ρi2 )1/2 while the denominator is a similar expression without the factor (i − 1). For small values of α0 , we see from the first term of a Taylor series expansion that E(ν − 1|D, S) ∼ α0 E 0 (ν − 1)2 [ρ/(1 − ρ 2 ) + S 2 /2(1 + ρ)2 − D 2 /2(1 − ρ)2 ].
(5.2)
It is evident from (5.2) that sib pairs with large values of D are particularly informative. If we genotype only those sib pairs whose phenotypes satisfy |D| > t, for each genotyped pair we have at the QTL an asymptotic noncentrality of 21/2 E(ν − 1||D| > t) ∼ −[α0 /2 · 21/2 (1 − ρ)]t ϕ(t )/{1 − (t )},
(5.3)
where t = t/(1 − ρ)1/2 . For a numerical example suppose that ρ = 0.25 (corresponding to a heritability of 0.50 for a purely additive trait) and t = 1.96. Then (5.3) equals −2.16α0 , while the noncentrality of a random sib pair is 0.78α0 . Hence only about one-eighth as many discordant sib pairs need be genotyped as random sib pairs. On the other hand, roughly 20 random sib pairs must be phenotyped to find one discordant sib pair. Other methods of selecting the sib pairs to be genotyped can be handled by modifications of the preceding argument. For the situation described above, the preferred definition of Risch and Zhang (1995) is about 82% as efficient as our suggestion. The case of concordant sib pairs, defined by max(Y1 , Y2 ) < −t or min(Y1 , Y2 ) > t, can be treated similarly. Now the value of t corresponding to the preceding examples is 1.19, and the asymptotic noncentrality is 1.49α0 . Instead of the ad hoc statistic (5.1), one might consider the score statistic (3.8). Now the unknown nuisance parameter ρ (and in general µ and σY also) must be estimated. This poses no problem if, as we assume, the sib pairs to be genotyped are selected from a random sample of sib pairs of known phenotypes, which are available to estimate the nuisance parameters. This will typically be a very large sample, ensuring that the nuisance parameters are estimated accurately. By some routine Taylor series expansions one sees that for purposes of asymptotic analysis one can regard them as known. Before proceeding, it is worth noting, however, that the situation would be quite different if the sib pairs are ascertained through their phenotypes, so the natural estimates of nuisance parameters are biased, perhaps severely. See, for example, Beaty and Liang (1987) for ascertainment corrections. An advantage of the score statistic over (5.1) is that it generalizes naturally to the case of larger sibships. For the score statistic (4.2), the arguments given above show that N E(α |Y1 , . . . , Y N ) ∼ (α0 /4)n=1 E 0 {[−tr( −1 Aν ) + Y −1 Aν −1 Y ]2 |Y }.
(5.4)
Mapping quantitative trait loci in oligogenic models
157
We define a discordant sibship to be one in which the squared norm of the (s − 1)-dimensional vector of orthogonal contrasts exceeds a threshold t. To facilitate the analysis of (5.4) we make the (Helmert) orthogonal transformation from Y = (Y1 , . . . , Ys ) defined by Z s = (Y1 + · · · + Ys )/s 1/2 = s 1/2 Y¯ , and for i = 1, . . . , s − 1 Z i = [(i + 1)i]−1/2 [ ij=1 Y j − iYi+1 ]. Under the probability P0 , Z 1 , . . . , Z s are independent and normally distributed with Var0 Z s = 1+(s −1)ρ, and Var0 Z i = 1−ρ, i = 1, . . . , s −1. The variables Z 1 , . . . , Z s−1 are the orthogonal contrasts in the Y , so our definition of discordant is that 2 Z 12 + · · · + Z s−1 > t. An expression for the expectation in (5.4) is given in (4.9). To find the asymptotic noncentrality 2 parameter, we evaluate the expectation of (4.9) given that Z 12 + · · · + Z s−1 > t. It is straightforward s k to compute each term, except possibly those involving i=1 Yi for k = 3, 4. Since our definition of discordance is symmetric in the Y , all terms in the sum have the same (conditional) expectation, and from the inverse of the Helmert transformation, we see that Ys = −(s − 1)1/2 Z s−1 /s 1/2 + Z s /s 1/2 . Hence these expectations are also readily evaluated, and we obtain the following expression: 2 E 0 {[−tr( −1 Aν ) + Y −1 Aν −1 Y ]2 | Z 12 + · · · + Z s−1 > t} 2 3 2 1 3(s − 1) 2t f s−1 (t ) (s − 3s + s + 3)ρ + s 2 − 3 2t f s−1 (t ) = 1 − + s(s + 1) Fs−1 (t ) Fs−1 (t ) (1 − ρ)2 s[1 + (s − 1)ρ](1 − ρ)2
+
s(s − 1){[1 + (s − 2)ρ]2 + ρ 2 } , [1 + (s − 1)ρ]2 (1 − ρ)2
where t = t/(1 − ρ), and f s , Fs are respectively the density and right tail distribution functions of a χs2 variable. Some examples of the efficiency gained by selective genotyping of the most discordant 10% of a sample of sibships of size s are given in Figure 2. For small s the gain in efficiency compared to genotyping random sibships is quite large; but as the size of the sibship increases the relative value of selective genotyping decreases while, as we saw in the preceding section, the unconditional value of the sibship increases. For example, a random sibship of size 4 is about as powerful as a selected sib pair from the most discordant 10% of the population. 6. ROBUSTNESS As we indicated above, by conditioning on the phenotypic values it is possible to make the statistics we have considered nonparametric with respect to the false-positive error rate. In this section we make a brief study of robustness of the power of these tests when a different model is assumed to be true—in particular for the model having a major gene with two alleles and a normally distributed residual. For simplicity we consider only sib pairs. The standardized ‘nonparametric’ version of the score statistic is α /[n Cn2 /2]1/2 . Since the expected value of the numerator, E(α ), is computed without making distributional assumptions, to evaluate the noncentrality parameter we need only evaluate E 0 C 2 , which under the assumption of normality equals (1 + ρ 2 )/(1 − ρ 2 )2 . Let = E 0 C 2 −(1+ρ 2 )/(1−ρ 2 )2 . Algebraic expressions for are straightforward, albeit somewhat complicated for the case of a single biallelic QTL and normally distributed residual e. Table 3 contains numerical examples of the percentage increase in E 0 C 2 , which is also the percentage increase in sample size that would be required to maintain the same power as determined for the assumed components of variance model. For the most part, the impact of using the components of variance when a biallelic major gene model is correct has a negligible effect on the power, but the effect can be substantial in the case of a rare recessive allele of large effect.
H.-K. TANG AND D. S IEGMUND
0
100
200
300
400
500
600
700
158
2
4
6
8
10
Fig. 2. Number of genotyped sibships for selected (proportion = 0.1) and unselected genotyping.
It is possible that a likelihood analysis of the correct model would produce a completely different and more efficient statistic. To simplify the notation we assume there is no dominance deviation. Let gi denote the indicator that the allele inherited from the ith parent is A1 . Then we can write Y = µ + a(g1 + g2 − 2 p) + e. Let p(g1 , g2 |ν) denote the conditional distribution of (g1 , g2 ) given ν. The likelihood function for a single pedigree is the mixture D − a(g1 − g2 )/21/2 p(g1 , g2 |ν) S − a(g1 + g2 − 2 p)/21/2 g1 ,g2 2 ϕ . (6.1) ϕ σe (1 − r 2 )1/2 σe (1 + r )1/2 σe (1 − r )1/2 If we take the first two terms of the Taylor series expansion of (6.1) about a = 0, we obtain after some calculation that (6.1) ϕ[S/σe (1 + r )1/2 ]ϕ[D/σe (1 − r )1/2 ] σe2 (1 − r 2 )1/2 2 2 D − σe (1 − r ) 1 S 2 − σe2 (1 + r ) 3 × 1 + α0 − (ν − 1)/2 + + (ν − 1)/2 , 2 2 σe4 (1 − r )2 σe4 (1 + r )2 ≈
(6.2)
where, as above, α0 = σ A2 /2 = pqa 2 . The efficient score for testing α0 = 0 is the logarithmic derivative of the likelihood function evaluated at α0 = 0. This is just the coefficient of α0 in (6.2). Any term not involving ν can be omitted, and unknown parameters must be estimated, so after summing over all sib pairs the statistic becomes
D 2 − σˆ e2 (1 − rˆ ) S 2 − σˆ e2 (1 + rˆ ) (ν − 1) − . (6.3) + 2σˆ e4 (1 − rˆ )2 2σˆe 4 (1 + rˆ )2
Mapping quantitative trait loci in oligogenic models
159
Table 3. Percentage increase in sample size for biallelic major gene. The column headed % change gives the percentage increase in sample size for a biallelic major gene (cf. (2.2)) relative to the assumed oligogenic model with the same variance components. The major gene contributes 25% of the trait variance; its additive effect is a, dominance deviation is d, and allele frequency is p; the sib correlation is ρ a 0.707
d 0
p 0.50
ρ 0.125 0.250
0.817
0
0.25
0.125 0.250
1.0 1.6
1.179
0
0.10
0.125 0.250
6.7 5.9
% change −0.3 0.6
0.577
−0.577
0.50
0.104 0.208
2.0 2.8
1.033
−1.033
0.25
0.088 0.175
20.6 17.7
2.513
−2.513
0.10
0.074 0.148
137 116
0.504
0.504
0.25
0.116 0.232
−0.3 1.2
0.637
0.637
0.10
0.122 0.243
4.3 4.9
This is exactly of the form of the score statistic for the components of variance model, except that σˆ e2 appears in place of σˆ 2Y and rˆ appears in place of ρˆ (cf. (3.5) ff.). However, the estimates for these parameters are calculated under the condition α0 = 0, and in spite of the difference in notation the parameters, hence the estimates, are the same under that hypothesis. Thus the score statistics for the twoallele model with normal residuals and for our components of variance model are the same statistic. This provides further evidence of the robustness of our components of variance test. 7. D ISCUSSION For an oligogenic model with normally distributed phenotypic data, we have introduced a parametrization that makes the ‘linkage’ parameters orthogonal to the ‘segregation’ parameters and hence allows us to compute explicitly score statistics, Fisher information matrices, and noncentrality parameters in a number of important special cases. We have evaluated the asymptotic noncentrality parameter for sibships of arbitrary size, which suggests what others have observed as a result of simulations, that a sibship of size s can be roughly
as powerful as 2s independent sib pairs. Our more precise analysis shows that this assessment is overly optimistic when s and α0 are large, but large sibships are, nevertheless, extremely valuable even in these cases.
160
H.-K. TANG AND D. S IEGMUND
We have evaluated the power of genotyping a selected subset of sibships defined by their phenotypes, which we select from a large random sample of sibships. The relative value of selective genotyping decreases rapidly as the sibship size increases. The Haseman–Elston regression statistic (Haseman and Elston, 1972) can be derived as a special case of the calculations of Section 3. One ignores S1 , . . . , S N and starts from the likelihood function for D1 , . . . , D N , then uses the robust version of the score statistic suggested in (3.12). Compared to the fully efficient likelihood analysis, for moderate phenotypic correlation (0.25) between sibs, Haseman–Elston regression is about 75% efficient when the mode of inheritance is additive. The modified Haseman–Elston statistic (Elston et al., 2000) is more (less) efficient than the classical for small (large) correlation between sibs. It also is about 75% efficient for moderate correlation. See also Teng (1996) and Wright (1997). In sibships the dominance component of variance contributes to the noncentrality of the score statistic designed to detect an additive component. Consequently, even if there is a large dominance component, but we model only the additive components, our loss of efficiency is usually relatively modest. Even for rare recessively acting alleles of relatively large effect, the loss rarely exceeds 10–20% of the sample size. Based on a conditioning argument, we have suggested a modified statistic, which is nonparametric under the hypothesis of no linkage and can be expected to be robust to moderate departures from normality. We have briefly discussed robustness against a true model involving a (biallelic) major gene. It appears that our model is robust if the allelic substitutions have small phenotypic effect, or modest effect but small dominance deviation. An interesting case deserving more careful attention involves a major gene having rare alleles of large effect. We expect to discuss gene–gene and gene–environment interactions in a future paper. ACKNOWLEDGEMENTS This research was partly supported by NIH Grant HG-00848. The authors thank two referees and the associate editor for their thoughtful suggestions. A PPENDIX : BETTER APPROXIMATIONS FOR SIBSHIPS OF SIZE s Because of the dependence among the νi j (which are pairwise independent), the null distribution of α is skewed when the number of siblings is s 3. To deal with a similar problem involving qualitative traits Tu and Siegmund (1999) suggested a p-value approximation that uses the third moment to correct for skewness. Let −β be the one-sided derivative at 0 of Cov(Z 0 , Z t ), γ be N 1/2 times the third moment of Z t under the hypothesis of no linkage, θ = [−1 + (1 + 2bγ /N 1/2 )1/2 ]/γ , and ν = ν[b(2β)1/2 ] the special function defined by Siegmund (1985, p. 82). The following is a slight modification of the approximation of Tu and Siegmund (1999): for a chromosome of length L with markers equally spaced at distance , P0 { max Z i > b} 0i
−1/2
≈ [2π(1 + γ θ )]
{1/θ N 1/2 + νβ Lb} exp[−N θ 2 (1 + 2γ θ/3)/2].
Substantial calculation shows that the value of γ equals
[3/2] 3s {[1 + (s − 2)ρ]3 + (3s − 10)ρ 3 + 3ρ 2 }
. {[ 2s /2][(1 + (s − 2)ρ)2 + ρ 2 ]}3/2 As a function of ρ this ratio is practically constant, so in evaluating the thresholds in Table 2 we have used the value for ρ = 0.
Mapping quantitative trait loci in oligogenic models
161
To determine the sample sizes in Table 2, we have used suitable versions of the power approximations provided by Feingold et al. (1993): P(max Z k > b) ≈ 1 − ((b − ξ )/σ0 ) 2ν ν2 , + ϕ((b − ξ )/σ0 ) − bσ0 /σ12 + (ξ − b)/σ0 2bσ0 /σ12 + (ξ − b)/σ0 where ν = ν[b(2β)1/2 /σ1 ]. This approximation, which is valid when there is a marker at the QTL τ , involves (i) the probability that the statistic Z τ is above the detection threshold b (or that the statistic exceeds b at one or both of the two markers flanking τ in the case that τ is not itself a marker) and (ii) the probability that the statistic is below the threshold at τ but Z t b at some nearby marker t. To implement this approximation we require the mean and variance (σ02 ) of Z τ and the conditional mean and variance (σ12 ) of Z t − Z τ given Z τ . See Tang (2000) for details. R EFERENCES A LMASY , L. AND B LANGERO , J. (1998). Multipoint quantitative-trait linkage analysis in general pedigrees. American Journal of Human Genetics 62, 1198–1211. A MOS , C. I. (1994). Robust variance-components approach for assessing genetic linkage in pedigrees. American Journal of Human Genetics 54, 535–543. B EATY , T. H. AND L IANG , K. Y. (1987). Robust inference for variance components models in families ascertained through probands: I. conditioning on the proband’s phenotype. Genetic Epidemiology 4, 203–210. B LACKWELDER , W. C. AND E LSTON , R. C. (1982). Power and robustness of sib-pair linkage tests and extension to larger sibships. Commun. Statist.- Theor. Meth. 11, 449–484. C OX , D. R. AND H INKLEY , D. V. (1974). Theoretical Statistics. London: Chapman and Hall. DARVASI , A., W EINREB , A., M INKE , V., W ELLER , J. I. AND S OLLER , M. (1993). Detecting marker-QTL linkage and estimating QTL gene effect and map location using a saturated genetic map. Genetics 134, 943–951. D UPUIS , J. AND S IEGMUND , D. (2000). Boundary crossing probabilities in linkage analysis. In Thomas Bruss, F. and Le Cam, L. (eds), Game Theory, Optimal Stopping, Probability and Statistics, Hayward, CA: Institute of Mathematical Statistics, pp. 141–152. D UPUIS , J. AND S IEGMUND , D. (1999). Statistical methods for mapping quantitative trait loci from a dense set of markers. Genetics 151, 373–386. D UPUIS , J., B ROWN , P. AND S IEGMUND , D. (1995). Statistical methods for linkage analysis of complex traits from high resolution maps of identity by descent. Genetics 140, 843–856. E AVES , L. AND M EYER , J. (1994). Locating human quantitative trait loci: guidelines for the selection of sibling pairs for genotyping. Behavior Genetics 24, 443–455. E LSTON , R., B UXBAUM , S., JACOBS , K. B. Epidemiology 19, 1–17.
AND
O LSON , J. M. (2000). Haseman and Elston revisited. Genetic
F EINGOLD , E., B ROWN , P. O. AND S IEGMUND , D. (1993). Gaussian models for genetic linkage analysis using complete high resolution maps of identity-by-descent. American Journal of Human Genetics 53, 234–251. F ISHER , R. A. (1918). The correlation of relatives on the assumption of Mendelian inheritance. Proc. Roy. Soc. Edinburgh. F ULKER , D. W. AND C HERNY , S. S. (1996). An improved multipoint sib pair analysis of quantitative traits. Behavior Genetics 26, 527–532.
162
H.-K. TANG AND D. S IEGMUND
F ULKER , D. W. AND C ARDON , L. R. (1994). A sib-pair approach to interval mapping of quantitative trait loci. American Journal of Human Genetics 54, 1092–1103. H ASEMAN , J. K. AND E LSTON , R. C. (1972). The investigation of linkage between a quantitative trait and a marker locus. Behavior Genetics 2, 3–19. K EMPTHORNE , O. (1955). Genetic Statistics. New York: Wiley. K RUGLYAK , L. AND L ANDER , E. S. (1995). Complete multipoint sib pair analysis of qualitative and quantitative traits. American Journal of Human Genetics 57, 439–454. L ANDER , E. S. AND K RUGLYAK , L. (1995). Genetic dissection of complex traits: guidelines for interpreting and reporting linkage results. Nature Genetics 11, 241–247. L ANDER , E. S. AND S CHORK , N. J. (1994). Genetic dissection of complex traits. Science 265, 2037–2048. PAGE , G. P., A MOS , C. I. AND B OERWINKLE , E. (1998). The quantitative LOD score: test statistic and sample size for exclusion and linkage of quantitative traits in human sibships. American Journal of Human Genetics 62, 962– 968. R AO , C. R. (1973). Linear Statistical Inference and Its Applications, 2nd edn. New York: Wiley. R ISCH , N. AND Z HANG , H. P. (1995). Extreme discordant sib pairs for mapping quantitative trait loci in humans. Science 268, 1584–1589. S IEGMUND , D. (1985). Sequential Analysis: Tests and Confidence Intervals. New York: Springer. TANG , H.-K. (2000). Using variance components to map quantitative trait loci in humans, Ph.D. Thesis, Stanford University. T ENG , J. (1996). Statistical methods in linkage analysis, Ph.D. Thesis, Stanford University. T ENG , J. AND S IEGMUND , D. (1998). Multipoint linkage analysis using affected relative pairs and paritally informative makes. Biometrics 54, 1247–1265. T U , I-P ING AND S IEGMUND , D. (1999). The maximum of a function of a markov chain and application to linkage analysis. Advances in Applied Probability 31, 510–531. W ILLIAMS , J. T. AND B LANGERO , J. (1999). Power of variance component linkage analysis to detect quantitative trait loci. Annals of Human Genetics 63, 545–563. W RIGHT , F. (1997). The phenotypic difference discards sib-pair QTL linkage information. American Journal of Human Genetics 60, 740–742. X U , X., ROGUS , J. J., T ERWEDOW , H. A., YANG , J., WANT , Z., C HEN , C., N IU , T., WANT , B., X U , H., W EISS , S., S CHORK , N. J. AND FANG , Z. (1999). An extreme-sib-pair genome scan for genes regulating blood pressure. American Journal of Human Genetics 64, 1694–1701. [Received March 6, 2000; revised June 28, 2000; accepted for publication June 29, 2000]