This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
^ with the W zero biased distribution in coordinates a, (3. Then for a three times differentiate test function h : TV —> R, Z a standard 0, as e -> 0, (3.16) where fe+{x) := sup| y |< £ /(:r + y), and / e _(z) :=mi\y\<ef{x + y). So, in particular, if T — 0, all densities qsi{x,w) are r + 1 times differentiable in x, and their derivatives are absolutely integrable, then |£r(/)| (x) is continuous in the points x = a,x = b (Mz+, /x) in the sense that fi = v o &*1. Recall that on (R z +, /x) we have EDiF(X) = E(X,l)F(X). We use the isometry where Dfe = ^lDi is the pullback of Dt by <£ and D^l(W) = (X, I). We need to calculate Dh and £)£1 on the path space PO(R) in terms of h and W. From (4.1) we see immediately that the pullback Dh = ^71D[ is given by + 1/2, Y =c UaY + (1 - U)aY' + tp(U)
10
Larry Goldstein and Gesine Reinert
normal vector in R p and Nh = Eh(Z),
lEhp-^W) - Nh\ < ^HE-^fllD^H £ \sa0\ E\\W - W^||. a,/3=l
Proof: Following Gotze (1991), and Barbour (1990) (see also Goldstein h Rinott (1996)), a solution / to the differential equation trSL>2/(W) _ w • V/(W) = /i(£- 1 / 2 W) - Nh exists and satisfies -j-^
/(w) < ^ I I S - ^ l l ^ l ^ / i l l ,
k = 1,2,3.
(3.1)
Applying Definition 1.2, |E/i(S- 1/2 W) - Nh\ = \E (tiED2f(W)
- W • V/(W)) |
= E J2 «a/»(/a./j(W) -
faAW*'0))
a,0=1
= E J2 sa0Vfa,0(U0) a, 0=1
• (W -
wa P
')
.
where £Qj/g is on the line segment between W and WQi/3. The proof is completed by applying the triangle inequality and the bound (3.1). • Given a collection of vectors X, = (-X^j,..., Xp^) € R p , i = 1,..., n let n
W = ^ Xj,
and set Oji%i = Cov(Xjti, XiA).
i=l
In order the generalize Proposition 2.1 to higher dimension, we make the following Definition 3.2: The collection of vectors X* = ( X M , . . . ,Xp
11
Zero biasing in one and higher dimensions
Example 3.4: If the components of each vector X, € W,i = l , . . . , n are obtained by a simple random sample of size p from a population with characteristics Ai, \Ai\ = Ni > p, satisfying
a€Ai
then
^ =^ E «
2
and
°M=Ni{N^i)
E«2
fi»jv.
and hence the collection (Ki)i=it...in has constant sign covariance. When a collection of vectors (Xj)j=i n has constant sign covariance then for every j , I = 1,... ,p we may define a probability distribution on indices i = 1,..., n by n
P ( p ' = i ) = f2Li>
With
Sjl=Vffji]1,
(3.2)
and in addition, if for i ^ i' we have Cov(Xj, Xj/) = 0, then Sjj = Cov(Wj, Wi)
where
n W"j = ^ X j t i . i=l
Theorem 3.5: Let (X.i)i=i,...,n be a collection of mean zero constant sign covariance vectors, independent over i, and for each i = 1,..., n and j , I = 1,..., p suppose that X^ has the Xj zero biased distribution in coordinates j , I and is independent ofX.j for j ^ i. Then with Pl having distribution (3.2) and independent of all other variables
VF' = W - X/i. + Xf3l = X.% + E
X
*
has the W zero biased distribution in coordinates jl. In particular, when W has positive definite covariance matrix S = (SJI), then for any three times differentiate test function h : W —* R, and Nh = Eh(Z) for a standard normal vector Z e Rp,
\Eh(s-^2W) - Nh\ < ^Hir1/2!!3!!!?3/*!! ]T | S j 7 |S||x p l - x ^ | | .
12
Larry Goldstein and Gesine Reinert
Proof: It suffices to consider a smooth function / with compact support, for which
£f^(W)=£f^I,,/ J (W)=^fl J ,/ i (X, + ^X() j=l
j=l i = l
i = l j=l
trfti
= ^EJ^xufiVk + j > ) = j^E I £f>,,/,,(x*+x>) 1
= E £ E>* E p{-pl = o/iKxj'+E x*) = E E E ^/i»(wJI). The second assertion follows directly from Theorem 3.1, completing the proof. • 4. Construction using exchangeable pairs Theorem 4.1 demonstrates how to construct the zero biased vectors X^' € R p , j , I = 1,... ,p for a mean zero vector X G R p with components satisfying conditions similar to those imposed to prove Proposition 2.1 and Theorem 2.1 in Goldstein & Reinert (1997); we note (X'pX'j) in (4.1) is an embedding of a univariate exchangeable pair in a multidimensional vector. Theorem 4.1: Let X' = (X[,X2, • • • ,Xp) be a vector of mean zero finite variance variables with Var(X'j) = cr2 > 0 and EX'jX't = K for j ^ /, cr2 > K. Assume that for every j there exists X" such that Xj = (X'1,...,X'j,X",...,X'p)
= d(X'1,...,X",X'j,...,X'p);
(4.1)
let dF- "(x) denote the distribution ofX'/. By (4.1) for all j ^ I, (4.2)
E{X'!X[) = K;
assume that (4.2) holds for j = I as well. In addition, assume that, for some A, v (4.3) E(X'!\X!) = \Y' where Y' = ^ X'm. m=l
By the foregoing, the positive quantity v2 = E(X'j - X'J)2 = 2(a2 - K)
(4.4)
Zero biasing in one and higher dimensions
13
does not depend on j , and we may consider thep+1 vector Xj = (X1,..., Xj, Xj , . . . , Xp) with distribution dFj
(x) =
2 (X'-X") 3 3> v2
,„ dFj (X1,...,X'j,X?,...Xp),
(4.5)
and X^ and X" the p-vectors obtained by removing X" and Xj from X^- , respectively. Then, with Uj a uniform U(Q, 1) variable independent of {X^, X"}, and X" =
UJX'J
+ (1 - Uj)X'!,
(4.6)
with Vji Bernoulli random variables P(Vji = 1) = P(Vji = 0) = 1/2 independent of { X " , X " } , the collection X* = VjtX» + (l-Vji)Xu,
j,l =
l,...,p
has the X zero bias distribution in coordinates j , I. Since 0 < Var(X^ + Y') = (p + 1)(CT2 + pn), it follows that pn > -cr2 and hence that Var(F') = p(
14
Larry Goldstein and Gesine Reinert
Using (4.5) and (4.6), with / any smooth function and, suppressing j , letting X' and X" denote the vector obtained by removing X" and X'j from Xj" respectively, Efj(X») = EfjiUjXr + (1 - Uj)X'!)
I =
X'i-X'i'
)
±E{X'j-X'/){f(X')-f{X"))
= ^E(X>f(X')-X>'f(X>)) = ^E (X^(X') - XY'f(X'))
by (4.3).
Now, taking expectation over Vji and noting that fji = fij, we have
j=l
1=1
= a2J2 EfviX") + ^J2 E { ^ « ( X " ) + £/*(X")}
+ SEE £; (^(x')-Ay7/(x')) r> 2
P
= -^-E £ (^( x ')-^7i(x')) n
P
E
(
P
X
P
X
y
+ ^ Y, { E^/'( ') - 'M*) - AE '//(X') + AF'/,(X') U=l
3=1
= 2^TAilE
1=1
(XftW-XY'MX'))
3=1 Okr
P
+ ^E
E
3=1
( P
P
}
£*;/<(X')-A£^(X') . U =l
(=1
J
)
J
Zero biasing in one and higher dimensions
Employing (4.4) for the first term, and letting div/(z) = J2i fi(x)i expression can be written p
15 tnis
„
J^EX'jfjiX') - XEY'dWf(X') + ^E(Y'dWf(X') - XpY'divf(X')) 3= 1
= ±EX'MX') + (-^ 2
+ 2 (1 Ap)
; "
) £Y'div/(X')
= £^;/,-(x'), since by (4.4) and (4.7), -XV2+2K(1-\P)
= 2 (-A(CT2 -K) + K(1 - Ap)) = 2 ( « - A ( < 7 2 + ( P - 1 ) K ) ) = 0.
This completes the proof.
•
Example 4.3: Independent Variables. It is not difficult to see directly from Definition 1.2 that a vector X of independent random variables can be zero biased in coordinate j by replacing Xj by a variable XJ having the Xj zero biased distribution, independent of Xi,l ^ j ; this construction is equivalent to the special case of Theorem 4.1 when taking X" in (4.1) to be an independent copy of Xj. In particular, in this case
iix-x«ii = i^--*;i. Prom this observation and calculations parallel to those in Proposition 2.2 we obtain the following corollary of Theorem 3.5. Corollary 4.4: Let (Xi)i=i...ifl be an independent collection of mean zero random vectors in R p whose coordinates Xj:i,j = 1,... ,p are independent with variance Ujj^ and finite third absolute moments. Then for n
n
x
w = J2 i
and
s
jj = Y, *»>"
and any three times differentiate test function h, lEhCL-WW) - Nh\
< ^(minpSjj)-^\\D3h\\ ±±E fajX^ + ^ | 3 ) ,
16
Larry Goldstein and Gesine Reinert
and when Xjti, i = 1,..., n are identically distributed with variance 1 for allj = l,...,p,
lEhp-WW) - Nh\ < ^||£> 3 /i|| J2E\XjA?In our next example we consider vectors having non-independent components. Example 4.5: Simple Random Sampling. Let X' g R p be a vector whose values are obtained by taking a simple random sample of size p from a population having characteristics A, \A\ = N > p, with J2aeA a ~ ®- Taking one additional observation X" we form an enlarged vector that satisfies (4.1). In the notation of Theorem 4.1, (4.8)
and (4.2) is satisfied with K the value (4.8), and
WI X ') = W±-PY'^ so (4.3) is satisfied with A = —l/(N — p). Hence the hypotheses of Theorem 4.1 hold, and the construction so given can be used to produce a collection with the X' zero biased distribution. We now arrive at Theorem 4.6: Let (Xj)i=i...in be an independent collection of mean zero random vectors in R p obtained by simple random sampling as in Example 3.4. Suppose \a\ < M for all a e A%, i'• = 1,..., n, and let n
W = ^ X,
and sjt = Cov(Wj ,W{).
»=i
Then for a three times differentiable test function h,
\EhCs-^2w)-Nh\ < ^ M / I I S - 1 / 2 ! ! 3 ! ! ^ ! ! E l^'l3,1=1
Proof: As shown in Example 3.4, the collection of vectors have constant sign covariance, and hence Theorem 3.5 applies. Using the construction given Theorem 4.1 and the bound from Remark 4.2 we have ||X7j, - X * | | < 2 M ,
Zero biasing in one and higher dimensions
and the conclusion follows.
17
•
In typical situations, £ and Sji will have order n, resulting in a bound of the correct order, n~ 1//2 . Acknowledgement: The authors would like to thank the Institute for Mathematical Sciences for their support during the program 'Stein's method and applications: A program in honor of Charles Stein', where this work was partially completed. GR was also supported in part by EPSRC grant no. GR/R52183/01. We also wish to thank a referee for very helpful remarks. References 1. A. D. BARBOUR (1990) Stein's method for diffusion approximations. Probab. Theory Rel. Fields 84 297-322. 2. T. CACOULLOS (1982) On upper and lower bounds for the variance of a function of a random variable. Ann. Appl. Probab. 10, 799-809. 3. T. CACOULLOS & V. PAPATHANASIOU (1992) Lower variance bounds and a new proof of the central limit theorem. J. Multiv. Analysis 43, 173-184. 4. T. CACOULLOS, V. PAPATHANASIOU & S. UTEV (1994) Variational inequalities with examples and an application to the central limit theorem. Ann. Probab. 22, 1607-1618. 5. L. H. Y. CHEN & Q.-M. SHAO (2004) Stein's method and normal approx-
6. 7. 8. 9. 10. 11. 12.
imation. In: An introduction to Stein's method, Institute for Mathematical Sciences Lecture Notes Series No. 4, ch. 1. World Scientific Press, Singapore. F. GOTZE (1991) On the rate of convergence in the multivariate CLT. Ann. Probab. 19, 724-739. L. GOLDSTEIN (2004) Normal approximation for hierarchical structures. Ann. Appl. Probab. 14, 1950-1969. L. GOLDSTEIN (2005) Berry Esseen bounds for combinatorial central limit theorems and pattern occurrences, using zero and size biasing. J. Appl. Probab. 42 (to appear). L. GOLDSTEIN & G. REINERT (1997) Stein's Method and the Zero Bias Transformation with Application to Simple Random Sampling. Ann. Appl. Probab. 7, 935-952. L. GOLDSTEIN & G. REINERT (2004) Distributional transformations, orthogonal polynomials, and Stein characterizations. J. Theoret. Probab. (to appear). L. GOLDSTEIN & Y. RINOTT (1996) On multivariate normal approximations by Stein's method and size bias couplings. J. Appl. Probab. 33, 1-17. N. PAPADATOS & V. PAPATHANASIOU (2001) Unified variance bounds and a Stein-type identity. In Probability and Statistical Models with Applications, eds. Ch. A. Charalambides, M. V. Koutras & N. Balakrishnan, pp. 87-100. Chapman and Hall/CRC, New York.
18
Larry Goldstein and Gesine Reinert
13. M. RAIC (2003) Normal approximations by Stein's method. In Proceed-
ings of the Seventh Young Statisticians Meeting, ed. A. Mvrar, pp. 71-97. Metodoloski zveski, 21, Ljubljana, FDV. 14. C. STEIN (1972) A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. Proc. Sixth Berkeley Symp. Math. Statist. Prob. 2 583-602. Univ. California Press, Berkeley. 15. C. STEIN (1986) Approximate Computation of Expectations. IMS, Hayward, CA.
Poisson limit theorems for the appearances of attributes
Ourania Chryssaphinou, Stavros Papastavridis and Eutichia Vaggelatou Department of Mathematics, University of Athens Panepistimioupolis, GR-157 84, Athens, Greece E-mail: [email protected], [email protected], [email protected] We examine the number of appearances of certain types of events in a sequence of independent (but not necessarily identical) trials. We introduce a concept of Renewal which, in this context, in certain ways generalizes Feller's Theory. Under quite general conditions, Poisson limit theorems are obtained, which extend known results. Key tools in that are Stein-Chen method and probabilistic arguments.
1. Introduction In the past, many authors have studied various questions around the following general situations: we have a sequence of independent trials (usually identical), each trial producing a letter from certain alphabets. We have a set of words from these alphabets, under consideration, and we examine the number of appearances of these words etc. In this paper, we consider a general setup for the case of independent and identical experiments. We study attributes in the sense given by W. Feller (1968, p.308) and under quite general conditions we present limit theorems to the Poisson distribution. This work is organized as follows: In Section 2 we give the basic definitions and notations with examples concerning attributes and the related random variables under consideration. In Section 3 we examine the case of renewal appearances of one attribute and we present a Poisson limit theorem (Theorem 3.1) for the number of such appearances. Finally, in Section 4, applications of the above results are given. 19
20
O. Chryssaphinou, S. Papastavridis and E. Vaggelatou
2. Preliminaries and notation Let us consider the triangular array of probability spaces
and (f2n,fn,Pn),
n>\
be the product probability space i.e., fin = ^n,i x . . . x nn,n, Fn = Fn,l ® • • • ® ?n,n *n
-*n,l X . . . X / n , n «
~
Let also ln be a positive integer such that ln < n . We consider the event En,i i
En,iQ
(f2nJ x . . . x
\J
flnii)
(2.1)
i=max(l, i-i n + l)
and we define Sn = (EnA,...,En,n),
(2.2)
which we call the £n attribute. As W.Feller wrote, a£n itself is a label rather than an event". Attributes may be runs of symbols (events, patterns, words etc.) or sequences of numbers. In what follows we consider independent and identical trials. The model studied by Guibas & Odlyzko, ("String Overlaps, Pattern Matching and Nontransitive Games", (1981)), is the case where we have totally i.i.d. trials, i.e. i? nii = Q and Pn
J~m
*n)i
21
Poisson limit theorems for the appearances of attributes
where Qn = Qn = n x ... x Q,
Fn = T®...®T,
Pn = P x ... x P.
We consider a subset
Enc\J n\ i=\
It is En,i = En and the attribute is simplified to <S"n =
(En, • • •
,En).
Example 1: To illustrate the above definitions we consider the very simple case of a sample space Q = {S, F}. It is Qni = a = {S, F}, for i = 1,..., n, so J7n = Qn. Let En = En,i = {SSS...SSS} U {FFF...FFF} U
ln-\
c a1--1 u a1" cuj';^. Then the attribute of interest is £n =
(En,...,En).
We introduce the following events: El. E*i = {(wi, •.. ,wn) G On : 3 j > 1, i - ln + 1 < j < i such that (WJ, ... ,Wj) e £•„,,} € .Fn,
(2.3)
We call the event E^t "the appearance of the attribute £n = (En,...,En) at the i-th trial".
22
O. Chryssaphinou, S. Papastavridis and E. Vaggelatou
E2. Next we introduce the event E^t G Tn as follows B
n,l — ^n.l
En,i = {(wi, • • •, wn) € ttn : 3 j > 1, i - ln + 1 < j < i such that (ujj,... ,u>i) € £„,, i-l
and(wi,...,wn)^UE^}'*^2-
(2'4)
We call the event E^ "the appearance of the attribute £n = (En,.. .,En) as a Renewal at the i-th trial". Finally we define the following event, which is very useful in proving our results. E3. Let the event E^^ € Tn be defined as follows:
^=^f)(
U
j=max(i, i-in+i)
^.) c ,fort>2.
(2.5)
We call the event E^i "the appearance of the attribute Sn = (En,... ,En) as a Quasi-Renewal at the i-th trial". Example 2 (continuation of Ex.1): Let us assume that ln > 3 and consider the sequence {F&F&SS.SSS SFSFFF /n-l
.FFFFFSFSS}. t,
It is clear that the sequence of our example belongs to rpA
r?A
&2ln + n,ln+2' ^2ln + U, 2ln+5
^21^ + 11, /rl+3> ^21 n + ll, 2(n+6>
^ 2 / n + ll, 2/ n +7-
23
Poisson limit theorems for the appearances of attributes
For example we have —D3
•pA
£/
2in+n, in+2 ~
i i
x{{S}x{S}x{S}...{S}x{S}x{S})
x nl-+9 u n2 x{{F}x{F}x{F}...{F}x{F}x{F}) In
and rpA _Oin+7 2ln + ll, 2ln+7 —il
a
x {{F} x {F} x {F} ... {F} x {F} x {F})
x n4 u oln+8 x {{5} x {5} x {S}... {S} x {S} x {S}} x /?4 i n -i
In our example we have that the sequence belongs to ^2ln + U,ln+2
a n
^2ln + ll,ln+2
an<
d
^2ln + ll,2ln+5
and to ^ ^2ln + ll,2ln+5-
Example 3 (continuation of Ex.1): To clarify the difference between the defined events I, II and III somewhat more we consider the case of i.i.d. with Q = {S, F}, 20 trials and let the outcome {FSFSSFFSFFSFSFFSFSFS}. Supposing that we are interested in En:i = {SFS} U {SFFS} CQ3uf24C 3
\JA=lQ\
4
Then En%i = En, so the attribute is the sequence £n = (En,..., Then we have that the above considered sequence belongs to T?A
T?A
TTIA
rpA
T?A
T?A
jpA
•^20,4) -^20,8) ^20,11) ^ 2 0 , 1 3 ' -^20,16' -^20,18' -^20,20)
En).
24
O. Chryssaphinou, S. Papastavridis and E. Vaggelatou
to frR JTlfl JpR T?R •^20,4 i -^20,8 ' -^20,13 ) - ^ O . I S >
since R
R
R
R
{F 'SFS 'SFFS FF SFS FF 'SFS FS} and to -^20,4 >
^20,8'
since Q
Q
{FSFSSFFSFFSFSFFSFSFS}. 3. One attribute We define the ^-measurable function Rn : Qn —* R, which counts the number of appearances of the £n as Renewal, i.e. if {w\,...,ujn) € fin, then the value Rn(u>i,..., u)n) is the cardinality of the set {i:li,..., ujn) is the cardinality of the set {i:ln) is the cardinality of the set {i:l
In our last example we have that An = 7, Rn - 4, while Qn = 2. Notice that since E^ C E^ C E^t for alH = 1,2,..., n then Qn
Poisson limit theorems for the appearances of attributes
25
A. It is
limJnP(£tf,J = 0. n
'
B. If condition A is valid, then E(An) =
nP(E*ln)
is bounded. We notice that the above conditions are quite natural and well recognizable, as they appear in many known results. The main result of this paragraph concerns the limiting distribution of the random variable Rn and it is as follows: Theorem 3.1: Let a real number A be such that lim E(Rn) = X. n
If the conditions A and B are valid, then Rn Z Po(A) where Po (/i) denotes the Poisson distribution with parameter /i. Tie rate of convergence is
y
j=max(l, i-ln
+ l)
)
with respect to the Total Variation Distance. The proof is based on the following lemma: Lemma 3.2: Under conditions A and B, it holds true that a) limn E(Rn - Qn) = 0 ^Qn)=0. b) limn P(Rn
26
O. Chryssaphinou, S. Papastavridis and E. Vaggelatou
Proof: a) We have the following E(Rn - Qn)
= EP {E*>n ( O ) = EP\ E n,n [ i=l
i=l
n
= E^(^ f i=l
[
i=l
(
<) ) JJ
C
U «,) W 0
\j=max(l, i-ln + l)
{
0
\i=max(l, i-ln + l)
\j=max(l, i-i n + l)
/
\j=max(l, i-«n + l)
)
\j=max(l, i-ln + l)
[«,)cn^]|+o.
[j
= f^plE^n i=l
[
j=max(l, i-ln + l)
^)1 / J
/
)
(3.1)
J
Moreover, we obtain that i-l
i-l
[(E^) c n^.]c
U
j=max(l, i-ln + l)
U
£«( (3.2)
t=max(l, i-2ln+2)
because the appearance of £n at the j-th experiment is not renewed, so there must be a preceding renewed appearance at a t-th experiment with j - ln +1 < t < j -1, which together with the inequality i-ln + l < j < i-l gives i - 2ln + 2 < t < i - 1). Therefore, combining (3.1) with (3.2), we obtain that
£CRn-Qn)<]T
£
P(£&n££ t ).
i=l t=max(l, i-2ln+2)
We distinguish two cases: i)Ifie{i-2In + 2)...,t-Zn}thenP(£;*in£;«t)
£(i^"Qn)<E
E
i=l t=max(l, i-2ln+2)
< 2E(An) sup J
P{E*JP(E*t)
J]
^ ^ " [t=max(l,i-In + l)
P
«t) \ J
Poisson limit theorems for the appearances of attributes
27
which tends to zero under the assumptions A and B. b) Let I[Rn > Qn] be the indicator r.v. of the event [Rn > Qn}. Recall that Rn>Qn. We then have P(Rn * Qn) = P(Rn > Qn) = £(J[fln > Qn\) < E{Rn - Qn) because the r.v. i?ra — Qn takes values in {0,1, 2,...} with probability 1 and I[Rn > Qn] < Rn — Qn- Applying the result of a) the proof is complete. • Proof of Theorem 3.1: Our aim is to prove that \imdTV{C-{Rn), n
Po(E(Rn)))=0.
Usually, the numerical evaluation of E(Rn) is quite difficult, while the E(Qn) can be evaluated more easily. Thus, taking advantage of the triangular inequality dTv(£(Rn),
Po(£(;?„)))
(3.3)
and of Lemma 3.2, it is enough to prove that limdTV(C(Qn),
Po(E(Qn)))=0.
dTV(C(Rn),C(Qn))
< P(Rn ± Qn),
n
It is which by Lemma 3.2 b) gives us (£(Rn),C(Qn)) limd n TV
=0
under the conditions A and B. Besides, it is known that dTV(Po(E(Qn)),
Po(E(Rn))) < E(Rn - Qn)
which also vanishes to zero by virtue of a) of Lemma 3.2. In order to prove that limn dTv(£(Qn), Po(E(Qn))) = 0, we shall apply the Stein-Chen method (Arratia et al. (1989, 1990), Barbour et al. (1992)). For that purpose we consider the index set /„ = {1,2, ...,n} and Xn^ to be the indicator r.v. of the event E^{. We also define the so called "neighborhood" Bn^ of the index % as follows BnA:={jeIn:\i-j\<2(ln-l)}. The Stein-Chen method gives dTV(C(Qn),
Po(E(Qn)))
(3.4)
28
O. Chryssaphinou, S. Papastavridis and E. Vaggelatou
where 6
*= E i€In
h=E
E E(Xnii)E(XnJ) j€B n , 4
E
E(XniiXntj)
b3 = ^ £;|S{X nii - £(X n>i )|| a(X n i i : j ^ Bn,i)} I •
For j ^ JBn,i the r.v.'s X nii and Xnj are independent and therefore 63 = 0.
(3.5)
The terms b\ and 62 depend upon n. In order to complete the proof we will show that both quantities tend to zero as n —> +00. The term b\. It is n
6
1= E »6/»
E ^(Xn,i)£7(Xnii) = £ j6B n ,i
J2
P(En,i)P(Enj)
i=l j=max(l, i-2I n +2)
< 4 E P < ) SUp I i=1
min(n, i+2ln-2)
P E
J2
( nM'
l
J
so that
bi < 4E(Qn) sup i
^
P(E^) I
^ ^ " lj=max(l, i-I» + l)
= O[ sup J
53
which vanishes to zero under our assumptions.
P«A),
J
(3-6)
29
Poisson limit theorems for the appearances of attributes
The term 62- It is min(n, i+2ln-2)
n
b2 = Z ieln n
=E
E
i=1
i-ln
=E j=l
n min(n, i+2ln-2)
E
^n ,n^) + E
E
E
P(En,i)P(E^) + Yl
E
»=1 j=max(l, i-2ln+2) n min(n, j+2ln-2)
E
P(ElnE^)
j=max(l, i-2ln+2)
Q
i=l j=max(l, i-2ln+2) n i-ln
<E
E
E(Xn,iXnJ) = Y:
j £ Bn,i
P(^,^^J)
i=l 3= i+U n min(n, i+2ln-2) P E
i=l j=i+ln n min(n, i+2ln-2)
P(E^)P(E^) + Y^
1= j+ln
i=l
E
j=i + ln
( n,i)P(Enj)
nE^mKi)-
Applying condition A we finally get
b2 < 2E(Qn) sup J Ki
= O [ sup J
p £
E
I .
,A . , , ,,
( nj) \ I
^j=max(l, i-ln + l)
E
)
P E
( nj) I I -
(3-7)
which also goes to zero as n tends to infinity. Relation (3.4) together with (3.5), (3.6) and (3.7) complete the proof of Theorem 3.1. • An analogue to the above theorem is the following: Theorem 3.3: Assume that conditions A and B are satisfied. If X is a real number such that limn E{Qn) = A, then Rn % Po(A) at rate 0
U ( IS Ki
\
E
. ~~.i-l , +, ,,l) j=max(l, n
with respect to the Total Variation distance.
P(EnM/
I
30
O. Chryssaphinou, S. Papastavridis and E. Vaggelatou
The proof of Theorem 3.3 is analogous to the proof of Theorem 3.1, except for one step: now one has to deal with the triangular inequality dTV(C(Rn),Po(E(Qn)))
< dTv(C(Rn),C(Qn)) + dTv(C{Qn),Po(E(Qn))).
Generalizing the case of one attribute we can consider a finite set of attributes and to derive analogous to the above results as well as new ones. That work is under preparation and its applications are interesting. In the next section we give some applications concerning the case of one attribute. 4. Applications 4.1. Runs of symbols Although the presented results concern the case of independent and identical trials we are going to give a simple application supposing non-identical trials, since the results are valid and their proofs are quite similar to the previous ones. We believe that clarifies more the definitions given by relations (2.1) and (2.2). Suppose that the sample spaces Qn^ are discrete and choose an a € n n >i (~l"=1 Qn,i- Moreover, take En,i :={a,...,a}C
Qnti_ln+l
x . . . x f2Uii
for all i = 1,2,..., n. Then the attribute £n = (En
p(PAs_f°, and P(E^) = (1 -
{
P^l7i(a))P(E^)
0,
i < ln
(1 - Pi-U(a)) UU-in+i Pj(*), i > ln. Clearly, the expected value E(Qn) is given by
E(Qn) = J2P(En,i) = UPi^)+ E ( i - ^ - ^ ) ) «=1
.7 = 1
i=ln + l
j=i-ln+l
II
P
»>
which is close to E{Rn) according to Lemma 3.2. Note that in the case of i.i.d. trials it is known that
31
Poisson limit theorems for the appearances of attributes
Supposing that lim/n = oo, n
and denning
\imE(Qn) n
=X
p := sup sup Pi(tj) < 1 the conditions A, B become \ixnlnpln = 0 and npl" = O(l). n
Thus Theorem 3.1 yields that the random variable i?n under conditions A, B converges in distribution to a Poisson random variable with parameter A at rate O(lnPl") with respect to the Total Variation distance. The i.i.d. case is due to Von Mises (cf. Feller (1968), p. 341) who proves convergence to the Normal distribution. His proof is based on the knowledge of a double generating function, a tool which is not available in the non identical case. 4.2. Increasing and decreasing sequences of real numbers Let Xi, X2, • • •, Xn be independent random variables defined on the same probability space and having a continuous distribution. In order to avoid cumbersome notation, the symbol P will refer to whatever probability space we are working on. The attribute £n, we are interested in, may refer to an increasing or to a decreasing sequence which consists of ln real numbers produced under the above scheme. When such a sequence occurs we say that the attribute £n appears. We study the number of renewal appearances of £n. 4.2.1. Increasing sequences of real numbers Set En,i := {(ai,a2,
-,ain)
: ax < a2 < ••• < atn} £ K ; "
and the attribute £n corresponds to an increasing sequence of length ln. Then E
n,i = {(Wl,W 2 ,-,Wn) € K" : (Wi-ln + l, ^-ln+2, = {(wi,o> 2 , ••-,<*>„) e R™ : u)i-in+i
• • • , "i) €
< u>i-in+2 < . . . < u>i}
EnA}
32
O. Chryssaphinou, S. Papastavridis and E. Vaggelatou
and hence,
P(E^) = P(Xi_u+1 < Xi-w < ... < X,) = I °l ) < J" In this example, we may easily calculate the mean E(Rn) following the same procedure as the one of Blom & Thorburn (1982). We derive the probability generating function g(t) of the waiting time of the appearance of the attribute, which is given by
From the above generating function we obtain the mean recurrence time fi /l = w
(
1 +
z ^ i
+
-
+
aB + i ) . . 1 ( 2 i B - i ) )
which combined with relation (6.8) of Feller (1968), p. 321, gives approximately E(Rn)
« — — ^. n 1+ ' ' V £ + 1 + • • • + (in + l)....(2/ n -l)J
We assume that 1. lim n ln = oo 2. limnE(Rn) = X. Under the above hypothesis we easily prove that conditions A, B are satisfied and therefore by Theorem 3.1 we conclude that the random variable Rn converges in distribution to Po(A) at rate
with respect to the Total Variation distance. Besides, someone may use the mean E(Qn), instead of E(Rn). former is given by
*«w-q+ <-«.) (cri)i because
P(EQ)-$°'
i
The
33
Poisson limit theorems for the appearances of attributes
and P{E%) = P((Xi-ln
> Xi-ln+1) n (Xi_, n+ i < ... < Xt))
ln
i >> I (in + 1)!' "' Then by applying Theorem 3.3, we come to a similar conclusion. Due to symmetry, the previous results also hold true for the case of decreasing sequences of the above type. We note that Pittel (1981) proved a similar result for the number of decreasing sequences with length larger or equal to ln using the method of moments. Theorems 3.1 and 3.3 give us the rate of convergence, as well as the possibility to study some more relevant aspects as the next problem. -
4.2.2. Increasing or decreasing sequences of real numbers The basic model is the same and let us consider the attribute £n which corresponds to both an increasing and decreasing sequence of length ln. We are interested in the number of renewal appearances of either an increasing or a decreasing sequence of length ln. For that purpose we define En,i
: = { ( « i , a 2 , •••,ain) /- Tain t IK
• a\ < a2 < • •• < ain o r a\ > o 2 > ... >
ain)
and therefore,
o r uJi-i,,+i
> Ui^in+2
> ...
>
UJi}.
Obviously P(E^)
= P{Xi_ln+1
<...<Xi)
+ P(Xi-ln+1
>...>Xi)
=
^-
for i > ln.
The calculation of E(Rn) is not possible, but we can estimate the mean E(Qn). The event E®'i occurs whenever an increasing or decreasing sequence of length /„ takes place at the i-th trial and no other suchlike sequence has occurred at the ln - 1 former trials. We have P(EQ
)_(
°.
i < l
n
34
O. Chryssaphinou, S. Papastavridis and E. Vaggelatou
Furthermore, for /„ + 1 < i < 2ln - 2 we have that P{E%)
=P[{Xi.in + P[(X^ln
>Xi-ln+l)n(Xi-ln+1 < Xi.ln+1)
<...<Xi)\ n (*i-«n+1
>...>Xi)]
2ln
~(Zn + l)!' Finally for i > 2ln — 1 we get that = p[(x i _ 2 I n + 2 > • • • > Xi-,JC n (Xi_«n > x,_ / n + 1 ) n (X,_ (n+1 < ... < Xi)] + P[(Xt-2in+2 < < Xi-in)c n (Xi_jn < Xi_, B+ i) n ( x ^ ^ + i > > Xi)] = 2P{(Xi-2ln+2
>•••
> . X i _ « J c n ( X i _ , n > X t _ , n + 1 ) n (Xi-ln+1
<...<
_ 2/w _ ^l ;„_! ) " ( Z n + 1)! (2Jn-l)! 2/» 2 (Zn + 1)!
(2/n-l)[(/n-l)!]2-
Using the above result we obtain that E{Qn)
+
2
2ln{ln-2)
-Ji.
(in + iy.
+2
<"-a- + 2>{^TTji-(a,-i)[k-w}-
Supposing that 1. limn ln = oo 2. limn E(Q n ) = A it can be verified that conditions A, B are satisfied, thus by a straightforward application of Theorem 3.3 we deduce that the random variable Rn converges in distribution to Po(A) with rate of order
The above result has been proved by Wolfowitz (1944) using the method of moments. Later, David & Barton (1962) proved an analogous result concerning sequences with length greater or equal to ln. In both cases the rate of convergence was not obtained. Acknowledgement: This work was fully supported by the Research Account of the University of Athens, Greece.
X^}
Poisson limit theorems for the appearances of attributes
35
References 1. D. J. ALDOUS (1989) Probability approximations via the Poisson clumping heuristic. Springer, New York. 2. R. ARRATIA , L. GOLDSTEIN & L. GORDON (1989) Two moments suffice for
Poisson Approximations: The Chen-Stein method. Ann. Probab. 17, 9-25. 3. R. ARRATIA, L. GOLDSTEIN & L. GORDON (1990) Poisson Approximation
and the Chen-Stein method. Statist. Science 5, 403-434. 4. A. D. BARBOUR, L. HOLST & S. JANSON (1992) Poisson Approximation.
Oxford University Press. 5. G. BLOM & D. THORBURN (1982) How many random digits are required until given sequences are obtained? J. Appl. Probab. 19, 518-531. 6. L. H. Y. CHEN (1975) Poisson approximation for dependent trials. Ann. Probab. 3, 534-545. 7. O. CHRYSSAPHINOU & S. PAPASTAVRIDIS (1988) A limit theorem for the number of nonoverlapping occurrences of a pattern in a sequence of independent trials. J. Appl. Probab. 25, 428-431. 8. O. CHRYSSAPHINOU, S. PAPASTAVRIDIS & E. VAGGELATOU (1999) On the number of appearances of a word in a sequence of i.i.d. trials. Meth. Comput. Appl. Probab. 1:3, 329-348. 9. O. CHRYSSAPHINOU, S. PAPASTAVRIDIS & E. VAGGELATOU (2001) On the number of nonoverlapping appearances of several words in a Markov chain. Combin. Probab. Comp. 10, 293-308. 10. F. N. DAVID & D. E. BARTON (1962) Combinatorial Chance. Hafner, New York. 11. W. FELLER (1968) An Introduction to Probability Theory and its Applications. New York, Wiley, 3nd edition. 12. L. GuiBAS & A. ODLYZKO (1981) String Overlaps, Pattern Matching, and Nontransitive Games. J. Comb. Theory Ser. A 30, 183-208. 13. B. G. PlTTEL (1981) Limiting behavior of a process of runs. Ann. Probab. 9, 119-129. 14. J. WOLFOWITZ (1944) Asymptotic distribution of runs up and down. Ann.
Math. Statist. 15, 163-172.
Normal approximation in geometric probability
Mathew D. Penrose and J. E. Yukich Department of Mathematical Sciences, University of Bath Bath BA2 7AY, United Kingdom E-mail: [email protected] and Department of Mathematics, Lehigh University Bethlehem PA 18015, USA E-mail: [email protected] We use Stein's method to obtain bounds on the rate of convergence for a class of statistics in geometric probability obtained as a sum of contributions from Poisson points which are exponentially stabilizing, i.e. locally determined in a certain sense. Examples include statistics such as total edge length and total number of edges of graphs in computational geometry and the total number of particles accepted in random sequential packing models. These rates also apply to the 1-dimensional marginals of the random measures associated with these statistics.
1. Introduction In the study of limit theorems for functionals on Poisson or binomial spatial point processes, the notion of stabilization has recently proved to be a useful unifying concept (Baryshnikov & Yukich (2005 [BY2]), Penrose & Yukich (2001 [PY1], 2003 [PY3]). Laws of large numbers and central limit theorems can be proved in the general setting of functionals satisfying an abstract 'stabilization' property whereby the insertion of a point into a Poisson process has only a local effect in some sense. These results can then be applied to deduce limit laws for a great variety of particular functionals, including those concerned with the minimal spanning tree, the nearest neighbor graph, Voronoi and Delaunay graphs, packing, and germ-grain models. Several different techniques are available for proving general central limit theorems for stabilizing functionals. These include a martingale ap37
38
Mathew D. Penrose and J. E. Yukich
proach [PY1] and a method of moments [BY2]. In the present work, we revisit a third technique for proving central limit theorems for stabilizing functionals on Poisson point processes, which was introduced by Avram & Bertsimas (1993 [AB]). This method is based on the normal approximation of sums of random variables which are 'mostly independent of one another' in a sense made precise via dependency graphs, which in turn is proved via Stein's method: see the monograph Stein (1986). It has the advantage of providing explicit error bounds and rates of convergence. We extend the work of Avram & Bertsimas in several directions. First, whereas in [AB] attention was restricted to certain particular functionals, here we derive a general result holding for arbitrary functionals satisfying a stabilization condition which can then be checked rather easily for many special cases. Second, we consider non-uniform point process intensities and do not require the functionals to be translation invariant. Third, we improve on the rates of convergence in [AB] by making use of the recent refinement by Chen & Shao (2004) of previous normal approximation results for sums of 'mostly independent' variables. Finally, we apply the methods not only to random variables obtained by summing some quantity over Poisson points, but to the associated random point measures, thereby recovering many of the results of [BY2] on convergence of these measures, and without requiring higher order moment calculations. We add to [BY2] by providing information about the rate of convergence, and relaxing the continuity conditions required in [BY2] for test functions and point process intensities. A brief comparison between the methods of deriving central limit theorems for functionals of spatial point processes is warranted. Only the dependency graph method used here, to date, has yielded error bounds and rates of convergence. On the other hand, our method requires bounds on the tail of the 'radius of stabilization' (i.e., on the range of the local effect of an inserted point). The martingale method, in contrast, requires only that this radius be almost surely finite, and for this reason is applicable to some examples such as those concerned with the minimal spanning tree, for which no tail bounds are known and which therefore lie beyond the scope of the present work. The moment method [BY2] and martingale method (Penrose (2004 [Pe2])), unlike the dependency graph method, provide information about the variance of the Gaussian limits. Whereas the moment method requires exponential tail bounds for the radius of stabilization, one of our central limit theorems (Theorem 2.5) requires only that this tail r(t) decay as a (large) negative power of t.
Normal approximation in geometric probability
39
With regard to ease of use in applications, the dependency graph method and method of moments require checking tail bounds for the radius of stabilization, which is usually straightforward where possible at all. The method of moments requires a more complicated (though checkable) version of the bounded moments condition (2.5) below (see [BY2]). The dependency graph method requires some separate calculation of variances if one wishes to identify explicitly the variance of the limiting normal variable. The martingale method requires the checking of slightly more subtle versions of the stabilization conditions needed here [Pe2, PY1]. 2. General results Let d > 1 be an integer. For the sake of generality, we consider marked point processes in Rd. Let (M, FM^M) be a probability space (the mark space). Let £((x, s); X) be a measurable R-valued function defined for all pairs ((x, s),X), where X c Rd x M is finite and where (x, s) € X (so l e i 1 1 and s G M). When (x, s) G (Rd x M)\X, we abbreviate notation and write £((x, s); X) instead of £,((x, s); X U {(x, s)}). Given X C RdxM, a > 0 andy G Rd, we let aX := {(ax,i) : (x,t) e X} and y + X := {(y + x,t) : (x,t) G X}; in other words, translation and scalar multiplication on Md x Ai act only on the first component. For all A > 0 let £x({x, s); X) := i{{x, s); x + X1/d(-x
+ X)).
We say £ is translation invariant if £((y + x, s); y + X) = £((x, s); X) for all y G Rd, all (x,s) £ Md x M and all finite X C Md x M. When £ is translation invariant, the functional £A simplifies to £\((x,s);X) = Let K be a probability density function on Rd with compact support A C Rd. For all A > 0, let V\ denote a Poisson point process in Rd x M with intensity measure {\n{x)dx) x ^M{ds). We shall assume throughout that K is bounded with supremum denoted ||K||OOLet (A\, A > 1) be a family of Borel subsets of A. The simplest case, with A\ = A for all A, covers all examples considered here; we envisage possibly using the general case in future work. The following notion of exponential stabilization, adapted from [BY2], plays a central role in all that follows. For x G W1 and r > 0, let Br(x) denote the Euclidean ball centered at x of radius r. Let U denote a random element of M with distribution FM , independent of V\.
40
Mathew D. Penrose and J. E. Yukich
Definition 2.1: £ is exponentially stabilizing with respect to K and (AA)A>I if for all A > 1 and all x € Ax, there exists an a.s. finite random variable R := R(x, A) (a radius of stabilization for £ at x) such that for all finite X C (A \ Bx-\/dR{x)) x M, we have 6 ((x, U); [Vx n (B A -./d fl (i) x M)] U A") = £A ((x, £/); PA n (BA-i/-fl(x) x A*)),
(2.1)
and moreover the tail probability r(t) denned for t > 0 by r(i) := satisfies
sup
P[R{x, A) > t]
(2.2)
A>1, x€Ax
limsupfMogr^) < 0 .
(2.3)
t—>oo
For 7 > 0, we say that £ is polynomially stabilizing of order 7 if the above conditions hold with (2.3) replaced by the condition limsup£'l'T(£) < 00. t-KX>
Condition (2.1) may be cast in a more transparent form as follows. Each point of X is a pair (x,U), with x G Rd and (7 € .M, but for notational convenience we can view it as a point x in Rd carrying a mark U := Ux. Then we can view X as a point set in Rd with each point carrying a mark in M.. With this interpretation, (2.1) stipulates that for all finite (marked) X C A \ Bx-i/dji(x), we have & (x; (Vx n B A -i/- fl (x)) U #) = & (z; Vx n B A - V - H W ) •
(2-4)
Roughly speaking, .R := i?(x, A) is a radius of stabilization if the value of ^x^'jVx) is unaffected by changes to the points outside Bx-i/dR(x). Functionals of spatial point processes often satisfy exponential stabilization (2.1) (or (2.4)); here is an example. Suppose M = [0,1] and PM is the uniform distribution on [0,1]. Suppose that A is convex or polyhedral, and K is bounded away from zero on A. Suppose a measurable function (q(x),x 6 A) is specified, taking values in [0,1]. Adopting the conventions of the preceding paragraph, for a marked point set X C Md let us denote each point x G X as 'red' if Ux < q(x) and as 'green' if Ux > q(x). Let £(x; X) take the value 0 if the nearest neighbor of x in X has the same color as x, and take the value 1 otherwise. Note that £ is not translation invariant in this example, unless q(-) is constant. For x G A let R := R(x, A) denote the distance between A ^ x and its nearest neighbor in \l^dVx • Then
41
Normal approximation in geometric probability
stabilization (2.4) holds because points lying outside Bx-i/dR(x) will not change the value of £\{x; V\), and it is easy to see that R has exponentially decaying tails. This example is relevant to the multivariate two-sample test described by Henze (1988). See Section 3 for further examples. Definition 2.2: £ has a moment of order p > 0 (with respect to K and (Ax)x>i) if (2.5) sup E[|6((z,t/);V]
x£Ax
For A > 0, define the random weighted point measure p,x on Rd by (x,s)eTxri{AxxM)
and the centered version Jrx := /x| — E [fJ-x]Let B(A) denote the set of bounded Borel-measurable functions on A. Given / 6 B(A), let (f,fy := / A fd,i{ and ,/Z«) := /^ fdj^. Let $ denote the distribution function of the standard normal. Our main result is a normal approximation result for (/, A*f), suitably scaled. Theorem 2.3: Suppose ||K||OO < oo. Suppose that £ is exponentially stabilizing and satisfies the moments condition (2.5) for some p > 2. Let q G (2,3] with q < p. Let f e B(^4) and put T\ := (/, f/x). Then there exists a finite constant C depending on d,^, K, p, q and f, such that for all X > 2, SU
P P f m ? w a ^ f l " $ W ^ C ( lQ g A)d9A(VarTA)-«/2. (2.6)
Separate arguments are required to establish the asymptotic behavior of the denominator (Var(TA))1/2 in (2.6). When A\ = A for all A, it is typically the case for polynomially stabilizing functionals satisfying moments conditions along the lines of (2.5) that there is a constant
(2.7)
A—»-oo
For further information about
(f,\-1/2Jli)^Af(O,a2(f, 2
£,K)),
(2.8)
where M{0,cr ) denotes a centered normal distribution with variance a2 if
42
Mathew D. Penrose and J. E. Yukich
In many applications (2.7) holds with cr2(f,£, n) > 0, showing that the case q = 3 of (2.6) yields a rate of convergence O((log A)3rfA~x/2) to the normal distribution. In other words, we will make frequent use of: Corollary 2.4: Suppose ||K||OO < oo. Suppose that £ is exponentially stabilizing and satisfies the moments condition (2.5) for some p > 3. Let f G B(A) and put T\ := (f,fJ,{). If (2.7) holds with
tern
[{V&TTX)1'2
J
Our methods actually yield normal approximation and a central limit theorem when the exponential decay condition is replaced by a polynomial decay condition of sufficiently high order. We give a further result along these lines. Theorem 2.5: Suppose ||K||OO < oo. Suppose for some p > 3 that £ is polynomially stabilizing of order j with 7 > d(15Q + 6/p), and satisfies the moments condition (2.5). Let f e B(A) and put T\ := (f,^x). Suppose that (2.7) holds for some a2 > 0. Tiien (2.8) holds and if a2 := <x2(/,£, K) > 0 there exists a finite constant C depending on d, £, K, p and f, such that for all A > 2,
~ E J )1/2 sup P [ ^Var tem. L( -'A)
< t] - $(t) < CA< 150pd+6d -^/ 2 ^- 6d ). J
(2.9)
Remarks:
(1) Our results are stated for marked Poisson point processes, i.e., for Poisson processes in Rd x M where M is the mark space. These results are reduced to the corresponding results for unmarked Poisson point processes in Rd by taking Ai to have a single element (denoted m, say) and identifying Rd x M with Rd in the obvious way by identifying (x, m) with x for each x G M.d. In this case the notation (2.4) is particularly appropriate. Other treatments, such as [BY,Pe2,PYl] and Penrose & Yukich (2002 [PY2]), tend to concentrate on the unmarked case with commentary that the proofs carry through to the marked case; here we spell out the results and proofs in the more general marked case, which seems worthwhile since examples such as those in Section 3.3 use the results for marked point processes. Our examples in Sections 3.1, 3.2,
Normal approximation in geometric probability
(2)
(3)
(4)
(5)
(6)
43
and 3.4 refer to unmarked point processes and in these examples we identify Rd x {m} with Rd as indicated above (so that V\ is viewed as a Poisson process in Rd). We are not sure if the logarithmic factors can be removed in Theorem 2.3 or Corollary 2.4. Avram & Bertsimas [AB] obtain a rate of 0((logA) 1+3/ ( 2d )A- 1/4 ), for the length of the fc-nearest neighbors (directed) graph, the Voronoi graph, and the Delaunay graph (see Sections 3.1 and 3.2). Our method for general stabilizing functionals is based on theirs, but uses a stronger general normal approximation result (Lemma 4.1 below). If (2.7) holds with
3. Applications Applications of Corollary 2.4 to geometric probability include functionals of proximity graphs, germ-grain models, and random sequential packing models. The following examples are for illustrative purposes only and are not meant to be encyclopedic. For simplicity we will assume that Rd is equipped with the usual Euclidean metric. While translation invariance is not needed in the general results in Section 2, most of the examples treated in this section involve translation invariant functionals £. However, the ex-
44
Mathew D. Penrose and J. E. Yukich
amples can be modified to treat the (non-translation-invariant) situation where Rd has a local metric structure. 3.1.
k-nearest neighbors graph
Let k be a positive integer. Given a locally finite point set X c Rd, the knearest neighbors (undirected) graph on X, denoted kNG(X), is the graph with vertex set X obtained by including {x, y} as an edge whenever y is one of the k nearest neighbors of x and/or x is one of the k nearest neighbors of y. Thefc-nearestneighbors (directed) graph on X, denoted kNG'(X), is the graph with vertex set X obtained by placing a directed edge between each point and its k nearest neighbors. Let Nk(X) denote the total edge length of the (undirected) k-nearest neighbors graph on X. Note that Nk{X) = Y,xex €k(x'>x)' where (,k(x; X) denotes half the sum of the edge lengths in kNG(X) incident to x. If A is convex or polyhedral and n is bounded away from 0 on A, then £fc is exponentially stabilizing (cf. Lemma 6.1 of [PY1]) and has moments of all orders. Moreover, as shown in [BY2] (see e.g. display (2.11), Theorem 3.1), at least when / is continuous and A\ = A for all A, lim A-1Var(/,Mi) =V^ f f {xf K{X)^'^
A—oo
'd dx,
JA
where V^ denotes the limiting variance for the total edge length of the fc-nearest neighbors graph on >}ldV\ when K is the uniform distribution on the unit cube. Since V^ is strictly positive (Theorem 6.1 of [PY1]), it follows that (2.7) holds with (T2(f,£k,n) > 0. We thus obtain via Corollary 2.4 the following rates in the CLT for the total edge length of Nk(\1/dV\) improving upon Avram & Bertsimas [AB] and Bickel & Breiman (1983). A similar CLT holds for the total edge length of the fc-nearest neighbors directed graph. Theorem 3.1: Suppose A is convex or polyhedral and n is bounded away from 0 on A. Let Nx := Nk{\l/dV\) denote the total edge length of the l d k-nearest neighbors graph on \ ^ V\. There exists a finite constant C depending on d,£k, and K such that
p
(1) SC(logA)3 1/2
d [wM-* l s
^ -
(31)
Similarly, letting £ (:r; X) be one or zero according to whether the distance between x and its nearest neighbor in X is less than s or not, we
Normal approximation in geometric probability
45
can verify that £s is exponentially stabilizing and that the variance of SxeA'/dpA €s(x': ^d/P\) is bounded below by a positive multiple of A. We thus obtain rates of convergence of O((logA)3dA~1/2) in the CLT for the one-dimensional marginals of the empirical distribution function of k nearest neighbors distances on >}IAV\, improving upon those implicit on p. 88 of Penrose (2003 [Pel]). Using the results from section 6.2 of [PYl], we could likewise obtain the same rates of convergence in the CLT for the number of vertices of fixed degree in the k nearest neighbors graph. Finally in this section, we re-consider the non-translation-invariant example given in Section 2, where a point at x is colored red with probability q(x) and green with probability 1 - q(x), and £(x; X) takes the value 0 if the nearest neighbor of x in X has the same color as x, and takes the value 1 otherwise. We can use Corollary 2.4 to derive a central limit theorem, with O((log A)3dA~1^2) rate of convergence, for J2xev /( x )£( x ! ^M> where / is a bounded measurable test function. 3.2. Voronoi and sphere of influence graphs We will consider the Voronoi graph for d = 2 and the sphere of influence graph for all d. Given a locally finite set X C M2 and given x € X, the locus of points closer to x than to any other point in X is called the Voronoi cell centered at x. The graph consisting of all boundaries of Voronoi cells is called the Voronoi graph generated by X. The sum of the lengths of the finite edges of the Voronoi graph on X admits the representation J2xex £( x ' ^)i w n e r e £(#; X) denotes one half the sum of the lengths of the finite edges in the Voronoi cell at x. If K is bounded away from 0 and infinity and A is convex, then geometric arguments show that there is a random variable R with exponentially decaying tails such that for any x € V\, the value of £(x;V\) is unaffected by points outside Bx-i/dR(x) [BY2,PY1,PY3]. In other words, £ is exponentially stabilizing and satisfies the moments condition (2.5) for all p > 1. Also, the variance of the total edge length of these graphs on \lldV\ is bounded below by a multiple of A. We thus obtain O((log A)3dA~1/2) rates of convergence in the CLT for the total edge length functional of these graphs on \1/ldP\, thereby improving and generalizing the results of Avram & Bertsimas [AB]. Given a locally finite set X C M.d, the sphere of influence graph SIG(A') is a graph with vertex set X, constructed as follows: for each x € X let Bx be a ball around x with radius equal to mm.yex\{x){\y ~ XW- Then Bx
46
Mathew D. Penrose and J. E. Yukich
is called the sphere of influence of x. We put an edge between x and y iff the balls Bx and By overlap. The collection of such edges is the sphere of influence graph (SIG) on X. The total number of edges of the sphere of influence graph on X admits the representation Ylxex £(x> %), where £(x; X) denotes one half the degree of SIG at the vertex x. The number of vertices of fixed degree 8 admits a similar representation, with £(x; X) now equal to one (respectively, zero) if the degree at x is 8 (respectively, if degree at x is not 5). If K is bounded away from 0 and infinity and A is convex, then geometric arguments show that both choices of the functional £ stabilize (see sections 7.1 and 7.3 of [PY1]). Also, the variance of both the total number edges and the number of vertices of fixed degree in the SIG on X^dV\ is bounded below by a multiple of A (sections 7.1 and 7.3 of [PY1]). We thus obtain O((log A)3dA-1/2) rates of convergence in the CLT for the total number of edges and the number of vertices of fixed degree in the sphere of influence graph on V\. 3.3. Random sequential packing models The following prototypical random sequential packing model is of considerable scientific interest; see [PY2] for references to the vast literature. With N(X) standing for a Poisson random variable with parameter A, we let B\,i,B\t2, • ••, BA,AT(A) be a sequence of d-dimensional balls of volume A"1 whose centers are i.i.d. random
Normal approximation in geometric probability
47
Let £((x, s); X) be either 1 or 0 depending on whether the ball centered at x at times s is packed or discarded. Consider the re-scaled packing functional £\({x, s); X) = £{{\1/dx, s); \l'dX), where balls centered at points of \l'dX have volume one. The random measure N{\) i=l
is called the random sequential packing measure induced by balls with centers arising from n. The convergence of the finite dimensional distributions of the packing measures fj,^ is established in Baryshnikov & Yukich (2003 [BY1]),[BY2]. £ is exponentially stabilizing [PY2,BY1] and for any continuous / e B([0, l]d) and K uniform, the variance of (/, fi^) is bounded below by a positive multiple of A [BY2], showing that (/, /u^} satisfies a CLT with an O((log A)3dA~1/'2) rate of convergence. It follows easily from the stabilization analysis of [PY2] that many variants of the above basic packing model satisfy similar rates of convergence in the CLT. Examples considered there include balls of bounded random radius, cooperative sequential adsorption and monolayer ballistic deposition. In each case, the number of particles accepted satisfies the CLT with an O((logA)3dA~1/2) rate of convergence. The same comment applies for the number of seeds accepted in spatial birth-growth models. 3.4. Independence number, off-line packing An independent set of vertices in a graph G is a set of vertices in G, no two of which are connected by an edge. The independence number of G, which we denote /3(G), is denned to be the maximum cardinality of all independent sets of vertices in G. For r > 0, and for finite or countable X C Rd, let G(X, r) denote the geometric graph with vertex set X and with edges between each pair of vertices distant at most r apart. Then the independence number P(G(X, r)) is the maximum number of disjoint closed balls of radius r/2 that can be centered at points of X; it is an 'off-line' version of the packing functionals considered in the previous section. Let b > 0 be a constant, and consider the graph G{V\,b\~lld) (or equivalently, G(X1/'d'P\,b)). Random geometric graphs of this type are the subject of the monograph [Pel], although independence number is considered only briefly there (on page 135). A law of large numbers for the independence number is described in Theorem 2.7 (iv) of [PY3].
48
Mathew D. Penrose and J. E. Yukich
For u > 0, let 7iu denote a homogeneous Poisson point process of intensity u on Rd, and let H^ be the point process ~HU with a point inserted at the origin. As on page 189 of [Pel], let Ac be the infimum of all u such that the origin has a non-zero probability of being in an infinite component of G(Hu,l). If 6d||«;||oo < Ac, we can use Corollary 2.4 to obtain a central limit theorem for the independence number /3{G{V\, b\~l^d)), namely
™S
P
\P(G(VX,bX-W))
[
-E(3(G(Vx,b\-1/d))
(Var/J(G(PAl6A-V-)))Va
^ .1
* *J "
a,,.
$W
< C(logA)3dA-1/2.
(3.2)
We sketch the proof. For finite X C Kd and x £ X, let £(x; X) denote the independence number of the component of G(X, b) containing vertex x, divided by the number of vertices in this component. Then Ylxex £( x ' ^) is the independence number of G(X,b), since the independence number of any graph is the sum of the independence numbers of its components. Also, our choice of £ is translation-invariant, and so we obtain
(3{G{Vx,\-l/db)) = P{G{\1'dVx,b))= E
i{^fdx\\lldVx)
= J2 &(*;PA) = 4,/>, x€Vx
where we here take the test function / to be identically 1 and take A\ to be A for all A. Thus a central limit theorem holds for P(G{V\,\-1/db)) by application of Corollary 2.4, if £ and K satisfy the conditions for that result. We take R(x, A) to be the distance from \lldx to the furthest point in the component containing \l/dx of G(\1/dV\,b), plus 26. Since £\{x;V\) is determined by the component of G^^Vx^) containing A1/dx, and this component is unaffected by the addition or removal of points to/from V\ at a distance greater than \~l/dR(x, A) from x, it is indeed the case that R(x, A) is a radius of stabilization. The point process \1/dV\ is dominated by H^w^ (in the sense of [Pel], page 189). Hence, P[R(x,\) > t] is bounded by the probability that the component containing x of G(7i\\K^!x> U {\l^dx),b) has at least one vertex outside Bt-2b(^l^dx). This probability does not depend on x, and equals the probability that the component of G(H®d,.,, ,1) containing the origin includes a vertex outside i?(i/(>)_2- By exponential decay for subcritical continuum percolation (Lemma 10.2 of [Pel]), this probability decays exponentially in t, and exponential stabilization of £ follows. The moments
Normal approximation in geometric probability
49
condition (2.5) is trivial in this case, for any p, since 0 < £(z; X) < 1. Thus, Corollary 2.4 is indeed applicable, provided that (2.7) holds in this case, with a2 > 0. Essentially (2.7) follows from Theorem 2.1 of [BY2], with strict inequality a2 > 0 following from (2.10) of [BY2]; in the case where K is the density function of a uniform distribution on some suitable subset of Kd, one can alternatively use Theorem 2.4 of [PY1]. We do not go into details here about the application of results in this example, but we do comment further on why the distribution of the 'add one cost' (see [PY1,BY2]) of insertion of a point at the origin into a homogeneous Poisson process is nondegenerate, since this is needed to verify a1 > 0 and this example was not considered in [PY1] or [BY2]. The above add one cost is the variable denoted A(oo) in the notation of [PY1], or A?(u) in the notation of [BY2]. It is the independence number of the component containing the origin of G(Hu;b) minus the independence number of this component with the origin removed (we need only to consider the case where bdu is subcritical). This variable can take the value 1, for example if the origin is isolated in G(TCu;b), or zero, for example if the component containing the origin has two vertices. Both of these possibilities have strictly positive probability, and therefore A(oo) has a non-degenerate distribution. 4. Proof of theorems 4.1. A CLT for dependency graphs We shall prove Theorem 2.3 by showing that exponential stabilization implies that a modification of (/, Jjrx) has a dependency graph structure, whose definition we now recall (see e.g. Chapter 2 of [Pel]). Let Xa, a £ V, be a collection of random variables. The graph G := (V,£) is a dependency graph for Xa, a 6 V, if for any pair of disjoint sets A\,Ai C V such that no edge in £ has one endpoint in A\ and the other in A2, the sigma-fields a{Xa,a G Ai}, and cr{Xa,a e A2}, are mutually independent. Let D denote the maximal degree of the dependency graph. It is well known that sums of random variables indexed by the vertices of a dependency graph admit rates of convergence to a normal. The rates of Baldi & Rinott (1989) and those in Penrose [Pel] are particularly useful; Avram & Bertsimas [AB] use the former to obtain rate results for the total edge length of the nearest neighbor, Voronoi, and Delaunay graphs. In many cases, the following theorem of Chen & Shao (2004, Theorem 2.7) provides superior rate results. For any random variable X and any
50
p>0,\et\\X\\p
Mathew D. Penrose and J. E. Yukich
= (E[\X\P})1/P.
Lemma 4.1: Let 2 < q < 3. Let Wi, i G V, be random variables indexed by the vertices of a dependency graph. Let W = ^ i 6 V Wj. Assume that E [W2] = l,E[Wi] = 0, and \\Wi\\q < 0 for alii e V and for some 6 > 0. Then sup \P[W
$(*)| < ^D^-^^e11.
(4.1)
t
4.2. Auxiliary lemmas To prepare for the proof of Theorem 2.3, we will need some auxiliary lemmas. Throughout, C denotes a generic constant depending possibly on d, £, and K and whose value may vary at each occurrence. We assume A > 1 throughout. Let (p\,X > 0) be a function to be chosen later, in such a way that px —> co and \~1ldp\ —> 0 as A —* oo. Given A > 0, let s\ := \~1ldp\, and let V := V(A) denote the number of cubes of the form d
Q = JJiJiSx, (ji + !)**),
w i t h a11
i=l
U e Z,
such that fQn(x)dx > 0; enumerate these cubes as Qi,Q2, • • • ,Qv(\)Since n is assumed to have bounded support, it is easy to see that V(A) = O(\pid) as A -> oo. For all 1 < i < V(A), the number of points of V\ O (Q, x At) is a Poisson random variable Ni := N(ui), where Vi : = A /
« ( x ) d x < ||K||OO/0A-
(4-2)
Assuming Vi > 0, choose an ordering on the points of V\ n (Qj x A'l) uniformly at random from all (A^)! possible such orderings. Use this ordering to list the points as (Xiti, C/j,i), ••-, (Xi,Nn ^i.AfJ, where conditional on the value of Ni, the random variables Xij, j = 1,2,... are i.i.d. on Qi with a density Kj(-) := K ( - ) / / Q . n{x)dx, and the f/jj are i.i.d. in A^ with distribution P x , independent of {Xjj}. Thus we have the representation V\ = ^^{(Xij, Uij)}^. For all 1 < i < V(X), let Vi :— V\ \ {(XijjUij)}^ and note that Vi is a Poisson point process on Rd x M with intensity \K(x)l^d\Q.(x)dx x Pyn(ds).
51
Normal approximation in geometric probability
We show that the condition (2.5), which bounds the moments of the value of £ at points inserted into V\, also yields bounds on ^IMiXijiUi^VxWlA^Xi^l^N,}. More precisely, we have: Lemma 4.2: If (2.5) holds for some p > 0, there is a constant C := C(p) such that for all X > 1, all j > 1 and 1 < i < V(X), E[MXi.ii'Px)
• UAXi,i)lj
(4.3)
Proof: If Ni = n, then denote {{Xitl, £/ u ),..., (XitNi, Ui>Ni)} by Xn. By definition, E [\U(Xij, Uu); VX) • lAx(Xitj)l,
H
.
where the expectation on the right hand side is with respect to U, Xn-\ and Vi. The above is bounded by i/^
E[|6((x,^);^-iUPOIPMa:)^--7
/ oo
.
= ViJ2 m=0JQ,nAx
V
E [\t\((x,U);Vx)\p I |PAn(Q i xM)| = m]Kl(x)dx
xP{\Vxn(QixM)\=m} = Vi [
JQinAx
E[\tx((x,U);V\)\p}Ki(x)dx < const, x
Vi,
where the last inequality follows by (2.5). By (4.2), this shows (4.3).
•
For 1 < i < V, and for j e {1,2,...}, we define c. . .= ftxttXij'UijbVx) ' \0
?4>j
if Ni > j,Xitj otherwise
G .4A
Lemma 4.3: If (2.5) holds for some p > q > 1, then there is a constant forl
X>,il ^=1
< Cpfp+1)/p.
(4.4)
52
Mathew D. Penrose and J. E. Yukich
Proof: Fix i < V(\) and write ^ for £itj. Clearly, with N := Ni and v •=
i/u
oo
oo
i=l
/
V
i=l
g
oo
OO
\
/
fc=O
OO
q
OO
j=lfc=O
J= l
q
g
Since a.s. only finitely many summands in the double sum are non-zero, by subadditivity of the norm, the above is bounded by oo
oo
M
fc=O j = l oo fe=0
^
l
J=
+1
|2* "J
?
H
j=l
g
oo L2fc+1>/j
j=l
q
M
< E E llo-i^^ii^ + Ei^- 1 ^!!,' fc=O j = l
j= l
(4-5)
where here and elsewhere \x\ denotes the greatest integer less than or equal to x. With rj := (1/q) — (1/p), Holder's inequality followed by (4.3) yields
Ifo • lN>2*v\\q < IIOIIP • (P[N > 2kv]T < Cpdx/P(P[N
> 2kv\y.
(4.6)
Substituting into (4.5) we obtain oo
oo
i=i
\v\
< Cpf P Z)i/2 fc+1 • (P[N > 2fcH)" + £ |& • l;v<,||p • (4.7)
E U 9
fe=o
3=1
By tail estimates for the Poisson distribution (see e.g. (1.12) in [Pel]), since e2 < 8 we have (P[N > 2kv])r> < (exp (-2 fc - 1 i/log(2 /c ))) r ' = exp(-fc2 fc - 1 ^log2), Hence, (P[N > 2kv])r> < 2"2fc,
k > max(3,2 - log^rji/)).
k > 3.
53
Normal approximation in geometric probability
Hence, since i) is a constant, for v > 0 we have oo
Y^ 2fe+i • {p\N > 2ku])r> < k=0
fc<max(3,2-log2(T)i/))
2k+1
Y^
+
Yl
2l k
fc>max(3,2-log2(t)i/))
~
4.3. Proof of Theorems 2.3 and 2.5 We prove Theorems 2.3 and 2.5 together. When proving Theorem 2.3 we assume that £ is exponentially stabilizing and (2.5) holds for some p > 2, and we choose q G (2,3] with q < p. When proving Theorem 2.5 we assume that £ is polynomially stabilizing of order 7 with 7 > d(150 + 6/p), and that (2.5) holds for some p > 3, and we set q = 3. Throughout this section, we fix / 6 B(A) and set Tx := (/,/4). We follow the setup of the preceding section, with the support of K covered by cubes of side \~lfdp\, and we now choose p\. With the tail probability r(t) defined at (2.2), we choose p\ in such a way that there is a constant C such that for all A > 1, pd>!P{\T{px)){q-2)/{2q)
and
T(PX)
(4.8)
and also p{
(4.9)
In the exponentially stabilizing case (Theorem 2.3) we achieve this by taking p\ = a log A for some suitably large constant a. In the polynomially stabilizing case of order 7 (Theorem 2.5), we take p\ = CXa with 25p /7 d\ 25 ^ 7 so a[---)=—. 4.10 P7 - 6d \6 pj 6 This implies that (4.8) holds with q = 3, and that (4.9) holds (to obtain the last conclusion we use our assumption on 7, which implies 7 > d(25+56/p).) For all 1 < i < V and all j = 1,2,..., let Rij denote the radius of stabilization of £ at (Xij,Uij) if 1 < j < Ni and Xij € A\\ let Rij be zero otherwise. Let Eitj := {Rij < p\}. Let E\ := nY=1 n^Lj Eitj. Then by Markov's inequality and standard Palm theory (e.g. Theorem 1.6 in [Pel]) 0=
54
Mathew D. Penrose and J. E. Yukich
P[ECX]<E £ l > ? . =A /
P[R{x,\)>px}K{x)dx < AT(PA).
(4.11)
For each A, and x € Rd, set /A(X) := /(ar)lAA(a0- Recalling the representation V\ = U i=1 {^i,j}jli, we have V(\)
Ni
T*=Y, T,M(Xi,j,Ui,i)\'Px) • h{Xi,i). To obtain rates of normal approximation for T\, it will be be convenient to consider a closely related sum enjoying more independence between terms, namely VW Ni T
A := E X > ( f e Ui,j);Vx) • IEV • fx{Xitj).
For all 1 < i < V(X) define Ni
Si := SQi := (VarTA)-x/2 ^ ^ ( ( X i J ; ^ , i ) ; ^ ) • 1^,, • h(Xij) j=\
and put 5 := (VaiTxyx/2(Tx - ETJ[) = E ^ ^ i - E5 4 ). It is immediate that Var5 = E 5 2 = 1. We define a graph G\ := (V\,£\) as follows. The set VA consists of the subcubes Q\, —,Qv(\), and edges (Qi,Qj) belong to the edge set £x if d(Qi,Qj)
< 2a\-l/dpx,
where d{Qi,Qj)
:= inf{ja: - y\: x e Qt, y e Qj}.
By definition of the radius of stabilization R(x, A), the value of Si is determined by the restriction of Vx to the A~1/dpA-neighborhood of the cube Qi. By the independence property of the Poisson point process, if Ai and At are disjoint collections of cubes in VA such that no edge in £x has one endpoint in A\ and one endpoint in A2, then the random variables {SQ1,Q1 & AI} and {SQ^QJ € A2} are independent. Thus Gx is a dependency graph for To prepare for an application of Lemma 4.1, we make five observations: (i) V(X) := |VA| - O(Xpxd) as A ^ oo.
Normal approximation in geometric probability
55
(ii) Since the number of cubes in Qi,.-.,Qv distant at most 2\~1/dp\ from a given cube is bounded by 5d, it follows that the maximal degree D satisfies D := Dx < 5 d . (iii) The definitions of Si and £jj and Lemma 4.3 tell us that for all 1 < i < V(X)
115,11, < CCVarTl)-1^ £ | £ . . |
< C{Y^T'xr^pfp+l)/p.
(4.12)
(iv) We can bound Var[T{] as follows. Observe that T'x is the sum of V(A) random variables, which by the case q = 2 of Lemma 4.3 each have a second . Thus the variance of moment bounded by a constant multiple of px each of the V(X) random variables is also bounded by a constant multiple of px ( p + 1 ) / p . Moreover, the covariance of any pair of the V(\) random variables is zero when the indices of the random variables correspond to non-adjacent cubes. For adjacent cubes, by the Cauchy-Schwarz inequality the covariance is also bounded by a constant multiple of ^ ( P + I J / P rp m g shows that Var[Tj;]=O(pf p + 2 ) / p A).
(4.13)
(v) Var[T{] is close to Var[2\] for A large. We require more estimates to show this. Note that \T'X - T\\ = 0 except possibly on the set Ex. Lemma 4.3, along with Minkowski's inequality, yields the upper bound V(X)
Ni
EEi^^'^);^)!1^^)
(4.14)
Since T\ = T'x on event E\, the Holder and Minkowski inequalities yield
<(\\Tx\\q +
\\Tx\\q)P[Ecx^-2^2^.
Hence, by (4.8), (4.11), and (4.14), IITA - T'x\\2 < CXpdx/p(\T(px))^-2^^
< C\~3
(4.15)
which implies that E[|Jl-TA|]
(4.16)
56
Mathew D. Penrose and J. E. Yukich
which we use later. Since Var[TA] = Var[T{] + Var(TA - T'x) + 2Cov(T{,Tx - T'x), by (4.15), (4.13), (4.9) and the Cauchy-Schwarz inequality, we obtain |Var[rA]-VarK]|
(4.17)
Given the five observations (i)-(v), we are now ready to apply Lemma 4.1 to prove Theorem 2.3. By (4.13) and (4.17), Var[TA], as a function of A, is bounded on bounded intervals. Hence, it suffices to prove that there exists Ao > 2 such that (2.6) holds for all A > Ao, since we can then extend (2.6) to all A > 2 by changing C if necessary. Trivially, (2.6) holds for large enough A when Var[T\] < 1, and so without loss of generality we now assume Var[T\] > 1. To establish the error bound (2.6) in this case, we apply the bound (4.1) to Wi-.= Si-ESi: l
C{V&iT'x)-1'2pdx{p+1)/p.
Our choice of 8 is applicable by (4.12). We clearly have E [Wi] = 0 and E [(EH(iA) wi)2} = 1- W i t h S = Y^~i] Wi, Lemma 4.1 along with observation (i) above yields sup \P[S
${t)\ <
C\Pxd(YarTx:)-q/2pxkip+1)/p
< CX(VnrTx)-^2pdxq,
(4.18)
where the last line makes use of the fact that Var[T(] > Var[TA]/2, which follows (for A large) from (4.17). Now if j3 > 0 is a constant and Z any random variable then by (4.18) we have for all t e R P[Z < t] < P[S < t + j3] 4- P[\Z < $(* + (3) + CX{Va.xTxyq/2pdxq
-S\>0\
+ P[\Z
< $(t) + C0 + CA(VarTA)-"/2p*> + P[\Z
-S\>0\ -S\>0\
by the Lipschitz property of $. Similarly for all ( e l , P[Z
CA(VarT A )- 9/2 / 9f - P[\Z - S\ > /?].
In other words s u p \P[Z
$ ( t ) | < C0 + CX(VnrTx)-q/2pdq
+ P[\Z -S\>
/?]. (4.19)
57
Normal approximation in geometric probability
Now by definition of 5, |(VarTjO-1/2(7A -ETX)-S\
= \{V&xT'x)-l'2{(Tx - ETA) - (TA - ET'X)}\ - T'x\ + E [\TX - TA|]}
< {\*vT'x)-^{\Tx
which by (4.16) is bounded by C\~3 except possibly on the set Ex which has probability less than CX~2 by (4.11) and (4.8). Thus by (4.19) with Z = (VarTj;)-1/2(rA _ E T A ) and 0 = C\~3 sup |P[(Varr A )- 1/2 (T A - ETA) < t] - $(*)| t
< C\{Va.xTx)-q/2pdxq + C\-2.
(4.20)
Moreover, by the triangle inequality sup IPKVarT;,)- 1 / 2 ^ - ETA) < t] - $(t)
-"+[^t-/H]-('\/il) + (()
<4-2I)
T|*(VI1!)-* -
Since for all s < t, we have |$(s) - $(<)| < (t — s) maxs(u) where 0 denotes the standard normal density, and since by (4.17) there is a constant 0 < C < oo such that for all A > 0 and all t 6 R V VarT' ~ * - '*' VarT' ¥
A
~ A2
A
we get
sup $U^±
| -$(t)
t \]j VaiT'x J ~ Thus by (4.20) and (4.21), sup P | " r * - E r * < J t I vVarT A J
$(t)
t
max
\X2 \ue[t-tc/\\t+tc/\2]
^(u)^ y
< CA(VarT A )-"/ 2 ^ 9 + CA"2.
'))
< £ A2
(4.22)
Finally, it can be deduced from equations (4.13) and (4.17) that we have VarTA = O(Xpdx(p+2)/p). Hence, under the assumptions of Theorem 2.3, in (4.22) the first term in the right hand side dominates, thus yielding the desired bound (2.6), and the proof of Theorem 2.3 is complete.
58
Mathew D. Penrose and J. E. Yukich
Under the assumptions of Theorem 2.5, provided cr 2 (/,£, n) > 0 the right hand side of (4.22) is bounded by C\~l^2p^, and since in this case we set p\ = Aa with a given by (4.10), some elementary algebra yields (2.9). Our assumption that 7 > d(150 + 6/p) then yields the central limit theorem behavior (2.8), which is also trivially true in the case where cr2(f, £, K) = 0. This completes the proof of Theorem 2.5. • Acknowledgements: We began this work while visiting the Institute for Mathematical Sciences at the National University of Singapore, and continued it while visiting the Isaac Newton Institute for Mathematical Sciences at Cambridge. We thank both institutions for their hospitality. JEY's research was supported in part by NSA grant MDA904-01-1-0029 and NSF grant DMS-0203720. References 1. F. AVRAM & D. BERTSIMAS (1993) [AB] On central limit theorems in geometrical probability. Ann. Appl. Probab. 3, 1033-1046. 2. P. BALDI & Y. RINOTT (1989) On normal approximations of distributions
in terms of dependency graphs. Ann. Probab. 17, 1646-1650.
3. Yu. BARYSHNIKOV & J. E. YUKICH (2003) [BY1] Gaussian fields and ran-
dom packing. J. Statist. Phys. I l l , 443-463.
4. Yu. BARYSHNIKOV & J.E. YUKICH (2005) [BY2] Gaussian limits for random
measures in geometric probability. Ann. Appl. Probab. 15, 213-253. 5. P. J. BICKEL & L. BREIMAN (1983) Sums of functions of nearest neighbor distances, moment bounds, limit theorems and a goodness of fit test. Ann. Probab. 11, 185-214.
6. L. H. Y. CHEN k. Q.-M. SHAO (2004) Normal approximation under local
dependence. Ann. Probab. 32, 1985-2028. 7. N. HENZE (1988) A multivariate two-sample test based on the number of nearest-neighbor type coincidences. Ann. Statist. 16, 772-783. 8. M. D. PENROSE (2003) [Pel] Random Geometric Graphs. Oxford University Press. 9. M. D. PENROSE (2004) [Pe2] Multivariate spatial central limit theorems with applications to percolation and spatial graphs. Ann. Probab. (to appear). Electronically available via http: //arxiv. org 10. M. D. PENROSE & J. E. YUKICH (2001) [PY1] Central limit theorems for some graphs in computational geometry. Ann. Appl. Probab. 11, 1005-1041. 11. M. D. PENROSE & J. E. YUKICH (2002) [PY2] Limit theory for random sequential packing and deposition. Ann. Appl. Probab. 12, 272-301. 12. M. D. PENROSE & J. E. YUKICH (2003) [PY3] Weak laws of large numbers in geometric probability. Ann. Appl. Probab. 13, 277-303. 13. C. STEIN (1986) Approximate Computation of Expectations. IMS, Hayward, CA.
Stein's method, Edgeworth's expansions and a formula of Barbour
Vladimir Rotar Department of Mathematics and Statistics, San Diego State University San Diego, CA 92182-7720, USA and Central Economics and Mathematics Institute, Russian Academy of Sciences Nakhimovsckii prospect - 47, Moscow 117418, Russia E-mail: [email protected] This paper concerns an approach to providing rates and asymptotic expansions in the CLT, based on the representation r
E{Wf(W)}= J2 :l~kE{f{rn\w)}
+ R,
m=0
where / is an arbitrary r times differentiable function, f^m' is its mth derivative, W is a r.v. with finite first r + 1 moments, 7m is the m-th cumulant of W or a characteristic close to it, and R is a remainder which may be small under suitable conditions. If W is a sum of r.v.'s, a good bound for the remainder should reflect the dependency structure between the summands. We review and discuss here known results on such bounds, and provide a new result that essentially widens possibilities of applications.
1. Introduction Starting from the pioneering paper by C. Stein (1972), there was a great deal of interest to Stein's method in limiting theorems, extended, developed and modified in many directions by C. Stein himself and many others during these thirty years. Most of these modifications were reflected, to a certain extent, in monographs or surveys by Stein (1986), Barbour, Hoist & Janson (1992), Barbour (1997), Chen (1998), Reinert (1998,1999), Rinott & Rotar (2000), Raic (2003). It is worthwhile also to mention here the approach from Tikhomirov (1980) which was not covered in the surveys above but which, 59
60
Vladimir Hotar
though involves characteristic functions, is close to Stein's method. Recently this approach was applied to limit theorems for quadratic forms in Gotze & Tikhomirov (1999, 2002). The present paper has mainly a methodological character and concerns one particular approach to the CLT and asymptotic expansions in it. Namely, we study the following representation, the possibility of which was first pointed out and justified in a certain case in Barbour (1986). Let / be an r times differentiate function, and W be a r.v. with finite first r + 1 moments. Then the representation mentioned is r
E{Wf(W)} = J2 -^E{f(m\W)}
+ R,
(1.1)
m=0
where -ym is the m-th cumulant of W, /( m ) is the m-th derivative of / , and R is a remainder. If f(w) = wr, then R = 0 (see also (2.5) below), and (1.1) gives just the well known relation between moments and cumulants. In the general case, R does not vanish but may be small, as we will see, under suitable conditions. If E{W} = 0, and E{W2} = 1, then (1.1) implies that r
E{Wf(W)} - E{f'(W)} = ^ 7m±LE{f(m\W)} + R. m=2
m
(1.2)
"
For a given function h, denote by S(h) the Stein function / solving the differential equation f'(w) — wf(w) = h(w) — $(/i) (see, e.g., Stein (1972, 1986)), that is,
S{h)(x) = ^JX
[h(t)-*(h)Mt)dt,
where ip is the standard normal density, and #(/i) = / ^ h(t)ip(t)dt. For / = S(h), from (1.2) it follows that r
E{h(W)} - *(/i) = - J2 ^-E{f(m\W)} m=2
m
- R.
(1.3)
-
The main term in (1.3) specifies the proximity to normality in terms of cumulants which are small under very mild requirements, provided W is close to normal. If the class of functions h under consideration is closed with respect to <S, regarding the main term one can use induction when establishing asymptotic expansions for E{h(W)}. This was shown first in Barbour (1986), and later in a more general situation in Rinott & Rotar
Stein's method for Edgeworth's expansions
61
(2003). In Section 4 we consider it in more detail, and right now we just note that, if one wants to write the Edgeworth expansion for E{h{W)} of a length of r, it suffices to write for E{f^m\W)} in (1.3) the expansion of the length r — m, which allows to use induction in r. Certainly, it presupposes a smoothness of h, though as was shown in Rinott & Rotar (2003), this requirement may be replaced, for some dependency structures, by a requirement on the smoothness of the distribution of W. Thus, the main problem concerns the remainder. The essential difficulty here lies in the fact that, as a rule, it is not possible to estimate R efficiently in terms of cumulants or some other characteristics of W itself as a non-decomposable r.v. (see Section 2 for details), so R should involve the "interior structure" of W. Considering independent summands, Barbour (1986) wrote down the representations (1.1) for each summand separately, and then combined them adroitly, using independence in a crucial way. For dependent summands the problem becomes essentially more complicated. For a generalized local dependency scheme the representation (1.1) was considered in Rinott & Rotar (2003). This paper has two aims. First, we systematize and discuss some known results concerning (1.1), second, we prove a new result that essentially widens possibilities of applications. In Section 2 we review results considering W as & non-decomposable random variable. Section 3.1 concerns the case when W is the sum of independent r.v.'s. In Sections 3.2-3.4 we consider dependent summands with a general dependency structure. Sections 3.2-3.3 concern results that use the so called ^-mixing characteristics. Section 3.4 contains a new result involving the so called strong mixing dependence measure a. Conditions on dependency structures in terms of the latter measure are, as a rule, much weaker than those in terms of
sup | / ( m ) ( t ) - / ( m ) ( 0 ) | . |t|<W
62
Vladimir Rotar
Proposition 2.1: (Barbour, 1986). Suppose f is r times differentiable, and E{\W\r+1ur(f;W)}
< oo.
(2.1)
Then (1.1) is true with a remainder R = Rr such that ^\-ya+1\E{\Wr-Ur(f;W)}
l^l^2^-^i
(7^7)\
v
s=0
where
+
E{\W\^ur(f;
H
'
W)}
(2 2)
-
(23)
d
' = t^y-
absolute constants cr are defined as Cl
= 2'
Cs
=
SUP
E{\Y'-'E{Y}\'Y
^
S
~ 2'
m
and sup in (2.4) is over all r.v. 's Y with the s-th cumulant -ys. (In particular, dx = 3, d2 = 5/2, d3 = 5/3). The function um(f;x) has been introduced in order to cover the case when £{|W| r + 2 } = oo while an intermediate moment E{\W\r+1+a} < oo, a < 1, is finite, as well as when /( r + 1 ) does not exist but a fractional derivative /( r + a > does. On the other hand, if ||/ (m+1) ||oo and E{\W\r+2} are finite, since «m(/;z)<||/ ( m + 1 ) ||ooM, we can write
,*,<_,,<-»„„(t^-wrp+sra). (,5)
As is easy to see, the bound for the remainder in (2.2) has three terms, namely, E{\W\rur(f;W)}
E{\W\^ur{f-W)} ~i ' (2.6) which do not vanish when W gets closer to a normal. Certainly, if £ { W } is zero, the first term equals zero, but the last two still may be not small. |7i|
7[
, 172
E{\Wr'ur{f;W)} (—-),
a n d
63
Stein's method for Edgeworth 's expansions
One way to fix the problem is to consider large r's if it is possible: then the denominators in (2.6) will make these terms small. The limiting case r = oo may be considered in the following way. Corollary 2.2: Suppose that W has a moment generating function, that / € Coo; and that, for all m = 0,1,.., ||/ ( m ) ||oo
(2.7)
E{Wf(W)} = J2 ^^-Eif^iW)}. m=0
Proposition 2.1 and Corollary 2.2 were proved in Barbour (1986) with use of Taylor expansions, and probably this is the best way to do that in the general case. Nevertheless, it is interesting to consider the following proof of (2.7) which uses Fourier transforms. Let Q(dx) and Q(t) be the distribution and the characteristic function of W, respectively, and f(t) be the Fourier transform of / provided that it exists. Assuming that all operations below are proper, we have oo
oo
E{Wf(W)} = ±-{ I Q'(t)f(-t)dt = ±-{ j {\nQ(t))'Q(t)f{-t)dt —oo
-. oo
= hH1-TT
7
—oo
oo
7
Q(t)(it)kf(-t)dt = E ^TT / fw(*)Q(dx)
CO
Conditions under which the above operations can be carried out are a bit more restrictive than in Corollary 2.2 (for example, the cumulant generating function should exist for all t), but this proof looks more illustrative and
64
Vladimir Rotar
makes presentations (1.1) and (2.7), so to speak, less unexpected than they may look at first glance. We turn now to the case when W is a sum of r.v.'s. 3. Representations for sums of r.v.'s Henceforth, W = Y17=i -^i> w n e r e -^' s a r e r.v.'s. To simplify some proofs and examples below we will always assume E{Xi} = 0 for all i. 3.1. Independent summands For a function / and afixedinteger r set /,(*) = / ( * + *), L(f;p,a) = sup{|/(z) - f(y)\/[\x - y\<*(l + \x\* + !#)]}, (3.1)
G = L(f^;p,a).
Let Wi = W - Xi. Proposition 3.1: (Barbour, 1986). Let f ber + 1 times differentiable, and G < oo for some p > 0,and a € [0,1]. Suppose that r.v.'s Xi, i = 1, ...,n, are independent, and E\Xi\r+p+a+1 < oo, 1 < i < n. Then (1.1) is true with the remainder R such that n
\R\ < dr^EUX^Urifw^XJ} n
< (2p + 2)2?Gdr J2^{I^l[+1+Q(l
+ l^il P + E{\W\P})}.
i=l
2
In particular, if E{W } = 1 , n
\R\ < A8Gdr ^2 E{\X\:+1+a(l
+ \Xi\")}.
forp<2. First, note that in the case p > 0 the above proposition covers unbounded functions / too. Second, we consider an example illustrating that the remainder "has a proper order". Below, for two r.v.s X and Y we say that X < Y in distribution if P(X > x) < P(Y > x) for any x.
65
Stein's method for Edgeworth's expansions
Example 3.2: Assume that W has been already normalized in some way, and after the normalization for all i = 1,..., n \Xi\ < Y/^Jn in distribution, where Y is a positive r.v. with a finite moment of the order (r + 1 + a). Then for p = 0, \R\ < 48Gd r .n- ( r - 1 + a : | / 2 £{|>T + 1 + a } =
O{n-^-1+a^2),
while the order of the last, r-th, term in (1.1) is equal to the order of jr+i(W), that is, due to the independency is O(n~^r~1^2). Set a = 1, which means in view of (3.1) that we require the finiteness of ||/'r+1^||oo (more precisely, we require / ^ r ' to satisfy the Lipschitz condition). Then • R = O(n-r/2). Now we turn to dependent summands. The framework below to a large extent is that from Rinott & Rotar (2003, abbreviated below as RR). 3.2. Decompositions
and dependency-neighborhoods wnere
s
chains
are
In all following sections W = J27=i ^i> X' arbitrary r.v.'s with zero mean. An essential feature of the approach below is that it does not concern a particular dependency structure but just provides a way to describe dependencies of a rather general nature. We fix an integer r, and for each summand Xj introduce r -f 1 formally arbitrary decompositions W = Wki + Wkl,
fc
= l,...,r + l.
(3.2)
Example 3.3: Let Wki be a partial sum of summands defined by Wki — J2jetf Xj, where for each i a sequence {Nki}rjX\ is a sequence of sets of indices such that i £ Mu Q Nii ••• Q Nr+i,%- One can view it as if Xi does not depend or weakly depends on {X,; j ^ A/H}, the vector {Xi\ i € A/H} does not depend or weakly depends on {X,; j ^ A/^}, and so on. In this case A/ii is called a dependency neighborhood of the first order, A/^i - that of the second order, and so on. Note that for different Xj's the chains of dependency neighborhoods are different. In the case of independent X's it suffices to set Jsfki = {i} for all k, and hence in this case all Wk% = Xi. The most typical example of the above scheme is mixing on graphs, that is, when the parameter indexing the summands, which is usually thought
66
Vladimir Rotor
of as a "time" or "space" parameter, has values which may be identified with vertices of a graph. If the graph is a usual integer valued lattice in Zk with edges connecting only nearest vertices, we deal with the usual mixing scheme for random fields, and for k = 1 - with a process on a line. If the graph is arbitrary, the scheme is more complicated. In the case of a graph-related dependence, the neighborhoods Afki, for k = l,...,r + 1, may be the usual "geographical" neighborhoods of the vertex i with respect to the graph, provided that the dependence of the variables Xi on {Xj\ j ^ Nki\ is weak, and the larger k, the weaker this dependence. • It is worthwhile emphasizing, however, that the construction used in Example 3.3 is a particular case, and the decompositions introduced are arbitrary. In particular, Wk may be the sum of a random number of summands, this number may depend on the values of X's, etc. For example, graphs mentioned in the last example can be random, and their structure may depend on the values of summands. Some non-trivial examples may be found in Section 2 of RR. As a matter of fact, the decompositions (3.2) play the role of a free parameter in the representations below. For related decompositions see also Stein (1986), Barbour, Karonski & Ruchiski (1989), Rinott & Rotar (1996), RR, and references therein. We set WOi = 0, and Usi = Wsl - W S _ M for s > 1. Then it follows that Wki = Uu + ... + Uki, and Wki = Uk+\,i + Wfc+i,» • So, in the situation of Example 3.3, Uu = Y,jeMu Xh a n d u*i = X^&vvw.-i.i XJ f o r s > 2In the independence case, Uu = Xi, and USi = 0 for s > 2. Let Tu = cr{Xi}, and let Tsi = a{Xi:
Uu, Uii, •••, USi} be the cr-algebra
generated by the random variables {Xi, Uu,U2i,---, Usi}; then define T'si = ^{Us+i^i,..., Ur+iti, Wr+lii}, s = 0,1,..., r, and set fr+i,i = cr{Wr+lti}. The term local dependency is used if jF"si and Fs+i,i are independent for all s = 1, ...,r and i — 1,... ,n. Example 3.4: Consider the scheme of Example 3.3, and assume that there exist sequences {A/"fci}^j such that a-algebras TSi and Ts+i,i are independent for all s and i, and for a number m max
max
{#(7Vli)) # ( ^ ^ . - 1 , 0 } < m,
where #(•) stands for cardinality. Such a case is naturally to be called a generalized m-dependence. •
67
Stein's method for Edgeworth's expansions
Next, we define a characteristic of dependence. First we will deal with the measure of dependence cj>(A, B) := sup {\P(B\A) - P(B)\;
A e A, B € B, P{A) > 0},
(3.3)
where A and B are two
(3.4)
s,i
In the local dependency case, T = 0. In general, T is a counterpart of a well known quantity in the mixing framework.
3.3. A representation in terms of the measure
1
= E{(\Xi\
+ \Uli\ + --- + \U{r+1)i\) },
p,i=Y^Vu
k
and fjk = J ^ / i , .
»=i
i=i
(3.5)
Example 3.5: Consider the situation of Example 3.3, assuming that the neighborhoods Nki are fixed non-random sets of indices with the property that i £ A/ii C A/iji... for each i. Assume that for an M > 1 #(A/"r+i,t) < M
for all i.
(3.6)
As in Example 3.2, suppose also that W has been already normalized in some way, and after the normalization for all i — 1,..., n \Xi\
in distribution
(3.7)
for a r.v. Y with a sufficient number of finite moments. Then, as is easy to see, 2ln-l/2MlE{\Y\1}.
fiH < Consequently, III < 2n1/2ME{\Y\} s+2
jis+2 < 2
s 2
n- / M
s+2
= O (My/n),
E{\Y\s+2}
= O(Ms+2n~s/2)
(3.8)
68
Vladimir Rotar
for all s > 0. As to fjk, it involves the first absolute moment pi, so this quantity is "large"; more specifically, fjk = O(MkJn).
(3.9)
In the remainder in the representation below it will be multiplied by T which is supposed to be sufficiently small. • Next we consider a modification of cumulants. As was shown in RR, when W is a sum of dependent summands, the cumulants 75 in (1.1) may be replaced by slightly different characteristics in a way that would lead to a simpler representation and a better remainder. Moreover, these characteristics may work when the traditional setup does not. To illustrate why it can happen, consider the simple Example 3.6: (RR). Let Zi,Zi, ...be independent r.v.'s with the same distribution as a r.v. Z for which EZ = 0, EZ2 = 1, and for a fixed n the vector ,
(iln,...,
inn)
. _ ( (Zi,..., Zn) |
^
^z
)
with probability 1 - 1/ra, w
.
t h p r o b a b m t y
1
^
n
Let Sn = Yin + ••• + Ynn. Obviously, ESn = 0, Var{Sn} = 2n - 1, and the normalization by ^Var{Sn} is not proper since {Sn/^/n) converges in distribution to a standard normal r.v. On the other hand, if we set Xi = Xin = Yin/y/n, then W will be asymptotically normal. Hence, instead of standard deviation we should use another scale characteristic. • We will return to this example once we define the characteristics under discussion. If Xjhad been independent of Wu for each i, since E{Xi} = 0, the variance of W would have been equal to Y17=i E{XiUu}. We denote this quantity by 72, and adopt it as the modification of the variance in the general case. Note that 72 depends on the choice of the decompositions (3.2). In the case of the local dependency 72 coincides with Var{VF}. Thus, when proving a limit theorem or establishing asymptotic expansions in it, we will normalize W in a way that leads not to the unit variance but to 72 = 1. To define the modification of the third cumulant we should again consider the case of local dependency and write the expression for the third cumulant of W in terms of mixed moments of the third order for r.v. Xi,Uu, and U-n. This expression gives the definition of the characteristic under discussion - denote it by 73 - in the general case, and so on for
Stein's method for Edgeworth's expansions
69
all 7fe, k = l,...,r + 1, which coincide with the corresponding cumulants in the local dependency case. We relegate the precise, though somewhat cumbersome, formula for 7fc to Section 5.1 (formula (5.4)), and now just note that |7fc| < C(k)fik, where the constant C(k) depends only on k; see Section 5.1 for detail. In the situation of Example 3.5, this implies that 72 = O(l), 77+2 — O(n~1/2), for I > 0. Example 3.7: We return to Example 3.6, and let W^ = Wki = Xi in (3.2) for all k = 1, ...,r + 1; i = 1, ...,n. Then Uu — Xi, and hence 72 = n~x Yl2=i -^"{^in} ~ 1' that ^si gives a correct normalization. It is easy to verify that in this case T = 1/n, and as was shown in RR the representation (1.1) with the replacement of 7^ by % implies eventually that Eh(W) *(/i) = O(l/y/n), that is, the right rate in the CLT. • 3.3.2. A type (1.1) representation in terms of (p For a sufficiently smooth / set /^maxH/Wlloo, r f ( / ) = max max sup ess sup \E{f{l){Wsi + x) \ ^ } ( w ) | , (3.10) l
x
u
provided that the above quantities are well denned. Clearly, r , ( / ) < /,.
(3.11)
Henceforth symbol C ( ) denotes a constant, perhaps different in different formulas, depending only on the argument in (•). Proposition 3.8: (RR) For any sufficiently smooth f r
_
E{Wf(W)} = ] T ^2+lE{f(m\W)} TfX.
+ Br(f),
(3.12)
where \Br(f)\ < C(r)[Ar + 2 r r + 1 (/) + fjr+lfrT\.
(3.13)
Corollary 3.9: In view of (3.11), \Br(f)\ < C(r)/ r + 1 [/2 r + 2 + fjr+1T}.
(3.14)
70
Vladimir Rotar
Example 3.10: Consider the scheme of Example 3.5. In view of (3.8) and (3.9) the bound (3.14) implies that for sufficiently smooth / Br(f) = O{Mr+2n~r'2
+ Mr+1yfHT).
(3.15)
Certainly in a typical situation T involves a characteristic of the size of dependency neighborhoods, say, M, and the choice of dependency neighborhoods (which play a role of a free parameter) is a trade-off between the first and the second terms in (3.15). Assume, for example, that for each M we can choose a system of neighborhoods such that T = T(M) < CM~k, where C, k are positive constants. Then, as is straightforward to calculate, the optimal value for M will be M = C(/c,r)n (1+r)/2(1+fc) , and in this case
(
r ,.
(l + r-)(2+r) \
Thus, for sufficiently large k the order of Br is close to O(n~r/2) while the order of the last term in the main term of (1.1) is (^(n"^" 1 ^ 2 ). It is worthwhile to note that T(M) considered above vanishes as a power function, which is less restrictive than the exponential rate appearing in most of the literature. On the other hand, if we can choose the system of neighborhoods for which T(M) is an exponential function, then the optimal M will be of the order of lnp M for some p. This leads to • Br(f) = O ((lnn)p n~5 j for some p' which can be easily calculated. Example 3.11: Consider now the case of a generalized m-dependency (see Example 3.4). Such a dependence structure may be still rather complicated (for example, if it is connected with a complicated graph), but the r.h.s. of (3.14) in this case is simple. Indeed, since we deal with m-dependency, T-0, and M < m(r + 1). So, in the case (3.7) in view of (3.8)
\Br(f)\ < C{T)fr+ln-r'2mr+2E{\Yr2}.
m
Next, we observe that in the local dependency case, that is, when T = 0, conditions on smoothness of / may be weakened if we assume the distributions of H^'s to be sufficiently smooth. More specifically, denote by qSi(-,uj) the conditional density of WSti with respect to Tai, provided that it exists for a.e. ui in the probability space. We omit obvious issues of non-uniqueness and regularity. Assume that for some m, 1 < m < r + 1 , all densities qSi(x, w) are m times differentiable in x, and suppose ipm := max
max
max ess s u p | | ^ (-,w)||i < oo.
71
Stein's method for Edgeworth's expansions
Set ^o = l- Recall also that if for a differentiable function p(x) its derivative is Lebesgue integrable, then p(x) is absolutely continuous and has limits at ±00 (see, e.g., Bruckner, Bruckner k Thomson (1997)). If p is integrable, then p(±oo) = 0. So, in E{f^(Wsi+x) Tsi}{uj) from (3.10) we can integrate by parts getting that \E{fW(Wsi+x)\Fsi}(uJ)\
= J fV-minMHy
+ x)q(™n{m>l)\y,u)dy
< /r+l-mVW Hence, in this case (cf. (3.11)) IV+lU) < /r+l-mVW The same integration by parts may be carried out in the terms E{f^m\W)} in (3.12) too. Formally it does not mean yet that we have reduced the necessary order of the finite derivatives of / , since when integrating by parts we still used the boundedness of all derivatives, but to make the reduction under discussion rigorous it suffices to use standard smoothing argument: to smooth / , to write the expansion (3.12) for the smoothed / , and then to come back to the original / . In the last step it suffices to use, for example, the smoothing Lemma 11.4 from Bhattacharya & Ranga Rao (1987), requiring - to meet conditions of this lemma - that sup f (fe+{x + z)- / e _(x + z)) (1 + |z| 3r+1 )
(3-17)
In RR the above circumstance was used for obtaining asymptotic expansions for non-smooth functions h. 3.4. A representation in terms of the measure a Now we turn to a new result consisting in switching from the dependency measure ^ to a strong mixing measure a(A, B) := sup {\P{AB) - P{A)P{B)\;
A € A, B e B} .
(3.18)
72
Vladimir Rotor
Clearly, making use of a-mixing characteristics can lead to an essentially better remainder in (1.1). We skip here particular examples when a(A, B) is essentially smaller than
(3-19)
s,i
Let now for a fixed r r+1
n
) Jh = Pl(r) = Yi(Jr%? ,
and
K(r) = £ ^ ( ^ 1 "' / ( r + 2 ) . (3.20)
i=i
1=1
Proposition 3.12: For any sufficiently smooth f,
E{Wf{W)} = j ^ ^E{f^(W)}
+ Br(f),
(3.21)
ro=l
where
\Br{f)\ < C(r)[flr+2rr+l(f) + frK{r)\.
Example 3.13: To understand the order of K(r) consider again the case = OiMn-1'2), and hence of Example 3.5 when n\'^2) vx = 0{My/n),
u2 = O(M2),
and
9l = O (
(^)/2)
for I > 2.
73
Stein's method for Edgeworth's expansions
Consequently, /
fc'(r-l)/(r+2)
K(r) = O M ^ < r + 1 > / ( r + 2 ) + M2Kr^r+V +M 4
K(r-2)/(r+2)
+ M3-
=
Rl/(r+2)
+ • • • + Mr+1
,
\
) ,
and if M is uniformly bounded in n, /
R(r) = o (vfar
(p+1)/(r+2 +
K"(r-l)/(r+2)
r+2
> + ir/( > + - — = —
K(r-2)/(r+2)
— n —
+
+
^l/(r+2) \
--- ^m)-
(3-22)
Thus, we have switched to the measure a at a cost that K (which is a counterpart of T from (3.4)), has been raised to a power less than one. On the other hand K
4. Type (1.1) representations and asymptotic expansions Obtaining asymptotic expansions is not an aim of this paper. So, we give here just a general scheme how to get expansions having a representation of the type (1.1). As was noticed already, making use of the representation mentioned is based on induction. Let W = Xi + ... + Xn, and E{X{\ = 0 for all i. As to normalization it can be carried out in two ways: we can assume either E{W2} = 1, and in this case we will deal with usual cumulants of W; or we can suppose that W has been normalized in a way that 72 = 1, and in this case we work with characteristics 7^. Let Qo(dx) = $(dx), and {Qv{dx),i> > 1}, are known signed measures involved in asymptotic expansions, namely, Qv(dx) = £ where Lm(dx) =
M
p^Lv+zsidx),
(4.1)
(-l)mip^(x)dx, V
I
/
O
m=ikj\^T^.)
\
*!m
'
(4 2)
-
74
Vladimir Rotar
s = s(k) =fci+ ... +fcy,the summation in ]C(K) *S o v e r a u vectors of nonnegative integers k = (&!,...,/;„)such that k\ + 2k2 + ••• + vkv = v (see, e.g., Bhattacharya & Ranga Rao (1986), Petrov (1995)) Coefficients /?; in (4.2) are either cumulants of W, or modified characteristics 7J defined in Section 3.3.1 and in Section 5.1 below, depending on what characteristics we decided to use. Let again r denote an integer. Assume that for a function h we have proved already that for 1 < t < r — 1 t-i
Eh(W) = J2
(4.3)
h(x)Qv(dx) + Rt(h),
v=0J
where (4.4)
\Rt(h)\ < Dt(h),
for an appropriate bound Dt(h). Our goal is to prove (4.3)-(4.4) for t = r. Suppose that we have at our disposal the representation
E{Wf(W)} = J2 £=±l£{/M(W)} + Br(f),
(4.5)
m=l
where / is a sufficiently smooth function, Br(f) is a remainder, and j3m is again either the m-th cumulant of W or the modified characteristic 7 m . Let now / = S(h), the Stein function. Then, since /?i = 0, and /52 = 1 for both choices of /?'s,
Eh(W) - *(/*) = - ]T ^±E{f^(W)} 771.
- Br(f).
(4.6)
By (4.6) and the induction hypothesis, applied to the function / ( m ' and taking t — 1 = r — m, Eh{W) - *(A) r a fr-m .
im
M
\
= - E ^ T E / f Hx)QAdx) + Rr-m+l(f ) 771 = 2
r
'
\l/=0 J
- Br(f) )
.
r-m
= -E^rEE ( /-// ( m ) ww^) R
m-2
•
i^=0
J
-E^T^- + i(/ ( m ) )-5 r (/). m=1
(4.7)
75
Stein's method for Edgeworth 's expansions
Thus, regarding the remainder, one should prove that ]
J2
-^Dr-m+1(fW) + \Br(f)\ < Dr{h).
m=2
This is possible only if one manages to represent Dt(-) in a special form, in terms of the function h itself. including a possibility to estimate Dt(f^) In the case of dependence measure 4>, and for dependency-neighborhoods structures, it has been done in RR, which required long calculations. Next consider the main term in (4.7), that is, r
„
r-m
£%r££(W
m=2
-
/(m)(x)L
v=0
^ («**)•
Using the formula (see, e.g., Barbour (1986)) / f<m\x)Lv(dx) = J
- i — / h(x)Lv+m+1(dx), m +v+1 J
(4.8)
it is straightforward to verify that r
„
r-m
.
" £ ^ T £ E M « - f(m)(x)L»+2s(dx) m=2
(4.9)
i/=0
- E _E d t i ^ r E M "-TT^ny / "(l)L'«<-«»(lfa^ So, the induction will be completed, if we prove that the latter expression coincides with
Y, fh(x)Q,(dx) = Y, I'Kx)YJ{l)P^iLi+2s{dx). l=i
J
i=i
J
To make sure that the last claim is true, it suffices to show (setting m = i + 1 in (4.9)) that for any I = 1 , . . . , r - 1:
Il=i (^T)! £(,) P k " TT^TT)L;+2(s+1)
=
£(o Pk/L|+25'
where in t h e left-hand side sum 1 < i < I a n d 0
(410)
Identity (4.10)
76
Vladimir Rotor
5. A proof of Proposition 3.12 5.1. Some definitions First, we introduce some notation from RR. Consider a summand X = Xi. For while, if it does not cause a misunderstanding, we will suppress the index i in Xi itself and in the corresponding decomposition terms Wki, Usi, and moments Hki ( se e Sections 3.2-3.3.1 for detail). Note first that we have Wk = U\ + ... + Uk, and W = Wk + Wk. Define VS=(WS, Us+1,..., Ur, Ur+i), U=V 0 =(E/i,..., Ur, Ur+1). For s + I < r + 1, p < I we define
»^)
and *«+i(p) = EM.lii^Um}.
=E | ^ T O
where X^fm|=i ' s *^e s u m o v e r a ^ v e c t o r s m = ( m ii • • • ,mr+i), such that m, € {l,...,r + 1} for all i < p, and m* = 0 for p < i < r + 1, and such that |m| = mi + • • • + mr+1 = 1; we also define U m = nl=i ^i"') u£--+' and m! = mx\ • ... • mr+1\. It is easy to see V m = wjniUm*i... that
\xi(s,P)\ < E[f l=I ^[^{1^1} ^ ^{(l^.l + \u.+i\ + - + l^+i-il)'} < ^{(ic/!i +... + i^+j-ii)1} < yum
+... + \ur\)1}
<E{(\X\ + \U1\ + ... + \Ur+1\)l} = LH. Likewise |0|(*)l<W-
(5.1)
Set Too = li T;o = 0 for / > 1, and for m > 1
?lm = Tlm(3) =
£
(5-2)
E - E
Jl+--.+Jm=iPl=l
Pm = l
Xh (s,Pl)Xh(s + pi,p 2 ) • • • XJm(s + Pi + ... + p m -l,p m ). For I < r set [1/2]
[1/2]
c(s,I) = E (-l) Tlm(s), c(s,0 = E IT'-(S)Im
m=0
( c ( s , 0 ) = 1, c ( s , l ) = 1,
jn=O
s>0).
(5-3)
77
Stein's method for Edgeworth's expansions
Recalling that the above quantities depend on the suppressed index i, define i V~~* V ^ o
Tm+l,t
E
—^T=
i \ /
, •. i\
2^^+i,i(s)ci(s + l,t),
t+l-m,l>l s=l n
Im+i = ^lm+\,h
m = l,...,r.
(5.4)
The last formula gives a formal definition of the quantities 7 m introduced in Section 3.3.1.
5.2. Proof of Proposition 3.12 Below we use the following inequality from Davydov (1970), see also, for example, Bradley (2002): \E{XY}-E{X}E{Y}\<%[a{A,B))1~^\\X\\P\\Y\\q, where p,q 6 (l,oo], and \ + \ < 1; A,B are cr-algebras; X € Y e Lq(B); and as in Section 3.4, a(A,B):=svp{\P(AB)-P(A)P(B)\\ A £ A,
(5.5) LP(A),
BeB}.
In particular, for q = oo
< 8[a(A,B)]1~'\\X\\P\\Y\\oo.
\E{XY} - E{X}E{Y}\
(5.6)
We turn to the direct proof of Proposition 3.12. For while we again omit the index i in Xi, W
(s)
f(W) = f{Wi) + E E E
^jU m /(l m l)(W p+1 ) + Rr,
(5.7)
i=l s=l | m | = j
where r+l
Rr^Yl
1
(p)
Yl
m
P=l |m|=r+l
~^mh--t){m^l)f{k+l]{wv+tup)dt, '
{
which follows by repeated Taylor expansions. Hence, r
I
E{Xf{W)} = Y,Y.
E
{s)
-^E{XXJm}E{fil)(Ws+1)}+E{XRr}+Tr,
1=0 s=l | m | = I (5.8)
78
Vladimir Rotar
where Tr = J2T,
E
W
-^T [E{X\Jmf^(Ws+1)} - E{XUm}E{f^(Ws+1)}] . m
1=0 s=l \m\=l
'
It is straightforward to calculate that \E{XRr}\ < M r + 2 r r + 1 ( / ) .
(5.9)
nfe = nw(r) = X ) ^ a ) i f 1 " l / ( r + a ) ,
(s.io)
Let now
where K is defined in (3.4). Using Davydov's inequality (5.6), we have 8^r2+2)f0K^/(r+V \Tr\ < + 8
E E E <*>^/#21)/(r+2)/itfi-
1=1 s=l \m\=l
(5.11)
Next we need the following lemma whose proof is relegated to Section 5.3. Lemma 5.1: For any k
E{f(Ws)} = Y,<*,P)Hf{
P
HW)} + Msk(f)
p=0 k
= E{f(W)} + '£/c{s,l)E{f(' l\W)} + Msk(f), (5.12) 1=2
where c(s,p) is defined in (5.3), and \Msk(f)\ < c(k)ink+irk+1(f)
+ /fcnfc].
(Note that we again suppressed index i in Xit Witt,, Usi, Ci(s,p), etc.) Next we apply Lemma 5.1 to the term E{f^l\Ws+i)} in (5.8) with replacing in (5.12) / by / ^ , s by s + 1, and k by r — I. Thus we write r-l
E{fW(Wt+1)}
= Y,c(s + l,t)E{f«+l\W)}
+ Ma+lir-l(fW)
(5.13)
r-l
= E{fW(W)} + ^2c(s + l,t)E{f(t+l\W)} t=2
+ Ms+1,P_«(/<'>),
79
Stein's method for Edgeworth's expansions
where
IM.+i.r-K/*0)! < c(r)[ M r . I + 1 r r _ i + 1 (/«) + / r n r -!]
= c(r)K_ i + 1 r r + 1 (/) + / r n r _ i ],
since by definition r f c ( / O ) = r f c + i ( / ) . It remains to insert (5.13) into (5.8). The main term after this operation is treated exactly as in RR (p. 549), so we turn to the remainder which we denote by Br. (Note that it still depends on i). In the next calculations we use (5.9), (5.11), (5.1), the inequality WMfc < W+fe>
(5-14)
and that K < 1. We keep also the notation j m for j m i in (5.4), and again allow C(r) to vary between equations. Thus, we have
|Br| := \E{Xf(W)} - j ^ ?2±LE{flm\W)}\ m=0
<
Mr+2 r r+1 (/) + c(r)/ r n r + 1
1=0 s=l \m\=l r
I
< c(r){/i r+2 r r+1 (/) + / r n r + 1 + £ ] T > + 1 ( S ) I • |M s+1 , r _,(/ (i) )|} 1=0 s=l r
< c(r){/i r+2 r r+1 (/) + / r n r + 1 + ]T w+1 [^ P _, +1 r r+ i(/) + / r n r _ ; ]} < c(r){M r+2 r r+1 (/) + / r n r + 1 + f Vu(l+1)/(r+2Ti//(r+%1->/(r+2)l (=1
< C(r){iir+2Tr+1(f)
»=1
+ frUr+1
+ /r E E ^ 2 + 1 ) / ( r + 2 ) A' 1 - ( ' + ' +1)/(r+2) } i=l s=l r
r+1
< C(r)|/x r+2 l r+ i(/) + / r ll r + 1 + / r 2 ^ 2^ ^+2 J=l p=l+2
< c(r) {/i r+2 r r+1 (/) + / r n r + 1 } .
K
/
80
Vladimir Rotar
This holds for X = X,. We recall now that the characteristics above depends on i, so /i; = /x/^, and 11*. = Ilfe^. It remains to sum over i, using that n
E^
n fci
»=1
n
=%, ^2/J.ki= pk, ^ n r + M = K(r), i=l
i=\
where K(r) is denned in (3.20).
•
5.3. Proof of Lemma 5.1 Before proving the first equality in (5.12), note that the second follows simply from the facts that c(s, 0) = 1, c(s, 1) = 0. In RR, with use of expansion (5.7), it was shown that for any k < r+1 — s E{f(W)} k
i
E
= E{f(Ws)} + Y/Y,
ip)
^E{VT}E{f{l)(Ws+P)}
+ Rks+Tks
J=2p=l|m|=l k I
= E{f(Ws)} + ^2J2 Xl(s,p)E{f( 1HWS+P)} + Rks(f) + Tks(f),
(5.15)
1=2 p = l
where
{
E E (p) S v ? 1
fc+i
P=l |m|=fc+l
j{1^t)(mP-l)f(k+l){Ws+p_i+tUs+p_i)dA 0
)
< (ik+iTk+iif), and k T
l
"s(f) = E E E
( P )
1=1 p=l |m|=(
^ [£{VS"7( l)(Ws+P)} - E{V™}E{f( l\Ws+p)}] .
Again making use of (5.6) we have
81
Stein's method for Edgeworth 's expansions
(p)
^\\V?\\(r+2yJiKl-l'(r+2)
|Tfc3(/)| < 8 ] T ] r £ Z=lp=l|m|=i
<8/*EEE
ip)
^l%+2)K^^
1=1 p=l |m|=i
where IT^ is defined in (5.10).
Denoting Hks(f) = Rks(f) + Tks(f) and HOs = 0, we have
\Hks(f)\ < c(k)(fik+1rk+1(f)
+ An,).
Furthermore
E{f(Ws)} = E{f(W)} - Hks(f) - YsJlxiis^Eif
l
\Ws+p)}
1=2 p=l
k
I
= E{f(W)} - Hk.(f) - E E * ( S - P ) [ £ { / ( I ) W } - ^-J, s+ p(/ (0 )] 1=2 p = l *:
(
k-h
h
+ E E Xi^s.Pi) E E X(2(s+Pi,P2)^{/('1+'2)(^+Pl+p2)}! (5-16) i!=2pi = l
(2=2P2 = 1
where the last expression was obtained by applying the first relation to /W. Observe that \Hk-l
= |flfc_«iS+p(/W) + rfc_«,8+p(/W)| < c(*){Mfc-j+irfc+1(/) + / fc n fc _ ; }.
(5.17)
The main part in (5.16) is treated exactly as in RR by repeating the above scheme in (5.16). Consider the remainder in the above expression for E{f(Ws)}, which we denoted by Msk(f). Again repeating the same scheme, and using the bound from (5.17), it is straightforward to verify (see also RR where similar calculations were carried out in the case of measure (f>) that [fc/2]
k
\Msk(f)\
< c(k) Y, c(s, ot/ifc+i-iiW/) + An*.,], /=0
82
Vladimir Rotor
where Tj m (s) and c(s,l) were defined in (5.2)-(5.3). Note that using (5.14) it is easy to verify that c(s,l)
(5.18)
Making use of (5.18), again (5.14) and the fact that K < 1, we have
\Msk(f)\
(=0
/
(=0
k
k-l
< C(A;) I l f c + i ( / ) ^ f c + i +/fc 2 ^ 2 ^ ^ + 2 V (=0 8=1
/
\
K
) /
h^^r+2)K^^+A
< c(k) (rk+1(f)»k+1 + \
s=l
1=0 s=l
/
i=o P=i+t
J
•
Acknowledgements: I would like to offer my warm thanks to the organizers of the excellent Program in honor of C. Stein, Singapore, August, 2003. I am grateful to R. Bradley for a very useful discussion on mixing characteristics. References 1. A. D. BARBOUR (1986) Asymptotic expansions based on smooth functions in the central limit theorem. Probab. Theory Rel. Fields 72, 289-303. 2. A. D. BARBOUR (1997) Stein's method. In: Encyclopedia of Statistical Science, 2nd ed. Wiley, NY. 3. A. D. BARBOUR, L. HOLST & S. JANSON (1992) Poisson approximation. Oxford Studies in Probability 2, Clarendon, Oxford. 4. A. D. BARBOUR, M. KARONSKI & A. RUCINSKI (1989) A central limit theorem for decomposable random variables with applications to random graphs. J. Combin. Theory, Ser. B 47, 125-145. 5. R. N. BHATTACHARYA & R. RANGA RAO (1986) Normal approximation and asymptotic expansion. Krieger, Melbourne, Fl. 6. R. C. BRADLEY (1985) Basic properties of strong mixing conditions. Dependence in probability and statistics. In: Progr. Probab. Statist. 11, (Oberwolfach, 1985), 165-192. Birkhauser, Boston, Mass. 7. R. C. BRADLEY (1998) On the simultaneous behavior of the dependence coefficients associated with three strong mixing conditions. Rocky Mountain J. Math. 28, 393-415.
Stein's method for Edgeworth's expansions
83
8. R. C. BRADLEY (2002) Introduction to Strong Mixing Conditions, Volume 1. Technical Report, Department of Mathematics, Custom Publishing of IU, Indiana University, Bloomington. 9. A. M. BRUCKNER, J. B. BRUCKNER &; B. S. THOMSON (1997) Real Anal-
ysis. Prentice Hall, NJ. 10. L. H. Y. CHEN (1998) Stein's method: some perspectives with applications. In: Probability towards 2000, Lecture Notes in Statistics 128, 97-122. Springer, New York. 11. Yu. A. DAVYDOV (1970) The invariance principle for stationary processes. Theor. Probab. Appl. 15, 487-498. 12. Yu. A. DAVYDOV (1973) Mixing conditions for Markov chains. Theor. Probab. Appl. 18, 312-328. 13. F. GOTZE & A. N. TIKHOMIROV (1999) Asymptotic distribution of quadratic forms. Ann. Probab. 27, 1072-1098. 14. F. GOTZE & A. N. TIKHOMIROV (2002) Asymptotic distribution of quadratic forms and applications. J. Theoret. Probab. 15, 423-475. 15. I. A. IBRAGIMOV (1961) Spectral functions of certain classes of stationary Gaussian processes. Soviet Math. Dokl. 2, 403-405. 16. I. A. IBRAGIMOV & Yu. A. ROSANOV (1978) Gaussian Random Processes, Springer-Verlag, New York. 17. H. KESTEN & G. L. O'BRIEN (1976) Examples of mixing sequences. Duke Math. J. 43, 405-415. 18. A. N. KOLMOGOROV & Yu. A. ROZANOV (1960) On strong mixing conditions for stationary Gaussian processes. Theor. Probab. Appl. 5, 204-208. 19. V. V. PETROV (1995) Limit Theorems of Probability Theory. Sequences of independent random variables. Clarendon Press, Oxford. 20. M. RAIC (2003) Normal approximation by Stein's method. In: Proceedings of the seventh young statisticians meeting, Metodolozki zvezki 21, Lublana FDV. 21. G. REINERT (1998) Couplings for normal approximations with Stein's method. In: Microsurveys in Discrete Probability, eds: D. J. Aldous & J. Propp. pp. 193-207. Dimacs series, AMS. 22. G. REINERT (1999) An introduction to Stein's method and application to empirical measures. In: Modelos Estocasticos, eds: J. M. Gonzalez Barrios & L. G. Gorositza, Sociedad Matematica Mexicana, pp. 65-120. Proceedings of the Symposium on Probability and Stochastic Processes, Guanajuato, Mexico. 23. M. ROSENBLATT (1971) Markov Processes, Structure and Asymptotic Behavior. Springer-Verlag, Berlin. 24. Y. RINOTT & V. I. ROTAR (1996) A multivariate CLT for local dependence with n ~ 1 ' 2 l o g n rate, and applications to multivariate graph related statistics. J. Multiv. Anal. 56,, 333-350. 25. Y. RlNOTT & V. I. ROTAR (2000) Normal Approximations by Stein's method. Decisions in Economics and Finance, 23, 15-29.
84
Vladimir Rotar
26. Y. RlNOTT k. V. I. ROTAR (2003) On Edgeworth Expansions for Dependency- Neighborhoods Chain Structures and Stein's Method. Probab. Theory Rel. Fields 126, 528-570. 27. C. STEIN (1972) A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. Proc. Sixth Berkeley Symp. Math. Statist. Prob. 2, pp. 583-602. Univ. California Press, Berkeley. 28. C. STEIN (1986) Approximate Computation of Expectations. IMS, Hayward, Calif. 29. A. N. TlKHOMIROV (1980) Convergence rate in the central limit theorem for weakly dependent random variables. Theor. Probab. Appl. 25, 800-818.
Stein's method for compound Poisson approximation via immigration—death processes
Aihua Xia Department of Mathematics and Statistics University of Melbourne, VIC 3010, Australia E-mail: [email protected]. In dependent systems, rare events have a tendency to appear in clusters which make Poisson distribution a less favourable model as approximation errors are too large to use. For such situations, a compound Poisson distribution seems to be a more suitable choice. Stein's method for the compound Poisson approximation was introduced in Barbour, Chen & Loh (1992) but the approach yields relatively useful Stein factors only when the approximating compound Poisson distribution Z = Yl'jLi JNj satisfies jXj I 0 as j —> oo, where Nj ~ Po(Aj), j > 1 are independent, because under this condition, a Markov immigration-death process with multiple births and unit per capita death can be brought in to estimate the Stein factors. In this paper, we present a framework of Stein's method for the compound Poisson approximation using immigrationdeath processes with multiple births and multiple deaths, as well as some preliminary studies on this approach.
1. Introduction Since the pioneering work of Stein (1971) for normal approximation to the sum of dependent random variables, Stein's method has been extended to many other settings such as Poisson approximation [Chen (1975)], Poisson process approximation [Barbour (1988), Barbour & Brown (1992)], compound Poisson approximation [Barbour, Chen & Loh (1992)], binomial approximation [Ehm (1991)], negative binomial approximation [Brown & Phillips (1999)], compound Poisson process approximation [Barbour k. Mansson (2002)], discrete Gibbs measures [Eichelsbacher & Reinert (2003)] and polynomial birth-death distribution approximations [Brown k. Xia (2001)], etc. An integer-valued compound Poisson random variable can be written 85
86
Aihua Xia
as Z — Yl'jLiJNj, where Nj ~ Po(Aj), j > 1 are independent. We use CP(A) to denote the distribution of Z, with A = (Ai, A2, • • • )• We assume EZ = X ^ i i^i < 00 in the paper. Barbour, Chen & Loh (1992) established the following Stein equation for the compound Poisson approximation: 00
J2^l9(n + l)-ng(n) = f(n)-CP(X)(f)
(1.1)
1=1
for suitable class of test functions / on Z + := {0,1,2, • • • } , where CP(A)(/) := E / ( Z ) . It is shown in Barbour, Chen & Loh (1992) that for every bounded function / we can find a solution g = g/ of (1.1) that is also bounded and this g is then used to investigate the errors of the compound Poisson approximation as follows. Let W be any random variable on Z + . Suppose that it can be shown that
lElf2lXig(W + l)-Wg(W)\<eo\\g\\+s1\\Ag\\
(1.2)
for all bounded g : Z+ —> R, where, for any function h, we define \\h\\ := sup t u e Z + \h(w)\ and Ah(w) :— h(w + l) - h(w). Then it follows that for any class of bounded test functions Q, sup \TEf(W) - CP(A)(/)| < £0 sup || 5 / || + El sup \\Agf\\. feg /es /eg The class of test functions TTV '•= {/ = 1A '• A c Z + } defines the total variation metric for probability measures Qi and Q2 on Z + : drv(Qi,Q2)=
sup
fe^rv
/ / d Q i - [ fdQ2
J
J
= sup |Qi(^4) - Q 2 (4)|, ACZ+
and the corresponding Stein factors for the solution to the Stein equation (1.1) are ||A5/||<(lA-i)e££i\
|| S/ ||<(iA-l) e ££i\
(1.3)
[Barbour, Chen & Loh (1992), Proposition 1 and Theorem 4]. These estimates are obtained by an analytical method and are in general of the right order [Barbour, Chen & Loh (1992), page 1852]. Hence, it is natural to ask for better estimates in various special circumstances. One of such is to view the approximating compound Poisson as the equilibrium distribution of a certain Markov immigration-death process so that the coupling technique
Stein's method for compound Poisson approximation
87
can be employed to estimate the Stein factors. To do so, we need to assume j\j I 0 as j t oo so that Vi := i \ — (i + l)Aj+i > 0, for i > 1. We can then rephrase the Stein equation (1.1) in terms of a function h such that g(w + 1) = Ah(w), in the form oo
Y^[h(n + t) - h(n)\Vi + n[h(n - 1) - h(n)] = f(n) - CP(A)(/),
(1.4)
i=l
where the left hand side is the generator of an immigration-death process Z with unit per capita death rate and with immigration in batches at intensity Ai, a batch of size j coming with probability ^-. It is a routine exercise to check that the immigration-death process Z has the equilibrium distribution CP(A), and the Stein equation (1.4) has solution hf given by r°° hf(n) = - / \Ef(Zn(t)) - CP(A)(/)]cft, (1.5) where Zn is a Z-process with Zn(0) = n. Note that Zn, Zn+\ and Zn+2 can be realized on the same probability space by taking T\ and T^ to be two independent standard exponential random variables which are also independent of Zn, and setting Zn+\{t) = Zn(t) + l{Tl>t}, Zn+2(t) = Zn(t) + l{T1>t) + !{r2>t}With these realizations, when two processes meet, they will stay together thereafter, a property used by Barbour, Chen & Loh (1992) [Theorem 5 and Proposition 2] to prove reasonable bounds for the test functions / in TTV'-
\\gj\\ = \\Ahf\\ < 1A -?=, ||A 5/ || = ||A2MI < 1A ^ (J^ + log + (2^)) , while Barbour & Xia (2000) obtained neat bounds for test functions of / = l[o,fc]i k G Z + , appropriate to Kolmogorov metric:
\\gf\\ = \\Ahf\\ < 1 A y X
||A5/|| = IIA^/H < \ A ^-i-y.
However, if the assumption j\j | 0 does not hold, there is no probabilistic interpretation. In this case, even for the Kolmogorov distance, much hard effort is needed to improve the Stein factors in (1.3) [see Barbour & Utev (1998)] with good bounds in restricted range only, let alone the total variation metric. Another way of assessing the compound Poisson approximation is through the Poisson process approximation and then use the results of
88
Aihua Xia
Barbour & Brown (1992), Brown & Xia (2001) and Chen & Xia (2004) to obtain the estimates. This idea is also used in Arratia, Goldstein & Gordon (1990), Section 3 and Barbour, Hoist & Janson (1992), pages 238-241. In Section 2 of this paper, we consider the compound Poisson approximation to the distribution of the total number of points of a point process by first "grouping" the points into "clusters" and then laying the clusters on a new metric space to form a new point process. We then formulate a Stein equation based on the generator of immigration-death processes with births and deaths occurring in clusters. In Section 3, we use the immigration-death processes to assist in estimating the Stein factors for the compound Poisson approximation in terms of the total variation metric and present a general theorem for the compound Poisson approximation to the distribution of the number of points of a point process. When applied to the sums of independent non-negative integer valued random variables, which is an exercise of long history [see Le Cam (1965), Barbour k. Utev (1999) and references therein], the result gives the following estimate. Example 1.1: Let Xi, • • •, Xn be independent non-negative integer valued random variables with P(Xi = 0) = 1 - Pi, F{Xi = k)= Pidtk, k = 1,2, • • • ; i = 1, • • • , n. Let W = YH=i xii xk = Yn=iPiaik, then
dTV(C(W),CP(X)) < (l A (j- (ln+(2A1) + ^j))
EE(X, 2 )EPQ.
This prototypical example shows how promising the approach is. If we use the approach of Barbour, Chen & Loh (1992), we will need to assume i\i [ 0, and even if the condition holds, the bounds are generally not as good as here with the Stein factor a function of Ai — 2A2 rather than Ai. Likewise, Poisson signed measures based on Poisson perturbation in Barbour and Xia (1999) will not achieve better results either. On the other hand, if we use declumping according to the values of Xi's and tackle the exercise using a multivariate Poisson approximation in Barbour (1988) [see Barbour, Hoist & Janson (1992), page 240], then it seems impossible to get the Stein factor in terms of Ai. In Barbour & Utev (1999), bounds of better form are obtained, but it is at the cost of unspecified constants (which are often big) multiplied to the bounds. Noting that the ln + factor should be superfluous, we hope that future studies can replace it by a constant [see Remark 3.4].
Stein's method for compound Poisson approximation
89
In various special cases such as that observed in Michel (1988) [see also Le Cam (1965)] where for each k > 1, a\k — a^k = • • • = ank, the total variation distance should be not bigger than the total variation distance between the distribution of Y = £"=1 l[X;>i] and Po(E(y)), the general result here is too crude. Although, for this particular case, it is possible to use Stein's method proposed here to tackle the problem directly, we acknowledge that in order to apply the method to various types of dependent structures, much more studies are needed to perfect the approach. 2. Stein's equation for the compound Poisson approximation Suppose X is a fixed locally compact second countable Hausdorff topological space. If J is a finite or infinitely countable index space, one can use the discrete topology. To achieve the idea of "grouping", we define an artificial carrier space F = {ao, a\, • • • } with the discrete topology. Let Tij and Wr be the spaces of all finite point measures on X and F respectively. For each point measure £, we use |£| to stand for the total mass of £. We recall that E is a point process on I if it is a random element on a probability space with values in "Hx [see Kallenberg (1983)]. Our aim is to estimate the errors of the compound Poisson approximation to the distribution of |H|, denoted by£(|2|). We partition X into Borel sets {Ui, i S N} and for each £ £ TCi, we group the points of £ into clusters by denning an associated member oo
(2.1)
where 5X is the Dirac measure at x. Hence, for a point process H, the associated member S is a point process denned on F and |H| = |H|. The grouping into clusters depends on the choice of {Ui, i € N} and is reflected in the upper bound of the errors of the compound Poisson approximation. The idea of grouping and laying the grouped random variables in a new carrier space extends the declumping idea in Arratia, Goldstein & Gordon (1990, Section 3.1), which was later refined by Chen (1998, Section 5) and formulated into a concept of lifting in Chen & Xia (2004). Another way of viewing E, is through a marked point process by treating ct=(Ui) a s locations and E(Ui) as marks [Daley & Vere-Jones (2003), pages 194-210], then apply the theories in the Poisson process approximation [Barbour & Brown (1992), Brown & Xia (2001) and Chen & Xia (2004)] to the
90
Aihua. Xia
ground process Ng = X ^ i ^ s ( u . y Noting that all of these approaches will inevitably throw away lots of information and this paper aims to make use of some information neglected in these approaches. Define a generator on the bounded functions on Tir as oo
oo
Ah(0 = J2 J W £ + «««) - M0] + ! > ( £ - «««) - h(Oie{«i}- (2-2) (=1
1=1
Then A is the generator of an immigration-death process oo
Zt(t) = 52z®(t)8ai,t>0,
(2.3)
*=i
which is constructed as follows. For each fixed I, Z^ (•) is the number of particles at location ai and it is an immigration-death process with initial state Zi (0) = £{«;}, the immigration rate l\i for batches of I immigrants (coming to location a;), and the death rate I for every batch of I particle^ to die together, independently of the others. The processes Z^ , I > 1 are independent. It is possible to argue directly as in Ethier & Kurtz (1986) [Chapter 4 and Problem 15 of page 263] that Z^Kt) constructed above satisfies Z^(0) = £, is Markovian and is the unique solution to the generator A with stationary distribution CP(A) [cf Preston (1975)]. We omit the proof as it is a routine exercise. The immigration-death process used here is to be compared with the previous studies [e.g., Barbour, Chen & Loh (1992)] where the death can only occur with unit per capita rate. When F isfiniteand 5 is a simple point process, i.e., P(H{x} = 0 or 1, for all x) = 1, then the choice of Ux = {x} results in a unit per capita death rate. Thus, in essence, our interpretation is more flexible and likely to have more applicability. Moreover, there is always a probabilistic interpretation using Markov immigration-death process for such an approach without any restrictive conditions imposed, hence enabling us to get sharper bounds for the Stein factors [cf Barbour, Chen k Loh (1992)]. The reason for using an immigration-death process with deaths occurring independently for each cluster and births arriving independently of everything is that we can construct such a process effectively according to our needs, as shown in the proofs of Propositions 3.2 and 3.3. If we use a marked Poisson approximation to tackle the compound Poisson approximation by looking at a Poisson process approximation of the ground process Ng only, then lots of valuable information in the marks will
Stein's method for compound Poisson approximation
91
be lost. It is more hopeful that working on the actual values of the random variables proposed here will get us better estimates. We now establish the Stein equation and find its solution. Lemma 2.1: Under the assumption that ^ i > 1 i\i < oo, for each bounded function f on HT, the integral
h(0 = - f°° E[/(Z£(t)) - CP(A)(/)]dt Jo
(2.4)
is well-defined and satisfies
Ah(O = f(0 - CP(X)(f).
(2.5)
Proof: We write Z£(t) = Dt(t) + Z0(t) and ZCP{t) = D(t) + Z0(t), where Dj and D are two pure death processes (no immigration) with death rates as described above and initial distribution D^(0) = £ and .D(O) ~ CP(A), respectively, and ZQ is a copy of the above immigrationdeath process with initial state 0 and independent of D^ and D. Then ZCp{t) ~ CP(A) for all t > 0. Define Si = inf{i: Ds(t) = 0} and S2 = inf It : D(t) = o | , then we have |E[/(Z € (t)) - CP(A)(/)]| < E|/(Z £ (t)) - f(ZCp(t))\ <2\\f\\W(S1vS2>t), so that r°°
/ |E[/(Z£(t))-CP(A)(/)]|dt < 2||/||E(5iVS2) < 2||/||E(5i+5 2 ) Jo
< 2||/||f]e{a,}A2 + 2||/||f;E(iV0// < oo, i=i
i=i
where Ni ~ Po(A(). To show Stein's equation (2.5), conditional on the first time of change of states r = inf {t : Z^(t) ^ £}, it follows from the strong Markov property
92
Aihua Xia
that, for each 0 < u < oo,
- f[E/(Z £ (t))-CP(A)(/)]dt Jo = - E [U{f(Zs(t)) - CP(X)(f)]dt Jo
= - [ / ( 0 - CP(A)(/)]E[r Au] - E jf" [/(Zc(t)) - CP(A)(/)]d* = -[/(O-CP(A)(/)]E[TAU]
- r £ ^ ° ° a ! , . . . r E[/(z€+i,Q; w) - cp(A)(/)]dtP(T e d«) /•u °°
~
Ev°°
Pin \
pu—v
x 4.1.1 /
E[/(Z € _ W a i (i))-CP(A)(/)]dtP(T€d«).
Letting u —+ oo gives
HO = -[/(O - CP(A)(/)]
1
°°
+ 5]h(£ + I6ai)
which implies Stein's equation (2.5) by some reorganisation.
1\
'
•
To estimate the Stein factors, there are two ways to couple the immigration-death processes starting from different states [see Remark 3.4]. Erom Poisson random variable approximation we learned that the best coupling for estimating the Stein factors is in terms of the time for one Markov chain to run until it reaches the initial state of the other [Xia (1999), Brown and Xia (2001)], and since the Markov chain used for Poisson random variable approximation has single birth and single death, it is possible to represent the mean of the time in terms of the stationary distribution of the Markov chain. For the compound Poisson approximation, when j\j J, 0 as j —> oo, Barbour, Chen & Loh (1992) introduced a Markov chain with multiple births and single death to estimate the Stein factors. Although it seems impossible to have any explicit expression for the mean time of the Markov chain to travel from one state to another, when the test functions are of the form / = l[o,fc], k € Z + , Barbour & Xia (2000) managed to exploit this coupling technique to obtain some neat bounds. The difficulty here is multiple births and multiple deaths. For such immigration-death
93
Stein's method for compound Poisson approximation
processes, little is known about effective couplings [Mufa Chen (2000), personal communication]. 3. The compound Poisson approximation in total variation As in the Poisson random variable approximation case, the most valuable part, which is also the most difficult part, is to obtain the best possible Stein factors. To achieve this, we first need a technical lemma. Lemma 3.1: For any two positive integers j , I with j < I, we have
xj-xl<^J-x,
0
J Proof: It suffices to show that V(x) := x^1 — x'" 1 — -^- < 0 for all choices 0 < x < 1. With this end in view, noting that ^V(x) = 0 gives a solution xo = (i^r)
J
.we
ha
ve V(0) < 0, V(l) < 0 and
the proof is hence completed.
•
Now, we are ready to work on the first estimate of the Stein factors. Proposition 3.2: For all /(£) = U(|£|) with A C Z+,
{
^
A
i-43|i-j|
ifl>j>l
<(lA^)|l-j|:=c(Ai)|l-j|.
(3.1)
Proof: If j — I, the claim is obvious. Hence, we first assume 0 < j < I. It follows from (2.4) that /•OO
\h{£ + I5ai) - h(€+j8ai)\
= /
nf(Zi+i5Ol
(*)) - f(Zi+jsa]
(t))]dt •
(3.2) Let T\ ~ exp(j') and T2 ~ exp(/ — j) be independent and independent of Zj (£), then we can write Z
i+l5a, (*) = Zdt) + lSai lTiAT2>t, Zi+jSaj (t) = Zg(t) + j6aj l T l > t -
94
Aihua Xia
Now, if T\ < t, then f(Zz+lsai(t))-f(Zi+jSa.(t))=O, hence, by conditioning on T\ > t, then on T^ < t and T2 > t, we get nf(zi+iSai{t))
-
f{zMSa.{t))\
= e-*E[/(Z c (t) + lSai ln>t) - f(Z€(t) + j8aj)] = e-"E[/(Z £ (t) + I6ai) - f{Zt{t) +e-*(l
(3.3)
+j8aj)\
_ e-('-^)[/(^(*)) - f(Zdt) +JSaj)].
(3.4)
It follows from (3.2) and (3.3) that f°° 1 \HZ + I6ai) - h(£+j5aj)\ < / e~jtdt < -. Jo J
(3.5)
Noting that f(rj + iSai) = f(r] + iSai),
for all rjeHr, ie Z+,
(3.6)
we have from (3.4) and Lemma 3.1 that
\nf(zi+iSai(t))-f(zMSoj(t))}\ < e~lt \nf(Zdt) + lK)-f(Zdt) +{e-jt
+jSai)]\
__ e-it} |E[/(Z£(t)) - fXZs(t)+J8ai)]\
lt
< e~ |E[/(Z € (t) + WQ1) - /(Z £ (t) + jSai)}\ + i ^ l e - * |E[/(ZC(*)) - /(Z € (t) +j6ai)]\.
(3.7)
Now, bearing the representation (2.3) in mind, we can decompose Z^\t) = D^\t) + Z^1](t), t > 0, where D^ is a pure death process with Di (0) = ^{ai} and unit per capita death rate, Zg
is a birth-
death process with Zg (0) = 0, birth rate Ai and unit per capita death rate, D[ , ZQ and {Z\ , I > 2} are independent. Hence, if we define
Rdt) = TZ2 Zf\t)8ai + D^(t)Sai, then Ze(t) = Z£\t)8ai + Rt(t), t > 0.
(3.8)
Stein's method for compound Poisson approximation
95
Noting that Z^\t) ~ Po(Ai(l - e"*)), we obtain |E[/(Z£(t) + W a i )-/(Z c (t)+j6 a i )]|
< £ nRdt) = v)\m(v + (Z^(t) + l)Sai)--Ef(r,+ (Z^(t)+j)Sai) < £ nJk(t) = ri)drv(C(Z^(t)),C{Z^(t) + l-j)) i)£Hr
< (i - ^drvmz^it^ciz^it)+1)) < (/ - j) m a x P ^ 1 ^ ) = k) < (I - j) min ( l , - _ L = \ ,
(3.9)
where Ait = Ai(l — e~l) and the last inequality is from Proposition A.2.7, Barbour, Hoist &. Janson (1992). Similarly, |E[/(Z { (t))-/(Z C (*)+#a 1 )]l
< Y. F(^(*) = V) |E/(T, + Z^W^) - E/fa + (4J)(t) + j)*ai) < ^ p(^(t) = v)dTV(c(z^(t)),c(z^(t) +j)) V<EHr
<jdTV(C(Z^(t)),£(Z^(t) + l)) (3.10)
<jm^T(Z^(t) = k)<juan\l,-^==). Combining (3.2), (3.7), (3.9) and (3.10) yields, for Ai > i ,
|Mf + ««,) - Ht +J8ai)\ <(l~ j) /°°(e- 2t + e"*) min ( l , _ L = j dt = ( / - j ) / (2-s) min { l , - = ! = = } cfc
= (I - j) /2eAl (2 - s)ds + {l- j) f Jo
JT\-
(2 -
s)-J~ds
V^eXis
while for AX < i , i f > 1 > I . Next we assume 0 = j < I, noting that the second term of (3.4) is 0, we have /•oo
\h(€ + l6ai)-h(S)\<
/ Jo
e~tdt = l.
96
Aihua Xia
Also, it follows from (3.4) and (3.9) that, for Ai > ^ ,
\KZ + l6ai)-h(Z)\
f
f
/"V'miiJl.-^^ldt
Jo 1 1
I v2eAit J Z"5^ Z"1
= / / min < 1, — = = > ds = I Jo I V2eAisJ 70 \/2_ 0.8578/
yeAi
ds + 1 V ^
1 v2eA!S
ds
vAi
8
and for Ai < ^ , °-^I > 1, completing the proof.
•
The estimate of the second difference of the function h is given in the following proposition. Proposition 3.3: For all j,l > 1, /(£) = U ( | f | ) with A C Z+,
\h(Z+3Sai+l6ai)-h(Z
SJMI! < |l A ^
+ j6aj ) - h ( £ + I5ai) + h(01
("<"•>+£)) (ln+(2A1) + 4 ^ " ) ) } J' : = ^(Ai)^.
(3.11)
Remark 3.4: It seems that the term ln+(2Ai) should be superfluous and it appeared simply because we used a coupling of four Markov birth-death processes starting at different states and evolving at the same time [see Barbour (1988), Barbour & Brown (1992)]. We already know from Xia (1999) and Brown k, Xia (2001) that a better coupling is to divide the four processes into two groups, and in each group, let one Markov birth-death process run until it reaches the initial state of the other, and then compare the averages of the two coupling times. It is hoped that in future research, such a coupling technique could be adapted to replace the term by a constant. Remark 3.5: When Ai is small, the Stein factors ci(Xi) and c2(Ai) are both equal to 1 and we might get a better bound if we tackle the problem using a multivariate Poisson approximation as in Barbour (1988) [see also Barbour, Hoist & Janson (1992), page 240] or, instead of focusing on Ai, we focus on other A^'s in our approach. Remark 3.6: In many complicated applications, it may be necessary to estimate combinations of h(^+j5aj + I5ai) — h(£+j5aj) — h(£ + I5ai) + h(£). For example, if we have a\k = a,2k = • • • = ank = flfe for all k > 1 in
97
Stein's method for compound Poisson approximation
Example 1.1, it is more desirable to estimate M£ + Yi8aYi + Y26aY2 )-h(£ + Y!SaYi) - ft(£ + Y25aY2) + ft(O, where the random variables Y\ and Y2 are independent and identically distributed with P(Yi = k) — ak for k > 1, and the bounds will be a function of YliLi ^< rather than Ai. Hence, there is some room to improve on the bounds if the sizes of the clusters satisfy more conditions. Proof of Proposition 3.3: It follows from (2.4) that | h(£ + jSaj + I8ai )-h(£ + jSaj )-h(£ + lSai) + h(01
(3.12)
/-OO
= / nf(Zi+iSai+jsa.(t)) - f(Zi+l5ai(t)) - f(Zi+jS {t)) + f(Z^t))]dt • Jo Let Si ~ exp(i) and 5 2 ~ exp(j) be independent and independent of Z^(t), then we can write % w O | (*) = Z^(t) + Zi+j6aj(t)
=
l5ailSl>t,
Zi(t)+jSajlS2>u
z
i+isai+jsaj (t) = Zt(t) +
lSailSl>t+jSajls2>t-
Hence, if either Si < t or S2 < t, then f{zi+lSai+j5aj
(t)) - f(zi+lSai
(0) - f(z€+jSa.
(t)) + /(z € (t)) = o.
This, together with (3.6), implies 1E\f(Zs+i5ai+j5aj (t)) - f(Zi+lSai (<)) - f(Zi+Jsaj
(*)) + f(Zt(t))]
+
= e-(' ^'E[/(Z € (t) + itfQl + j5aj) - /(Z c (t) + Ztfai) -/(Z £ (t)+j<J a ,) + /(Z € (t))] = e-('+^tE[/(Z?(<) + (l+j)Sai) - f(Z^t) + lSai) -f{Ze(t)+j5ai) = e-{l+3)t E E
(3.13)
+ f(Zz(t))) E
^ W
+ (l'+J')S°i) ' 2f(Zdt)
+ (l'+f ~ l)«aj
+/(Z € (t) + ( i / + j ' - 2 ) J a i ) ] .
(3.14)
Using (3.13), we obtain from (3.12) that |h(£ + J«QJ. + WQI) - h(t + j5 Qj ) - h(£ + Wai) + h(0 < / 2e-{j+l)tdt ~ Jo
< -. ~j +l
(3.15)
98
Aihua Xia
Next, it follows from (3.8) and (3.14) that
E [/ [Zi+l5ai +j5aj (t)) - f (Z€+Mai (t)) - / (Zi+jta. (t)) + / (Ztf))] = e-(W£Y,
£ P(^(*)=»/)E[/(f? + (^1)(t)+i' + i')*a1)
j ' = ll'=l?j€«r
-2/(r ? +(z( 1 ) (t) + i ' + / - l ) 5 Q l ) + /(i?+(41)(*)+'/+i'-2)«a1)],
(3-16)
which in turn yields E[/(Z £ + M a | + ^. (*)) - f{Zi+Uai (*)) - / ( ^ + . s (*)) + /(^e(*))] < j/e-
(3.17)
On the other hand, E[/fa + (4 X) (*) + 2)« ai ) - 2/(7? + (4 1 } (t) + l)<Jai) + /(r? + Z{01](t)6ai)} oo
= S ^ + k5°i [F(Zo1}W = *) - 2F(Z^1)(t) =fc- 1) fe=0
+ P(41}(t) = fc-2)]| < 0 . 5 ^ IPCZ^t) =fc)- 2P(Z^(0 =fc- 1) + P(z£\t) = k-2)
< o , | i P W , ^ + 1 )((tti- 1 ) 2 + ^ ) ^ , where, again, An = Ai(l - e~'). Therefore,
E[/(r? + (Z^(t) + 2)5ai) - 2f(v + (Z$\t) + l)5ai) + /(r? + Z™(t)8ai)]\
^ 2 A Aro^y
( 3 - 18 >
99
Stein's method for compound Poisson approximation
Combining (3.12), (3.17) and (3.18) gives | KZ + j6aj + I5ai) - h(£ + jSaj )-h(£ + I5ai) + h{£) |
(3.19)
Kjlfe-^^j^L^y It is clear that the integral in (3.19) is bounded by e 2t 2 A
/
~ (
T77^ n) dt = A 1 - s) f2 A T-) ds
Jo \ ^i(l-e"*)/ Jo V - W = I"1 2(1 - s)ds + f (1 - s)/(XlS)ds = i - fln(2Ai) + - M , ^o
JJ-
Ai \
4Ai/
valid for Ai > 0.5. However, because of (3.15), direct verification ensures that (3.11) is true for Ai < 0.5 as well. • To state the main theorem, we need the Wasserstein distance between two probability measures Qi and Q2 on Z + defined as < M Q i , Q a ) = sup \ffdQife?w \J
J
f fdQ2
=inf|Yi-y2|,
where Tw — {f '• Z + >-> R, \f(x)-f(y)\ < \x — y\} and the second equation is due to the Kantorovich-Rubinstein duality theorem [Rachev (1991), Theorem 8.1.1, p. 168] and the infimum is taken over all couplings of (Yi,Y2) such that Yi ~ Q,, i = 1, 2. We also need the Palm distributions of a point process [Kallenberg (1983)]. For a point process H on X with finite mean measure fj,, we define probability measures {Px, x € J}, called the Palm distributions, on the Borel sets of Tlj equipped with the weak topology, satisfying Px(.) =
L[
~;j ;
fi{dx)
n
,
x e I H - almost surely.
When H is a simple point process, namely, P(E{x} = 0 or 1 for all x € 1) = 1, the Palm distribution Px can be heuristically interpreted as the conditional distribution of the process H given H{x} = 1. A point process Hx on X is called a Palm process of S at location x if its distribution equals Px. T h e o r e m 3 . 7 : Assume S is a point process on T with finite mean measure \i, {Ex, x e X} are the Palm processes of E, A; = £ ~ i P(H(C/j) = l),l> 1,
100
Aihua Xia
and Ex is the associated member ofEx, Ex = Ex - Ex(Ui)5aSx{U), oo
then
.
dTV(£(\E\),CP(\)) < ^(Ai)^; / EfllS-SillJEpS,^)]/!^) 7=1. ^ oo
.
+ ci(Ai)V;/ ^ ^ ( H ^ ) ^ ) , ^ ^ ) ) ) / ^ * ) .
(3.20)
Remark 3.8: When F is a finite set and S is a simple point process, the choice of Ux = {x} gives a unit per capita death rate and the second term of (3.20) simply vanishes. Proof of Theorem 3.7: We write H* = E - E{Ui)SasiUi), be independent copies of H and Ex respectively, then
let H' and E'x
EAh(E) oo
oo
oo
= E £ l\t [h(E + I5ai) - h(E)} + E ^ ^ [ ^ ( 5 - Wai) - /.(S)]/!^^)^ (=i x=i
i=i
= £ { £ [ ^ ( 5 + H'(I/t)<W(Ui)) " ^(H)]S'(^) i=l
+ E[/i(3') - /i(H' + 2(^)*a a(l/i) )]3(y i )} =£
/" E[/i(3 + 3;(C/0*aa,(l/i)) - ft(S)]/i(di) OO
.
+ 5 ] / E[/i(3i) - ft(3i + 3x(I/0Jas,(l/i))]**(<*«:)
t=i ^ ^
+ [h(Si) - h(Ex + E'^U^^^idx) oo
(3.21)
.
+ Y^ / E[A(3i + 3 ; ^ ) ^ ^ . , ) -KK +Sx(Ui)6aaxlu.))]p(dx). i=iJui
By Proposition 3.3, the first term of (3.21) is bounded by oo
c2(X1)J2 n\\Z-&x\\K(Vi)Mdx) fct Jut = C2(A!) J ] / E[||S - EiWinSxWMdx),
(3.22)
101
Stein's method for compound Poisson approximation
while, it follows from Proposition 3.2 that the second term is bounded by oo
ci(Ai)£/
fcf
= ci(Ai)f] /
dw(£(Sx(Ui)\*),£(S*(Ui)))n(dx).
-
\Y-Ex(Ui)\n(dx) (3.23)
Hence, the proof is completed by collecting the bounds in (3.22) and (3.23). • Proof of Example 1.1: Take r = {1, • • • , n } , C/j = {i}, then, due to the independence of X,'s, the second term of (3.20) is 0, leaving the first term which equals C2{Xi)52nz(Ui)]
I ES x (CA i )M(«fa)=c 2 (A 1 )^E[S(I/ i )]E|S 2 (^)].
•
Acknowledgement: I started using Markov immigration-death processes with multiple births and multiple deaths in 1995 when I was working at the Department of Statistics, the University of New South Wales and reworked on the idea in 2003. Part of the work was done when I visited the Department of Statistics and Applied Probability at the National University of Singapore. References 1. R. ARRATIA, L. GOLDSTEIN & L. GORDON (1990) Poisson approximation and the Chen-Stein method. Statist. Science 5, 403-434. 2. A. D. BARBOUR (1988) Stein's method and Poisson process convergence. J. Appl. Probab. 25 (A), 175-184. 3. A. D. BARBOUR & T. C. BROWN (1992) Stein's method and point process approximation. Stock. Procs. Applies 43, 9-31. 4. A. D. BARBOUR, L. H. Y. CHEN & W. LOH (1992) Compound Poisson approximation for nonnegative random variables via Stein's method. Ann. Probab. 20, 1843-1866. 5. A. D. BARBOUR, L. HOLST & S. JANSON (1992) Poisson Approximation. Oxford Univ. Press. 6. A. D. BARBOUR & M. MANSSON (2002) Compound Poisson process approximation. Ann. Probab. 30, 1492-1537. 7. A. D. BARBOUR & S. UTEV (1998) Solving the Stein equation in compound Poisson approximation. Adv. Appl. Prob. 30, 449-475. 8. A. D. BARBOUR & S. UTEV (1999) Compound Poisson approximation in total variation. Stock. Procs. Applies 82, 89-125.
102
Aihua Xia
9. A. D. BARBOUR & A. XIA (1999) Poisson Perturbations. ESAIM: Prob. & Statist. 3, 131-150. 10. A. D. BARBOUR & A. XIA (2000) Estimating Stein's constants for compound Poisson approximation. Bernoulli 6, 581-590. 11. T. C. BROWN & M. J. PHILLIPS (1999) Negative Binomial Approximation with Stein's Method. Meth. Comput. Appl. Probab. 1, 407-421. 12. T. C. BROWN & A. XIA (2001) Stein's method and birth-death processes. Ann. Probab. 29, 1373-1403. 13. L. H. Y. CHEN (1975) Poisson approximation for dependent trials. Ann. Probab. 3, 534-545. 14. L. H. Y. CHEN (1998) Stein's method: some perspectives with applications. In: Probability towards 2000, pp. 97-122. Lecture Notes in Statistics 128. 15. L. H. Y. CHEN k. A. XIA (2004) Stein's method, Palm theory and Poisson process approximation. Ann. Probab. 32, 2545-2569. 16. D. J. DALEY & D. VERE-JONES (2003) An introduction to the theory of
point processes. Springer, New York. 17. W. EHM (1991) Binomial approximation to the Poisson binomial distribution. Statist. Probab. Lett. 11, 7-16. 18. P. EICHELSBACHER & G. REINERT (2003) Stein's method for discrete Gibbs measures. Preprint: http://www.stats.ox.ac.uk/~reinert/. 19. S. N. ETHIER & T. G. KURTZ (1986) Markov processes: characterization and convergence. John Wiley & Sons, New York. 20. O. KALLENBERG (1983) Random Measures. Academic Press, New York. 21. L. LE CAM (1965) On the distribution of sums of independent random variables. In: Proc. Internat. Res. Sera., Statist. Lab., Univ. California, Berkeley, pp. 179-202. Springer-Verlag, New York. 22. R. MICHEL (1988) An improved error bound for the compound Poisson approximation of a nearly homogeneous portfolio. ASTIN Bulletin 17, 165169. 23. C. J. PRESTON (1975) Spatial birth-and-death processes. Bull. Int. Statist. Inst. 46, 371-391. 24. S. T. RACHEV (1991) Probability metrics and the Stability of Stochastic Models. John Wiley & Sons, New York. 25. C. STEIN (1971) A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. Proc. Sixth Berkeley Symp. Math. Statist. Prob. 3, 583-602. 26. A. XIA (1999) A probabilistic proof of Stein's factors. J. Appl. Probab. 36, 287-290.
The central limit theorem for the independence number for minimal spanning trees in the unit square
Sungchul Lee and Zhonggen Su Department of Mathematics, Yonsei University Seoul 120-749, Korea E-mail: [email protected] and Department of Mathematics, Zhejiang University Hangzhou 310028, P.R.China E-mail: [email protected] Assume that V\ is a Poisson point process of intensity 1 in R 2 . Let fn be a minimal spanning tree (MST) on V\ n [-n 1 / 2 /2,n 1 / 2 /2] 2 , and let N(fn) be the independence number of fn, i.e., the size of largest independent sets. In this paper, we prove a central limit theorem for
N(fn). 1. Introduction Beardwood, Halton & Hammersley (1959) showed that the length Ln of the travelling salesman tour on the n random points in R d satisfies a law of large numbers. Furthermore, Steele (1988) and Redmond & Yukich (1994) found that this type of asymptotics is commonly observed in several combinatorial graphs including the minimal matching and the minimal spanning tree. The reader may look at Steele (1997) and Yukich (1998) for the grand view of this kind of the law of large numbers. The subject of this paper is on the central limit theorem. The possibility of a central limit theorem for the travelling salesman tour was already suggested in Beardwood, Halton & Hammersley (1959); the BHH conjecture. However, until now only a very few attempts have been tried and among these trials Avram & Bertsimas (1993) and Kesten & Lee (1996) are worth to look carefully. The remarkable observation of Avram & Bertsimas (1993) is that conditioned on certain event Gn, the length Ln of several combinatorial graphs 103
104
Sungchul Lee and Zhonggen Su
on the n random points in R d can be written as the sum of locally dependent random variables and that the probability of G£ is negligible. So, they found the way to represent the random variable Ln as the sum of locally dependent random variables and they showed in this setting how to apply Stein's method which is the main theme of this special volume. By using Stein's method they proved the central limit theorem for the length Ln of several combinatorial graphs including the nearest neighbor graph. This is the first real progress in the direction toward the possible full justification of the BHH conjecture. However, the method of Avram k Bertsimas (1993) does not work for the length Ln of the more complicated combinatorial graphs. Kesten & Lee (1996) provided a new idea for analyzing the structure of the more complicated combinatorial graphs—the (space) truncation method. They wrote Ln — ELn as a sum X^fe=i ^-k of martingale differences and they approximated Afc by a truncated random variable Ak,L- Their main work was to show that this approximation is good enough for the study of the length Ln of the minimal spanning trees. In this way, they were able to apply Levy martingale central limit theorem and proved the central limit theorem for the length Ln of the minimal spanning tree. The reader may look at Alexander (1996) and Lee (1997, 1999) for the related results. The truncation method of Kesten & Lee (1996) which is called the stabilization method has been successfully adopted for many other problems; see for example Penrose & Yukich (2001), Baryshnikov & Yukich (2005). However, still this method is not good enough to prove the BHH conjecture. In this paper we report a new method of constructing yet another truncated random variable Ak,L in the hope that this method may provide a new insight for the BHH conjecture. Let {xi,..., xn} be a finite subset of R d , d > 2. A minimal spanning tree (MST) on {xi,... ,xn} is a spanning tree T({x\,... ,xn}) on {x\,... ,xn} such that 22 eeX({xi,...,xn»
|e| = min I 2 J |e| : T a spanning tree on {xi,...
,xn}>,
e€T
where |e| = |x, - Xj\ is the Euclidean length of the edge e = (xi, Xj). There are several interesting central limit theorems in connection with an MST; the length of an MST (see Alexander (1996), Kesten & Lee (1996) and Lee (1997)), the number of vertices of a given degree a in an MST (see Lee (1997, 1999)). In this paper we add one more story, the central limit theorem for the independence number of an MST. This problem is moti-
The independence number for minimal spanning trees
105
vated by the section 2 of David Aldous's "My Favorite 6 Open Problems" which is available at his homepage. See also Aldous & Steele (2003). An independent set in a graph G is a set of vertices, no two of which are linked by an edge and the independence number N(G) of a graph G is the maximal size of an independent set in the graph. As one of the most basic parameters for random graphs, the independence number has attracted attentions of a number of authors in the literatures. Meir & Moon (1973) were the first to study the independence number for uniform random trees and obtained a weak law of large numbers. Later on, Pittel (1999) used the recurrence relation for the moment generating functions of uniform random trees and established the central limit theorem. In the case of G(n,d/n), Frieze (1990) used the method of bounded differences and proved that for any e > 0, and for sufficiently large d > do(e) and n > no(e), P(N(GL,1\\
-^(Iogd-loglogd-log2 + l) < ^ ) - 1 .
In this direction, Boucheron, Lugosi & Massart (2000) used a powerful tool of log-Sobolev inequality and got a sharp concentration inequality which is independent of d. In the case of random geometric graphs, Penrose (2003), Penrose & Yukich (2005) verified the exponential stabilization property for some random geometric graphs and obtained a law of large numbers and the central limit theorem. In this paper we prove the central limit theorem for the independence number of an MST when the random points are from the Poisson point process. Our results are as follows: Theorem 1.1: Suppose that V\ is a Poisson point process of intensity 1 in R 2 . Let fn be the MST on Vx n [-n 1 / 2 /2, n 1 / 2 /2] 2 and denote by N(fn) the corresponding independence number. Then there are two positive finite constants c\ and C2 such that cxn < VarN(fn) < c2n and N(Tn) is asymptotically normally distributed, that is, asn->oo
N(fn)-EN(fn)^ (VariV(Tn))i/2 in distribution. Following the framework of Kesten & Lee (1996), we write the quantity N(Tn) — EN{Tn) as a sum of martingale differences, and then apply the
106
Sungchul Lee and Zhonggen Su
Levy martingale central limit theorem. In this way, the proof of the central limit theorem for N(Tn) is reduced to a kind of weak law of large numbers estimate for certain conditional variances. To get a weak law we need some independence and this required independence is usually obtained by the truncation method using the stabilizing property (see Lee (1997), Penrose & Yukich (2001), Baryshnikov & Yukich (2005), Penrose & Yukich (2005), for the stabilizing property and it applications). However, for our random variable N(Tn) this stabilizing property does not hold or at least we don't know how to prove it. So, we need a new idea; instead of studying the original random variable N(Tn) we artificially create an approximating random variable, say N^. By its construction this approximating random variable N'n has the stabilizing property so we can follow the steps of Kesten & Lee (1996) and get the central limit theorem for N'n. Furthermore, the approximation error turns out to be small enough to dig out the central limit theorem for N{Tn) from the central limit theorem for N'n. One may hope to prove that Theorem 1.1 holds for the non-Poisson case. Unfortunately, since we don't have the stabilization property for the independence number N(Tn), we cannot follow the de-Possionization step of Kesten & Lee (1996) and even worse we cannot prove the existence of ~VaiN(Tn). Our main tool is the percolation technique regarding the existence of the open and closed circuit at criticality. So, our approximation method works only for d = 2 and hence the problem is still open for high dimensions. In Section 2, we review MSTs and continuum percolation for further use. Since there are a large number of literatures on MSTs and continuum percolation, we will omit the details. In Section 3, we prove Theorem 1.1. In this paper, there are lots of strictly positive but finite constants whose specific values are not of interest to us. We denote them by a, C(9). 2. Minimal spanning trees and continuum percolation In this section, we recall several facts regarding minimal spanning trees and continuum percolation for further use. Lemma 2.1: Let G — (V, E, w) be a connected weighted graph satisfying \V\ < oo, \E\ < oo. If there exists a path -K = {v\,V2,- • •,vn) in G with (VJ,VJ+I) € E, 1 < j < n — 1, from vi e V to vn e V, such that also Vj - Vj+i\ < A, 1 < j < n - 1, then for any MST T on G there exists a path 7r' = (v[,v'2 ...,v'm) in G with ( v ^ + 1 ) G T, 1 < j < m - 1, from vi = v[ to vn = v'm such that \v'j - v'j+11 < A, 1 < j < m - 1.
The independence number for minimal spanning trees
Proof: See Lemma 2 of Kesten & Lee (1996).
107
•
Lemma 2.2: There exists a finite constant Dd, which depends only on the dimension d, such that, for any MST T(A) on any finite subset A of Kd, T(A) has maximum vertex degree bounded by D&. In particular, D2 — 6. Proof: See Lemma 4 of Kesten & Lee (1996).
•
Now, we review continuum percolation. For a set W of points of R 2 and for r > 0, x € R 2 , denote by {x <-^-» oo in W} the event that there exist an infinite path x = XQ,X-\_,X2, • • • , x n , • • •, and |a;n| -* oo in W such that \xi — Xi-\\ < r,i = 1,2, • • •. The fundamental theorem of continuum percolation says that there exists a constant 0 < rc = rc{2) < oo such that r urn <—> • in r>\V\) = / ><0 if r > rc, P(0 oo [0 iir < rc.
3. Proof of Theorem 1.1 Let us start with the proof of the statement that the variance Var7V(Tn) is asymptotically the same order as n. By a standard block argument like that of Kesten & Lee (1996) Theorem 2, pp.525-527, we easily see that VariV(fn) > Cln.
(3.1)
Suppose that {Xi : i > 1} are i.i.d. uniform points on [-n 1 / 2 /2,n 1 / 2 /2] 2 . Let an be a Poisson random variable with mean n, then X\, X2, • • •, Xan is identical in distribution to V\ n [-n 1 / 2 /2, n 1 / 2 /2] 2 . Now Let Tn be the MST on {Xi : 1 < i < n} and denote by N(Tn) the corresponding independence number. To stress the dependency to the points, we denote by N(S) the independence number of the MST through S. By Steele, Eddy & Shepp (1987) Lemma 2, it follows for any m > 1 and X\, X2, • • •, xm \N(x1,x2,...,xm)
-N(x1,x2,..-,xm-i)\
Thus by Efron-Stein inequality, n
VaiN(Tn)
(3.2)
108
Sungchul Lee and Zhonggen Su
Also, note that E(N(fn)
- N(Tn))2 < c5E(an - nf = c5n.
This, together with VarAT(TTl) < c^n, implies VarJV(fn) < c2n.
(3.3)
Next, let us turn to prove the asymptotic normality. Fix 1/4 < a < j3 < 1/3. Let An(x) be the moat with center x — {x\,X2), inner radius na, outer radius vP, that is 2
A°n(x):=\{[xk-n0,xk+n% fc=i 2
4 l (z):=II[ a : *- n a < : r f c + n Q ]' An(x):=A°n(x)\Ai(x). We say a finite point set A is nice if all inter point distances in A are distinct. For any nice set of points A in R 2 , let fn(A) be the MST through A n [-n 1 / 2 /2,n 1/ ' 2 /2] 2 , ATn(^4) the corresponding independence number. For the critical value rc of Poisson continuum percolation, call an edge e = (x, y) open in A if |e| < rc. Now we would like to have the following properties for the point configuration A on the moat An(x); (PI) there exists a closed circuit C\ in An(x) at criticality whose interior contains A%n(x), i.e., there exists a continuous path TTI : [0,1] —> An(x) with TTI(O) = 7Ti(l) such that U0
(P2) there exists an open circuit O An(x) with 7T2(0) = 7T2(1) such that there is an m with ^(j/m) € A, and |7T2((j + l)/m) - n2(j/m)\ < rc for 0 < j < m,
The independence number for minimal spanning trees
#2 := U0
109
An(x),
the bounded component of R 2 \ 7f2 contains ff i,
(P3) there exists a closed circuit C3 in An(o;) at criticality whose interior contains 02, (P4) there exists an open circuit O4 in An(x) at criticality whose interior contains C3. If (PI), (P2), (P3) and (P4) hold simultaneously, we say that the property Gn(x) (or Gn(x;A) to denote the dependency of the property to the point configuration A) holds. Note that the property Gn(x) is completely characterized by A D An(x). If for x with An(x) C [-n 1 / 2 /2, n x / 2 /2] 2 the property Gn(x) holds, then without loss of generality we may and do assume that O2 and O4 are the most inner and most outer open circuits respectively. By Lemma 2.1, we know that in the MST Tn(A) there exists a unique self-avoiding path it = (2/1,2/2, • • •, 2/m) from 02 to O4 with 2/1 £ 02, ym € O4, 2/j ^ O2 U O4 for 2 < j < m — 1. Suppose there is another such path 7T* = (zi, 22,..., z;). In addition, without loss of generality suppose i
max i \Zj - zj+1\ > ^ m a x ^ \Vj \zio - Zjo+i\ = j ^ ^
\ZJ -
yj+1\, zj+1\.
Note that since the path vr should cross the closed circuit C3, maxi<j< m _i \yj — yj+i\ > r c . Now, we travel from Zj0 to O2 using zi), from zx to 2/1 using only the points in O2, from 2/1 to y m (zjo,zjo-i,..., using 7T, from i/m to zi using only the points in O4, from z; to zJO+i using (z^, z^_i,..., ZJO+I). In this way we find a path n** = (wi,W2, • • •, Wk) from wi = zjo to Wk = zjo+i with maxi<j< fe _i \WJ - Wj+i|_< \zjo - zjo+i\. By Lemma2.1, the edge (zJO, z JO+1 ) cannot be in the MST Tn(A) and this leads to a contradiction. Therefore, there is indeed a unique path n = (2/1, • •., ym) In this case, we find the first 2/i0 m ^ from 02 to 04 in the MST fn(A). with jump size greater than the critical value, that is \Vi-y%+i\
110
Sungchul Lee and Zhonggen Su
Now, we call such points x/i0 big jump points and we denote the set of the big jump points in [-n 1 / 2 /2,n 1 / 2 /2] 2 by Bn. When we consider the independent set, we would like to exclude those big jump points. In other words, instead of Nn(A) we now consider N^(A) where N^(A) is given by Nn(A) = max{\A\; A is an independent set in fn(A),
A D Bn = 0}. (3.4)
Since for each big jump point j/j0 there is a corresponding open circuit O2 whose interior K(yi0) has its volume at least (2na)2, and since by the choice of the big jump points in [-n 1 / 2 /2,n 1 ^ 2 /2] 2 for two distinct big jump points yio andy^, the number |B n | of big jump points is at most (3.5)
Also, by the maximality of Nn(A) we have Nn(A) > N^(A). Now, note that any independent set A can be written as A = (A n Bn) U (A C\ B£). Since a subset A n £?£ of the independent set A is still an independent set, by the maximality of N^(A) we have \A n B^\ < N'n(A) and also \A\ - \A n Bn\ + \AK Bcn\ < \Bn\ + N'n{A). Hence, \Nn(A) - K(A)\ < \Bn\ = o{nx'2).
(3.6)
Next let us focus upon the Poisson point configuration V\. Observe first that by the standard argument of the RSW inequality in continuum percolation (see Alexander (1996) and Grimmett (1999) for the detail) in R 2 , P(Gn(x;Vi)) -»lasn-»oo uniformly in x. Write for simplicity N'n =: N'n{V{) and < 2 = VarA^. By (3.6), it is easy to see that bm <*. . 1 n^oo VarAT(Tn) and so c§n < a'2 < c^n and to prove Theorem 1.1, it suffices to prove the central limit theorem for A^, that is, a s n - » o o N'n - EN'n -*N(0,l) &
(3.7)
111
The independence number for minimal spanning trees
in distribution. Now, we represent N^ — EN^ as a sum of martingale differences and we apply Levy's martingale central limit theorem to the sum of martingale differences. In this way, the proof of the central limit theorem for N^ is reduced to a kind of weak law of large numbers estimate for certain conditional variances. Even though a weak law of large numbers is much easier to obtain, in general, than a central limit theorem, it still requires some independence. The required independence is obtained by approximating the conditional variances by quantities which depend only on the local configuration of V\. This approximation method is motivated by the bounded rooted dual of Redmond & Yukich (1994) and works well due to the existence of the open and closed circuits at criticality. For each v = (i>i,u2) € Z2, we let Q(v) = I l L i K ~ ^vk + | ] - W e order the points v of Z2 with Q(v) D [n 1 / 2 /2, n 1 ^ 2 /2] 2 ^ 0 in some way, say lexicographically, as v(1), • • • ,v(l), and we define Tk by Tk = <J(VI n [Ui
N'n-EN'n = Yd*k, fc=i
where Afe = E{N'n\Fk) -
EiN'J?^)
= / P(dak,--
,dat)
[K([\Ji
/ P(dak,--
U [U i>feai ]) - N'n([Ui
,da{)
[Dn([Ui
- Dn([Ul
U [Ui>fcOi],Ofe)]
=: fDn,kP(dak,...,dat),
(3.9)
where Ai — V\ H Q(v(i)), 1 < i < I, where P(dak,-P{Ak € dak, • • • ,Ai G dai), and where Dn{A,B)
=
N'n{A)-N'n{A\B).
,dai)
is short for
112
Sungchul Lee and Zhonggen Su
By virtue of the representation of N^ — EN'n as a sum of martingale differences and by Theorem 67.2 of Levy (1937), Theorem (2.3) of McLeish (1974), or Theorem 3.2 of Hall & Heyde (1980), to prove (3.7), it suffices to verify the following three relations: 1 ' —fi ^2 A2. —> 1 in probability as n —> oo, °n k —- max |Afc| —> 0 in probability asn-> oo, an i
~^E( max A?) is bounded in n. <2
i
(3.10)
(3-11) (3.12)
To this end, we still need more notations. Now, for L = nP we define DL{X; W) by the following. If Gn(x) happens under the point configuration W, that is, if Gn(x; W) happens, we let DL(x; W) = £>„([
1/2
1/2
—, - y - ] 2 n W, Q(x) n W).
If Gn{x; W) does not happen, we let 2
£ L (z; W) = Dn(l[[xk -L,xk + L}n W,Q(x) n W). fe=i
Note that the event Gn(x;W) completely depends on the point configuration J7fc=1 [a;jt - L,Xk + I ] n W . Also, if Gn(x; W) happens, then DL(X; W) completely depends on the point configuration nfc=i[x*; — L, xjt + L] n W. In this case, even though DL(X; W) = D n d - n 1 / 2 ^ , « 1/2 /2] 2 n W, Q(x) n W) looks as if it has dependency running over the whole point configuration [—n1/'2/2,n1/2/2]2 n W, this is not the case. In fact, when we calculate the modified independence number ^ ( [ - n 1 / 2 / 2 , n 1 / 2 / 2 ] 2 n W ) and K ( ( h n l / 2 / 2 > n1/2/2}2 \ Q(x)) n W), the big jump point does not change, the graph structure outside the big jump point does not change, and the vertices outside the big jump point which are included in the modified independence set also do not change. Therefore, the point configuration outside the big jump point has no influence on DL(X;W). Of course, if Gn(x;W) does not happen, by definition DL(X;W) completely depends on the point configuration nfe=i[xfe - L,Xk + L]n W. Therefore, in either case DL(X;W) completely depends on the point configuration
113
The independence number for minimal spanning trees
nLifr* - L,xk+L]nW and if Gn(x;W) happens, then DL(x;W) = Dn([-n^2/2, nll2/2]2 n W, Q(s) n W). Now, to proceed, we write AfciL for the expression arising in (3.9) when Ai(_[Ui
Afe,L = /
P(dak,...,dai)
x [£>L(u(fc); [U< fc A] U [UixtOi]) - DL{v(k); [\Ji
(3.13)
:= I DnXLP{dak,...,dai). Note that AA depends on Pi n [-n 1 / 2 /2,n 1 / 2 /2] 2 whereas AkyL
depends
only on U.[(v(k))i - L, (v(k))i +L]n Vx. (3.11) and (3.12) can be easily verified by Chebyshev's inequality and by the moment estimates in Lemma 3.1 below. Lemma 3.1: For each q > 0, there exists a constant C(9) such that E\Dn{Vx,Vi
n Q{x))\" < C(q) uniformly in x.
(3.14)
0
For L = n , E\DL{x; Vi)\q < C(9) uniformly in x.
(3.15) c
Proof: To have the moment estimate in (3.15), in the case G n{x\V\) first we construct an MST T(l) on (Y[[xi - L,xt + L]\ Q{x)) O V\. Second we construct an MST T(2) on Y[\xi ~ L, X{ + L] D V\ by applying the revised add and delete algorithm (see Kesten & Lee (1996) and Lee (1997)) with T(2)\T(1) to T(l). When we construct T(2) from T(l), the local structure of the graph around some vertices changes. However, since by Lemma 2.2 each v 6 Q(x) n V\ has degree bounded by 6 in T(2), by the revised add and delete algorithm the number of vertices for which the edges in T(2) emerging from the particular vertex are different from those in T(l), is bounded by 12|Q(a;) C\Vi\. Also, by the choice of the big jump point the number of big jump point may change at most eg. So, it follows from the revised add and delete algorithm that
\DL{x;V1)\
114
Sungchul Lee and Zhonggen Su
does not change. Indeed, suppose that A and B are two subsets of points and A C B. If x, y € A and e = (x, y) is one of edges of an MST of B, then e = (a;, y) is one of edges of an MST of A, too (see Eddy, Shepp h Steele (1987)). Hence the addition or the deletion of the points from Q{x) have no influence upon the existence of a unique path from the most inner open circuit to the most outer open circuit in V\ D An{x). So, in this case we have \DL{x;Vi)\
1 ^En ^™k
A
~~ ~
1 ' ^ ~ ^2 Z ) EAk
1S b
" k=l
°Unded
in
»•
So the heart of the matter lies in the proof of convergence in (3.10). Define a n,L = Sfc=i E^&,L- I n order to prove (3.10), we shall verify the following two relations: 1 '
-££|A£-A^|^0,
(3.16)
k=i
1 /2
' y ^k,L —* 1 in probability as n —» oo.
(3-17)
Recall that
P(Gn(x-Vi)) -*lasn-»oo uniformly in x. If Gn(v(k); (Uj
So, with en,k = Dn,k - Dn,k,L
&1-&U - (J Dn,kdP)2 - ( J Dn>ktLdP)2 = (j Dn,kdpf - n Dntkdp = 2(JDn,kdP){Jen,kdP)
f£n,kdpy
- ( Jen,kdP)2.
115
The independence number for minimal spanning trees
Now, since £„,* = £n,k1-Gz(v(k);(ui
E(jen,kdP)2
< E jelkdP <
we have
uniformly in k,
[JenMdP)1'2(p{Gcn{0)))1/2
-> 0.
(3.18)
Thus we have
k:An(v(k))nR2\{-n1/2/2,n1/2/2]29i0
k=l
k:An(v(k))c[-nV2/2,nl/2/2]2
-> 0.
(3.19)
Thus, since cgn < a'^ < cin, it follows that hm -~ = 1 and
cgn < o'^L < CIQU.
Now (3.17) follows from Chebyshev's inequality using the fourth moment. Since the events Afc^ and Afc^i, are independent as long as An(v(k)) n A^(v(k')) = 0, and since for each k the number of k' with K(v(k)) n A°n(v(k')) ^ 0 is of order n2/3, E(J2(^IL it=i
~ £Afc,L))4 < cn[n + n 1+2/3 + n1+i0 + n1+20} = O(n1+6^)
and hence
p(\-^y:AlL-EAlL > £ ) < £ i ^ - , 0 .
^
'
'
>- M4
Now, Theorem 1.1 follows. Acknowledgements: SL was supported by the BK21 project of the Department of Mathematics, Yonsei University, the interdisciplinary research program of KOSEF 1999-2-103-001-5 and Com2MaC in POSTECH. ZS was supported by NSFC 10371109, 10071072.
116
Sungchul Lee and Zhonggen Su
References 1. D. J. ALDOUS k. J. M. STEELE (2003) The objective method: Probabilistic combinatorial optimization and local weak covergence. In: Probability on Discrete Structures, Vol. 110 of Encyclopaedia of Mathematical Sciences, ed: H. Kesten, pp. 1-72. Springer, New York. 2. K. S. ALEXANDER (1996) The RSW theorem for continuum percolation and the CLT for Euclidean minimal spanning trees. Ann. Appl. Probab. 6, 466494. 3. F. AVRAM & D. BERTSIMAS (1993) On central limit theorems in geometric probability. Ann. Appl. Probab. 3, 1033-1046. 4. Y. BARYSHNIKOV & J. E. YUKICH (2005) Gaussian limits for random measures in geometric probability. Ann. Appl. Probab. 15, 213-253. 5. J. BEARDWOOD, J. H. HALTON & J. M. HAMMERSLEY (1959) The shortest
path through many points. Proc. Cam. Philos. Soc. 55, 299-327. 6. S. BOUCHERON, G. LUGOSI & P. MASSART (2000) A sharp concentration inequlaity with applications. Rand. Struct. Alg. 16, 277-292. 7. A. FRIEZE (1990) On the independence number of random graphs. Discrete Math. 81, 171-176. 8. G. R. GRIMMETT (1999) Percolation, 2nd edition. Springer, New York. 9. P. HALL & C. C. HEYDE (1980) Martingale Limit Theory and Its Application. Academic Press, New York. 10. H. KESTEN & S. LEE (1996) The central limit theorem for weighted minimal spanning trees on random points. Ann. Appl. Probab. 6, 495-527. 11. S. LEE (1997) The central limit theorem for minimal spanning trees I. Ann. Appl. Probab. 7, 996-1020. 12. S. LEE (1999) The central limit theorem for minimal spanning trees II. Adv. Appl. Prob. 31, 969-984. 13. P. LEVY (1937) Theorie de I'addition des variables aleatoires. GauthierVillars, Paris. 14. D. L. MCLEISH (1974) Dependent central limit theorems and invariance principles. Ann. Probab. 2, 620-628. 15. A. MEIR & J. W. MOON (1973) The expected node-independence number of random trees. Nederl. Akad. Wetensch. Indag. Math. 76, 335-341. 16. M. D. PENROSE (1996) The random minimal spanning tree in high dimensions. Ann. Probab. 24, 1903-1925. 17. M. D. PENROSE (2003). Random Geometric Graphs. Oxford University Press. 18. M. D. PENROSE & J. E. YUKICH (2001) Central limit theorems for some graphs in combinatorial geometry. Ann. Appl. Probab. 11, 1005-1041. 19. M. D. PENROSE & J. E. YUKICH (2005) Normal approximation in geometric probability. In: Stein's method and applications, Eds: A. D. Barbour & L. H. Y. Chen, pp. 37-58. IMS NUS Lecture Notes Series Vol. 5, World Scientific, Singapore. 20. B. PlTTEL (1999) Normal convergence problem? Two moments and a recurrence may be the clues. Ann. Appl. Probab. 9, 1260-1302.
The independence number for minimal spanning trees
117
21. C. REDMOND & J. E. YUKICH (1994) Limit theorems and rates of convergence for Euclidean functionals. Ann. Appl. Probab. 4, 1057-1073. 22. J. M. STEELE (1988) Growth rates of Euclidean minimal spanning trees with power weighted edges. Ann. Probab. 16, 1767-1787. 23. J. M. STEELE (1997) Probability Theory and Combinatorial Optimization. SIAM. 24. J. M. STEELE. L. A. SHEPP & W. F. EDDY (1987) On the number of leaves of a Euclidean minimal spanning tree. J. Appl. Probab. 24, 809-826. 25. J. E. YUKICH (1998) Probability Theory of Classical Euclidean Optimization Problems. Lecture Notes in Mathematics no. 1675, Springer, New York.
Stein's method, Markov renewal point processes, and strong memoryless times
Torkel Erhardsson Department of Mathematics, KTH S-100 44 Stockholm, Sweden E-mail: [email protected] We show (leaving out most of the details) how Stein's method can be used to obtain bounds for the total variation distance between the distribution of the number of points in (0, t\ of a stationary finite-state Markov renewal point process, and a compound Poisson distribution. Two bounds are presented, of which the first makes use of an embedded renewal-reward process, where visits to a particular single state serve as renewals. For the second bound, the original point process is first embedded into a process of the same kind, but with an added state, representing "loss of memory" in the original process. An explicit expression for the second bound is given for the case when the original process is a renewal process.
1. Introduction
Here we shall describe an application of Stein's method to the following problem. Let $ be a stationary Markov renewal point process on a finite state space S. (By such a process we mean a special kind of marked point process on the real line, with mark space S. The definition is given in Section 2.) Let B be a subset of S. Let W = *((0,t] x B), i.e., W is the number of points of \P in the interval (0, t] with marks in B. Some special cases of interest are the number of points of a stationary renewal process in (0,t], and the number of visits by a stationary irreducible Markov chain on the state space 5 to the set B during (0,t\. We want to find an explicit bound for dTV(J?(W),POIS(ir)), 119
(1.1)
120
Torkel Erhardsson
where drv{-r) is the total variation distance, and POIS(TT) is a suitably chosen discrete compound Poisson distribution. The reason why we want to bound (1.1) is that we expect that, under an additional condition, J£{W) is close to a compound Poisson distribution. The additional condition, which should be kept in mind although it is not necessary for any of the theorems to hold, is that the set B is rare in the sense that points of ^ with marks in B are rarely found. These points can then be partitioned into clumps, the sizes of which are approximately I.I.D., and the number of which is approximately Poisson distributed, an idea which is known as the Poisson clumping heuristic; see Aldous (1989). Before proceeding, we recall a few basic definitions. POIS(TT), where n is a finite measure on the positive integers, is defined as _2f (%2i=i ^i)> where all variables are independent, _2?(Tj) = TT = 7r/||7r||, and U ~ Po(||7r||). TT is called the compounding measure. The total variation distance (ITV{', •) is a metric on the space of probability measures on the nonnegative integers, denned by drv(j/i,(/2) = sup \vi(A) - v2{A)\. ACZ+
We use the following notation: K = the real numbers, Z = the integers, R+ = [0,oo), R'+ = (0,oo), R'_ = (-oo,0), Z+ = {0,1,2,...}, and Z'+ = {1,2,...}. The distribution of any random element X in any measurable space (S, y) is denoted by _S?(X). The Borel cr-algebra of any topological space S is denoted by 38$• Finally, we point out that the results presented here can be found also in Erhardsson (2000, 2004), together with detailed proofs. 2. Marked point processes We first recall the definition of a Markov renewal point process on a finite state space. Let {((f i^+i);* e Z} be a stationary Markov chain taking values in M'+ x S, where S = {l,...,n}, with a transition probability p which only depends on the second coordinate. Assume that {Vf;i G Z} is irreducible, and that E(£jf) < oo. For each A C S, let {(Cf^^^i e Z} have the distribution ^((Cf >^+i);» G Z|VOS € A), and define {UtA;i £ Z} by
feoCf, _V*
ift>l; CA
if ? < - 1
121
Strong memoryless times
Define a point process ^A on R x 5 by * A (-) = E i e z / { ( c / / 1 ' V%A) e }• This point process is called a Palm version with respect to marks in A of a Markov renewal point process. By definition, tyA is a marked point process on R with marks in S, and the basic theory for such processes (see Pranken et al. (1982) or Rolski (1981)) tells us that there exists a unique stationary version \P(-) = Eiez-MX^*' ^*) e '}> satisfying the Palm inversion formula: for each measurable function g : 9t —» R + (where 9t is the space of locally finite counting measures on R x 5),
E
,
m
_ E ( J ^ g(ft(**))*)
M<7W) -
jg^j
where ... < TAY < T£ = 0 < TA < ... are the ordered random integers {i G Z; V^-4 G A}, and flt : 9t -> «Tt is the shift operator. Some important special cases deserve mentioning. If S = {1}, then SP1 and \Er are Palm and stationary versions, respectively, of a renewal process. Likewise, the stochastic process {r?t;i e M} defined by r/t = Vmax{i€z;Ui
if i = 0;
X?=l0,
l-EJ-i3?,
if*<"l-
The point process £°(-) = E i e z - ^ P ^ ' ^ i 0 ) G •} on R x Z+ is called a Palm version of a renewal-reward process. The corresponding stationary version £ satisfies the Palm inversion formula: for each measurable function g : 9t->R+,
E(S(O) -
E(X?)
•
The point of introducing the renewal-reward process is that for this process, a useful total variation distance bound exists, due to Erhardsson (2000). The result is contained in the following theorem.
122
Torkel Erhardsson
Theorem 2.1: Let £(•) = X^ez-MX-^>^) € •} be a stationary renewal€ reward process, and £°() = J^iez^U^i^) "} tne corresponding Palm version. Then dTV(J?([
S
l W
J(O,t}xZ+
E(X°)
l
ydt;(x,y)),POIS(Tr))
E(*°) 2
E(X°)
>'
0
where irk = f^njPC^o = fc) for fc > 1, and '(lA^-Je'^ll,
always;
+ log + (2(7ri-27r 2 ))J, 1 . (1-20)A '
where X = X ^ i ^
if ITT, - (i + l)7ri+i > 0 Vi > 1; if fl •e' i ^ 2'
an<
^ ^ = A S S 2 *(* ~ l)71"^
A sketch proof of Theorem 2.1 is included here, to show how Stein's method is used. A full proof is given in Erhardsson (2000). Proof: [Sketch] We write (/>(£) = // 0 t , x Z yd£(x, y), for brevity of notation. The definition of total variation distance gives drv(JSf(^(0).POIS(7r)) OO
= sup |E(£*7TfcAMO + *))-0(OyUM0)l. where JA is the solution of the Stein equation for the distribution POIS(TT) with h = IA; see Theorem 1 1 ^ in Erhardsson (2003). For each (x,y) in K x Z+, let a Palm process £(x
sup \E(J2 knkfA(
^cz+
fc=i
l/E(/^(0(O + l/) - //1(0(?a;'y>)))^(x, y)|
= sup I / < Hiin) [
J(0,t]xZ+
yE|0(O - 0(^' y ) ) + y|di/(x,y),
123
Strong memoryless times
where "<•> = *(«•)) = i ^ * M - ) ,
(. is the Lebesgue measure on K, and ^y = _£?(F0°). Here, in the inequality, we used the fact that |[_A /^H < #i(7r); see Theorems 11.5 and 15.1 in Erhardsson (2003). Let f(°'») = EiezHiX^^}0'^) € •}, where {(X^+Xi-XuYJ, A
( i
y
> i
%>1;
) - S (0,2/),
i = 0;
»<-l,
[(Xi-X0,Yi),
where jSf(xf' y) ) = JS?(Xi|y0 = 2/), and x{°' y) is independent of f. Let £(x,y) = g_x(^o,v))t and j e t £(*.») = ^ ( g ) . This gives:
=
/
vd£{u,v)-
J(-x,t-x]xZ+
f
vd&°>ri(u,v) + y
J{-x,t-x]xZ+
< Yo + f
J(t-x-X[°'y\t-x]xZ+
+ I
vdt(u,v) vdZ{0'y\u,v)
J (U'__n(-x,~x+\X0\])xZ+
+ f
vd^°>y\u,v).
i(E' + n((-i-x 1 ,t-i])xz +
The terms on the right-hand side can be bounded using the Palm inversion formula and the independence assumptions. For details, see Erhardsson (2000). • 3. A partial solution Using the renewal-reward process defined in the previous section, a solution can be found for the problem considered in Section 1. The solution does not cover all the cases that we are interested in. We assume that the subset Bc is non-empty, and pick a £ B c . As in the previous section, let $ a (-) = J2 i g Z /{([/", V?) 6 •} be a Palm version of the Markov renewal point process with respect to marks in {a}. Let ... < X'L1 < Xft = 0 < Xf < ... be the ordered random variables {f/f; V? = a,i & Z}, and define Y? = \{k e Z;X? < U£ < X?+1,Vka <= B}\.
124
Torkel Erhardsson
Then, by the strong Markov property, £"(•) = Y^izzI{(XiiYi') € •} is a Palm version of a renewal-reward process. Similarly, let *(•) = J2iez 1{(ui: vi) € •} be a stationary version of * ° . Let . . . < X-1 < XQ < 0 < X\ < . . . be the ordered random variables {Ui; Vi = a,i € Z}, and define Yi = \{k eZ;Xi
Xi+1,Vk e B}\.
Then, £(•) = ^2i&I,I{(Xi,Yi) e •} is a stationary version of £ a . Using the triangle inequality for the total variation distance, Theorem 2.1, and the basic coupling inequality, we get drv(.S?(*((0,t]xB)),POIS(7r)) < dTV(^(
[
ydZ{x,y)),
POIS(TT))
V(O,t]xZ +
+drv(^(*((0,*] x B)),jSf( f
yd£{x,y))),
J(O,t}xZ+
-
U ;
MEQtf) (nx?Y0°) E((Xf)2)E(y0°) V E(Xf) E(Xf) E(Xf)2 ' +2P(*((0,X 1 ]x J B)>0),
(3.1)
where 7rfe = ^^P(Y^ = jfc) for k > 1. The bound (3.1) is often easy to compute numerically. For example, if 3?((fi\V,? = i) = exp(7i) for all i € S (so that {r]t;t e M} is a Markov process), then all quantities involved can be computed simply by solving a handful of systems of linear equations of dimension at most \S\. The drawback of (3.1) is that even though B is rare, the bound is not in general small unless also points of ^ with marks equal to a are frequently observed. This technical condition is satisfied in many interesting cases, but certainly not always. To remove it we introduce in the next section the concept of strong memoryless times. 4. Strong memoryless times The strong memoryless times were introduced in Erhardsson (2004), from which we take the following definition and theorems, and to which we refer for further details and proofs. Definition 4.1: Let (£, V) be a R + x 5-valued random variable, where S can be any measurable space. A strong memoryless time ^ is a random
125
Strong memoryless times
variable defined on the same probability space as (C,^), which satisfies P(0 < C < 0 = 1. an< 3 sucn that, for some 7 > 0 and some probability measure fi on S, JSf((C - t,V)\C
exp( 7 ) x /1
Vi G R+.
Theorem 4.2: For each function a : R + —> [0,1] such that a(t)< W
"
inf
A€3SM,+ X3SS (exp(7)x/x)(.4)>0
^ r f ' ^ w ^
(exp( 7 ) X /l)(i4)
Vt€M+, +
'
(4.1) V
'
and such that cr(t)e'Y* is nondecreasing and right-continuous, there exists a strong memoryless time £ such that P(C<*,C t). Moreover, the right-hand side in (4.1) can be used to deSne a. Theorem 4.3: The random variable £ is a strong memoryless time if and only if P(0 < < < C) = 1 and ^ ( C , C - C, V)|C < 0 = -S?(CIC < 0 x exp( 7 ) x M . Moreover, for
P(C < *. C < 0 = *•(*) + 7 / ^(w)^w; Jo
P(C = C < < , V e - ) = P ( ( < t , y e - ) - 7 M ( 0 /
P((C — t V) s A) -7 , . ' w . s = e- 7 'ess i n f l 6 ( t > o o ) x s / ( x ) ,
A€S9R, xSSs (exp(7) X /U)(J4) (exp(7)XM)U)>0
where / is the Radon-Nikodym exp(7) x fi.
v, ;
\
derivative of JS?(C, V") -with respect to
126
Torkel Erhardsson
5. A more general solution We first give a brief description of how strong memoryless times can be used to bound (1.1), followed by a more detailed account; for full proofs, see Erhardsson (2004). We begin with the Markov chain {«f, V ^ J j i e Z } on the state space M.'+ x S, where S = {1,..., n}, from which ty is constructed as described in Section 3. Using strong memoryless times, we embed this Markov chain into another Markov chain {(Cf,^+i);* € Z} on the extended state space R'+ x S, where S = {0,1,..., n}, and use the embedding chain to construct a stationary Markov renewal point process f on R x S, such that its restriction t o R x S has the same distribution asty.The points of \If with mark 0 represent a loss of memory in ^ , in the sense that the distance from such a point to the next point with mark different from 0, and the mark of the latter point, are independent of the past. Furthermore, the faster ^ loses its memory after a point with mark in S has been observed, the more frequently observed points with mark 0 will be in #. The final step is to re-use the solution of Section 3, but now applied to <J>, and with the choice a = 0. More formally, we first assume that B = S. This entails no loss of generality, since instead of ^ we may consider the restriction of \P to R x B, which is also a stationary Markov renewal point process. Next, let 7 > 0, let /j, be a probability measure on S = {l,...,n}, and, for each i € S, let a* : M.+ -> [0,1] satisfyjhe conditions of Theorem 4.2, for &((, V) = -^(Co\ Vf\Vos = i). Let {(Cf, Vf+1); i € Z } b e a stationary Markov chain on the state space R + x 5, where S = {0,1,..., n}, with a transition probability p defined by (for i ^ 0, j'• ^ 0): ft
p(i, [0,t] x {0}) = <7i{t) + 7 / <Ti(x)dx; Jo
P(h [0, t] x {j}) = p(i, [0, t] x {j}) - ti(j)~f f at(x)dx; Jo
p(0, [CM] x {0}) = (1 - £)(1 - e " ^ ) ; p(0,[0,t] * {j}) = utiMl - e-We)t), where e €(0,1) is arbitrary. Let {(Cf, V^+i);* e z ) n a v e -^((Cf, Vgi); * e Z\Vf e 5), and define {Uf; i € Z} by
feoCfUf = < 0,
ifi>l; if i = 0;
the
distribution
127
Strong memoryless times
Define the point process * s on R x 5 by $ s (-) = E i G z 7 { ( ^ f - ^ S ) € }• Clearly, <3>s is a Palm version with respect to marks in 5 of a Markov renewal point process. Moreover, from the definition of p, Theorem 4.3, and the strong Markov property, the restriction of $ s to 1 x S has the same distribution as ^fs. Also, the stationary version V? satisfies the Palm inversion formula: for each measurable function g ; 9t —> R+,
where ... < r ^ < TQ = 0 < -rf < ... are the ordered random integers {i e Z; V/ € 5}. The restriction of $ to R x 5 has the same distribution as *. For the final step, let *°(.) = £ i 6 2 2_{(tf°, V?) G •} be a Palm version with respect to marks in {0}, let ... < X°_ 1 < X$ = 0 < X° < ... be the ordered random variables {Uf; V® = 0, i e Z}, and define y,° = \{k e Z;X? < U°k < X?+1,V? e S}\. $°(') = Siez I{(X?, y^) £ •} is a Palm version of a renewal-reward process. Similarly, let *(•) = X)»ez ^{(^>^») € "} b e a stationary version of * ° , let ... < X_ i < Xo < 0 < X\ < ... be the ordered random variables {Ui- Vi = 0,ie Z}, and define Yi = \{k eZ;Xi
Xl+1,Vk e S}\.
^(•) = X^iez^{(^»'^) £ •} is a stationary version of £°. Proceeding as in Section 3, drv(^(*((0,t] x 5)),POIS(7r)) = d T y(if(*((O,t] x 5)),POIS(7r))
yd£{x,y)),-POIS(ir)) V
i(0,t]xZ +
y
+ drV(#m(0,t]xS)),J?([ ~
E(xf) V E (*i°) + 2P(*((0,Xi]xS)>0), 0
where 7rfc = i7 i 5 7P(r o = k) forfc> 1.
'
yd£(x,y))), ECx?)2
y (5.1)
128
Torkel Erhardsson
6. Number of points of a stationary renewal process The bound (5.1) does not at first sight seem explicit, but using the Markov property and solving a handful of systems of linear equations of dimension at most N, it is possible to express all quantities appearing in (5.1) in terms of 7, //, {E(Co S /{^ S = s'}\V0S = *); (s, s')eSx
{/
and
as{t)dt;s£ s\
{/
{E((C0s)2| ^
S),
/
= s); s € S},
as(t)dtdu; s € s\.
Below, we compute (5.1) in the important special case when S = {1}, which means that * is a stationary renewal process. We refer to Erhardsson (2004) for details. Theorem 6.1: Let $ be a stationary renewal process with interrenewal time C- Let 7 > 0 and let a : R + —> [0,1] satisfy the conditions of Theorem 4.2. E.g., we might take <j(t) = e"7*ess inf xe(tiOo) / c (a;)
Vt G R+,
where fe is the Radon-Nikodym derivative ofJ£(£) with respect to exp(j). Define c0 = 7 /0°° a(x)dx and c\ = 7 /0°° J^° a(x)dxdu. Then, drv(JSf(*((0,<])),POIS(7r))< 3i ^ ( 0 - 7 - ^ 0 ^ E(Q - Cl E(C2) - 2 7 - ^ ! HlM n^{—cQ— + ^ ^ + — W ) — +
2(E(C)-c 1 )(E(C)-7- 1 co)\ , 2(E(C)- 7 - 1 co)
^E(0
wiiere ||7r|| = tco/E((), '( H M << U ; " I
l A
)
+
1(0
w
_ _
W€R+I
fc 1
7ffc = (1 — co) ~ co for eacii A; > 1, and M^)
e M
'
ll7rllco(2c°-1)l4ll7rllc<'(2co-1) +log + (2||7r||co(2 C o -l))),
ifco 6(0,1];
l A
ifcoG[i,l];
2
. lk||(s"o-4)'
ifc0G(|,l].
Proof: [Sketch] The quantities involved in (5.1) can be expressed in terms of 7, E(C), E(C2), Co and ci. For example, letting rf = min{i > 1; V° = 0}, P(r o ° = k)= F{T° = A: + 1) = e(l - co)*" 1 ^
VA; > 1.
Strong memoryless times
129
In particular, E(?o°) = E/CQ. Also, define h : {0,1} -> R + by
Ms) = E(£c s IKf = *)> »=o where T? = min{i > 1; Vf = 0}. Then, E(X°1) = h(0)=e/1
+ eh(l),
and /i(l) = E(CoW = 1) + (1 - co)/i(l) = E(C) - co/7 + (1 - co)/i(l). It follows that E(Xj') = eE(C)/co. The other quantities are treated similarly. We finally let e -» 0. • As a last example, we consider the case when the interrenewal time has a mixed exponential distribution. This is relevant e.g. when studying the number of visits to a single state by a reversible Markov chain. Example 6.2: Let / c (x) = £ " = 1 ai7ie~ 7iI , where 0 < 71 < ... < j n , o.i € [0,1] for i = l , . . . , n , and Y^i=i ai — *• Choosing 7 = 71, we get a(t) = ct\e nt, Co = a.\ and c\ = ai71~1. Moreover, in this case E(£) = If C ~ exp(7), then Co = 1 and c\ = 7""1, making the bound 0. Remark 6.3: In a forthcoming paper, bounds for (1.1) in the case when W is the number of visits to a rare set by a stationary irreducible finite state Markov chain, will be computed using both the partial solution in Section 3 and the more general solution in Section 5. Numerical comparisons can then be made easily. References 1. D. J. ALDOUS (1989) Probability approximations via the Poisson clumping heuristic. Springer, New York. 2. T. ERHARDSSON (2000) On stationary renewal reward processes where most rewards are zero. Probab. Theory Rel. Fields 117, 145-161. 3. T. ERHARDSSON (2003) Stein's method for Poisson and compound Poisson approximation. In: An introduction to Stein's method, Institute for Mathematical Sciences Lecture Notes, Vol. 4, ch. 2. World Scientific Press, Singapore. 4. T. ERHARDSSON (2004) Strong memoryless times and rare events in Markov renewal point processes. Ann. Probab. 32, 2446-2462.
130
Torkel Erhardsson
5. P. FRANKEN, D. KONIG, U. ARNDT & V. SCHMIDT (1982) Queues and point processes. Wiley, Chichester. 6. T. ROLSKI (1981) Stationary random processes associated with point processes. Lecture Notes in Statistics 5, Springer, New York.
Multivariate Poisson—binomial approximation using Stein's method
A. D. Barbour Angewandte Mathematik, Universitat Zurich Winterthurerstrasse 190, CH-8057 Zurich, Switzerland E-mail: [email protected] The paper is concerned with the accuracy in total variation of the approximation of the distribution of a sum of independent Bernoulli distributed random d-vectors by the product distribution with Poisson marginals which has the same mean. The best results, obtained using generating function methods, are those of Roos (1998, 1999). Stein's method has so far yielded somewhat weaker bounds. We demonstrate why a direct approach using Stein's method cannot emulate Roos's results, but give some less direct arguments to show that it is possible, using Stein's method, to get very close.
1. Introduction The Stein-Chen method for Poisson approximation (Chen, 1975) has found widespread use, being easy to apply, even for sums of dependent indicators, and yet typically delivering bounds for the approximation error which are accurate at least as far as the order is concerned. For instance, considering the very simplest setting of a sum W of independent Bernoulli random variables (Xj, 1 < j < n), with Xj ~ Be (pj), the distance in total variation between £(W) and the Poisson Po(A) distribution with the same mean A := Y?j=i Pj 1S bounded by n
dTV(C(W),Po(X)) <min(\-l,l)J2pj,
(L1)
which is of the correct order; in particular, if all the pj are equal to p, the bound is just min{p, np2}, and if n tends to infinity while p remains fixed, the bound does not become progressively worse, but stays at the value p. 131
132
A. D. Barbour
For the multivariate analogue, take a sequence of independent Bernoulli random d-vectors (Xj, j > 1), with T[Xj = ^)=pf,
l
T[Xi=0) = l-Pi:=l-Yip®,
where eW denotes the i'th coordinate vector in E.d. Setting W := 2 " = 1 Xj> it was shown in Barbour (1988, Theorem 1), using Stein's method for Poisson process approximation in the multivariate setting, that
drv(C(W),V) < f > i n I A " 1 ^ ( V {pf}^ ,p$\ ,
(1.3)
where v :— C((Z\,..., Zd)) and the (Zi, 1 < i < d) are independent Poisson random variables with EZj = X^; here, A := 2?=i Pj> Mi : = A"1 n = 1 j ^ ' ' and cA:=i+log+(2A).
(1.4)
This bound is in many ways rather similar to that of the one dimensional version, but not quite as good. For instance, for identically distributed Xj, reduction to the one dimensional case shows that dTV(C(W),v)<min{P,np2}
(1.5)
is a valid upper bound in the multivariate case, too (Le Cam (1960), Michel (1988)). In contrast, the bound (1.3) translates to min{cnpp,np2}; for np large, this has an extra factor cnp x log(np), making the bound less accurate than that of (1.5), even if only logarithmically with np, and, for large enough n and fixed p, eventually destroying the bound altogether. This suggests that (1.3) could perhaps be improved in general, and replaced with a bound which would not become larger just because the size of the problem increased. Using two quite different methods, based on generating functions, Roos (1998, Corollary 1; 1999, Theorem 1) proved two bounds which demonstrated that this is indeed possible. His first bound was derived from a multivariate Charlier asymptotic expansion, and looks a little different from (1.3):
dTV(£(W),v)<—n
fd i £
"—n2
rmn{^\2e\}X-^{pf}
\ . (1.6)
Multivariate Poisson-binomial approximation
133
However, for identically distributed Xj and all n large enough, it takes the value (1.7)
remaining stable as n increases. However, Roos was unsatisfied with this result, precisely because of the comparison with (1.5) in the case of identically distributed Bernoulli random vectors. The problem is that the factor multiplying p in (1.7) is not bounded uniformly over all probability distributions (/jj, i > 1) over Z+; in d dimensions, it can be as large as d/(2 - \/3), and any attempt to approximate an infinite dimensional system for which J2i>i y/JM = oo is likely to be comparatively ineffective. His second bound, obtained by an elegant multivariate modification of Kerstan's (1964) method, gives a result which is satisfactory in all respects:
dTV(C{W),v) < c f > i n /A" 1 £ L'1 {pf)2\ ,p>\ ,
(1.8)
with a constant c which is always smaller than 8.8, but can be chosen close to 1 if maxi<j
134
A. D. Barbour
approximation has been shown to be true, and it seems conceivable that, as in the univariate case, a uniform bound of order O(A -1 ) for the weighted second differences could in fact be found. The main purpose of this paper is to show that such a bound is impossible. We exhibit a rather simple subset A of Z+ with the property that the solution to the bivariate Stein equation corresponding to the test function 1A has a weighted second difference whose supremum is of strict order O(X~1 log A). This implies that the elementary Stein arguments based on suprema can never yield a better order than that given in Barbour (1988, Theorem 1). However, we go on to show that there are other arguments here also, based on Stein's method, which yield bounds matching Roos's (1.6), and coming quite close to his (1.8).
2. Results Stein's method for approximation with the multivariate Poisson distribution v := Po (A/ii) x Po (A^2) x • • • x Po (A/id) starts with the observation that, for any given bounded function / : Z^ —> R, the function h defined by h(j):=-
Jo
j = (ji,...,jd)&Zd+
E«{/(V(t)) -„(/)}dt,
(2.1)
satisfies the equation Ah(j) = f(J) - »/(/),
(2.2)
where V is a multivariate immigration-death process with infinitesimal generator A defined by d
d
X
(Ah)(j) ••=Y, ^{h(j + e^)-h(j)} + Y/j^{h(j-e^)-h(j)}, i=l
(2.3)
i=l
and E"') denotes expectation conditional on V(0) = j ; note that v is the equilibrium distribution of the process V. Thus, for a random non-negative integer valued d-vector W, it follows that dTV{C(W),v)=
sup E{(AhA)(W)},
ACZ*.
(2.4)
135
Multivariate Poisson-binomial approximation
where hA is defined by (2.1) with / = 1A. Taking W := £ " = 1 xi it follows by routine arguments that
as in
(L2)>
W,{(Ah){W)} = £ £ p«pj'>E{A^(^)},
(2.5)
where W,- := £ w * j , Ai/i(Jb) :=fe(fc+ e « ) - ft(fc) and Aith := Ai(Aj/i). The right hand side of (2.5) can then be bounded by using the inequality d
(
1
d
2
d
sup sup V otiOnAuhAti) <mini \~ cxY]—,
)
Y^OL}}
,
(2.6)
which is proved for any a € JRd in Barbour (1988, Lemma 3), and (1.3) follows. The bound given in (2.6) is the key to establishing (1.3), and if it could be improved in such a way that the factor c\ were replaced by a A-independent constant, a bound comparable with Roos's (1.8) would result. Unfortunately, for large A, the order of the A-dependence in (2.6) is best possible whenever d > 2. It is shown in the following theorem that it is enough to consider sets A of the form A(r, s) := {j € Zd+ : 0 < h < r, 0 < j2 < s}. Theorem 2.1: Suppose that \i £ JR^_ is such that /ii,/Z2 > 0, and also that A > (e/327r)(/ii A/i 2 )" 2 - Then we have |Ai 2 ^ ( m i , m 2 )(i)| > c O i i ^ r ^ A - M o g A for any j such that (ji, j 2 ) = (wii,m2), where mi := [_A/x»J, i = 1,2, and
Remark: The lower bound for the A-dependence in (2.6) now follows by considering the choices a = e^ + e^2\ a — e^ and a = s^K Proof: We evaluate A12/MC?) by using the formula (2.1), realizing copies of processes V starting in the states j , j + £(1), j + e(2) and j + £(1) + £(2) simultaneously by setting y(D(i) : = y(°)(i) + J[£i > t]ew,
W(t)
:= V^(t)+I[E2
and y(12)(f) : = y(»)(t) + I[E! > t]eW + I[E2 > t]e{2\
> t]e(2)
136
A. D. Barbour
where V^ is started in j , and E\ and £ 2 are independent of each other and of V^°\ and have standard negative exponential distributions. This then gives A12MJ) = - | ° ° E [lA{V^2\t))
- lA(V{2\t)) - lA(V{1\t)) + lA(V^(t))}
= - I"0 e- 2t E {lA(V^(t)
+ eW + eM) - lA(V^(t)
dt
+ s^)
-lA(V<-°\t) + e^) + l^(V(°)(t))} dt,
(2.7)
and, for A = A(r,s) andfc€ Z%, l^(fc + e ( 1 ) + £ ( 2 ) ) - l A ( f c + £ ( 2 ) ) - U ( f c + £ ( 1 )) + l^(A;) = l {(rjS)} ((fci,fc2)). Hence, for any j such that (j'1,.7'2) — (011,7712), it follows that /•OO
A12MJ) = - / e- 2t P W) [Vi(t) = mx,V2{t) = m2]dt Jo = - / e-2tqi(t)q2(t)dt, Jo where
(2.8)
C(Vi(t) I V-(O) = mO - Bi (mi, e'1) * Po ( m ^ ) ) = £(mi - X t (i) + y t ( l ) ), where Xj ~ Bi (rnj, 1 — e~t) and Y^ which it follows that
~ Po (rhi(t)) are independent, from
qi(t)=F[xP = Yti)] = E{y?>(xli>)}, where
0 < y?\k) := Po(fhiimk} <
J—,
y/2emi(t) this last by Barbour & Jensen (1989, p. 78). However, <W(Bi [mi, 1 - e-'), Po (m<{l - c"'})) < 1 - e~\ from (1.1), and c/TV(Po(mi{l -e-*}),Po(mi(*))) < 1 - e~*
137
Multivariate Poisson-binomial approximation
from Barbour, Hoist & Janson (1992, Theorem l.C), using the fact that rrii - \fii\ < 1. Hence, and from Abramowitz & Stegun (1964, 9.6.10 and 9.6.16), ftW>E{»««
(0
)}-^^ y/2em,i(t)
2
= e - -W/ 0 (2m^)) - ^ = ^ (2 9)
- vl^yi2^~4^}'
-
as long as 1 — e~* < \\J~^ and fhi(t) > 1; that is, whenever
Thus it follows from (2.8) and (2.9) that, for j such that (ji,J2) — (w^ii ^2), \A12hA(j)\ > - — = /
y-^i-j,)^
AVM1M2 7l/Amin( Ml , M2 ) 4 - ^ ^
- 4 A ^ i 0 - i V^) {l0gA + logmin^' "»> + log ( i ^ ) } for c as defined above, if logA > -2hogm\n(ti1,fj,2)+iog(-J-^J
>.
• Although the direct argument by way of (2.5) and a bound such as (2.6) thus cannot yield an estimate of d,Tv{£(W), v) that is as good as those of Roos, alternative ways of exploiting (2.5) can be effective. Our next theorem uses (2.5) to give an analogue of Roos's bound (1.6). To state it, we define
A 2 «:=I>?} 2 ; j=\
^=maxpf(l- P f). -
3
-
n
Theorem 2.2:
dTV{c{w),V) < 2 y (1 A
l
=]
yPfPf. (2.10)
138
A. D. Barbour
Proof: The proof again begins from (2.4) and (2.5), but uses the simple inequality |E/(X) - E/(7)| < 2\\f\\dTV(C(X),C(Y)) applied with X = Wj and Y = Wj + e^ to \EAuh(Wj)\,
(2.11)
giving
+£<*>)).
lEAuhiWj)] < 2\\Alh\\dTV(C(Wj),C(Wj
(2.12)
By Barbour & Jensen (1989, Lemma 1), and because C(Wj ) is unimodal, we have
drviCiW^UWj+eV))
< (l A \ ) . \ 2 x /A/ii-A 2 (i)-T i y
(2.13)
Then, using (2.1) and the processes V^0) and Vw of Theorem 2.1, it follows easily that, for any A C Z^., ||A,M < min j l , j T e-tdTV,(£(rtW),£(yf(') + l ) ) d t | , where 1^
~ Po(m;(^)); this implies that
{
f°°
e~l
1, / and the theorem follows.
1
f
/ o 1
= d t \ = m i n < l , J — — },
If X i , . . . , Xn are identically distributed, having py 1 < i < d, then it follows that i————
•
= p^ for all
Q
a bound similar in spirit to (1.7). Further comparisons with (1.6) are contained in the following corollary. Corollary 2.3: Suppose that, for all 1 < i < d, Hi) < \^u
(2.14)
n < (1 - e/4)(A/ii - A2(i)).
(2.15)
139
Multivariate Poisson-binomial approximation
Then it follows that
drvWW),") < 2 £ J E (l A W ^ ) ^ ^|E/(^/^)W|
(2.16) •
(2.17)
Proof: The bound (2.16) follows from (2.14) because, under conditions (2.14) and (2.15), flA—. = | < (1 A W — — ] , \ 2 x / A / i i - A 2 ( i ) - r i y - V VeXmJ
Ki
(2.17) is then implied by the Cauchy-Schwarz inequality.
•
Remark: Expression (2.17) is in fact somewhat smaller than Roos's (1.6), although of exactly the same asymptotic order, so that a satisfactory counterpart has been derived using Stein's method in all cases in which (2.14) and (2.15) are satisfied. If (2.14) is violated for any i, the bound (1.6) is greater than 1, so (2.17) cannot be any worse. If (2.14) holds, but (2.15) is violated for some i, then XfXi < 2/(4 - e), and hence 2 < l/{(2 — y/3)\fii}; thus taking 1 for such i in (2.10) still yields a smaller bound than (1.6). Hence Theorem 2.2 yields a bound which is uniformly smaller than (1.6). It may, at times, be substantially better: if, for instance, n = md and pj = p for all j such that [j/rn\ = i — 1, p* = 0 otherwise, then (for large m) (2.16) gives a bound 4e~1dp, which is much smaller for large d than the value d2p/(2 - y/i) given by (1.6). We now turn to an analogue of Roos's (1.8), which is not entirely satisfactory. Recalling (1.3), we define £\ := Y^=i Vi(j)t where
mti) := min A"1** £ ^ A p ? ; I
i=i
Mi
(2.18)
)
note that, from (2.6), d
sup sup ]T pf pf AuhA(j)
<m(j)-
(2-19)
140
A. D. Barbour
In addition, we define the related quantity
„(,):= mm | - g - J — , p .J, and set
m{J)--=2Y,v?™Ai,(i*0 ~
[
\
\ ... ) ) •
A
2y/Xni-\2(%)-Ti) J
Then we have the following theorem. Theorem 2.4: Assume that e\ < 1/2. Then it follows that dTV(C(W), u) < (ei2 + e 3 )/(l - 2£i), where n
1
7 2
£12 := XN ^) ? ^);
£3 :
n
= ^Z^O')-
j=i
i=i
Proof: The argument again starts from (2.4) and (2.5), but now with (2.11) used to show that, for any bounded / ,
|E/(Wg - E/(W)| < 2||/|| drvmW^CiW)) d
< 2\\f\\J2pf)dTv(C(Wj),C(Wj+e^)). i=l
Then it is immediate from (2.13) and (2.19) that n
d
j=l
i,l=l
£ Y, pfpTiV&iMWj) - EAahA(W)} < e12.
(2.20)
Similarly, it follows from (2.11) that n
d
E E pfpTi^iMW) 3 = 1 i,(=l
- u(AuhA)} < 2e1dTV(C(W),u). (2.21)
The final step is to note that V is stationary if C(V(0)) = u, so that,
Mvltivariate Poisson-binomial approximation
141
from (2.7) and by Barbour (1988, (2.12) and (2.13)),
i,l=l
= i|P[Z + £ W + £(() € A] - P[Z + e « 6 A] - P [ Z + £(i) e,4] + F [ Z € A]| (2.22)
where Z ~ i/; the bound | E ^ P ^ ' M A W M I < P2- follows from Barbour (1988, Lemma 2). Hence,' from (2.20) to (2.22), it follows that dTV(C(W),
i/)(l - 2ei) < £12 + £3,
completing the proof.
•
The bound is in many ways rather good, the only drawback being the requirement that s\ be smaller than and, effectively, not too close to 1/2. If we define K := 2c\ max 7?2(j), it is immediate that
1 < j
dTV(C{W),v)<ez(l + K)l{l-2el). For Xi,..., Xn identically distributed, having py = p/j.i for all i and with X — np> 1/2, one has £3 = p/2, and
K<2pc np mini2, f ^ V M 7 J /V(n - l)p(l - p) 1 is typically rather small for n large — for instance, of order p(np)~ 1 / 2 c tlp if Ei=i Vt*i < °°- Moreover, if pCnP < 1/4, it follows that K < 1 and that £1 < 1/4, so that the bound is then at most 2p; similarly, if pCnP —> 0, then both K and £1 tend to zero, and the bound is asymptotic to p/2, half as big as Roos's bound under such circumstances. However, although pcnp < 1/4 whenever p < 1/(4 log n), an inequality satisfied in many applications, it is still violated for anyfixedp whenever n is large enough.
142
A. D. Barbour
Acknowledgement: This work was supported in part by Schweizerischer Nationalfonds Projekt Nr. 20-67909.02. I would also like to thank the Institute for Mathematical Sciences at the National University of Singapore for their support, during the current program. References 1. M. ABRAMOWITZ & I. A. STEGUN (1964) Handbook of mathematical func-
tions. Dover, New York. 2. R. ARRATIA, L. GOLDSTEIN & L. GORDON (1989) Two moments suffice for Poisson approximation: the Chen-Stein method. Ann. Probab. 17, 9-25. 3. A. D. BARBOUR (1988) Stein's method and Poisson process convergence. J. Appl. Probab. 25A, 175-184. 4. A. D. BARBOUR, L. HOLST & S. JANSON (1992) Poisson approximation. Clarendon Press, Oxford. 5. A. D. BARBOUR & J. L. JENSEN (1989) Local and tail approximations near the Poisson limit. Scand. J. Statist. 16, 75-87. 6. T. C. BROWN, G. WEINBERG & A. XIA (2000) Removing logarithms from Poisson process error bounds. Stoch. Procs. Applies. 87, 149-165. 7. T. C. BROWN & A. XIA (1995) On Stein-Chen factors for Poisson approximation. Statist. Probab. Lett. 23, 327-332. 8. T. C. BROWN & A. XIA (2001) Stein's method and birth-death processes. Ann. Probab. 29, 1373-1403. 9. L. LE CAM (1960) An approximation theorem for the Poisson binomial distribution. Pacific J. Math. 10, 1181-1197. 10. L. H. Y. CHEN (1975) Poisson approximation for dependent trials. Ann. Probab. 3, 534-545. 11. J. KERSTAN (1964) Verallgemeinerung eines Satzes von Prochorow und Le Cam. Z. Wahrscheinlichkeitstheorie verw. Geb. 2, 173-179. 12. R. MICHEL (1988) An improved error bound for the compound Poisson approximation of a nearly homogeneous portfolio. ASTIN Bulletin 17, 165-169. 13. B. ROOS (1998) Metric multivariate Poisson approximation of the generalized multinomial distribution. Teor. Veroyatnost. i Primenen. 43, 404-413. 14. B. ROOS (1999) On the rate of multivariate Poisson convergence. J. Multivariate Analysis 69, 120-134.
An explicit Berry-Esseen bound for Student's t-statistic via Stein's method
Qi-Man Shao Departments of Mathematics and of Statistics and Applied Probability National University of Singapore 6 Science Drive 2, Singapore 117543 and Department of Mathematics, University of Oregon Eugene, OR 97403, USA E-mail: [email protected] Let {Xi, 1 < i < n} be independent random variables with zero means and finite second moments. Bentkus, Bloznelis & Gotze (1996) obtained the Berry-Esseen bound for the Student t-statistic with an unspecified absolute constant. In this note we use Stein's method to give a direct proof of the bound with an explicit absolute constant.
1. Introduction Let Xi, X2, ..., Xn be a sequence of independent random variables with EXi = 0 and EX? < 00. Put
t=l
i=\
t=l
Define the Student ^-statistic by T
_
"
Vn s
where s2 = ^ j ]CiLi(-^i ~ Sn/n)2. The study of Berry-Esseen bound for Student's statistic has a long history (see, e.g., Bentkus & Gotze (1996) and references therein) and the first optimal result is due to Bentkus & Gotze (1996) for i.i.d case, which is extended to the non-i.i.d case by Bentkus, 143
144
Qi-Man Shao
Bloznelis &; Gotze (1996). In particular, they show that (l.i)
sUp\P(Tn
n 2
n 3
< dJ5- ^£;x27 { | X i | > B r v / 2 } +C 2 B- ^^l^| 3 /{|x i |
i=l
where C\ and C2 are absolute constants. The bound coincides the classical Berry-Esseen bound for the standardized mean Sn/Bn up to an absolute constant. Their proof is based on the characteristic function approach. The main purpose of this note is to give a direct proof of (1.1) with explicit values of C\ and Ci via Stein's method. The method wasfirstintroduced by Stein in 1972 for normal approximation and has been successfully applied to many other probability approximations, notably to Poisson, Poisson process, compound Poisson and binomial approximations. See, for example, Arratia, Goldstein & Gordon (1990), Barbour, Hoist & Janson (1992). See also Chen (1998) for an account of Stein's method and a brief history of its developments. Chen & Shao (2001) recently proved the non-uniform Berry-Esseen bound for the standardized sum through Stein's method. It is known that the Student t-statistic is closely related to the selfnormalized sum Sn/Vn via the following identity (Sn ( n \1/2\ f T ^ -, {T >x} = I— >x[———-) \. n lVn \n + xz — 1 / J Hence we state our main result in terms of the self-normalized sum. Theorem 1.1: We have sup \P(Sn/Vn < z) - $(z)| < 10.202 + 25/33, z
where
n
p2 = B-2^EXfl{\Xi>0.5Bn}
n
andfa= B~3 ^
(1.2)
E\Xi\3I{lXi\<0.5Bn}.
i=l
i=l
(1.3) In particular, we have n
sup \P(Sn/Vn
(1-4)
»=i
for 2 < p < 3. Remark: For i.i.d case, Nagaev (2002) also obtained an explicit bound similar to (1.4), but with a bigger value of the constant.
A Berry-Esseen bound for the t-statistic
145
2. Proofs Throughout this section we assume that {JQ, 1 < i < n} are independent with EXi = 0 and finite second moments. Without loss of generality, assume Bn = 1 and z > 0. The proof is carried out via (i) truncation and (ii) applying the Stein's method to the truncated variables. The main difficulty arises from the variable Vn in the denominator which can not be simply replaced by {E{V^))1'2. (1.2) is obviously true if /32 > 0.1 or /33 > 0.04. So we can assume that (32 < 0.1 and & < 0.04.
(2.1)
Recall a Hoeffding (1963) type inequality for non-negative independent random variables (see, e.g., the proof of Lemma 3.1 in Chen & Shao (2001)). Let {Yi,l < i < n} be independent non-negative random variables with for0<x
A direct corollary of (2.2) is: for c 2 < 1 - /?2 n
P(Vn
Er = l^^|X.I<0.5 } -C 2 ) 2 X (2.3)
Hence for z > 0 P ( 5 n / K > z) < P(Vn < 0.7) + P(Sn > 0.7z) t (l-0.1-0.49) 2 \ 1 Obi ) + 1 + (0.7z)2 ^6XP(
£O
-°15+rMo^
and sup(P(Sn/Vn
>z)-(\-
$(«)))
z>0
< 0.015 + 0.585 = 0.6.
(2.4)
146
Qi-Man Shao
Thus, (1.2) holds if /32 > 0.6/10 = 0.06 or (33 > 0.6/25 = 0.024. Therefore we can further assume /?2 < 0.06 and /?3 < 0.024.
(2.5)
Let Xi = XiI{\Xi\<0.5}i Sn = Yli=lXi, V£ = E?=i *l
Vn = max(Vn,0.8),
(2 6)
6 = Xi/v:, w = Er=i 6 = 5n/vB*, m
"!»(*) = Ci-f{-«j
(*) = E?=1 "^iW-
Note that \P(Sn/Vn < z) - P(W < z)\ < P( max \Xi\ > 0.5) + P(Vn < 0.8) l<j
<4fe7e.p(-"-^;0-64»') < 4/32 + /?3
sup
:
- exp (
0<x<0.024 X
V
[by(2.3)]
—)
X
/
< 4/32 + 0.98/33.
(2.7)
Now we use the Stein method to bound P(W < z) — $(z). Observe that for any absolute continuous function / n
W7(W) - $ > / ( W - 6)
(2-8)
i=l
1=1 n
i=l
= E ^ /
J 1
1=1 "
=E
"/-5i
.i ~
/-I
/
/ ' ( ^ + 0[-f{-«;
/ ' ( W + *)»ni(*)* = /
f'(W + t)m{t)dt.
Noting that ^
m(«)dt = E ^ 2 = ^ 2 /KT 2 = 1 - (1 - K2/0-64)/{1?7i<0.8},
A Berry-Esseen bound for the t-statistic
147
we have f'(W)
- Wf(W) = f'(W)(l
- VB2/0.64)J{V.n<0.8}
(2.9)
- £ &HW - 6) + / lf'(W) ~ f'(W + t)\m{t)dt. Now we let / := fz be the solution of the following Stein equation: f'(w) - wf(w) = I(w < z) - $(z).
(2.10)
P(W
(2.11)
Then + R3,
where R, = Ef'z(W)(l
- K 2 /0.64)/ { v- n<0 . 8} ,
n
R2 = YlEZifz(W-Zi),
(2.12)
(2.13)
8=1
i?3 = 1 1 E{{f'z(W) -f'z(W + t))m(t)}dt.
(2.14)
It is known that (see [Stein (1986), p.23]) 0 < fz{w) < V2TT/4 < 0.627,
(2.15)
l/»l
(2-16)
< 1.
\wfz(w) -(w + t)fz{w +1)\ < (\w\ + 0.627)|t|.
(2.17)
It follows from (2.16), (2.3) and the proof of (2.7) that I-Ril < P(Vn < 0.8) < 0.98/33.
(2-18)
Now by Lemmas 2.1 and 2.2 below, |i? 2 | < 1.568/32 + l-85/33 and \R3\ < 4.59/32 + 21.92/33. This proves (1.2) by combining the inequalities above.
•
We now present two lemmas used in the main proof. Lemma 2.1: Under (2.5), we have \R2\ < 1.568/32 +1.85/%.
(2.19)
148
Qi-Man Shao
Proof: Note that £» and W—& are almost independent and the dependence is mainly because of the denominator V£ in & and W — &. To eliminate the dependence, we introduce
% = (
* ' ) 1 / 2 ' V$) = max(%,0.8), Wft = ^ - ^ .
E l
(2.20)
(*)
Since E(Xi) = 0, |E(S n - Xt)\ = | ^ £ ( X , - 7 { | ^ | > 0 . 5 } | < 2/32.
Thus, under (2.5), we have
XW2
E\Sn - XJ < (E(Sn -
<(X:EX? + (£?(5 n -^)) 2 ) 1/2 < (l+4/?|) 1 / 2 < 1.008.
(2.21)
Noting that by (2.15) and (2.16)
= \afz(bx) + abxf'z{bx)\ < 0.627|a| + |afc|/0.8
\-^(axfz(bx))\
for \x\ < 1/0.8 and for constants a and b, and ••
- %
, y
«
v*2
-
V*2
^:)(V;*+F(:))
-wtM+w-^-0-98*1
(2 22)
-
149
A Berry-Esseen bound for the t-statistic
we have
< ^^{(0.6271^1 + l ^ ( ^ - ^ ) l ) 0 9 8 ^ 2 | ~-~i
I
0.8
n
i
n
< 0.615 Y, E\Xi\3 + (0.98/0.8)^E\Xi\ 3 E\S n - Xt\ i=l
i=l n
+0.627(0.8)-1 Y, \EXi\ i=l
[^ (2-15)]
< (0.615 + (1.008)(0.98)/0.8)/?3 + 0.627/0.4/32 < 1.568/?2 + 1.85/?3, as desired.
[by (2.21)] (2.23)
•
The next lemma provides an estimate for R3. Lemma 2.2: Under (2.5), we have
\R3\ < 4.59/?2 + 21.06/%.
(2.24)
Proof: First we prove that
R3 < 4.59/?2 + 21.92/fe.
(2.25)
The proof of —R3 < 4.59/?2 + 21.92/?3 is similar and hence will be omitted.
150
Qi-Man Shao By (2.10) and (2.17),
R3 < E f m(t)(\W\+0.627)\t\dt + E [ m(t)(I{w
n
3
< 0.5E ((\W\+0.627) J2\ti\ )+J2 3
n
3
i=l
m(t)I{z_t<w
+0.627)) +J2E(£I{'+Zi<w
< 0.98Y,E{\Xi\3(0.627 +\Sn\/0.8))
i=l := i?3,i + R3,2-
«i n
< 0.5(0.8)- '£E(\Xi\ (\Sn\/0.8 n
E
n
i=l
+Y/E(^I{z-lxii/o.s<w
i=l
(2.26)
By (2.21),
n
#3,1 < 0.98^] {(0.627 + 0.5/0.8)E\Xif + E\Xi\3E\Sn - X^/Os} i=l
< 0.98(0.627 + 1/1.6 + 1.008/0.8)/33 < 2.47^.
(2.27)
To bound i? 3 , 2 , let
n £ = 0.75^|6| 3 , ^ = 1^1/0.8 i=l
and define
{
-6 - r)i/2
x-(2zS + r)t/2
tor x < z - rji - 8
r]i)/2 for z-T]i-5<x
+5
(2.28)
A Berry-Esseen bound for the t-statistic
151
Then, by (2.8) n 3=1
= f m(t)h^5(W + t)dt> f J-i
I{z-n,<w
J\t\<&
~ ~
n
= I{z-Vi<W
n 3=1
= Iiz-v^w^iV*/
mzx(Vl0.82) - 1/3)
> (2/3)7 { ,_ r , i < vv < 0} - I{vn<0.8},
(2-29)
where in the last but second inequality, we used the fact that: min(a, b) > a - a2/(4b)
for any a > 0, b > 0.
Hence n
(2/3)J2E£l{z-rH<W
< EI{yn<0.8} j^g + E EgWhitS(W) - E E ^^'^^(^ - 6) i=l
i=l
j=l i=l
< P(V n < 0.8) + #3,3 + fl3]4 + #3,5 (2-30)
< 0.98/?3 + #3,3 + #3,4 + #3,5 by (2.18), where
Rz,3=J2E^\W\(S
+
Vi/2)),
n 1=1
^3,5 = - E E E^hiAw - o). i=l
jjii
152
Qi-Man Shao
From (2.21) we obtain that n
\R3,3\ < E(\W\5)
+0.5^2E(^rii\W\) i=l
n
< (0.5 + 0.75)0.8- 4 J3^|X?5 n | < 1.25(0.8)"4(0.5 + 1.008)03 < 4.603/33
(2.31)
and n
l i ^ l ^ ^ l O S + o.s^) n
3
2
<0.75f;(^|^| )
n
4
+0.5(0.8)" ^£;|Xi|4
1=1
1=1
n
n
i=l
i=l
< (0.75/1.6)£(^|&| 3 ) + 0.6104^£|Xi| 3 < l-53/33,
(2.32)
where we used the fact that £ " = 1 1 & | 3 < (0.5/0.8)XILi C.? < 1/1-6. To bound i?3,5, we define V^ as in (2.20) and h^ as in (2.28) with
^• = 0.75 J2 \X'\3/Vur Observe that j_
i
vf - v&
rv-2
1.5 / 2 Vidt _
1 ^1 Y I 2
Jv* V
V
U)
n VU)
<
1 t) A
- l jl
Vn V(j)
153
A Berry-Esseen bound for the t-statistic
Clearly for any x (2.33)
\hi,s(x) - hitSj(x)\ < \8-Sj\
< ^i^+0.75X:i^| 3 (l/^ - I / O 1*3
3 < 0.75|X 3 -|
1.5(0.75)[X,[2 ^
3
^ 0.3751X,[2 , l - m ^ l V ^ ^ 0.83 + VTr20.83 Z . °-5X< < 1.832X?. Write i?3i5 = i?3?6 + i?3,7, where
^3,6 = E E £ ; { ^ ^ ( W - ^) - k.*^ - ^ } ) }'
By (2.33),
|i? 3 ,6|<1.832^E^ 2 ^l^ 2 } < 1.832^ E{\^\Xf} < (1.832/0.8)/33 = 2.29/?3. Notice that l^i,^ I < k + r/,/2 < 0.75(0.5/0.8) + 0.5/1.6 = 0.782.
(2.34)
154
Qi-Man Shao
Since Xj and {V/*), S n - Xj,Xi} are independent for j ^ i, we have
i=l
j/i
»=1 J#»
-
^^.(I/V^)3^^,
((sB - ^,-)/^5))) |
< 0.782(0.8)- 3 EE £J l^l / {l^l>0-5}
E
*i
+ E E | £ ^ ( ^ r - 3 - V373)^A ((5ra - *,)/!$,)) | + E E | £ ( ^ 2 ^ y « ~ 3 { ^ ((5« - *i)ivz) <3.06/32+ 0.782 E E ^ (
i
^ ^ )
+EE*(^m-*il) < 3.06/?2 + 2.292/33 + - ^ E ^ l
3
[by(,22)1 ^
5
" " ^il
^ (2'21)1
< 3.06/32 + 3.56/?3.
(2.35)
Now combining (2.30), (2.31), (2.32), (2.34) and (2.35) yields #3,2 < 1.5(0.98/33 + 4.603/33 + 1.53/?3 + 2.29^ 3 + 3.06/32 + 3.56/33) < 4 . 5 9 ^ + 19.45/33. This proves (2.25) by (2.26), (2.27) and (2.36).
(2.36) •
A Berry-Esseen bound for the t-statistic
155
Acknowledgement: The author would like to thank Professor Bentkus and the referee for their valuable comments on the paper. The work was partially done when the author was visiting the Institute of Mathematical Sciences, National University of Singapore in August 2003. This research was partially supported by NSF grant DMS-0103487 and grant R-1555-000035-112 at the National University of Singapore.
References 1. R. ARRATIA, L. GOLDSTEIN & L. GORDON (1990) Poisson approximation and the Chen-Stein method. Statist. Sci. 5, 403-434. 2. A. D. BARBOUR, L. HOLST & S. JANSON (1992) Poisson Approximation. Oxford Studies in Probability 2, Clarendon Press, Oxford. 3. V. BENTKUS, M. BLOZNELIS & F. GOTZE (1996) A Berry-Esseen bound for Student's statistic in the non-i.i.d. case. J. Theoret. Probab: 9, 765-796. 4. V. BENTKUS & F. GOTZE (1996) The Berry-Esseen bound for Student's statistic. Ann. Probab. 24, 491-503. 5. L. H. Y. CHEN (1998) Stein's method: some perspectives with applications. Probability Towards 2000, eds: L. Accardi & C. C. Heyde, pp. 97-122. Lecture Notes in Statistics 128, Springer Verlag. 6. L. H. Y. CHEN & Q. M. SHAO (2001) A non-uniform Berry-Esseen bound
via Stein's method. Probab. Theory Rel. Fields 120, 236-254. 7. W. HOEFFDING (1963) Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58, 13-30. 8. S. V. NAGAEV (2002) The Berry-Esseen bound for self-normalized sums. Siberian Adv. Math. 12, 79-125. 9. C. STEIN (1972) A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. Proc. Sixth Berkeley Symp. Math. Statist. Prob. 2, 583-602. Univ. California Press, Berkeley, Calif. 10. C. STEIN (1986) Approximation Computation of Expectations. IMS Lecture Notes 7, Inst. Math. Statist., Hayward, Calif.
An application of Stein's method to maxima in hypercubes
Z. D. Bai, L. Devroye and T. H. Tsai Department of Statistics and Applied Probability National University of Singapore 6 Science Drive 2, Singapore 117543, Republic of Singapore E-mail: [email protected] School of Computer Science, McGill University Montreal H3A 2K6, Canada E-mail: [email protected] and Institute of Statistical Science, Academia Sinica Taipei 115, Taiwan E-mail: [email protected] We show that the number of maxima in random, samples taken from [0,1] is asymptotically normally distributed. The method of proof relies on Stein's method and gives a convergence rate. 1. Introduction A point y in M.d is said to be dominated by another point x if the (vector) difference x — y has only nonnegative coordinates. We write y -< x. The points in a given sample that are not dominated by any other points are called maxima. We derive in this short note a central limit theorem (CLT) for the number of maxima in random samples independently and identically distributed (iid) in the hypercube [0, l]d. A proof with the same rate was given previously in our paper Bai et al. (2004). We provide an alternative proof here using more original ideas introduced by Stein, which, in addition to methodological interest, also sheds more light on the complexity of the problem. For concrete motivations and information regarding dominance and maxima, we refer the reader to our paper Bai et al. (2004). 157
158
Z. D. Bai, L. Devroye and T. H. Tsai
Maxima in hypercubes. Let x i , . . . , Xn be a sequence of iid points chosen uniformly at random from [0, l]d, d > 2. Denote by Kn = Kn^d the number of maxima in {xi,..., x ra }. The mean of Kn is known to be
Wn4] = ^
y (1 4- O(aogn)-1)),
(1.1)
for bounded d; see Bai et al. (2004) and the references therein for more information. The variance satisfies (see Bai et al., 1998)
(i^"((^T)!+'")(1
+o
(^")"I»'
where Kd
v-
1 1^d_2kKd-i-k)\(k-iy.(d-2-k)\
Jo Jo
x + z - xz
is a bounded constant for d > 2. An asymptotic expansion for V[/fn] was derived in Bai et al. (2004). A Berry-Esseen bound for Kn,d- Suppose that Yi, Y 2 ,... is a sequence of random variables. Write {Yn} G CLT(rn), if Yn sn Z^]<x)-Hx) Pp( x
\ V *I/«J
/
=O(rB),
where rn —> 0 and $(i) is the standard normal distribution function. A sequence rn will be referred to as a convergent sequence if rn —> 0. We will construct a sequence of random variables KniW satisfying the following two theorems. Theorem 1.1: For a convergent sequence rn > Q,((lnn) 2~~), {Kn} e CLT(rn) iff {Kn,w} e CLT (r n ). Theorem 1.2: The centred and normalized random variables K* w defined by K^w :— (Kn,w - E[KntW])/^V[Kn!W] converge to the standard normal distribution with a rate d! {K*Wn,X) = O ((loglogn) M (logn)-^) ,
An application of Stein's method to maxima in hypercubes
159
where X denotes the standard normal distribution and dx(X,Y) := sup (|E[fc(X)] - E[h{Y)]\ : sup \h(x)\ + sup \h'(x)\ < l ) . From Theorem 1.2, it is easy to derive a rate for the Kolmogorov distance between the distribution of (Kn — E[Kn])/y/V[Kn] and that of a standard normal. Theorem 1.3: {Kn} G CLT ((loglogn) d (logn)-^) . Indeed, Theorem 1.3 follows from Theorem 1.2 and the fact that
where
{
,/rZ, iiu<x, 0, if u > x + y/r^, linear, otherwise,
and v / r ^ = (log log n) (logn) 4 . The proof given in this short note is similar to that given in Bai et al. (2004) since they both rely on the log-transformation first introduced by Baryshnikov (2000) and Stein's method. The main difference is that we give here a more self-contained proof based on Stein's original procedures (instead of just applying a theorem formulated in the book Janson et al. 2000). Also the conditioning arguments used in Bai et al. (2004) are replaced by more direct calculations. 2. CLT for Kn As in Bai et al. (2004), the proof of Theorem 1.1 is divided into several steps. The log-transformation. Assume now that x i , . . . , Xn are iid points uniformly distributed in the cube (—l,0)d. The transformation x = (xi,...,Xd) —> y = (j/i, • • •, yd), where (see Baryshinikov, 2000) Vi = -\og{-Xi),
i=
l,...,d,
160
Z. D. Bai, L. Devroye and T. H. Tsai
from ( - l , 0 ) d to Rf = {x : xt > 0 for all i = l,...,d}, preserves the dominance relation and thus transforms exactly maximal point to maximal point. Denote by y i , . . . , y n the images of x x , . . . , x n under such a transformation. Then the components of yi are i.i.d. with exponential distribution (A = 1). We define ||x|| = xx H 1- xd for x e M+. Then ||yi|| has a gamma distribution with parameter (d, 1), i.e., ||yi|| has the density function ?d_1\\e~x. Approximation to Kn by the number of maxima in a strip. Let Ba = {x : ||x|| > a } n R f and Bca = {x : ||x|| < a } n R f . Take a = Inn - ln(4(d - 1) In Inn), /3 = lnn + 4 ( d - l ) l n l n n . Let Kn be the number of maxima of the points falling in the strip T := BanBp. We prove that for a convergent sequence rn > fi((lnn) ~ ) , {Kn} e CLT{rn) iff {Kn} € CLT (rn).
(2.1)
To prove (2.1), we use the following Lemma, whose proof is omitted. Lemma 2.1: Let Xn,Yn be two sequences of random variables and rn be a convergent sequence. Suppose that (i) the total variation distance d(Xn, Yn) between Xn and Yn is bounded above by O(rn), (ii) \E[Xn}-E[Yn]\
=
O(rn^Y[X^}),
\V[Xn] - V[Yn}\ =
O(rny/V[Xn]).
and {in)
Then {Xn} € CLT(rn) iff {Yn} e
CLT(rn).
Let Nn(A) denote the number of points of { y i , . . . , y n } falling in A. Denote by Kn(A) the number of maxima of { y i , - . . , y n } falling in A and by Vn the event that no points of { y i , . . . , y n } fall in Bp. Clearly, Kn{A) < Nn(A). Note that maximal points contributing to Kn may not be maximal points contributing to Kn when Nn(Bp) > 0. However, we have Kn{Ba)lvn = Knlvn, which implies that Kn = KnlVn
+ Kn(Bca)lVn
+ KnlVc.
(2.2)
161
An application of Stein's method to maxima in hypercubes
To apply Lemma 2.1, we need to estimate the following quantities: d(Kn,Kn)
< F(Knlv.
> l)+P(Kn(Bca)lVrl
> 1) + F(KnlVc > 1),
E[Kn] - E[Kn] < E[Knlv,} + E[Kn(Bca)} + E[KnlV£], V[Kn] - V[Kn] < \E[Kl] - E [ ^ ] | + \E[Kn] - E[^ n ]| (E[Kn) + E[Kn}) , where
|E[^] - E [ ^ ] | =
\E[KZ1V*]+E[KZ(BI)1VJ-E[KZIV*]
.
Observe that 1 ^ < Nn(B0), E[Nn(B0)]=nV(\\yi\\>f3) /•oo
x
d-l
= O (n/T'-V)
= O (pnn)-3
= O (nad-le-a) = O((lnn) d - 1 lnlnn). Recall that E[Kn] x j l n n ) ^ 1 and E[K%\ x (lnn) 2 ^" 1 ); see (1.1) and (1.2). We show that E[/fn] and E[X^] have the similar asymptotic order. By (2.2), Kn = KnlVc +Kn-
Kn(Bca)lVf, -
KnlVc
< KnlVc + Kn < Nn(T)lv,+Kn. Therefore, EKn < E[Nn(T)Nn(B0)]+E{Kn] = E[Kn] + n(n - l)P( y i e T)P(y2 € B0) = E[Kn] + E[ATn(B/3)]E[iVn_i(r)] = O((lnn) d - 1 ),
162
Z. D. Bai, L. Devroye and T. H. Tsai
and E[K*\ < 2E[N2(T)Nn(B0)) + 2E[i£] = 2E\K2n\ + 2n(n - l)P( y i € T)P(y2 6 B0) + 4n(n - l)(n - 2)P 2 ( yi € T)P(y2 € B0) = 2E[Kl\ + 2E[Nn(B0)}E[Nn^{T)} + 4E[ATn(5/3)]E[Arn_1(T)]E[A/n_2(T)] = O((lnn) 2 ( d - 1 )). Estimates needed. We now claim that (i) E[X n (B-)]=O((lnn)- 3 ( d - 1 )), (it) E[Knlv.}=O{{\nn)-*d-V), (ii1) E[ifnlv=] = O((lnn)-2<'1-1)), (m) E [ ^ i v ; c ] = o ( ( i n n ) - ( d - i ) ) j {Hi') E[Kllv,} = O((ln-n)-('1-1)) and
WEKKll^Wlnr.)-^-1)). Prom these it follows that d(Kn,Kn) =
O((lnn)-2(d-V),
E[Kn}-E[Kn]\=O((\nn)-2(d-V), and V[Kn]-V[Kn]\=O((\nn)-ld-V). Proof of (i). If y is a maximal point, then there are no points in the region Cy = {z : Zi > yiti < d}. The probability that yi falls in Cy is J\\y\\ Therefore, for large n
(d-1)!
E[Kn(B
- O (n(n - l)-^ 1 *-^-^- 1 ^"") = 0((lnn)- 3 ^- 1 J).
An application of Stein's method to maxima in hypercubes
163
Proof of (ii). Note that Gn-.n •= {yi is a maximum in {y!, • • • ,y n }} C G n :n-i := {yi is a maximum in {yi, • • • ,yn_1}}. Thus E[KnNn(B0)\ < nP(|| y i || >0) + n(n - l)P(G n:n n {yn e B0}) < nF(\\yx \\>0) + n(n - l)P(G n:n _i n {yn e B0}) = E[Nn(Bp)] + E[Kn-i]E[Nn(B(,)\ = O ((lnn)-3^-1)) + O ((lnn)^ 1 )) O ((lnn)- 3 ^ 1 )) = o((lnn)- 2 ( d - 1 )). Proof of (in). E[K%Nn(B0)] < nP(||yi|| > 0) + 3n(n - l)P(G n:B n {yn € B/,}) + n(n - l)(n - 2)F(Fn:n n {yn £ S^}) < E ^ ^ ) ] + 3E[Xn_1]E[iVn(B/3)] + n(n - l)(n - 2)P(F n;n _i n {yn € S^}) < E[Nn(B0)] + 3E[Xn_1]E[iVn(B/3)] +E[K2_1]E[iVn(B/3)] = 0((lnn)"(d-1)), where Fn:n '•= {yi,Y2 are two maxima in {yi, • • • ,y n }} C Fn:n-i := {yi,y2 are two maxima in {yx, • • • ,y n _i}}. Proof of {ii'). Similarly as above, we have E[KnNn(B0)} < n(n - l)P(yi e T not dominated by points in T and y 2 G B0) <
E[Kn_^[Nn{B0)]
= 0((lnn)- 2 ( d - 1 )). Proof of {in'). E[K^Nn{B0)} < 2n(n — l)P(yi € T not dominated by points in T and y2 € B0) + n(n — l)(n - 2)P(y!,y 2 e T not dominated by points in T and y3 G B^) < 2E[Kn-1]E[Nn(B0)} d 1
= 0((lnn)-( - ^.
+E[Kl_1]E[Nn(B0)}
164
Z. D. Bai, L. Devroye and T. H. Tsai
Proof of (iv). Given yi,y2, the conditional probability that y3 falls in Cyi U Cy2 is P(C yi ) + F(Cy2) - P(C yi n Cy2) > i (e-II^H + e-l^l") ; the conditional probability that both yi and y 2 are maximal is less than ( l - - (V l|yi11 + e - | | y 2 l l ) V
< e-§(«-2)(e-»yiii+e-»"ii)_
We thus have E
[-^n( S a)] = E [X/, = 1 ^y* is maximal and ||yi||
< E[Kn{B°a)) + [(d"21)!]2 (J).)1
<E{K
" a (l°") a(
,
fJ\xy)d^e-^-^~*+^-*-ydx c - f n-2)e-°
<E[Kn(BJj+[(n_2)(d_1)!]2e ^((lnn)-2^-1)). Approximation by Poisson process. Construct a Poisson process {W n } on T with intensity function An = ne~HwH. Denote by Nw the number of points of the Poisson process falling in T. Also, let Kn,w denote the number of maxima of the Poisson process and Nn be the number of points of {yii • • • i]fn) that£alls in T. It is easy to see that the conditional distribution of Kn given Nn = m is identical to the conditional distribution of Kn,w given Nw = m. Thus, the total variation distance between Kn and KntW satisfies sup F{Kn eA)- F(KntW e A) A
= sup J2 p (^« = mMKn e A\Nn = m) A
0<m
-
J2
V(NW = rn)F(Kn,w € A\NW = m)
0<m
< 51 |P(Nn=m)-P(7V1B=m)|+ 0<m
< O(pn),
YI V(Nw=m) n<m
165
An application of Stein's method to maxima in hypercubes
(see Prohorov, 1953) where
P. =-P(x. e r , . f j £ L . ~ * - o ("°'""1'°'""). Similarly, we have
7a (d-1)!
V
«
/
E[ff n ]-E[jr n i l 0 ]|
(2.3)
3. A central limit theorem for Kn,w We prove in this section Theorem 1.2. We first give a lemma on Stein's method. Let h(x) be a function such that sup \h{x)\ + sup \h'{x)\ < 1. X
(3.1)
X
Let / be the solution of the differential equation xf(x) - f'(x) = h(x) - Eh, where
1 Eh = E[h(X)] = -=
C°° \ h(x)e~x
V2TT 7-oo
/2
dx,
where X is the standard normal variable. Let Zv be a set of random variables and Uv = {j : Zj is dependent of Zv],
Vv = ^ Zj, jeuv
Uvj = {k : Zk is dependent of Zv or Zj},
s = y j z v,
s v — s — v v,
Vvj = V^ ^k'
sv<j = s — vvj.
V
Lemma 3.1: Use the above notation and assume that E[ZV] = 0 and E{S2] = ZVE[ZVVV] = 1. Then
di (S, X) < CJ2 5 3
J2
v j€U» keUVljuU»
(E\ZvZjzk\ + E\ZvZj\E\Zk\).
166
Z. D. Bai, L. Devroye and T. H. Tsai
The lemma is essentially the same as Theorem 6.31 of Janson et al (2000, Page 158). Split R+ into cubes of edge-length Sn where Sn is a small positive number to be specified later. Let Zv denote the number of maxima of the Poisson process falling in the cell Tv (only cubes intersecting with T are counted). Set Kn,w = /
J
Zv.
v
and K,w = (Kn<w - E[Kn,w])/y/y[KntW]
= J2(ZV -
E[Zv])/y/v[KntV,].
v
Replacing Zv in Lemma 3.1 by (Zv — TE[ZV])j'\/V[KnjW], we obtain
E
(nzvZjZk)
jeuvkeuVijuuv
+E[ZvZj]E[Zk] + E[Zv]E[Zj]E[Zk]).
(3.2)
We now show that (i) If v ^ j , then nzev*Z?) < E[Zev*Nf>] < E[Z£]E[N?]
(h,h = 1,2),
where Nj is the number of Poisson process points falling in the region This follows from the fact that E[Z^1 \Nj — m] is decreasing in m. (ii) If v, j , k are pairwise distinct, then E[ZvZjZk] < E[ZvZjNk] < E[Z,,]E[j\r,-]E[^Vfc]. Similar to the proof for (i), ¥.[ZvZj\Nk = m] is a decreasing function of m. Thus, E[ZvZjNk] < E[ZvZj]E[Nk}. Then (ii) follows from (i). Substituting these into (3.2), we obtain di(K!W,X)
< CV[Kn,w}-l i^E[Z3v] + ]TE[^] 2 J2 E[N,} \
V
V
j£Uv
+Y,nzv]1ZnNi] v
jeUv
E k£UvtjUUv
E
^ ] | - (3-3) J
167
An application of Stein's method to maxima in hypercubes
Recall that V[Kn,w] x (Inn)"*-1. Define V
pv = / ne"l|y'ldy, D
( l n n ) ^ 1 ne _ a
-llyllj
f
Pn = J^e M* ~ -^3T)f
a(lnn)''- 1 lnlnn
(rf-i)!
'
Then we have v
u m>l
- E E E(Z"liV- = m]m2^e-P" v m>l
< 9 V V EfZ^lAT, = m]^e-P" + V V m 3 ^e-P" f
m>l
'
« 7n>4
<9Mn + 5 ^ p ' <9Mn + 5maxp^Pn. V
(Recall that a = Inn - ln(4(rf - 1) In Inn)). If we choose Tv (i.e. 6n) so small that maxp^Pn < 1/5, V
then
£E[^ 3 ]<9M n + l. Similarly, we can prove that V
Combining the above estimates, we have diiK'^X)
< CV[KniVI))-i (Mn(l + Qi + Ql) + 1),
where Qi = max Y" E[Nj] Q2 = max
Y feGt/UlJ-Ut/^
E[Nk}.
168
Z. D. Bai, L. Devroye and T. H. Tsai
On the other hand, Q\ < Q2 and Q2 = OUlnlnnf-1
j
ne~xdzJ = O((lnlnn)d).
Therefore, we conclude that di(K*iW,X) = O ( ( l n l n n ) 2 d ( l n n ) - ^ ) . References 1. Z.-D. BAI, C.-C.
CHAO, H.-K.
HWANG & W.-Q.
LIANG (1998) On the
variance of the number of maxima in random vectors and its applications. Ann. Appl. Probab. 8, 886-895. 2. Z.-D. BAI, L. DEVROYE, H.-K. HWANG & T.-H. TSAI (2004) Maxima in hypercubes. Rand. Struct. Alg. (to appear). 3. Z.-D. BAI, H.-K. HWANG, W.-Q. LIANG & T.-H. TSAI (2001) Limit theorems for the number of maxima in random samples from planar regions. Electron. J. Probab. 6, Article 3, 41 pp. 4. A. D. BARBOUR k. A. XiA (2001) The number of two dimensional maxima. Adv. Appl. Prob. 33, 727-750. 5. O. BARNDORFF-NIELSEN & M. SOBEL (1966) On the distribution of the
number of admissible points in a vector random sample. Theor. Probab. Appl. 11, 249-269. 6. Y. BARYSHNIKOV (2000) Supporting-points processes and some of their applications. Probab. Theory Rel. Fields 117, 163-182. 7. S. JANSON, T. LUCZAK & A. RUCINSKI (2000) Random Graphs. John Wiley & Sons, New York. 8. Yu. V. PROHOROV (1953) Asymptotic behavior of the binomial distribution. In: Selected Translations in Mathematical Statistics and Probability, Vol. 1, pp. 87-95. ISM and AMS, Providence, R.I. (1961); translation from Russian: Uspehi Matematiceskih Nauk 8 no. 3 (35), 135-142.
Exact expectations of minimal spanning trees for graphs with random edge weights
James Allen Fill and J. Michael Steele Department of Applied Mathematics and Statistics The Johns Hopkins University, Baltimore MD 21218-2682 E-mail: [email protected] and Department of Statistics, Wharton School University of Pennsylvania, Philadelphia PA 19104 E-mail: [email protected] Two methods are used to compute the expected value of the length of the minimal spanning tree (MST) of a graph whose edges are assigned lengths which are independent and uniformly distributed. The first method yields an exact formula in terms of the Tutte polynomial. As an illustration, the expected length of the MST of the Petersen graph is found to be 34877/12012 = 2.9035.... A second, more elementary, method for computing the expected length of the MST is then derived by conditioning on the length of the shortest edge. Both methods in principle apply to any finite graph. To illustrate the method we compute the expected lengths of the MSTs for complete graphs. 1. Introduction to a formula Given a finite, connected, simple graph G, we let v(G) denote the set of vertices and let e(G) denote the set of edges. For each edge e G e(G) we introduce a nonnegative random variable £e which we view as the length of e, and we assume that the edge lengths {£e : e e e(G)} are independent with a common distribution F. If S(G) denotes the set of spanning trees of G and I denotes the indicator function, then the random variable LMST(G)=TminG)££eI(eeT) is the length of the minimal spanning tree of G. In Steele (2002) a general formula was introduced for the expected value of LMST(G). 169
170
James Allen Fill and J. Michael Steele
Theorem 1.1: If G is a finite connected graph and the Tutte polynomial of G is T(G;x,y), then for independent edge lengths that are uniformly distributed on [0,1], one has Mr
f1(l-p)Tx(G;l/p,l/(l-p))
(W
E[LMST(G)]
=
r(G;iMi/(i-p)) * '
Jo - T
(L1)
where Tx(x, y) denotes the partial derivative ofT(x, y) with respect to x. The next few sections will provide a self-contained proof of this result. The formula is then applied to several concrete examples, including—for novelty's sake—the famous Petersen graph. We then focus on Kn, the complete graph on n vertices. In particular, we show how a conditioning argument based on first principles can be used to compute several expectations which were first found from Tutte polynomials. 2. MSTs and connected components For any finite graph G and any subset A of the edge set e(G), we let k(G, A) denote the number of connected components of the graph with vertex set v(G) and edge set A. If each edge e G G is assigned length £e, it is often useful to take A to be the set of short edges denned by et(G) := {e £ e(G) : £e < t}. For continuously distributed edge lengths, the minimum spanning tree is (almost surely) unique, and the sum
NMST(G,t)=
J2
!(&<*),
e£MST(G)
gives us the number of edges of the MST of G that are elements of et(G). Now, if G is a connected graph, then by counting the number of elements of et(G) in each connected component of (G,et(G)) one finds the formula NMST(G,t) +
k(G,et(G))=n,
and this yields a convenient representation for LMST{G)- Specifically, we
Expected length of an MST
171
have /-I
LMST{G) = Y^^HeeMST{G))^Yl e£G
/ J (* < &> e e MST(G) )dt
v 0
e€G '
/•I
= / 5^[l(e€MST(G))-I(C e <«, eeMST(G))]
= / [n - 1 - iVMST(G, t)] (ft = / [k(G, et(G)) - 1] dt, Jo Jo so we find / k{G,et(G))dt. (2.1) Jo Versions of this formula go back at least to Avram & Bertsimas (1992), but here we take away a new message. For us the main benefit of the formula (2.1) is that through k{G,et(G)) it suggests how one should be able to calculate E[LMST(G)] with help from the Tutte polynomial of G. 1 + LMST(G)=
3. The Tutte Polynomial: a review of basic facts Given a graph G (which may have loops or parallel edges), the Tutte polynomial T(G; x, y) may be computed by successively applying the four rules: (1) If G has no edges, then T(G; x, y) = 1. (2) If e is an edge of G that is neither a loop nor an isthmus*, then T(G; x, y) = T{G'e;x, y) + T{G'l\ x, y), where G'e is the graph G with the edge e deleted and G" is the graph G with the edge e contracted. (3) If e is an isthmus, then T(G; x, y) = xT(G'e;x, y). (4) If e is a loop, then T(G;x,y) = yT{G"e\x,y). SOME INSTRUCTIVE EXAMPLES
Rules (1) and (3) tell us that the Tutte polynomial of K2 is just x. In fact, by n — 1 applications of Rule (3) one finds that the Tutte polynomial of any tree with n vertices is xn~1. The rules are more interesting when contractions are required, and it is particularly instructive to check that the Tutte polynomial of the triangle graph K3 is x + x 2 + y. *A loop is an edge from a vertex to itself, and an isthmus is an edge whose removal will disconnect the graph.
172
James Allen Fill and J. Michael Steele
For a more complicated example, one can consider the Tutte polynomial of a bow tie graph G which is denned by joining two copies of K3 at a single vertex. For the bow tie one finds
T(G;x,y) = (x + x2 + y)2, and this formula has an elegant (and easily proved) generalization. Specifically, if G and H are two graphs that have one vertex and no edges in common, then T(G U H; x, y) = T(G; x, y)T(H; x, y).
(3.1)
As a corollary of this observation we see that the product of Tutte polynomials is always a Tutte polynomial. T H E RANK FUNCTION AND MEASURES OF CONNECTEDNESS
The rank function r ( ) of a graph G is a function on the subsets of e(G) which associates to each A C e{G) the value
r(A) = \v(G)\-k(G,A), where, as before, k(G, A) is the number of connected components of the graph with vertex set v(G) and edge set A C e(G). The rank function provides a measure of the extent to which the graph (v(G), A) is disconnected, and it also provides an explicit formula for the Tutte polynomial, T(G;X,y)=
^
(x - l)r(e(G))-r(A)(y _ ^\A\-r(A)m
(33)
ACe(G)
One simple consequence of this formula is that T(G;2,2) = 2^G^, a fact that is sometimes used to check a Tutte polynomial that has been computed by hand.
4. Connection to the probability model For a connected graph G, one has
r(A) = \v(G)\-k(G,A)=n-k(G,A)
and r{e(G)) = n - 1,
so if we set m = \e(G)\ then the sum formula (3.2) may be written as T{G y)=
£ (V-l)M[(*-V
^ (x-l)(v-l)n V
ni>
'
ACe(G)
(z-i)0/-i)"^G)V y )
\y)
[{x
1)iy
1)J
•
173
Expected length of an MST
If one now sets p=
y
and l - p = - , y
(4.1)
one gets a sum of the form
J2 PW(l-p)m-Wf(A).
(4.2)
ACe(G)
Any such sum has an obvious interpretation as a mathematical expectation, and such reinterpretations of the Tutte polynomial are often used in statistical physics. More recently they have been used in studies of the computational complexity of the Tutte polynomial (see, for example, Welsh (1999) or Lemma 1 of Alon, Frieze, & Welsh (1994)). 5. A moment generating function and a moment The reformulation in formula (4.2) also turns out to give an almost automatic path to a formula for E[LMST(G)}. The first factor in the sum is equal to the probability (under the uniform model) that one has £e < p for exactly those edges in the set A, so if one takes A = ep(G) = {e:ee
e(G), & < p}
then the moment generating function V (t)
=
can be written in terms of T(G; x, y). Specifically one has
^t)=pn-1(l-Pr-n+1etT^G;l
+ et^,T^j,
(5.1)
and this formula tells almost everything there is to say about the distribution of k(G, ep(G)). In particular, it yields a quick way to calculate E[k(G,ep(G))]. If we retain the abbreviations (4.1), then (taking 1 + [e*(l -p)/p] for x) we have
When we let t — 0, we find for x = 1/p and y = 1/(1 - p) that
w
«-H!rS
(52)
174
James Allen Fill and J. Michael Steele
Finally, from the definitions of x and y together with the representation (2.1) for LUST(G) in as an integral of k(G, ep(G)), we find w\r
^1
/•1(l-P)r,(g;l/p,l/(l-P))
E[W(G)1 = JQ -—
Y——F-—
(5.3)
dp,
and the proof of Theorem 1.1 is complete. 6. Simple—and not-so-simple—examples As noted in Steele (2002), there are some easy examples that help to familiarize the formula (5.3). In particular, if G is a tree with n vertices, then T(G;x,y) = a;""1 and the integral works out to be (n — l)/2, which is obviously the correct value of E[LMST(G)]. This fact may also be seen as a corollary of the product rule (3.1); if graphs G and H share only one common vertex and no edges, then the graph GUH has Tutte polynomial T(G; x, y)T(H; x, y) and formula (5.3) tells us that E[LMST(G U H)\ = E [ L M S T ( G ) ] + E [ L M S T ( # ) ] ,
a fact which can also be seen directly from the definition of the MST. For the complete graph on three vertices we have already seen that one has T(K3) = x + x2 + y, and the integral (5.3) yields E[LMST(K3)} = 3/4, which one can also check independently. Also, for K4 the Tutte polynomial can be found by hand to be T(Ki; x, y) = 2x + 2y + 3x2 + 3y2 + Axy + x3 + y3, and when this polynomial is used in the integral formula (5.3), we find E[LMST(K4)}
=
g.
A still more compelling example than K4 is given by the famous Petersen graph that is illustrated by Figure 6.1. The Tutte polynomial T(G; x, y) of the Petersen graph is not easily found by hand, but the Maple program tuttepoly reveals that it is given by 36x + 120a:2 + 180x3 + 170a;4 + 114a:5 + 56a:6 + 21x7 + 6xs + x9 + 36y + 84y2 + 75y3 + 35y4 + 9y5 + y6 + 168xy + 2A0x2y + 170x3y+ 70x4y + 12x5y + 171xy2 + 105x2y2 + 30x3y2 + 65xy3 + 15x2y3 + 10xy4. From the formula (1.1) for the expectation, symbolic calculation then gives E[LMsT(Petersen)] = ^ ^
= 2.90351....
(6.1)
175
Expected length of an MST
Fig. 6.1. The Petersen graph can be used to illustrate several graph-theoretic concepts. In particular, it is non-Hamiltonian and nonplanar, yet it can be embedded in the projective plane.
7. Complete graphs and two problems For the complete graph Kn the Tutte polynomial T(Kn;x,y) is known for moderate values of n, but the complexity of the polynomials grows rapidly with n. The polynomials for n = 2 , 3 , . . . , 8 are given by Gessel & Sagan (1996) and Gessel (personal communication) has extended the computation to n = 15. These polynomials and formula (1.1) can be used to compute E[LMST(Kn)}.
n ~2 3 4 5 6 7 8 _9_
E[Z/MST(-Kn)]
Numerical Value
Forward Difference |
T]2 3/4 31/35 893/924 278/273 30739/29172 199462271/184848378 126510063932/115228853025
0.50000 0.75000 0.88571 0.96645 1.01832 1.05372 1.07906 1.09790
0.250000 0.135714 0.080736 0.051864 0.035401 0.025343 0.018844 j
176
James Allen Fill and J. Michael Steele
From the tabled values it is natural to conjecture that E[LusT(Kn)] forms a monotone increasing concave sequence. One might hope to prove this result with help from the integral formula (1.1) and the known properties of the Tutte polynomials for Kn, but this does not seem to be easy. This problem was raised at the conference Mathematics and Computer Science II at Versailles in 2002, but no progress has been made. The tabled values may also tell us something about a remarkable result of Frieze (1985) which asserts that \\mE\LMST{Kn)}
= C(3) = 1-202....
The rate of convergence in Frieze's limit is not known, but the declining forward differences in this table for E[LusT(Kn)] suggest that the rate of convergence in Frieze's theorem may be rather slow. Unfortunately, the computational complexity of the Tutte polynomial makes it unlikely that any of these problems will be resolved by the integral formula (1.1). As a consequence, it seems useful to consider alternative approaches to the calculation of E[LMST(G)} which are closer to first principles. 8. A recursion and its applications If the random variables U\, U2, • • •, Un are independent and uniformly distributed on [0,1], and if E denotes the event that {Un = min(tf l t U2, ...,Un)
and Un = y},
then conditional on E the random variables U\, U2, • • •, Un-\ are independent and uniformly distributed on [y, 1]. This elementary observation suggests a recursive approach to the computation of E[Z, M S T(G)]. The recursion is perhaps most easily understood by first considering K3. For the moment, let D denote the graph with two vertices v\ and V2 and two parallel edges between v\ and V2- Under the uniform model U[0,1] for edge lengths we obviously have EU[0'1][LMST(D)]
=Ef/[°'1][min(C/1,[/2)] = 1/3.
On the other hand, if the edge lengths of D are chosen uniformly in the interval [y, 1], then we have EU^[LMST(D)}
=y + (l-
y)Eu^[LMST(D)}
= (1 + 2y)/3.
Since the shortest edge in a graph is always in the minimal spanning tree, these observations suggest that if we condition on the length of the shortest edge, then we might well expect to find a useful recurrence relation.
177
Expected length of an MST
If Y denotes the length of the shortest edge in K3 under the uniform model, then the density of Y on [0,1] is just 3(1 — y)2. Also, after we take the shortest edge (u, v) of K3 as an element of our MST, the rest of the cost of the MST of K3 is simply the cost of the MST of the graph obtained from Kz by identifying u and v and removing the resulting loop. This observation gives us the identity E[£MST(^3)]
= / E[LUST(K3)\Y Jo =f
= j,] {3(1 - y)2} dy
[y + Eu^[LUST(D)})
3(1 -y)2dy
= -A,
(8.1)
which fortunately agrees with the value given twice before. To apply this idea more broadly one just needs to introduce appropriate notation. NOTATION AND A GENERAL RECURSION
If A = (a,ij)nxn is a symmetric n-by-n matrix with a^ S {0,1,...} and all diagonal entries equal to zero, then we let G(A) denote the loopless graph with vertex set {v\,v2, • • • ,vn) such that for each 1 < i < j < n there are a^ parallel edges between Vi and Vj. Also, for each 1 < i < j < n such that aij > 1, we let G ( A ^ ) denote the graph with n — 1 vertices which is derived from G(A) by first identifying vertices i and j and then removing the resulting a^j loops. If G(A) is connected, then the length of the MST is well-defined for both G(A) and G(A (ij) ), and the quantities # A ) = EU^{LMST(G(A))}
and 0(A<«>) = E ^ - ^ X M S T ^ A ^ ) ) ]
are connected by a simple linear identity. Theorem 8.1: For all n > 2, one has the recursion ,(A)=-1;^-<^(A"J'). 1
+ Z^l
(8.2)
Proof: To follow the pattern that we used for K3, we first let Y denote the length of the shortest edge in G(A). We then let m = £^ a^ > 1 denote the number of edges in G(A) and condition on both the value of Y and the event that it is edge e which has length Y. This gives us the integral formula
f Eu^[LMs'T(G(A))\Y = y,Ce =
g Jo
y}m(l-y)m-1dy.
178
James Allen Fill and J. Michael Steele
Now, if e is one of the a^ edges between Vi and Vj, then we have Et/t°'1][iMST(G(A))|y = y,ie = y) = y +
E^'^LMSTO^A^))]
= y+ {(n-2)j/+(l-y)0(A^)}. Thus, after working out the integral, we find that >(A) is equal to
I m
E
i
aij {HLZl + -JtL-KAM)) 1 m
W
+!
=n 1
l
+E^"Q^(Afa 1 + El^^nflO-
) )
• •
A PRELIMINARY EXAMPLE
It is obvious from first principles that for a tree Tn with n vertices one has E ^ ' ^ L M S T ^ ) ] = (n - l)/2, but it is instructive to see how the
recursion (8.2) recovers this fact. If G(A) = Tn, then for each 1 < i < j < n such that a^ = 1 the graph G(A^) is a tree with n — 1 vertices. By the recursion (8.2) and induction, one then sees that Ec/[°'1][LjviST(T'n)] depends only on n—and not on the particular structure of the tree Tn. The recursion (8.2) therefore asserts
E^[W(r n) ] = ("-iXi + ^ W T ^ ) ] ^ and this leads us directly to E£/[°'1][LMST(Tn)] = (n - l)/2. MORE REVEALING EXAMPLES
The analysis of the complete graphs calls on more of the machinery of the recursion (8.2) for >(A), but the story begins simply enough. For n = 2 the strict upper triangle of A has just the single entry o = ai2, and in this special case write (f>(a) instead of
EU^[LMST(K3)]
= ^ ( X J ) = [ 2 + 30(2)]/4 = 3/4,
179
Expected length of an MST
a result that is already quite familiar. The analogous calculation for K4 is only a little more complicated. One application of the recursion (8.2) gives
=
Vu[0
and a second application gives 0
(2 l)
= [2 + # ( 3 ) + m ] / 6 = 8 / 1 5 >
so in the end we find E ^ ' ^ L M S T C - K ^ ) ] = 31/35, which fortunately agrees with the table of Section 7. For K5 the computation is not so brief, but the pattern is familiar. To begin we have /I 1 1 1\
V
1/
and one application of the recursion (8.2) brings us to 1 F Ev[0'1][LMsT(/f5)] = YJ 4 + 1 0 0
/222V 11 .
The second application now gives
s+
'C")-M «*O"(")]and we also have / 3 3 \ _ 2 + 6^>(4) + <^>(6) _ 2 + (6/5) + (1/7) _ 117 ^ \ l) ~ 8 ~ 8 ~280 and / 4 2 \ __ 2 + 40(4) + 40(6) _ 2 + (4/5) + (4/7)
^ V 2J ~
9
~
9
118
~ 315'
from which we find , /
2 2 2
\
*{ ' J J
3 + (351/140)+ (118/105)
=
io
=
557
Wo-
180
James Allen Fill and J. Michael Steele
After feeding these intermediate quantities into our first identity, we find at last ,„» 4 + (557/84) 893 vu[o,ihT 9. Concluding remarks Neither the elementary recursion (8.2) nor the integral formula (1.1) seems to provide one with an easy path to the qualitative features of E[LMST(-Kn)]i though they do provide tantalizing hints. The sequence E[LusT(Kn)] is quite likely monotone and concave, but a proof of these properties would seem to call for a much clearer understanding of either the Tutte polynomials T(Kn; x, y) or the intermediate quantities
References 1. N. ALON, A. FRIEZE & D. WELSH (1994) Polynomial time randomized approximation schemes for the Tutte polynomial of dense graphs. In: Proceedings of the 35th Annual Symposium on the Foundations of Computer Science, ed: S. Goldwasser, pp. 24-35, IEEE Computer Society Press. 2. F. AVRAM & D. BERTSIMAS (1992) The minimum spanning tree constant in geometric probability and under the independent model: a unified approach. Ann. Appl. Probab. 2, 113-130. 3. A. M. FRIEZE (1985) On the value of a random minimum spanning tree problem. Discrete Appl. Math. 10, 47-56. 4. I. M. GESSEL & B. E. SAGAN (1996) The Tutte polynomial of a graph, depth-first search, and simpicial complex partitions. Electron. J. Combinatorics 3, Research Paper 9, (36 pp.). 5. J. M. STEELE (2002) Minimal Spanning Trees for Graphs with Random Edge Lengths. In: Mathematics and Computer Science II: Algorithms, Trees, Combinatorics and Probabilities, eds: B. Chauvin, Ph. Flajolet, D. Gardy, and A. Mokkadem, Birkhauser, Boston. 6. D. WELSH (1999) The Tutte polynomial. Rand. Struct. Alg. 15, 210-228.
Limit theorems for spectra of random matrices with martingale structure
F. Gotze and A. N. Tikhomirov Fakultdt fur Mathematik, Universitat Bielefeld D-33501 Bielefeld, Germany E-mail: [email protected] and Faculty of Mathematics, Syktyvkar State University Oktjabrskyi prospekt 55, 167001 Syktyvkar, Russia E-mail: [email protected] We study two classical ensembles of random matrices introduced by Wigner and Wishart. We discuss Stein's method for the asymptotic approximation of expectations of functions of the normalized eigenvalue counting measure of high dimensional matrices. The method is based on differential equations for the densities of the limit laws.
1. Introduction The goal of this note is to illustrate the possibilities of Stein's method for proving convergence of empirical spectral distribution functions of random matrices. We consider two ensembles of random matrices: real symmetric matrices and sample covariance matrices of real observations. We shall give a simple characterization of both semicircle and Marchenko-Pastur distributions via linear differential equations. Using conjugate differential operators, we give a simple criterion for convergence to these distributions. Furthermore we state general sufficient conditions for the convergence of the expected spectral distribution functions of random matrices. As applications we recover some well-known theorems for bounded ensembles of random matrices (Corollaries 1.7 -1.8 and 2.7-2.8 below). For a more detailed discussion of the literature we refer the readers to surveys by Pastur (2000, 2004), Bai (1999) and the publications of the authors, Gotze & Tikhomorov (2003a; 2004a). The mentioned results are based on the paper Gotze & Tikhomorov (2003b, 2004b). 181
182
F. Gotze and A. N. Tikhomirov
1.1. Real symmetric matrices Let Xjk, 1 < j < k < oo, be a triangular array of random variables with EXjk
= 0 a n d EXfk
= a2jk,
a n d l e t Xkj = Xjk,
forl<j
oo. For a
fixed n > 1, denote by Ai < . . . < \ n the eigenvalues of a symmetric n x n matrix W n = (Wn(j, k))lk=1,
Wn(j, k) = -j^Xjk, for 1 < j < k < n,
(1.1)
and define its empirical spectral distribution function by
(1.2)
Fn(x) = -f^I{Xjix},
where I{B} denotes the indicator of an event B. We investigate the convergence of the expected spectral distribution function, EFn(x), to the distribution function of Wigner's semicircle law. Let g{x) and G{x) denote the density and the distribution function of the standard semicircle law, that is
1 /
9(x) = — V 4 - x 2 /{|X|<2}, Z7r
G(x) =
fx J-oo
g(u)du.
(1.3)
1.2. Stein's equation for the semicircle law Let C(R), resp. C 2 (5), B e l , denote the class of continuous functions on R resp. the class of differentiate functions on B with bounded first derivatives on all compact subsets of B. Introduce now the following class of functions CJ-2,2} = {/ : R - K : / € CX(R \ {-2,2}); 5m|»Hooll//(y)l <»=>; limsup|4-2/ 2 ||/'(y)| < C}. The following lemma guarantees the existence of a solution of a differential equation equation, which will be motivated below. Lemma 1.1: Assume that a bounded function f(x) without discontinuity of second kind satisfies the following conditions tp(x) is continuous in the points x = ±2
(1.4)
and r2
I
(1.5)
183
Spectra of random matrices with martingale structure
Then there exists a function f € C|_ 2 2 , such that, for any x ^ ±2,
(4-x2)f(x)-3xf(x)=
(1.6)
If(p(±2) = 0 then there exists a continuous solution of the equation (1.6). Similar to C. Stein's (1986) approach to proving the Central Limit Theorem, we show the following characterization of Wigner's limit law via an adjoint differential equation to the first order equation (4-x2)g'(x)=xg(x),
(1.7)
for Wigner's limit density g. As a a immediate consequence of the last equation, partial integration and the above Lemma we get Proposition 1.2: The random variable £ has distribution function G{x) if and only if the following equality holds, for any function f € Cj_ 2 2 ,, E(4 - Z2)f'(Q - 3E£/(0 = 0.
(1.8)
1.3. Stein criterion for random matrices
Let W denote a symmetric random matrix with eigenvalues Ai < A2 < • • • < An. If W = U~ 1 AU, where U is an orthogonal matrix and A is a diagonal matrix, one defines / ( W ) = U~ 1 /(A)U, where /(A) = diag(/(Ai),...,/(A n )). When f has a distribution approximating Wigner's distribution, (1.8) will be approximately zero. For example, let £ be conditionally uniform distributed on the spectra {A 1; ..., An} of the given random matrix W. It is easy to see that P{£ < x} — EFn(a;). Note that for indicator functions
faffi = Ite < x) with x # ±2 and ?(£) := £(fate) - G(x)) w e h a v e
(1.9)
Eipte) = EFn(x) - G(x).
By averaging the lefthand side of (1-6) with respect to the expected empirical distribution Fn and rewriting the mean with respect to Fn by traces of functions of symmetric matrices we arrive at the following result. Theorem 1.3: Let W denote a sequence of random matrices of order nxn such that, for any function f £ C|_ 2 2 ,, iETr(4I n - W 2 )/'(W) - -ETrW/(W) -+ 0, n n
as n -> 00.
(1.10)
184
F. Gotze and A. N. Tikhomirov
Then An := sup \EFn(x) - G(x)\ -> 0, as n -> oo.
(1.11)
X
Here and throughout the paper Tr denotes the trace of an n x n matrix. A sufficiently rich class of differentiable functions for which it suffices to check the condition in the above Theorem is given by the imaginary part of x —> l/(x — z), z € C \ R. In the following we will study this class in detail. 1.4. The Stieltjes transform of a semicircle law Introduce the Stieltjes transform of a random variable £ with distribution function F(x), for any z = u + iv, v / 0, via
T(z) = E-i- = r ?-•*
—dF{x).
(1.12)
J-^x-z
Note that T(z) is analytic for non-real z and satisfies the conditions ImT-Imz>0,
Imz^O,
supj/|T(iy)| = 1.
(1.13)
v>i
It can be shown that for any continuous function ip(\) with compact support /•OO
-I
/'OO
/
S(s) = I"0 —dG(x) = E/z(0, S'(z) = -E(/Z(O)L =
Vj-^,
(1.15) where £ is a random variables with distribution function G(x). The differential equation for the Stieltjes transform of the semicircle law which follows from (1.10) is given by (z2 - A)S'{z) - zS(z) - 2 = 0.
(1.16)
185
Spectra of random matrices with martingale structure
1.5. Resolvent criterion for the spectral function of random matrix
distribution
We shall now relax the convergence condition in Theorem 1.1. to the special class of functions / introduced above. Introduce the resolvent matrix for a symmetric matrix W, for any nonreal z, (1.17)
R(z) = (W-zir\
where I denotes the n x n identity matrix. Then Theorem 1.1. may be strenghened as Theorem 1.4: Assume that, for anyv ^ 0, ftn(W)(z) := iETr(4I - W 2 )R 2 (z) + -ETrWR(z) -> 0, as n -> oo n n (1.18) uniformly on compacts sets in C\R. Then A n -> 0, a s n - » oo.
(1.19)
We shall now discuss the main application of this approach for ensembles of matrices with possibly dependent entries. 1.6. Convergence of expected random spectral to the semi circular law
distributions
We shall assume that EXjt = 0 and a2jL := EX2,, for 1 < j < I < n. Introduce cr-algebras Til = cr{Xkm : 1 < k < m < n, {k,m} ^ {j, I}}, 1 < j < I < n, and T> = a{Xkm : 1 < k < m < n, k ^ j and m ^ j}, 1 < j < n. We introduce as well Lindeberg's ratio for random matrices, that is for any r > 0, L
(1.20)
n(T):=^Y,VXll{]Xjl>T^}. j,l = l
Classical martingale limit theory motivates the following result. Set X^
= XjtI{\XjiI
< TV^} - XjtI{\XjtI
< ry^},
r > 0.
186
F. Gotze and A. N. Tikhomirov
Theorem 1.5: Assume that the random variables Xji,\ < j < I < n, n > 1 satisfy the following conditions, for any r > 0: E{4 T ) |JP'} = 0, £ ]
n
(1.21) E|E{X 2 ,|JP}-<7 2 ,|-+0
E
•= \
43) - i E
as n -»oo,
(1.22)
E E|E{((X^)2-E(X«)2) x ((*i T ) ) 2 - E(X^ T) ) 2 )|^,}| -> 0 as n -» oo,
max
1 n max — c7 ^ cr% < C < 00, J
l
there exists a 2 > 0 such that
4 2) :=^ E I 4 - " 2 ! - 0 ^ n^oo,
(1.23)
1<J<2<JI
and L n (r) -+ 0 as n —> 00 for any fixed T > 0.
(1-24)
Then A n := sup |EFn(a:) - G{x<7~1)\ ^ 0
as n -> 00.
(1.25)
Remark: Note that condition (1.23) implies that lim •iE||W n || 2 = (7 2
(1.26)
n—>oo n
where we denote by ||W|| the Frobenius norm of a matrix W:
HW||2 = f > / = J2 WttM2-
(1-27)
Usually one considers only the case cr2; =
187
Spectra of random matrices with martingale structure
scheme of enlarging W by an (n + l)st random row and column, which are martingales with respect to the cr-algebra generated by W. Corollary 1.6: Let Xij, 1 < I < j < oo be independent and EX;., = 0, EX^. = a2. Assume that, for any fixed r > 0, Ln(r) -> 0,
as n -> oo.
(1.28)
Then the expected spectral distribution function of a matrix W converges to the distribution function of the semicircle law, that is An :=sup|EF n (x)-G(x<J- 1 )| -> 0,
as n -> oo.
(1.29)
Another application of Theorem 1.5 concerns the distribution of spectra for the unitary invariant ensemble of real symmetric nx n matrices W n = (Xij) induced by the uniform distribution on the sphere of the radius %/iV in RN with TV = "lit!), that is
J2 (4 n) ) 2 = iV-
l
( L3 °)
Rosenzweig (1962) has defined this class of random matrices as a fixed trace ensemble and proved the semicircle law using Wigner's method of moments. This class has been described in Mehta (1991), Ch. 19, as well. Corollary 1.7: Let X\•" , 1 < I < j < n be distributed as above, for any n > 1. Then A n := sup \Fn{x) - G(x)\ -> 0 as n -> oo.
(1.31)
X
We may consider the ensemble of real symmetric n x n matrices determined by the uniform distribution in the ball of radius y/N in RN with N = " ( " 2 +1) , that means that
£
(Xlf)2
(1.32)
l
This class of random matrices was introduced by Bronk (1964) as a bounded trace ensemble. The eigenvalue density for the bounded trace ensemble is identical to the density of zeros of Hermite-like polynomials. Using this fact, Bronk proved the semicircle law for such matrices. See also Mehta (1991), Ch. 19. Theorem 1.5 implies the following result.
188
F. Gotze and A. N. Tikhomirov
Corollary 1.8: Let x[™ , 1 < I < j < n be distributed uniformly in the ball of the radius VN in RN with N = " ( " 2 +1) , for anyn>l. A n -> 0,
as n -> oo.
Then (1.33)
The arguments used above may be extended to the following case of sample correlation matrices. 2. Sample covariance matrices Let Xjk, 1 < j,k < oo, be random variables with EXjk = 0 and EX?fc =
/ l
(2.1)
•
Denote by Ai < . . . < An the eigenvalues of the symmetric n x n matrix Wn = -XKT, P and define its empirical spectral distribution function by
(2.2)
^W^E 7 ^*}.
(2-3)
where I{B) denotes the indicator of an event B. We investigate the convergence of the expected spectral distribution function EFn(x) to the distribution function of Marchenko-Pastur law. Let ga(x) and Ga(x) denote the density and the distribution function of the Marchenko-Pastur law with parameter a € (0, oo), that is ga(x) = — y / { x - a ) { b - x )
I{xe[a,b}},
Ga(x) = /
ga{u)du, (2.4)
where a = (1 - y/a)2, b = (1 + ^/a)2. As in the preceding section, Stein's method may be applied here as well. 2.1. Stein's equation for the Marchenko-Pastur
law
Introduce a class of functions
C\aM = {/ : R -> R : / € CX(R \ {a, &}); B5| vHoo |y/(y)| < oo; limsup | i ( a - 6 ) 2 - ( y - i ( a + 6))2|
\f(y)\
189
Spectra of random matrices with martingale structure
First we state the following Lemma 2.1: Let a ^ 1. Assume that a bounded function
(2.5)
and ,6
/ ip{u)ga{x)du = 0. (2.6) Ja Then there exists a function f € CI 6 , such that, for any x ^ a or x ^ b, (x - a)(b - x)xf(x)
- 3x(x - i ( a + b))f(x) =
(2.7)
If ip(a) = 0 (ip{b) = 0) then there exists a continuous solution of the equation (2.7). Remark: Note that for a = 1 we have a = 0 and 6 = 4. Here we may rewrite equality (2.7) as follows z2(4 - x)f'(x) - 3x(x - 2)f(x) = ip{x).
(2.8)
Proposition 2.2: The random variable £ has distribution function Ga(x) if and only if the following equality holds, for any function f £ C) b ,,
m - a)(b - t)tf'{0 - Z^{i - \{a + 6))/(0 = 0.
(2.9)
2.2. Stein's criterion for sample covariance matrices Let W denote a sample covariance matrix (as before) with eigenvalues 0 < A! < A2 < • • • < An. If W = U ^ A U , where U is an orthogonal matrix and A is a diagonal matrix, one defines / ( W ) = U~ 1 /(^-)U, where /(A) = diag(/(A 1 ),...,/(A B )). We can now state a result for the convergence to the Marchenko-Pastur law for the spectral distribution function of such empirical covariance matrices. Theorem 2.3: Let Wn denote a sequence of sample covariance matrices of order n x n such that, for any function f e Cj o 6 i, iE1V(Wn - aln)(bln - W n )W n /'(W n ) n - -ETVW n (W n - ^ J / f W , ) -» 0, n I
as n -» oo. (2.10)
190
F. Gotze and A. N. Tikhomirov
Then A n := sup |EFn(a;) - Ga(x)\ -> 0, as n -> oo.
(2.11)
2.3. The Stieltjes transform of a Marchenko-Pastur law Consider again the functions fz(x) = ~r£i for any non-real z. Recall (fz(x))'x = ~(fz{x))'z = — (xlz\i and denote by Sa(z) the Stieltjes transform of the Marchenko-Pastur law with parameter a. Applying equation (2.9), we obtain E^-a)(6-O+3E^-^)=0
(2.12)
This equation yields similarly as above a differential equation for the Stieltjes transform of the Marchenko-Pastur law, that is z(z - a){b - z)S'a(z) + {\z{a + b) - ab)Sa(z) + z - \{a + b) + 2 = 0. (2.13) Restricting the class Cl Stieltjes transforms.
6,
to special functions as above we again use
2.4. Resolvent criterion for sample covariance matrices As in the case of Wigner matrices we may restrict ourselves to the resolvent R(z) of the sample covariance matrices W. We have Proposition 2.4: Assume that, for any v ^ 0, Hn(W)(z) : = i E T r W ( W - al)(61 - W)R 2 (z) n + -ETr(W - ^ I ) W R ( z ) -> 0, as n -» oo, n 1 uniformly on compacts sets in C\R. Then A n -> 0, a s n - > oo.
(2.14)
(2-15)
This result allows to prove convergence in case of dependent matrix entries as well. 2.5. Convergence of random spectral distributions to the Marchenko-Pastur law We shall assume that EXjL = 0 and a2jt := EX?;, for 1 < j < n and 1 < I < m. Introduce the cr-algebras Pl = a{Xkq :l
{j, I}}, 1 < j < n, 1 < I < m,
191
Spectra of random matrices with martingale structure
and Tl = a{XJ3 :l<j
1 < I < m.
As above we introduce as well Lindeberg's ratio for random nxm matrices; that is, for any r > 0, -
L
n m
( 2 - 16 )
^) = — E E ^ w i ^ 3 = 1 1=1
Furthermore, introduce {X^ := X^I^^^y
-EX^I^.^,^}
and $? = E{XJ/T)|Fij}, and also the vectors X,(T) = {X$,..., X$)T ana^
- ^ 1 ] ( ,. • - , ? „ , ( ; •
Theorem 2.5: Let m = m(n) depend on n in such a way that ^—^- -> a e (0,1), as n -> oo. Assume that the random variables {Xji,l satisfy the following conditions:
(2.17)
< j < n,l < I < m}, n,m > 1
E{X i i|.F l } = 0,
(2.18)
n m
1
^^—EE^i^l^}-^!-0
as
"-oo, (2.19)
j=i(=i
there exists d1 > 0 such that (2.20) -
3)
m
n
4 - i E E *|*{«4T)>a - E ( 4 T ) ) 2 ) ( ( ^ ) 2 -E(^i})2)h} -» 0, ,
(2.21) m
^ ==ij] n m
E
E
J=l l<j^fc
(EKKJft2 - E«W)a)((4r))a - E«LT))a)\Fi\
-+ 0,
(2.22)
and £TI(T)
-> 0
as n -> oo for anyfixedT > 0.
(2.23)
Then A n : = s u p | E F n ( x ) -GcCicr" 1 )! - • 0 as n -^ oo. X
(2.24)
192
F. Gotze and A. N. Tikhomirov
Remark: Note that condition (2.20) implies that lim - E T r W n = a2 < oo.
n—»oo n
(2.25)
Remark: The conditions (2.18) and (2.19) are similar to those in Theorem (1.5). The condition (2.22) requires a weak correlation of conditional expectations of truncated observations whereas condition (2.21) guarantees a conditional weak correlation of squares of truncating observations. These conditions are needed for the approximation of terms of type EXfUXi in this martingale type setup. Since we assumed that Xij have second moments only, we to truncate all random variables. Corollary 2.6: Assume (2.17). Let, for any n,m > 1, Xji, 1 < j < n,l < I < m < oo, be independent and EXtj = 0, EXfj = a2. Assume that, for any fixed r > 0, -> 0,
Ln{r)
as n -> oo.
(2.26)
Then the expected spectral distribution function of the sample covariance matrix W converges to the Marchenko-Pastur distribution, that is An
:=sup|EF n (x) - G^xa'1)]
-> 0,
as n -> oo.
(2.27)
X
Another application of Theorem 2.5 concerns the distribution of spectra for the n x m matrices X n = (X^) induced by the uniform distribution on the sphere of the radius y/N in M.N with iV = nm , that is n
m
E E ( 4 n ) ) 2 = iV-
3=11=1
(2-28)
Here, we obtain from Theorem 2.5: Corollary 2.7: Assume (2.17). Let X^\ 1 < j < n, 1 < I < m be distributed as above, for any n > 1 and m > 1. Then An
:= sup \Fn(x) - Ga(x)\ -> 0 as n -> oo.
(2.29)
X
We may again consider the ensemble of real nx m matrices determined by the uniform distribution in the ball of the radius y/N in RN with N = nm, that means n
m
EEPO2^0 = 1 1=1
(2.30)
Spectra of random matrices with martingale structure
193
Using Theorem 2.5, we arrive at the following result. Corollary 2.8: Assume (2.17). Let xfr\ 1 < I < j < n be uniformly distributed in the ball of the radius vN in M.N with N = nm , for any n > 1. Then A n -> 0,
as n -> oo.
(2.31)
Acknowledgements: Part of the work was done while FG was visiting the Institute for Mathematical Sciences, National University of Singapore in 2003. The visit was supported by the Institute. ANT was partially supported by Russian Foundation for Fundamental Research Grant N 02-0100233, and both FG and ANT were partially supported by INTAS grant N 03-51-5018. References 1. Z. D. BAI (1999) Methodologies in spectral analysis of large dimensional random matrices: a review. Statistica Sinica, 9 611-661. 2. B. V. BRONK (1964) Topics in the theory of random matrices. Ph.D. thesis, Princeton University. 3. F. GOTZE & A. N. TIKHOMIROV (2003a) Rate of convergence to the semicircular law. Probab. Theory Rel. Fields 127, 228-276. 4. F. GOTZE & A. N. TIKHOMIROV (2003b) Limit theorems for spectra of random matrices with martingale structure. Bielefeld University, Preprint 03-018. www.mathemathik.uni-bielef eld.de/fgweb/preserv.html 5. F. GOTZE k. A. N. TIKHOMIROV (2004a) Rate of convergence in probability to the Marchenko-Pastur law. Bernoulli, 10, 503-548. 6. F. GOTZE & A. N. TIKHOMIROV (2004b) Limit theorems for spectra of positive random matrices under dependence. Zap. Nauchn. Sem. POMI v. 311. Probability and Statistics 7, 92-124. 7. P. HALL & C. C. HEYDE (1980) Martingale limit theory and its application,
Academic Press, New York. 8. M. L. MEHTA (1991) Random Matrices, 2nd ed., Academic Press, San Diego. 9. L. A. PASTUR (2000) Random matrices as paradigm. Math. Phys. 2000, 216-265. 10. L. A. PASTUR (2004) Random matrices: asymptotic eigenvalues statistics. Seminaire de Probability XXXVI, 135-164. Lecture Notes in Math. 1801, Springer. 11. N. ROSENZWEIG (1962) Statistical mechanics of equally likely quantum systems. In: Statistical physics (Brandeis Summer Institute, Vol. 3), Benjamin, New York, 12. C. STEIN (1986) Approximate computation of expectations. IMS Lecture Notes Series 7, Institute of Mathematical Statistics, Hayward, CA.
Characterization of Brownian motion on manifolds through integration by parts
Elton P. Hsu Department of Mathematics, Northwestern University Evanston, IL 60521, USA E-mail: [email protected] Inspired by Stein's theory in mathematical statistics, we show that the Wiener measure on the pinned path space over a compact Riemannian manifold is uniquely characterized by its integration by parts formula among the set of probability measures on the path space for which the coordinate process is a semimartingale. Because of the presence of the curvature, the usual proof will not be readily extended to this infinite dimensional setting. Instead, we show that the integration by parts formula implies that the stochastic anti-development of the coordinate process satisfies Levy's criterion.
1. Introduction The basis of Stein's approach to the central limit theorem is the fact that the equation (1.1)
Ef'(X) = EXf(X)
characterizes the standard normal distribution N(0,1). More precisely, for a real-valued random variable X, if the above equality holds for all realvalued functions / such that both xf(x) and f'{x) are uniformly bounded, then X has the standard normal distribution: F{X <x} = - 1 = / X V27T J-oo 195
e-y2/2dy.
196
Elton P. Hsu
(1.1) is an integration by parts formula for the standard Gaussian measure /x because it moves the differentiation away from the function / . Let D
=7T' ax
D
* = ax ~T+X-
Then D* is the adjoint of D with respect to fi:
(Dg,f)fl = (g,D*f)^, and (1.1) can be simply written as MD*f(X) — 0. In this article, we consider an infinite dimensional extension of this characterization of the standard Gaussian distribution. Consider a real-valued Brownian motion W = {Wt, 0
EDhF(W) = E \F(W) f hsdWs] for a large class of functions F on PO(K) and directions h. More generally, let M be a compact Riemannian manifold M and o £ M a fixed point. The pinned path space over M is Po(M) = Co([0,l],M). Consider the probability space (PO(M), 3S{P(M)), v), where v is the Wiener measure. Let X = {Xt, 0 < t < 1} be the coordinate process on P0{M), namely X(-y)t = 7t for 7 e PO(M). Thus under the probability u, the process X is a standard Brownian motion on M. The corresponding integration by parts formula, due to Bismut (1984) and Driver (1992), is
EDhF(X) = E [F(X)JO
(hs + \Ricuixhhs,dWs^
.
The purpose of this article is to show that this integration by parts formula characterizes Brownian motion among the set of M-valued semimartingales. 2. One dimensional case The relation Ef'(X) = EXf(X) for a standard Gaussian random variable can be verified directly by using the density function fix(dx) dx
e~*2/2 y^
Characterization of Brownian motion on manifolds
197
However, the following point of view is more fruitful. The differentiation operator D = djdx generates the the translation semigroup: Ttx = x + t. For the shifted Gaussian measure fxx+t = fix o Tf1 we have HX+t(dx) Hx(dx)
= ete_*V2
Now we have
Ef(X + t)= f
f(x)fix+t(dx)
= f
f{x)e^2'^x{dx)
= E [etX-*''2f(X)\ • The relation E/'(X) = EXf(X) is then obtained by differentiating with respect to t and letting t = 0. An operator D on a function space is called derivation if it satisfies D(fg) = 52?/ + fDg. For any derivation operator D, the adjoint operator has the form D* = -D + D*l. Thus finding an integration by parts formula is equivalent to calculating D*l, the divergence of 1 (the unit vector). The general integration by parts formula takes the form {Dg,f)li = -{g,Df)ti
+ {g,{D*l)f)tl.
If we take g = 1 and let X be a random variable with the law /i, then we have EDf{X)=Ef(X)D*l(X). In our case, the underlying Gaussian measure /i has a density function p(x) = e~x /2/\/27r with respect to the Lebesgue measure, which is invariant under the translation group {Tt, t G R}, and we have D*l(x) = - — lnp(x) =x. In infinite dimensional situation we will not have a measure invariant under translations. Nevertheless, as we will see, the expression D*l (the derivative of the logarithmic "density") still makes sense.
198
Elton P. Hsu
The fact that the equation Ef'(X) = EXf(X) implies that X has the distribution JV(0,1) can be proved using Stein's equation. Let
be the distribution function of N(0,l). The general Stein's equation has the form
f'(x) - xf(x) = h(x) - f h(z) d$(z). Ju Take the special case h(x) = /(_oo)2](:r) (z fixed). The equation /'(x)-i/(a;)=/(_OOiZ](i)-#(z) can be solved explicitly:
f(x) = J— I*{X){1
7{X>
~ *{Z))>
&{x) \ * ( z ) ( l - # ( x ) ) ,
X Z]
~
x>z.
It is easy to verify that both xf(x) and f'(x) are uniformly bounded. Using this / in Ef'(X) = EXf(X) we have immediately F{X
(2)(w;w)t
= {(wi,wj)t}1^n
Here In is the n x n identity matrix.
= in.t.
3. Infinite product Gaussian measure The one dimensional theory in the preceding section can be extended directly to the product Gaussian measure on Rz+. Probabilistically, this corresponds the case of a sequence X — {Xn} of i.i.d. random variables with the standard Gaussian distribution N(0,1). We consider the set ^ of cylinder functions F(x)=f(x°,x1,---,xn).
Characterization of Brownian motion on manifolds
199
Here x = (xo,x\,X2, • • •). Consider the gradient
DF(x) = (fX0(x),fXl(x),---). More conveniently, consider the set of directional derivatives: oo
DtF(x) = (DF(x),l) = £ji/ X i (x). Each / = (l\, I2, • • •) is a direction of differentiation. It is easy to see that E DiF(X) = E (X, I) F(X),
(3.1)
where (X,I) = Y^=oXnln is the inner product in I2CZ+). This equation characterizes the product Gaussian measure on Rz+. To see this, let Yi = (X,l) = loXo + --- + lnXn and F ( x ) = f ( l o x o + h x i + ••• + l n x n ) = f ( ( x , I ) ) .
Then
DtF(X) = \l\lfW) and (3.1) becomes
\l\\ Ef'(Yt) = EYiffXt). Hence by the one dimensional result, Y; has the distribution N(0, |/||), the Gaussian distribution of mean zero and variance |/||. It is an easy exercise to show that if Yt = (X,l) has the law N(0, |/||) for all I e 12(Z+), then is i.i.d. with the distribution JV(0,1). X = (Xi,X2,...) 4. One dimensional Brownian motion Consider the path space PO(R) = Co([0,1],R) and the map 0 : PO(R) -> Ez+ defined by
$(W) - {Xn, neZ+} = If
endWt, n e Z+j .
Here {e n , n & R+} is an orthonormal basis for L2[Q, 1], which we may take to be eo(t) = 1 and en(t) = v2cos7rnt,
n = l,2,....
200
Elton P. Hsu
Here we assume that W = {Wt, 0 < £ < l } i s a semimartingale so that the stochastic integration makes sense. Let
MO = / en(s)ds. Jo
The inverse
-1
z
: R + —> PO(IR) can be loosely described as OO
(4.1) n=0
If 1/ is the Wiener measure v on Po(lR) (the law of Brownian motion W) and fi the product Gaussian measure on R z +, then we have an isometry
DhG(W)=HG^
^ -
G
^ ,
where oo
,t r oo
X ) '"e"(S^
/»(*) = 1 ] Jn/ln(t) = / n=0
dS
-
^° L=0
On the other hand, oo
l X
oo
ln
{X, l) = J2 ^ n = J2 n=0
.i
n=0 -70
e
i
S dW
"( ) * = / ^(S) dW'^0
Introducing the Cameron-Martin space ^ = { / i e P 0 ( R ) : |ft| j e .
201
Characterization of Brownian motion on manifolds
if the integral is defined and \h\jp — oo otherwise, we can write symbolically
D'hl(W) = (h,W)je=
fhsdWs.
Jo Now that we have found D^ll(W) as a stochastic integral, the integration by parts formula has been transplanted from the product space Rz+ to the path space PO(K) to read EDhG(W) = E (h, W)^ G{W). We can prove the above integration by parts formula from another point of view. Let nw be the law of Brownian motion W and ^w+th be the law of W + th (Brownian motion shifted by th) with h G Jf. Then we have the Cameron-Martin-Maruyama-Girsanov theorem: j
w+th
r
z-i
f2
/-l
-i
Differentiating the identity r EG(W
duw+th~\
+th)=E^G(W)^-^-^
with respect to t and then letting t = 0 we have '
EDhG{W) = E \G(W) I hs dw} . This relation holds for all h S JV and nice functions G (e.g., cylinder functions). Note that h does not have to be deterministic, for it suffices to assume that it be ^-adapted and that E|/i|^, < oo, 33* = {3§S) 0 < s < 1} being the standard Borel filtration on the path space PO(R). Theorem 4.1: Let W be a real-valued continuous semimartingale. It is a Brownian motion if and only if the integration by parts formula EDhG{W) = E (h, W)x
G{W)
holds for and all cylinder functions G and all adapted h such that (h, W)jg, G(W) is integrable. Proof: We have shown that the above integration by parts formula holds if W is a Brownian motion. To show the implication in the other direction, we can resort to the map $ and work on the product space Mz+. Of course, this proof will not extend to the path space over a compact manifold (non-flat path space). A better way is to use Levy's criterion.
202
Elton P. Hsu
Taking G = 1, we have DhG = 0 and E (h, W)x
= E / h. dWs = 0 Jo
for all adapted A such that the stochastic integral is integrable. This implies that W is a (local) martingale from the fact that W, as a continuous semimartingale, is the sum of a local martingale and a process of local bounded variation. Now we take G(W) = W\. Prom the definition of D/, we have DhG(W) = H ^
W
+ t h
^
G
^=
h l
.
The integration by parts formula becomes Ehi =E{h,W)jg,Wi. The right side is E I Wi f hs dWs] = E I hsd{W, W)s. Hence, equating the expressions, we have E / hsds = E f hsd(W, W)s. Jo Jo This holds for all suitable adapted h, from which we conclude immediately that (W, W)t = t. By Levy's criterion W is a Brownian motion. •
5. Brownian motion on a Riemannian manifold We briefly describe Brownian motion on a Riemannian manifold and its integration by parts formula. For a detailed discussion, the reader is referred to the monograph Hsu (2002). Let M be a compact Riemannian manifold (or more generally, a complete Riemannian manifold with uniformly bounded Ricci curvature) and o a fixed point on M. Let u be the Wiener measure on the path space PO{M). On the probability space (Po(M),@(Po(M)),u), the coordinate process U = {IIt, 0 < t < l } i s a Brownian motion on E™. Let 0{M) be the orthonormal frame bundle of M and TT : 6{M) —> M the canonical projection. Let Hi,-- • ,Hnbe the canonical horizontal vector
Characterization of Brownian motion on manifolds
203
field on @(M). Fix a frame Uo € ir~1o and consider the Ito type stochastic differential equation on &(M): n
i=l
Its unique solution is called a horizontal Brownian motion. The projection X = •KU is a Brownian motion on M starting from o, whose law is the Wiener measure v on the path space (P0(M), &(P0(M)). The map J : Po(Rn) —> PO(M), which we call the Ito map, is the stochastic equivalent of the development map (rolling without slipping) in differential geometry. In differential geometry, the map J carries straight lines (with uniform speed) to geodesies. In the context of stochastic analysis, it carries a euclidean Brownian motion on R" to a Riemannian Brownian motion on M. As in the case in differential geometry, J is invertible and the inverse image W = J~lX is again obtained by solving an Ito type stochastic differential equation driven by X. The Wiener measure (the law of Brownian motion X on M) is v = (i o J~1 and we have an isometry: J:
(Po(Rn),v)^(Po(M),v).
Since W = J~XX is obtained from X by solving an Ito type stochastic differential equation, W is well defined when X is a semimartingale. We have the following basic fact (see Hsu (2002)). Proposition 5.1: X is a Brownian motion on M if and only if its stochastic anti-development W — J~XX is a Brownian motion on Rn. Let x e M and TXM the tangent space at x. Let Ricx : TXM —> TXM be the Ricci transform at x of the Levi-Civita connection of M. Let u e &{M) be an orthonormal frame at x — TTU. Then u : Rn —> TXM is a linear isometry. The scalarized Ricci transform at u is Ricu d= u-lRicxu : Rn -> W1. The following famous integration by parts formula is due to Bismut (1984):
E(Vf(Xt),ht)=E\f(Xt)J
(hs + ^Ricu{xhhs,dWs\\
for h € Jf, f £ C°°(M) and 0 < t < 1. This formula can be extended to a more general integration by parts formula for directional derivative operators Dh on the path space PO(M).
204
Elton P. Hsu
We briefly recall the definition of the directional derivative Dh on the path space P0(M). For a path h e P0(^n) and a path 7 € P0(M) for which the horizontal lift U^j) makes sense, we define Dh{i)s = U(-f)shs. We can regard Dh as a vector field on the path space PO(M). The directional derivative Dh along the direction h can be defined as
where {$, t € M} is the flow generated by the vector field Dh- It takes a fair amount of effort to define Dh in this manner, for the existence of the flow used in the definition is by no means obvious. However, for a cylinder function G(7) = g(jSlr-- >7«i) with 0 < s\ < s2 < • • • < sj < 1 and g G C°°(M'), DhG(~f) can be simply defined by the formula: 1
DhGi-r) =
J2(uh)7i1^f(f)^Si).
Starting from Bismut's integration by parts formula, we can show by induction on the number of time dependencies the following integration by parts formula for the Wiener measure v on the path space P0(M\. for all cylinder functions G and h € Jff,
EDhG(X) =E [G(X) jT (h, + ±Bicuw.ha,dWs\\ . This form of integration by parts formula was first proved in Driver (1992). It is the proper generalization in the path space PO(M) of the relation Mf'(X) = E X / ( X ) for a standard Gaussian variable X. In the next section we show that this equation characterizes Brownian motion in the set of semimartingales. 6. Characterization of Riemannian Brownian motion In this section we prove the main result of the article. Theorem 6.1: Let X be an M-valued semimartingale on a Bltered probability space (fi,^"*,P). TJien it is a Brownian motion on M if and only if
EDhG(X) = E \G(X)J
(hs + \RicU{xhhs,dWs\\
for all cylinder functions G and all ^-adapted h G J f such that the random variables on the two sides are integrable. Here W and U are the antidevelopment ofX in M.n and the horizontal lift of X in &{M), respectively.
Characterization of Brownian motion on manifolds
205
Proof: Recall that X is a Brownian motion on M if and only if the continuous semimartingale W = J~lX is a Brownian motion on W1. All we need to show is that the semimartingale W satisfies Levy's criterion. To show that W is a local martingale, we again let G = 1. From Dh\ = 0 we have
Let W = M + A be the Doob-Meyer decomposition of the semimartingale and introduce the notation
\W\.= Yl \{M\Mi)\s+ Y, \A'\1<*.J<"
f6-1)
l
Consider the set &(W) of adapted h = {hs, 0 < s < 1} such that
E f \h.\2d\W\.+ (£ \ha\d\W\a\ I
0 < s < l
is again in &(W). Therefore, E / (9s,dWs) = 0 Jo for all g € &(W). This easily implies immediately that W is a local martingale. The verification that the quadratic variation (W*,W)t = In -t,
/„ = (n x n) - identity matrix
is more difficult. First of all for a fixed 0 < u < t < 1 we take the function G to be G(X) = Wt-Wu = (J-'X^ - (J- 1 *)^ Of course, G is not a cylinder function on PO(M), However, from general approximation theory for stochastic differential equations (see Ikeda & Watanabe (1989)) one can approximate G(X) =Wt — Wu by a sequence
206
Elton P. Hsu
of cylinder functions in a very strong sense. More precisely, suppose that h € &(W), i.e., it satisfies the condition E f \hs\2d\W\. < oo, Jo see (6.1). Let
(hs+l-R\cu(x)hs,dw}j.
h(X) = J
Then there is a sequence of cylinder functions Gn such that lim Gn(X)h(X) = G(X)h(X) 71—»OO
and lim DhGn(X) = DhG{X),
n—*oo
both limits taking place in L2(PO{M),^(PO{M)), u). Here DhG(X) is obtained by calculating formally the pushforward of the vector field Dh through the map J" 1 : DhG(X) = Dh{J-lX)t
- {J~lX)u = {J;lDh)t -
{J:xDh)u.
This calculation can be found in Driver (1992) and Hsu (1995) when W is assumed to be a Brownian motion, but only slight modifications are needed if W is only assumed to be a local martingale. We have
(J;1Dh)t-(JZ1Dh)u= + f\had8 Ju
{
+\
[ (6(h)s,dWs)
Ju
Y, l
^j(U(X)s)(Hei,Hhs)d(Wi,W^)s\. J
Here Q is the scalarized curvature tensor (or the curvature form on the orthonormal frame bundle 0{M)) and Hf = Yl7=i fi-Hi 1S the horizontal vector defined by / e R n . The explicit expression of the integrand 6{h) involves the curvature tensor and its derivatives and is not important for our purpose. What is important is that under the condition imposed on h the stochastic integral on the right side is a martingale. Now we have
EDhGn(X) = E \cn(X) J (h. + ^Biculxhht,dWa\]
.
207
Characterization of Brownian motion on manifolds
We take the limit as n —> oo. On the left side we have
£
Ef\h3ds+l Ju
nj(U(X)s)(Hei,Hhs)d(Wi,W^)
[ i
\. J
E \(Wt - Wu) J (hs + ^Ricu.h.,dWs\] . Because W is a local martingale, the above expression is equal to E ^ d(W*,W)ihs
+ ^RicUshs\.
Equating the two sides, the resulting equality has the following form: E /" {Inds-d(W*,W)s}hs+E
f Cijk(s)hksd(W\Wi) = 0. (6.2)
Here C^ = {Cijh(s), 0 < s < 1} is continuous and adapted, whose actual expression is not needed in the following discussion. Using hs = I hT dr Jo
and changing the order of integration we find that the second term is the sum of
E f hkTdr f
Cijk(s)d(W\Wi)s
E j hkTdr f
C%]k{s)d{W\Wj)s.
J0
Ju
and Ju
JT
Note that the second expectation here and the first term in (6.2) do not involve the values hT for 0 < T < u, hence the first expectation must vanish for all permissible hT, 0 < s < u, and we therefore must have
E
ry cijk(S)d(w\wi)s
J?T]=o
for a l l O < T < u < £ < l . This fact implies in turn that both terms in (6.2) vanish. It follows that E / {Inds-d(W*,W)s}hs=0, Ju which implies that (W*, W)t = In-t. By Levy's criterion, W is a Brownian motion on W1. By Proposition 2.1, X is a Brownian motion on M. u
208
Elton P. Hsu
7. Concluding remarks and acknowledgements
We have shown that the integration by parts formula for the Wiener measure on the path space over a Riemannian manifold characterizes uniquely the Wiener measure among the set of probability measures on the path space under which the coordinate process is a semimartingale. This may be regarded as the first step towards exploring in the context of stochastic analysis the circle of ideas surrounding the well known Stein-Chen technique in mathematical statistics. The success of the Stein method in the theory of central limit theorems may point to a possible parallel theory in the current context. In particular, one wonders whether it is possible to introduce a useful concept of distance between a semimartingale and a Brownian motion on a fixed Riemannian manifold. In addition, in view of the importance of Stein's equation, it may also be worthwhile to explore this equation in an infinite dimensional setting. An early draft of this article was written and presented at the conference in honor of Professor Charles Stein in August, 2003, at the National University of Singapore. The final version was completed during the author's visit at the Institute of Applied Mathematics of the Chinese Academy of Sciences in Beijing. The research of this article was supported in part by the NSF grants DMS-010479 and DMS-0407819. The author gratefully acknowledges the financial support from the above mentioned sources. References 1. J. BISMUT (1984) Large deviation and Malliavin calculus. Birkhauser, New York. 2. B. DRIVER (1992) A Cameron-Martin type quasi-invariance theorem for Brownian motion on a compact Riemannian manifold. J. Func. Anal. 110, 272-376. 3. E. P. Hsu (2002) Stochastic Analysis on Manifolds. Graduate Studies in Mathematics, Vol. 38, American Mathematical Society, Providence, RI. 4. E. P. Hsu (1995) Quasiinvariance of the Wiener measure and integration by parts in the path space over a compact Riemannian manifold. J. Func. Anal. 134, 417-450. 5. N. IKEDA & S. WATANABE (1989) Stochastic Differential Equations and Diffusion Processes, 2nd edition, North-Holland/Kodansha.
On the asymptotic distribution of some randomized quadrature rules
Wei-Liem Loh Dept. Statistics and Applied Probability, National University of Singapore 6 Science Drive 2, Singapore 117543, Republic of Singapore E-mail: [email protected] This article gives a brief survey of some of the popular unbiased randomized quadrature rules that have been proposed in the statistical literature. In particular, the asymptotic distributions (as well as other related properties) of the sample means based on Latin hypercube sampling, randomized orthogonal arrays and scrambled nets are discussed.
1. Introduction Let / : [0, l ) s —> R be a square-integrable function. The aim is to estimate the integral f(x)dx,
fi = [
using a fixed number n of function evaluations, say f(x\), • • • , f(xn). usual estimate for /i is then given by the sample mean
(1.1) The
Clearly there is a design issue in the choice of x\, • • • ,xn. This problem arises, for example, in the context of computer experiments [see, for example, Sacks, Welch, Mitchell k. Wynn (1989), Koehler k Owen (1996) and Santner, Williams & Notz (2003)], when / is a known computer algorithm and an estimate of the average value of the output of / is desired. Davis & Rabinowitz (1984), page 417, noted that for the integral in (1.1) with s > 15, sampling or (deterministic) equidistribution methods are generally preferred. They are time consuming, but with some care, 209
210
Wei-Liem Loh
are reliable. Other sophisticated methods of variance reduction appear to exhibit a dimensional effect and are probably ruled out in this range. In a sampling or Monte Carlo method, the points x\, • • • ,xn are selected via random sampling. In the special case that xi, • • • ,xn are n independent identically distributed random vectors from the uniform distribution on [0, l) s , we have |/i — fj,\ = O p (n^ 1 / 2 ) as n —> oo. It should be noted that this error bound is a probabilistic one and the convergence rate is independent of the dimension s. Equidistribution methods (also known as quasi-Monte Carlo methods) can be described as deterministic versions of a Monte Carlo method. Here the points x\, • • • ,xn are chosen deterministically and typically deterministic (worst case) error bounds are available. Using a suitably well chosen set of points, it is known that quasi-Monte Carlo methods can result in |/2 — fi\ = O{n~l(logri)s~l) for an integrand / with a relatively low degree of regularity. This article focuses on three classes of unbiased randomized quadrature rules, namely Latin hypercube sampling, randomized orthogonal array sampling designs and scrambled nets. These three rules give different sampling schemes in choosing the points x\, • • • ,xn. They can also more appropriately be interpreted as hybrids of Monte Carlo and quasi-Monte Carlo methods. The rest of this article is organized as follows. In Sections 2 to 4, Latin hypercube sampling, randomized orthogonal array sampling designs and scrambled nets are described respectively. In 1972, Charles Stein introduced a powerful and general method for obtaining an explicit bound for the error in the normal approximation to the distribution of a sum of dependent random variables. A unifying theme behind Sections 2 to 4 is that Stein's method appears to be an extremely useful way of determining the asymptotic distribution of ft based on each of the three above-mentioned randomized quadrature rules. However lots of questions and open problems remain unanswered. Some of these shall be noted at appropriate parts of Sections 2, 3 and 4. One of the hopes of this brief survey article is to attract other researchers to this area and take on some of the unresolved problems raised here. 2. Latin hypercube sampling McKay, Beckman & Conover (1979) proposed Latin hypercube sampling as an attractive alternative to simple random sampling in computer experiments. The main feature of Latin hypercube sampling is that, in contrast
211
Randomized quadrature rules
to simple random sampling, it simultaneously stratifies on each input dimension. More precisely, for positive integers s and n, let (1) 7Ti, • • • ,TTS be random permutations of {1, • • • , n } each uniformly distributed over all the n! possible permutations; (2) C/j li ... i j s j, 1 < ii,--- ,is < n, 1 < j < s, be (0,1] uniform random variables; (3) the £/,!,... ,i s ,j's and TIVS be all stochastically independent. A Latin hypercube sample of size n, taken from the s-dimensional hypercube [0, l ) s , is denned to be .*.(*)) : 1
{X(*i(i),ir2(i),---
where for all 1 < i\, • • • , is < n, ,is) = (ij -Uili...iisj)/n,
Xj(ii,--X{i\,---
,is) = (Xi(h,---
,is),---
VI
<j
,Xs{h,---
,is)Y-
Thus to estimate /x in (1.1), we use the following sample mean based on the above Latin hypercube sample, namely ALH5 = - ] T /(X(7n(fc), • • • ,7Ts(fc))). U
(2.1)
k=l
It is clear that [ILHS is an unbiased estimate of /i. Using an ANOVA decomposition [see, for example, Efron & Stein (1981)], Stein (1987) re-expressed the function / as s
f{x)=/j,
+ Y,ak(xk)+r{x), fc=i
V i = ( i i , - - - ,x,)'€
[0,l)s,
where
ak{xk)= f
[f(x)-n]l[dXj.
(2.2)
In statistics terminology, ak is the 'main effect' function of xk, the fcth covariate of x, and r is the 'residual from additivity' of / . It can easily be seen that / ak{xk)dxk Jo
[
•/[o,!)- 1
=0,
r(x) FT dxj = 0, j /
k
VI < k < s,
212
Wei-Liem Loh
identically in Xk for all 1 < k < s. Stein (1987) further proved that as n —> oo, If 1 Vai((iLHs) = r2(x)dx + o(~), n n J[o,i)° whereas 1 /" 1 s f1 2 Var(A//jo) = - / r (x)dx + - Y" / a2k(xk)dxk. Here fiiw denotes the sample mean based on an i.i.d. sample of size n. Thus the asymptotic variance of (J,LHS is never greater than that of fiiiD and the reduction can be substantial if some of the a2, are large. Owen (1992a) proved the following central limit theorem for (XLHSTheorem 2.1: If f is a bounded real-valued function on [0, l) s , then n1/2((iLHS - M) - N 10, / r2(x) dx) , \ J[o,D) in distribution as n —> oo. Furthermore Owen (1992a) generalized Theorem 2.1 with the following multivariate central limit theorem. Theorem 2.2: Let / = (/i, • • • , fP)' be a bounded function from [0, l)s to Rp. Then with JILHS as in (2.1), TI1I2((ILHS - (J-) tends in distribution to the p-variate normal distribution Np(0, S) where the (i,j)th element of the covariance matrix E is given by Eij = / ri(x)rj{x)dx, J[o,\y and ri is the residual from additivity for fi whenever 1 < i < p. On a more personal note, the author first stumbled onto this field in 1994 when he came across Owen's 1992 paper in J. Roy. Statist. Soc. Ser. B. Among the many results on Latin hypercube sampling in that paper, Theorem 2.1 above caught his attention. It occurred to him then that in dimension two (i.e. s = 2), Theorem 1 is exactly Hoeffding's combinatorial central limit theorem [see Hoeffding (1951)]. Interestingly both proofs used essentially the method of moments. However the motivation of Hoeffding was not high dimensional numerical integration but rather nonparametric statistics (in particular rank test statistics).
213
Randomized quadrature rules
Hoeffding's 1951 paper has generated a great deal of interest over the years and a number of researchers have refined Hoeffding's result by getting bounds on its convergence rate to normality as well as by relaxing the moment assumptions. The following are some of the papers found in the literature on Hoeffding's combinatorial central limit theorem. Motoo (1957) used a Lindeberg-type proof while von Bahr (1976) used characteristic functions. However the sharpest results were achieved by Ho & Chen (1978) and Bolthausen (1984) using Stein's method [see Stein (1972,1986)]. In particular, Bolthausen (1984) obtained a Berry-Esseen type bound on the convergence rate to normality. By adapting appropriate parts of Bolthausen's proof in combination with the multivariate normal approximation version of Stein's method [see, for example, G6tze (1991) and Bolthausen & Gotze (1993)], Loh (1996b) generalized Bolthausen's Berry-Esseen type bound to the Latin hypercube sampling setting (i.e. for arbitrary but fixed s). More precisely, Loh (1996b) proved Theorem 2.3: Let / : [0,l) s -> Rp be such that J[Q 1)s ||/(z)||£ < oo, where \\.\\p denotes the usual Euclidean norm in Rp. Let £ be as in Theorem 2.2 and define M(n,' •' - is) = Ef{X(ii, 1
• • • , is)),
VI < ii, • • • , is < n,
" j#fci3-=l
Assuming that E is nonsingular, we have for sufficiently large n, sup{|£I{n 1 / 2 ir 1/2(/xLffs -n)eA}-
E1{ZP e A}\ : A e A} < Cs,p(33,
where CSiP is a constant depending on s and p but not on n, I{.} is the indicator function, Zp denotes a random vector having the p-variate standard normal distribution, A represents the class of all measurable convex sets in Rp, and
^3 = ^ 1
E l
E{\\n-1'2X-1'2\f{X(ilr--,i.))-'£Mik)+(s-l)A\\3p fc=l
While the asymptotic distribution of (XLHS is reasonably well understood now, the same cannot be said of the randomized quadrature rules in the next two sections.
214
Wei-Liem Loh
3. Randomized orthogonal array sampling designs Let s, n and t be positive integers with t < s. An orthogonal array of strength* is an nxs matrix with elements taken from the set {0,1, • • • ,<7-l} such that in any n x t submatrix, each of the q* possible rows occurs the same number of times. Hence n/q1 must be an integer. The class of all such arrays is denoted by 0A(n, s, q, t). Excellent accounts of orthogonal arrays can be found in Raghavarao (1971) and Hedayat & Sloane (1999). Owen (1992b) and Tang (1993) independently suggested the use of randomized orthogonal arrays in sampling designs for computer experiments on [0, l)s. The main attraction behind these designs is that they, in contrast to simple random sampling, stratify on all i-variate margins simultaneously. Let (1) 7i"i, • • • ,irs be random permutations of {0,1, • • • ,q — 1}, each uniformly distributed on all the q\ possible permutations; (2) t/i1,...,i,,j,0 < i\, • • • ,is < q — 1,1 < j < s, be [0,1) uniform random variables; (3) the Uii,---,u,j's and TTfe's be all stochastically independent. Suppose A e OA(n,s,q,t) jt
,is) = -(ij + Uiu...}i3!J),
X ( i \ , - - - ,is) - {Xi(ii,---
VI < j < s ,
,is),--- ,Xs(h,---
,is))'.
This class of designs includes Latin hypercube sampling since the class of randomized orthogonal arrays generated by OA(n, s,n, 1) is identical to Latin hypercube sampling. To estimate \i in (1.1), the usual estimate based on the above randomized orthogonal array sampling design is 1 n !X 7r a (J-OAS = ~X]/( '( l( ».l)'7r2(Oi,2)»--- ,7Ts(ai,s)))n
i=l
Owen (1992b), page 446, observed that randomized orthogonal arrays of strength t > 2 require quite large sample sizes for modest q and hence
215
Randomized quadrature rules
would seem to be of less practical use at present. Consequently for the remainder of this section, we shall take t = 2 . As in Latin hypercube sampling, / has the following ANOVA decomposition: for all x = (xi, • • • , x3)' € [0, l)s, s
f(x)=n + J2<Xk(xk)+ ^2 k=l
l<j
ajtk(Xj,xk)+f(x),
where a^, as in (2.2), is the 'main effect' function of Xk and aj>k{xj,xk)
= / •VD-2
[f{x)-fjL-aj(xj)-ak(xk)}
Y[ dxt, ,^
l<j
k
In statistics terminology, o^k is the 'two-factor interaction' between the covariates Xj and XkOwen (1992b) proved that for A e OA(q2,s,q,2), i.e. if n = q2,
Var(iioAs) = 4 /
9 J[o,i)>
r2{x)dx + O(^), 1
as q —> oo. Since
/ r2(x)dx = V / a2k{xj,Xk)dxjdxk+ r2(x)dx, s / 1 s 01 3 •W) i<j
216
Wei-Liem hob.
we define, for 0 < i\, 12, h < Q - 1, MHI*2,«3)
= £[/(X(ii,i 2 ,i 3 ))], 9-1
1
*"1
1
1
2
W = [^Var^oAs)]- ^[/^(TrxKi),^2(^,2),^3(^,3))) - / i 3
" " H ^ ^ K J ) ) j=l
~
51
l
Mfc,*(7rfc(at,fc).?n(ai>j))]-
Then assuming that J,Q 1)3 f2(x)iia; > 0, we have S\ip{\P(W
<x)-
$ ( z ) | : - 0 0 < X < OO} = O ( 9 - ( m - 2 ) / ( 2 m - 2 ) ) )
as q —> 00. Here $ denotes the cumulative distribution function of the standard normal distribution. The proof of Theorem 3.1 uses Stein's method (via the introduction of an exchangeable pair of random variables). The above theorem is unsatisfactory in many ways. Firstly, this convergence rate to normality is unlikely to be optimal. It would certainly be nice to figure out the optimal convergence rate. Also more importantly, there is a need in applications to extend this result to one for arbitrary but fixed s (and not just for s = 3). Tang (1993) proposed the following alternative way of constructing random orthogonal array samples that are also Latin hypercubes. More precisely, for a given A e 0A(n, s,q,t), a randomized orthogonal array is obtained from Owen's algorithm. Then for each column of this randomized n x s array, we replace the nq~l positions with entry k by a random permutation (with each such permutation having an equal probability of being taken) of knq~l + 1, knq"1 + 2, • • • , knq~l + nq'1 - (k + l)nq~l for all k G {0,1,- • • ,q - I}. This procedure generates a random sample that is both a randomized orthogonal array and a Latin hypercube. As such, Tang (1993) called this random sample a random OA-based Latin hypercube. Let floLH denote that corresponding sample mean based on a random OAbased Latin hypercube.
217
Randomized quadrature rules
The following theorem is due to Tang (1993). Theorem 3.2: Let A e OA(q2, s, q, 2) and f : [0, l ) s -> R be a continuous function. Then V&r{(iOLH) = ~2 / f2(x)dx Q J[o,i)>
+ o(-~),
as q —> oo. Tang (1993) further proved that if / is additive, i.e.
f{x) =n + Y, ak{xk), fc=i
vx = (*!,-••, Xs y e [o, iy,
then Va.r((t,oAs) - Var(//oLif) > 0. Thus the sample mean based on a random OA-based Latin hypercube would do as well asymptotically as the sample mean based on the corresponding randomized orthogonal array but the former would perform much better if the underlying integrand is additive. The asymptotic distribution of (JLOLH has not been determined and is an open problem. 4. Scrambled n e t s Let b > 2 and s > 1 be integers. An elementary interval in base 6 is a subset of [0, l)s of the form
for integers Cj, kj with kj > 0 and 0 < c,; < bkj — 1. DEFINITION. Let 0 < t < m be integers. A finite sequence of points {Ai : i = 1, • • • , bm} from [0, l ) s is a (t, m, s) net in base b if every elementary interval £ in base b of s-dimensional Lebesgue measure bt~m satisfies
Y,l{Al€S} = bt. i=l
DEFINITION. For t > 0, an infinite sequence {At : i — 1,2, • • • } of points from [0, l ) s is a (t, s) sequence in base 6 if for all integers k > 0 and m>t, the finite sequence {Ai : i = kbm + 1, • • • , (k + l)6 m } is a (t, m, s) net in base b.
218
Wei-Liem Loh
The usual (t, m, s) net or (t, s) sequence estimate of /j, in (1.1) is given by
A = ± £/(*)• If / is of finite total variation in the sense of Hardy and Krause, it follows from the Koksma-Hlawka inequality that |/i — /J,\ = O((logb n)s~1/n) if {Ai : i = 1, • • • ,n} is a (t,m,s) net in base b and |/i-/u| = O((\ogbn)s/n) if {Ai : i > 1} is a (£, s) sequence in base 6. For more precise statements as well as details of the above statements, we refer the reader to Owen (1997a), page 1887, and Niederreiter (1992), Theorems 4.10 and 4.17. Owen (1995) introduced the idea of scrambled (t, m, s) nets as follows. S u p p o s e t h a t {Ai : (AiA,---
: i = I , - - - ,bm} is a (t,m,s)
,AitS)'
net in
base b. We observe that Aitj can be expressed as oo
A
',j
=Ylai>i'kb~k
fc=i
for suitable integers 0 < a^k < b — 1. Let {^j,^j;ai,^j;aua2r
' ' : 1 < j < S, 0 < Ofc < 6 - 1,fc= 1, 2, . . .}
be a set of mutually independent random permutations of the integers {0,1,..., b — 1}, where each of these permutations is uniformly distributed over its b\ possible values. Now a scrambled (t, m, s) net in base b has the form {Xi = (Xiti,- • • , Xit.y : i = 1, • • • , bm}, where oo
x
i,i = YlXi>i
and, for 1 < i < bm, 1 < j < s, Xij,i = 7rj(ai,j,i),
Owen (1995), pages 307-308, further showed that {X, : i = 1, • • • ,bm} is also a (t,m, s) net in base b with probability 1 and that for each i, Xi has the uniform distribution on [0, l)s. To estimate fi, the usual estimator based on the scrambled (t, m, s) net {Xi-.i
=
l,---,bm}is 1
fcm
i=l
219
Randomized quadrature rules
Since Xi is uniformly distributed on [0, l) s , fLt,m,s is an unbiased estimator for /j.. For simplicity, we let crfms denote the variance of fit,m,s- Owen (1997a, 1998) obtained some finite sample size bounds on of m s . Generalizations of Owen's work can be found in Yue (1999), Yue &; Mao (1999) and Hickernell k Yue (2000). s DEFINITION. A real-valued function / on [0, l) is smooth and has a Lipschitz continuous mixed partial of order s if there exist finite constants B > 0 and /3 € (0,1] such that
d^r/w
- a ^ W H * B*x - »«•'
v
*'yG i°-1)s-
The following theorem, due to Owen (1997b), gives the exact rate of convergence for
I
L d\ /(*)] dx>0.
Then there exist positive constants c, C such that cms-lb-3m
< ulms
< Cm-'i"3",
as m —* oo. Owen (1997b), page 1555, noted that it is an open question whether the rate in Theorem 4.1 holds under weaker smoothness conditions. We would also like to point out that in the case of (0, m, s) nets and (0, s) sequences, the condition b > max{s, 2} is not overly restrictive as Lemma 4.2 below shows. Lemma 4.2: For m > 2, a (0, m, s) net in base b can exist only ifs < 6+1. Furthermore a (0, s) sequence in base b can exist only ifb > s. The proof of the above lemma can be found in Corollaries 4.21 and 4.24 of Niederreiter (1992). As in the case of randomized orthogonal arrays, there seems to be almost no results on the asymptotic distribution of fit,m,s- The only result that we are aware of is due to Loh (2003) below. Theorem 4.3: Let b > max{s,2}, / : [0, l) s —> R be smooth and have a Lipschitz continuous mixed partial of order s such that
f
J[o,iy
L L
9a;
9
\
i •••9x8
f(x)} dx>0. J
220
Wei-Liem Loh
Then a^ s(/io,m,s — A*) tends in distribution to the standard normal distribution as m —> oo. Again the proof of the above theorem uses Stein's method. The proof of Theorem 4.3 gives a logarithmic upper bound on the convergence rate of the distribution of /io,m,s to normality. Hong, Hickernell &; Wei (2003) exhibited empirical evidence that the central limit theorem for scrambled nets can take place for reasonable sample sizes. Thus we conjecture that the optimal convergence rate to normality for the law of /io,m,s is likely to be significantly sharper and it would be nice to know what that is. More recently, Faure & Tezuka (2002) and Tezuka & Faure (2003) proposed alternative ways of scrambling (t, m, s) nets and (t, s) sequences. Yue & Hickernell (2002) studied the discrepancies of Owen's and Faure-Tezuka's scrambled nets. The asymptotic distributions of the sample means based on these new scrambled nets have not been determined as far as we know. Acknowledgements: I would like to thank Andrew Barbour and Louis H. Y. Chen for the very kind invitation to contribute an article to this volume in honor of Charles Stein. References 1. E. BoLTHAUSEN (1984) An estimate of the remainder in a combinatorial central limit theorem. Z. Wahrsch. verw. Geb. 66, 379-386. 2. E. BOLTHAUSEN & F. GoTZE (1993) The rate of convergence for multivariate sampling statistics. Ann. Statist. 21, 1692-1710. 3. P. J. DAVIS & P. RABINOWITZ (1984) Methods of Numerical Integration, 2nd edition. Academic Press, Orlando. 4. B. EFRON & C. M. STEIN (1981) The jackknife estimate of variance. Ann. Statist. 9, 586-596. 5. H. FAURE & S. TEZUKA (2002) Another random scrambling of digital (t,s)sequences. Monte Carlo and quasi-Monte Carlo methods, 2000 (Hong Kong), pp. 242-256. Springer, Berlin. 6. F. GOTZE (1991) On the rate of convergence in the multivariate CLT. Ann. Probab. 19, 724-739. 7. A. S. HEDAYAT & N. J. A. SLOANE (1999) Orthogonal Arrays: Theory and Applications. Springer, New York. 8. F. J. HICKERNELL & R. X. YUE (2000) The mean square discrepancy of scrambled (t, s)-sequences. SIAM J. Numer. Anal. 38, 1089-1112. 9. S. T. Ho & L. H. Y. CHEN (1978) An Lp bound for the remainder in a combinatorial central limit theorem. Ann. Probab. 6, 231-249. 10. W. HOEFFDING (1951) A combinatorial central limit theorem. Ann. Math. Statist. 22, 558-566.
Randomized quadrature rules
221
11. H. S. HONG, F. J. HICKERNELL & G. W E I (2003) The distribution of the
discrepancy of scrambled digital (t, m, s) nets. 3rd IMACS Seminar on Monte Carlo Methods-MCM 2001 (Salzburg). Math. Comput. Simulation 62, 335345. 12. J. R. KOEHLER & A. B. OWEN (1996) Computer experiments. Design and Analysis of Experiments, Handbook of Statistics 13, pp. 261-308. NorthHolland, Amsterdam. 13. W. L. LOH (1996a) A combinatorial central limit theorem for randomized orthogonal array sampling designs. Ann. Statist. 24, 1209-1224. 14. W. L. LOH (1996b) On Latin hypercube sampling. Ann. Statist. 24, 20582080. 15. W. L. LOH (2003) On the asymptotic distribution of scrambled net quadrature. Ann. Statist. 31, 1282-1324. 16. M. D. MCKAY, R. J. BECKMAN & W. J. CONOVER (1979) A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics 21, 239-245. 17. M. MOTOO (1957) On the Hoeffding's combinatorial central limit theorem. Ann. Inst. Statist. Math. 8, 145-154. 18. H. NIEDERREITER (1992) Random Number Generation and Quasi-Monte Carlo Methods. SI AM, Philadelphia. 19. A. B. OWEN (1992a) A central limit theorem for Latin hypercube sampling. J. Roy. Statist. Soc. Ser. B 54, 541-551. 20. A. B. OWEN (1992b) Orthogonal arrays for computer experiments, integra' tion and visualization. Statist. Sinica 2, 439-452. 21. A. B. OWEN (1995) Randomly permuted (t, m, s)-nets and (t, s)-sequences. Monte Carlo and quasi-Monte Carlo methods in scientific computing (Las Vegas, NV, 1994), Lecture Notes in Statistics 106, pp. 299-317. Springer, New York. 22. A. B. OWEN (1997a) Monte Carlo variance of scrambled net quadrature. SIAM J. Numer. Anal. 34, 1884-1910. 23. A. B. OWEN (1997b) Scrambled net variance for integrals of smooth functions. Ann. Statist. 25, 1511-1562. 24. A. B. OWEN (1998) Scrambling Sobol' and Niedereitter-Xing points. J. Complexity 14, 466-489. 25. D. RAGHAVARAO (1971) Constructions and Combinatorial Problems in Design of Experiments. Wiley, New York. 26. J. SACKS, W. J. WELCH, T. J. MITCHELL & H. P. WYNN (1989) Design and analysis of computer experiments (with discussion). Statist. Sci. 4, 409-435. 27. T. J. SANTNER, B. J. WILLIAMS & W. I. NOTZ (2003) The Design and Analysis of Computer Experiments. Springer, New York. 28. C. M. STEIN (1972) A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. Proc. Sixth Berkeley Symp. Math. Statist. Prob. 2, pp. 583-602. Univ. California Press, Berkeley.
222
Wei-Liem Loh
29. C. M. STEIN (1986) Approximate Computation of Expectations. IMS Lecture Notes, Monograph Series, Vol. 7. Hayward, California. 30. M. L. STEIN (1987) Large sample properties of simulations using Latin hypercube sampling. Technometrics 29, 143-151. 31. B. TANG (1993) Orthogonal array-based Latin hypercubes. J. Amer. Statist. Assoc. 88, 1392-1397. 32. S. TEZUKA & H. FAURE (2003) /-binomial scrambling of digital nets and sequences. Information-Based Complexity Workshop (Minneapolis, MN, 2002). J. Complexity 19, 744-757. 33. B. VON BAHR (1976) Remainder term estimate in a combinatorial central limit theorem. Z. Wahrsch. verw. Geb. 35, 131-139. 34. R. X. YUE (1999) Variance of quadrature over scrambled unions of nets. Statist. Sinica 9, 451-473. 35. R. X. YUE & F. J. HICKERNELL (2002) The discrepancy and gain coefficients of scrambled digital nets. J. Complexity 18, 135-151. 36. R. X. YUE & S. S. MAO (1999) On the variance of quadrature over scrambled nets and sequences. Statist. Probab. Lett. 44, 267-280.
The permutation distribution of matrix correlation statistics
A. D. Barbour and Louis H. Y. Chen Angewandte Mathematik, Universitat Zurich Winterthurerstrasse 190, CH-8057 Zurich, Switzerland, E-mail: [email protected] and Institute for Mathematical Sciences, National University of Singapore 3 Prince George's Park, Singapore 118402 E-mail: [email protected] Many statistics used to test for association between pairs (yi, Zi) of multivariate observations, sampled from n individuals in a population, are based on comparing the similarity ay of each pair (i, j) of individuals, as evidenced by the values j/j and yj, with their similarity b^j based on the values z, and Zj. A common strategy is to compute the sample correlation between these two sets of values. The appropriate null hypothesis distribution is that derived by permuting the ZJ'S at random among the individuals, while keeping the yi's fixed. In this paper, a Berry-Esseen bound for the normal approximation to this null distribution is derived, which is useful even when the matrices a and b are relatively sparse, as is the case in many applications. The proofs are based on constructing a suitable exchangeable pair, a technique at the heart of Stein's method. 1. Introduction When testing for linear association between the x- and y-values in data consisting of n pairs (xx, yx),..., (xn, yn) of real numbers, a classical procedure compares the value of Pearson's product moment correlation coefficient Er=i(ai-5)(i/i-y)
( l l )
with a critical value derived from its null distribution. In the parametric and version, the null distribution is that resulting when (xi,X2,---,xn) (yi,y2, • • • ,Vn) are independent random samples from the standard nor223
224
A. D. Barbour and Louis H. Y. Chen
mal distribution. Alternatively, to avoid making such strong assumptions, one can use the permutation null distribution, obtained when the values x\,...,xn are paired at random with the values j / i , . . . , yn. This is the distribution of T := Y^i=ixillir(i)i where TT denotes a uniformly distributed random element of the permutations Sn of the set [n] := { 1 , 2 , . . . , n}. The combinatorial central limit theorem of Wald & Wolfowitz (1944) shows that this distribution is asymptotically normal under limiting conditions reminiscient of those for the usual central limit theorem. The same is true for more general statistics T' of the form T" := Y17=i z(hn(i))i f° r z a m a t r i x of reals, as was proved by Hoeffding (1951). Later, in a celebrated application of Stein's method, Bolthausen (1984) established the Berry-Esseen bound n
n
dK{C{a-\T' - ET')), *) < Ca-Zn-1 J2 E !**(*' *)|3 i=l 1=1
( L2 )
2
for a universal constant C, where a := VarX" and
Z*(i, i) : = Z(i, o
- n - 1 1 E z ^ o + E z ^ k ) \+n~2 E E zU> k)-<
[j=i fc=i J j=ifc=i here, di< denotes the Kolmogorov distance between probability distributions, dK(P, Q) := sup |P{(-oo, x]} - Q{(-oo,a:]}|. X
Specialized to the context of T, this implies that n
n
dK(C{a-\T - ET)), $) < Ca^n-1 E N 3 E ^ '
(L3)
for the same constant C, where
t=l
and ±i := x, - x, yt := j/j - y. There are other measures of linear association, of which Kendall's T is an example, which can be viewed mathematically as a 2-dimensional or matrix analogue of Pearson's statistic. However, statistics of this form, which were first systematically studied by Daniels (1944), turn out to have much greater importance when applied as measures of spatial or spatiotemporal association. Their practical introduction began with the work of
Matrix correlation statistics
225
Moran (1948) and Geary (1954) in geography and of Knox (1964) and Mantel (1967) in epidemiology; see also the book by Hubert (1987). They typically take the form W := Y^.. aijhj, where ay- and b^ are two different measures of closeness of the i'th and j-th. observations, which may or may not be related; the null distribution of interest is then that of W
:
= J2ij °
where n is uniformly distributed on Sn. Here and throughout, the notation Yj'i,j denotes the sum over all ordered pairs of distinct indices i, j e [n]; a similar convention is adopted for more indices. In this and the following section, we assume that the matrices a and b are symmetric, as is natural for such applications. Other choices are possible, and Daniels (1944) himself considered anti-symmetric matrices; these, together with other, more general constructions, are covered by the results of Section 3. Various conditions under which the null distribution of W should be asymptotically normal have been advanced in the literature. The first approach, using the method of moments, is that of Daniels (1944). Another approach involves approximating bij by b(n~li,n~lj) for a piecewise Amonotone function b on [0,1]2 (see Jogdeo (1968)), and then approximating W by J2'i j a,ijb(Ui, Uj), where the random variables C/j, 1 < i < n, are independent and uniformly distributed on [0,1]. This direction was taken by Shapiro and Hubert (1979), who based their arguments on the work of Jogdeo (1968). Both of these approaches give rise to unnecessarily restrictive conditions; in particular, both require that the leading, linear term in a Hoeffding projection of W should be dominant. This requirement greatly restricts the practical application of the theorems, and alternative conditions, avoiding the difficulty, were proposed in Abe (1969) and in Cliff & Ord (1973). Barton (unpublished manuscript) gave counter-examples to these proposals. He then studied the case in which a is the incidence matrix of a graph and b^ = b(c(i), c{j)) is a function of the colours c(i), c(j) assigned to the vertices i and j of the graph, and gave a correct, although still rather restrictive, set of conditions for asymptotic normality in this context. Satisfactory conditions for asymptotic normality in the general case were given in Barbour k Eagleson (1986a). We begin by stating their Theorem 1,
226
A. D. Barbour and Louis H. Y. Chen
for which some more notation is required. Define 1
Ao := {{nh}'
^
/
n
<»«; M2 := n" ^ { < } 2 ;
A22:={(n)2}-1J2'iJal
1
and
^is^n-^KI3,
where (n) r denotes n(n — 1) • • • (n — r + 1), a* := (n - 2)" 1 ^
(aij ~ -^o),
and a^ := a^- - a* - a* - Ao,
and make similar definitions for b. Then they showed that di(£({VarW}- 1/2 (W - EW)), $) < *Ti(<$i + J2)
(1.4)
for a universal constant K\. In this bound, the quantities Si and 52 are given by x
4 - 3 , I D
<Si := n V ^AuBu
and
jr
and
/ ^22-622
<J2 := W—:—^—, V
nA\2LS\2
/n
_N
(1.5)
(1.6)
n—1 di denotes the bounded Wasserstein metric: for probability measures P and Q on R, di(P,Q):=sup / / d P - / / d Q ,
where JF is the set of bounded Lipschitz functions satisfying ||/|| < 1 and having Lipschitz constant at most 1. This result implies that a sequence W^ of statistics of this form is asymptotically normal if both S]"'' and <$2™ t e n d to 0. Although the argument is still based on a Hoeffding projection of W, the conditions are both simpler and less restrictive than those previously given, and the error in the approximation, measured in terms of the metric d\, is bounded by directly calculable quantities. The leading, linear component in the projection is a 1-dimensional statistic to which the Wald-Wolfowitz theorem can be applied with error of order O(6i); the quantity 62 accounts for the error involved in neglecting the remaining, quadratic component of W. The distance (1.4) between £({VarW}- 1 / 2 (W-EW)) and $, measured with respect to the metric d\, automatically implies a bound with respect
227
Matrix correlation statistics
to the Kolmogorov distance &K of order O(S\'2 + 8\ ); see, for example, the brief remarks in Erickson (1974, pp. 527,528). Zhao et al. (1997), working directly in terms of Kolmogorov distance, proved that dK{C{a-l'2{W
- EW)), *) < K2(6, + S3),
(1.7)
where 63:=a~3n4A23B23
and A23 := n~2S^
\a.ij\3;
this bound is the specialization to our problem of their more general theorem giving the rate of convergence to the normal for the statistic X := Si^-C(i,j;7r(i),7r(j)), where C is an arbitrary 4-dimensional array. For Kolmogorov distance, (1.7) can be asymptotically sharper than the rate deduced from (1.4). Suppose, for instance, that c$ = a{i/n,j/n) and b^ = b(i/n,j/n), 1 < i,j < n, for smooth symmetric functions a, b : [0,1]2 -»JR. Then i
'
^^TTyE^S
f1
0
a*(n) ~ ai(i/n),
r1
- I I a(x,y)dxdy=:a0; where ai(x) := / {a(x, y) - ao}dy; Jo
"•$ ~ o,2{i/n,j/n),
A™ ~
Jo
where a2(x,y) := a(x, y) -
falWdx-AM;
4 ? ~ t [1a22(x,y)dxdy=:AW; Jo Jo
A<$ ~
Jo
pMxtfdx^AM;
d£ ~ f I \a2(x,y)\3dxdy=:A^23\ Jo Jo
and similar asymptotics hold also for 6. Hence, if the A'(m' and I?['ml are all positive, for I = 1,2 and m = 2,3, it follows that
,
1
~
n
_1/2JW_
( J 4[12] B [12])3/2
lfl
f 4 [22]
D[22]
-. X / 2
"
_1/2
n
*"~»- {3H§Br} s"""2'
(L8
>
228
A. D. Barbour and Louis H. Y. Chen
and
*3~»-"2(jS^-»-1/2-
(19
»
1 2
Thus both 5\ + 82 and 8% + 83 are of order O(n~ / ), but for CLK the bound (1.4) then only implies a rate of O(n - 1 ' 4 ), which is worse than the order O(n~l/2) implied by (1.7). However, as remarked above and as discussed at greater length in Barbour & Eagleson (1986b), these asymptotics are not those relevant in many statistical applications. To illustrate this, we specialize to graph-based measures of association, such as those considered by Knox (1964), Bloemena (1964) and Friedman & Rafsky (1983), and take both a (n) and 6(n) to be incidence matrices of graphs. In examples concerning spatial and temporal correlation, increasing n typically means increasing either the area of the study or its timespan; in either case, the number of neighbours of a vertex is unlikely to increase proportionally to n, as the above asymptotics would suggest, but rather more slowly. So suppose that the typical vertex degree in a^ is of order na for some 0 < a < 1, and in b^ of order n13 for some 0 < /3 < 1. This means that each row and column of A (B) typically contains O(na) (O(n^)) l's, all other entries being 0. Then the corresponding asymptotics for the Ars typically become
4^xn-2+2a;
A£>xn-1+a;
A{£> x n~3+3a
and A™ x n~1+a,
(1.10) though for balanced graphs both A^' and ^4^ may be of smaller order; analogous formulae hold for the Brs. Note that these relations agree with the previous asymptotics only when a = (3 = 1. It now follows from (1.10) that Thus 5\ = 0(62) and 82 = 0(^3) unless a = (3 = 1. Furthermore, 5\ = o(53) whenever a + (3 < 13/7, so that, even for dx, the bound (1.4) implies a better result than (1.7) in this range; indeed, the bound in (1.7) tends to zero as n —> 00 only if a + (3 > 7/4, whereas that in (1.4) tends to zero so long as a + (3 > 1. In Theorem 2.4, we improve the &K bounds to the same order 0(61+62) as that given in (1.4) for the metric di. This then gives a convergence rate which is uniformly better than that given in Zhao et al. (1997) whenever a + f3 < 2, and is as good when a + (3 = 2. Indeed, the quantity 5\ is the same as that appearing in Bolthausen's Lyapounov third moment bound
229
Matrix correlation statistics
for approximating the linear component in the projection, and is of order 0{n~1/2) in all the asymptotics considered above. The quantity <52 is the ratio of the standard deviations of the quadratic and linear components, and is a natural measure of the effect of neglecting the quadratic component in comparison with the linear component. Hence nothing essentially better than our theorem is to be hoped for from any argument that is based, as are both ours and that of Zhao et al. (1997), on using the linear projection for the approximation and ignoring the quadratic component. Our argument, motivated by Stein's method, proceeds by way of a concentration inequality, and is much in the spirit of Chen and Shao (2004). The concentration inequality is used to bound the probability that the linear component takes values in certain intervals of random length; this in turn enables the error incurred by neglecting the quadratic component to be controlled. We also use the same technique in the final section, to prove a dx-bound of analogous order for the error in the normal approximation of X = J2i 12 j C(i, j ; n(i),ir(J)), the context studied by Zhao et al. (1997): see Theorem 3.3. The resulting bounds are again always of asymptotic order at least as good as those of Zhao et al. (1997), and are frequently much better. Example: Each of n individuals is assigned two sets of values, one describing them in socio-economic terms, the other in terms of consumption pattern. Thus individual i is assigned values yi & y and Zi £ Z, where y and Z are typically high dimensional Euclidean spaces. The individuals are then clustered twice, once using the j/j values, and once using the Zi, and it is of interest to know whether the two clusterings are related. To test this, one can set ciij(bij) = 1 if i and j belong to the same y~ (Z-) cluster, and 0 otherwise, and compare the value of the statistic W := 12i j aij^ij t° its null distribution. To fix ideas, suppose that there are ky ^-clusters, with sizes roughly uniformly distributed between 1 and 2n/ky. Then An x ky2,
A22~ky1,
and
An~ky3.
Assume that analogous relations hold also for the ^-clustering. Then ^xn-1/2
and S2 ~
{n^kykz}1'2.
As n increases, more data being available, it is natural to allow for a more informative analysis, by letting both ky = ky and kz = k£ increase with n. Then the normal approximation to the null distribution of W is guaranteed by Theorem 2.4 to have error of order O({n~1ky k^ j 1 ' 2 ) ,
230
A. D. Barbour and Louis H. Y. Chen
which is o(l) as long as ky kg = o(n). This represents a substantial improvement over the order to be obtained from the theorem of Zhao et al. (1997), which is o(l) only so long as k^k^ = ofa 1 / 4 ). Remark: It is nonetheless interesting to note that there is a second dibound given in Barbour & Eagleson (1986a), of more complicated form than those given above, which does not rely on projection, and which establishes asymptotic normality in yet wider circumstances. Under the asymptotics of (1-10), it yields the result diiCttV&iWj-^iW
-TEW)),®) =O(n-^m'm^a+0'1)),
(1.11)
establishing asymptotic normality for all choices of a, /3 > 0, and not just for a + (3 > 1. This then also implies that dK(£({Va,rW}-1/2(W
- EW)), $) = O(n~i «*»(°+fi.i))t
(1.12)
of smaller order than <5i + 62 if a + /3 < 3/2, indicating that the approach by way of projection, while much simpler in construction, loses some precision. 2. Matrix correlation statistics Let A and B be symmetric n x n matrices, and define
W:=W(n):=^[jaijb7T{iMj),
(2.1)
where n is a uniform random permutation of [n] := {1,2,..., n) and J2[ • denotes the sum over pairs (i,j) of distinct elements of [n]. Thus, in the definition of W, the diagonal elements play no part, so that we may assume that an = ba = 0 for all i € [n]. We then note that
where
M := T T E
^2,
a b
(71)2 *-~'i,3 *-~'l,m
ij im\
we use the notation from the introduction throughout. We note also that, with a* and a^- defined as before, we have n
Y, < = 0; J2~aio= E fi« = 0t=l
i:i^j
j-.j^i
(2-2)
231
Matrix correlation statistics
With these definitions, and with their analogues for the matrix B, it was shown in Barbour &: Eagleson (1986a) that W = ft + V + A, where
(2.3)
n
V:=2(n-2)Y,a*iK(i) and A := ^ i=l
a^K^^y,
(2.4)
it was also shown that
*>:=VzrV=An2{n-2)2A12B12, n —1
VarA^ B l f c i ) ! ^ ^ , n—3
and that Cov (V, A) = 0. In this section, we derive a concentration inequality bounding the probability of V :=• (j~xV belonging to an interval of random length |A|, where A := a~1A, and use it to deduce a normal approximation to W. The argument is based on ideas from Stein's method, using an exchangeable pair. Broadly speaking, we show that E{F/A(F)}«E{/A(V)}, for a function / A which is chosen so that f'A(v) « 1[ Z)Z +A]( V )> a n d which also satisfies | | / A | | « | | A | : see also Chen & Shao (2004). To avoid trivial exceptions, we assume that Ai2Bi2 > 0; otherwise, V = 0 a.s. The first step is to construct an exchangeable pair (Stein 1986, p. 2). We do this by defining a pair of independent random variables / and J, each uniformly distributed on [n], which are independent also of n, and then defining n' by (ir(a) ifa?I,J; 7r'(a) = i TT(J)
if a = 7;
[TT(J)
i f a = J.
(2.5)
We then set n
V := 2a- (n - 2) £>*&;,(i); A' := a"1 ^ 1
. a,Awa) :
it follows immediately that (V, A) and (V,A') are exchangeable. We use this to prove the following lemma. Lemma 2.1: We have E{V - V | TT} = 2rr 1 V;
JE{(V - V')2} = 4n" 1 lEy 2 = An'1;
JE{(A - A') 2 } = Sn~2(n - 1)JEA2 = 8n" 2 (n - 1)<5|,
232
A. D. Barbour and Louis H. Y. Chen
where p ._ EA2 _ VarA _ (n - I ) 3 A22B22 2 '" " Var? " 2n(n - 2)2(n - 3) AUB12
lg2
2*2-
Proof: If U and U' are exchangeable, then E{([/' - t/)(C/' + U)} = 0, which in turn implies that E{(£/' - U)2} = 2E{U(U - U')}.
(2.6)
Applying this first with U — V, we have E{(V" - F) 2 } = 2E{yE(V - V I TT)} = Aa-\n - 2)E{VE(fl-(/, J; TT) | TT)},
(2.7)
where ff(/,J;7r):=/i(J,J,7r(J),7r(J)) :
= a*iK(i) + a*jK(j) - a*iK(j) -
a b
J l(i)'
because V-V
= 2a-1(n-2)H(I,J;n).
(2.8)
Now H(I, I; TT) = 0, and
E ( a ; » ; ( J ) i { / ^ j } 1 T) = » - J ^
a-t;,,, =
2 n
, ^
2 )
,
from (2.2) and (2.4). Hence it follows that E(tf(/,7;7r)|7r) = (7V/{n(n-2)},
(2.9)
which, from (2.8), is equivalent to E{y — V' \ TT} = 2n~ 1 y, and hence, from (2.7), that E{(V"-V) 2 } = 4n- 1 EV 2 , as claimed. For the second part, apply (2.6) with U = A; this gives E{(A' - A)2} = 2<7-1E{AE(tf (/, J; TT) | TT)},
233
Matrix correlation statistics
where K(I, J;ir) := ^
+ ajj(6 w ( j ) T ( j ) - K(i)*(j))}
{aij{K(I)7r(J) - K{J)ir(J))
+ 2_j {ai/(67r(t)7r(/) -
&
x(i)7r(J)) + "ij(^7r(i)7r(J) ~ &,r(t)ir(J))}
+ «Jj(l».(J)i(/) - ^7r(/)7r(J)) + au(K(I)K(J)
~
K(J)K(I))-
Now
= n~2{n
- 2)CTA,
where the sum J^^ ; m is over all distinct choices of the indices, and
in view of (2.2). Hence E(K(J, J; ?r) I TT) = 4n~ 2 (n - 1)CTA,
and the lemma follows because E A 2 — %\.
•
Now suppose that 8 is chosen such that \nE{\V - V\ min(|V - V\, 6)} > | ;
(2.10)
this is possible, since ^nE{|y' — y|min(|y' — F|,<5)} is a continuous and increasing function of 5 > 0, taking all values between 0 and 1. Then we have the following bound. Lemma 2.2: For S such that (2.10) is satisfied, it follows that max{P[2:-|A|
V
< (l + V2)d2 + 26 + 2Varr)(5)
for all z, where V(S)
:= \nE{\V
- V\ min(|V - V\,S) \ TT}.
234
A. D. Barbour and Louis H. Y. Chen
Proof: We give the argument for TP[z < V < z + |A|]. Define the function 2) /A = /i by
{
-(±|A|+<5)
ifx
-±|A|+a;-z
if z - 6 < x <
| | A | + <J
if * > z + |A| + «S.
z+\A\+5;
Then, by the exchangeability of (V, A) and ( V , A'), we have 1E{(V'-V)lfA(V') + fA,(V)}} = 0; adding 2E{(V — V')fA(V)}
to each side and multiplying by n/4, this gives (2.11)
\nK{{V - V')fA(V)}
= inE{(K' - V)[fA(V) - fA(V)}} + \nE{(V - V)[fA,(V) - fA(V)}}. The left-hand side can be rewritten as in Lemma 2.1 as inE{(V - V')fA(V)} = inE{/ A (F)E(F - V \ n)} =
E{V/A(V)},
this last from Lemma 2.1, implying in turn that inE{(V - V')fA(V)}\ < IE|FA| + SE\V\ (2.12)
from the Cauchy-Schwarz inequality, from the definition of <5>2, and because W/2 = 1. The second term on the right-hand side of (2.11) is easily bounded by \\nlE{(V' - V)[fA,{V) - fA(V)}}\ < |nE{|V" - V\ ||A'| - |A||} < inE{|V'-V||A'-A|} (2.13)
< 62/V2, by Lemma 2.1. For the first, we write
ln-E{(V'-V)[fA(V>)-fA(V)]} = \vMUV'-V)£
V
f'A(V + t)dt\
= E y°° f'A(V + t)M(t)dty where M(t) := \n(V - V)[1{V
-V>t>0}-
1{V - V < t < 0}] > 0
235
Matrix correlation statistics
a.s. for all t. Recalling the definitions of /& and i](S), it thus follows that \nE{{V -V)[fA{V)
- fA{V)}}
> E J/ l{z-5
+ t
S}M(t)dt\ J
> E J \{z < V < z + |A|} / M{t) dt \ { J\t\<s ) = \nE{l{z
< V < z + \A\}\V - V\min(|y' - V\,S)}
= ^{r,(6)l{z
(2.14)
> | P [ z < V < z + |A|] - \JE{(V(S) - E»?(<5))1{2 < V < z + |A|}}(. But now, by Cauchy's inequality applied to the quantities y/2\r](5) -Erj(d)\ a,nd l{z < V < z + \A\}/V2, | E { ( T J ( 5 ) - Er,(S))l{z
+ |A|}}|
< i{5 F ^<^< z +l A l]+ 2 V a r ^)}and the lemma follows from (2.11), (2.12), (2.13) and (2.14).
•
The next step is to choose a suitable value of 6, and then to bound V&ir)(6). For the first, note that min(x, 6) > x — x2/(AS) in x > 0, so that \nlE{\V'-V\mm(\V'-V\,8)} > i n E {|V - V\(\V -V\- \V - V\2/{U))}
= i_^ E r -v|»>| if 5 > \nE,\V - V\3. Now, from (2.8), E | F ' - V\3 = 8a~3(n - 2)3E\H(I, J; TT)|3 < 8 ( n - 2 ) a - 3 ^ . TE)h(i, JMQ,*U))\*
^ S n - V - 3 ^ ' . T[
\h(i,j,l,m)f.
(2.15)
< 16{|a*i.n3 + |a*6^| 3 + |a*6^| 3 + \a*bt\3},
(2.16)
Then, from the definition of h, \h(i,j:l,m)\3
236
A. D. Barbour and Louis H. Y. Chen
implying that E | V - V\3 < Sn^tr" 3 64(n - I) 2 nAu nBn; hence (2.10) is satisfied with the choice 8 = 128<5i, where <*i := ^a^A^B^.
(2.17)
The bound on Vaxr)(5) follows from the next lemma. L e m m a 2.3: Suppose that n > 4 a n d t h a t d(i, j,r,s), 1 > ) = 0> ^ ^ I < i,j < n- Then it follows, setting for •n uniformly distributed on Sn, that Xij := d(i,j,-rr(i),n(j))
Var SjT' • xij} ^ 8n3D2(l + 5n~l), where
D2:={(nh}-2£ij Proof: T h e conditions on d(i,j,r,s) that
We now note that, for i,j,l,m
Y,'rtt
all distinct,
E,{XijXim} = v^r^2rstu
d(i,j,r,s)d(l,m,t,u),
where, as usual, the sum runs over all ordered quadruples of distinct elements of [n]. But now, because of the conditions on d(i,j,r, s), we have \n)A —r's x < - 2 ^ d(l,m,r,u) — ^ ^ u:u^r,s
u:u^s,r
d(l,m, s,u) - ^ t:t^tr
d(l,m, t,r) — y ^ d(l,m,t, s)) , t:t^s
so that, summing over all distinct choices of i, j , I, m, and also recalling the
J
237
Matrix correlation statistics
inequality \xy\ < (x2 + y2)/2, we have
~
( n ) 4 ^-'i.j ^ f , m ^ r , s
x| ^
|d(t,j,r,s)d(i,m,r,u)|+ ^
|d(i,j,r,s)d(i,m,s,u)|
+ Y2 \d(i,j,r,s)d(l,m,t,r)\+ Y2 \d(i,j,r,s)d(l,m,t,s)\ > t:t^r r=l 3
=
4n (n-l) (n) 4
J
t:t^s s:s^r
4 D2
(2 18)
'
-
If indices among i,j,l,m are equal, the argument does not use cancellation in the d-sums. First, for / = i, we have E{XijX im } = r ^ - ^
^ u d(i,j, r, s)d(i, m, r, u),
so that
l E i ^ nXtjXm}\ -.
n
(2.19) n
^ Tn)~Y.Y. Yl 12 E E \d(i,j,r,s)d(i,m,r,u)\ \
Z'5 j = l r = l
j,^
i s
,s^
T
= ^ i ^ D 2 . (2.20) (n)3 The same bound also holds for the three remaining arrangements with one pair of equal indices. Finally, we have
E-• ***«} = E-• E' T V ^ ' ^ ^ T T 1 ^ 2 ' <2-21> A
—'*,3
J
t-^iiJ
^—^r,s
(n)2
and the same bound follows also for J2i j E{XijXji}.
[71)2
238
A. D. Barbour and Louis H. Y. Chen
Collecting the results in (2.18) - (2.21), we have shown that l-^«,j
i
(n)4
[
(n) 3
(n)2
J
and the lemma follows by using elementary computations.
•
Now rj(5) is almost of the form considered in the previous lemma: r)(5) = jn-1^^
w(i,j,Tr(i),n(j)),
where :=4a-2(n-2)2\h(i,j,r,s)\mm\\h(i,j,r,s)\,—^—-\. I 2(n - 2) J
w(i,j,r,s)
Hence we can apply Lemma 2.3 to calculate Va.TTj(5) by defining d(i,j,r,s) := in" 1 {«;(;, j , r , s ) - (n)~l ^
(
^ w(i,j,/,m)|,
noting that then
/ n - 2 \ 2 J2 „ „ =
7
-0^12^12,
\n — 1 / a 2 giving Varjj(J) < 2J2(1 + 5n- 1 )n/(n - 1) < 252(1 + Sn" 1 ),
(2.22)
uniformly in n > 4. This completes the preparation. Theorem 2.4: For W defined as in (2.1), we have sup \P[W -ii
${z)\ < (2 + C)8 + 12<52 + (1 + y/2)l2,
z
where 5 := 1285i, 5\ — n4a~3A\2Bi2, 62 is as defined in Lemma 2.1, $ denotes the standard normal distribution function, and G is the constant in Bolthausen's (1984) theorem.
239
Matrix correlation statistics
Proof: We use the chain of inequalities -P[z - |A| < V < z] = ~1P[V
- fi) < z] - P[V < z\\ < (1 + V2)62 + 25+ 1262,
from Lemma 2.2 and (2.22), using the choice 5 = 1285i from (2.17). This, together with Bolthausen's bound (1.3) sup|P[V
completes the proof.
•
Remark: Note that our approximation, in common with that of Zhao et al. (1997), normalizes W with a, and not with •v/VarW. However, because VarA/VarV = <5|, we could use VVaiW instead, without affecting the order of the error. 3. General arrays We now consider the more general statistic n
n
*:=XXC(U;7r(i),ir(j))
(3.1)
studied by Zhao et al. (1997), where C is an arbitrary 4-dimensional array. The first step is to split the sum into a linear and a quadratic part, as in (2.3) and (2.4). Noting that the terms with i = j fall naturally into the linear part, we adopt a notation which treats them separately. We define 'off-diagonal' sums, in which replacing an index at any position by a '+' sign denotes summation over all indices at this position which, combined with the other index of the (first or last) pair, do not result in a pair of identical indices: thus C(+,j;l,m):= £
C(t, j;J,m);
C(+, +; l,m) := ^
i:iftj
and C(i,+;+,m):= ]T ]T C(i,j;l,m). j-.j^i
i.l^m
C(i,j;l,m)
240
A. D. Barbour and Louis H. Y. Chen
We then define the 'diagonal' sums, in which replacing a pair of indices by '(++)' indicates summation over all pairs of equal values: thus n
n
n
C((++); I, m) := £ C(i, i; I, m) and C((++); (++)) := £ Z ) C(i, *; I, /). i=l
i=l 1=1
With this notation, we set C(i,j;l,m) := C(i,j;l,m) ?~ ox {(», +; Z, m) + C(+, j ; i, m) + C(i,j; I, +) + C(i, j ; +, m)}
-
Ivy it
Zi )
-
n,_2){g(+,»;
+
(n-lKn-2){C(+'+;/'m) +
+
(^))2cW + ; ^ V [ 2 ]
g[4] n ( n - 2)2
i, m) + C(j, +; /, m) + C(i, j ; +, i) + C(t, j ; m, +)}
C[5] n(n-l)(n-2)2
C (
^'+'+)}
+
^^co]
C(+,+;+,+) (n - l) 2 (n - 2)2
and 2(n-2)c*(M) := ^ ~ _ 1 2 ) { C ( t , + ; ; , + ) + C(+,r,+,0}
- —f
Tii( it
^{C(i,+;+,+) + C(+, i;+,+) + C(+,+; l,+) + C(+,+;+, I)} Zi)
+
n 2 ( n 2 - 2 ) C ( + ' + i + ' + ) + W'*'1'1) -n- 1 {C(t,t;(+ + )) + C((++);/,/)}+n- 2 C((++);(++)),
where C[l]:=C{i,+;l,+) + C(i,+;+,m) + C(+,j;l,+) + C(+,j;+,m); C[2] := C(+, i; I, +) + C(+, i; +, m) + C(j, +; /, +) + C(j, +; +, m) + C(i, +; +, I) + C(i, +; m, +) + C(+, j ; +, 0 + C(+, j ; m, +); C[3] := C(+, t; +, Z) + C(+, i; m, +) + C(j, +; +, J) + C(j, +; m, +); C[4] := C(i, +; +, +) + C(+,j; +, +) + C(+, +; I, +) + C(+, +; +, m) and C[5] := C(+,i;+,+) + C(j,+;+,+) + C(+,+;+,0 + C(+,+;m,+).
Matrix correlation statistics
241
C(+,j; l,m) = C(i, +; I, m) = C(i,j; +, m) = C(i,j; l,+) = Q
(3.2)
Note that then
and that
£y(i,o = xy(M) = o.
(3-3)
With these definitions, X can be decomposed into linear and quadratic parts: (3.4)
X = n + V + A, where
n
V = V(7r):=2(n-2)£y(t,7r(t)); i=l
A = A(TT) := J2[. C(i,j;n(i),KU))-
(3-5)
This decomposition is the analogue of that given in (2.3) and (2.4), and is slightly different from the version given in Zhao et al. (1997); here, the terms V and A are uncorrelated. We write
C12 := n- 2 £ f > * ( i , W2; C22 := {(n)2}~2 £ | • E | m {C(i, j ; I, m)}2, assuming that C12 > 0, to avoid trivial exceptions. Lemma 3.1: With the definitions above, we have _
_
EV = EA = E{VA\ = 0;
2
_
2
a := EV =
and n-3
An2(n—r>\2 (
n-1
' C12
242
A. D. Barbour and Louis H. Y. Chen
Proof: We indicate the calculations for E A 2 , which are the most complicated. First, for i,j,l,m distinct, using (3.2), we have
E{C(t, j ; 7r(i), *(j))C(l, m; ir(l), n(m))}
= 7 V E ' , C(i,j;r,s)C(l,m;t,u) (71)4 •*—'r,s,t,u
= - T V E ' tC(i,j;r,s){C(l,m;t,r) + C(l,m;t,s)} = Jn^Yl'rs C{i,j;r,s){C(l,m;s,r) + C(l,m;r,s)}. Then, adding over i, j , l,m distinct and again using (3.2), it follows that |^ i i > | > m E{C(t,j;7r(i),7r(j))C(i,m;7r(i),7r(m))}|
= T V E ! E ' {C(l,m;r,s) + C{m,l;r,s)}{C(l,m;s,r) +C(l,m;r,s)} 4n(n - 1) ^ ^(n-2)(n-3)C22' the last line following because \xy\ < \{x2 + y2). Similar calculations for the cases in which the pairs (i,j) and (I, m) have one index in common give a total contribution of at most 4n(n — l)(n — 2)~1C>22, and the terms with (l,m) = (i,j) and (l,m) = (j,i) yield at most 2n(n - l)C22- Adding these three elements gives the stated bound for EA 2 . • We now write V = V(ir) := cr~1V(n) and A = A(n) := cr~1A(7r), and set l\ := VarA = a" 2 EA 2 < / ( n ~ ]f ^2. 2 - 2 n ( n - 2 ) 2 ( n - 3 ) C12 We then define IT' as in (2.5), and write V = V(TT') and A' = A(TT'), giving the exchangeable pair (V, A), (V, A'). Lemma 3.2: Witi the above definitions, we have E{V -V'\ir} = 2n~1V! E{(V - V')2} = An-1, E\V - V'\3 < 5l2n3a-3C13 and then finally JE{(A - A') 2 } < Sn^JEA 2 = Sn'11\, where
Ci3:=n-2]T]r|c*(U)|3. *=i 1=1
Matrix correlation statistics
243
Proof: Much as in the proof of Lemma 2.1, we have
V -V = 2a-\n-2) X{C*(I, 7T(J)) - C*(I, 7T(/)) + C*{J, 7r(J)) - C*(J, 7T(J))}1{/ ^ J } , giving E { V - V" | TT} = ^ 2 ^ E ^ { C *(*.'(0) - ^ M C O ) + C*(j,7T(j)) - C*(j,7r(i))} =
2 ( n ^ l^ c . ( .^ ( . ) )
+ (n_1)^c.(.j7r(.)) + n ^ c , ( j > c . ) ) l
as required, by way of (3.3); the formula for E { ( ^ — V ) 2 } now follows from (2.6). The argument for TE\V - V'\3 then matches that given in (2.15) and (2.16), leading to the expression
W,\V-V'\3 ^Sn^a'3 xl6
E ' • E i {IC'(*'"OI3 + I^^'OI3 + |c'(i,0l3 + k*(j,rn)|3},
from which the result follows. The corresponding formula for A' — A is a(A'-
A) = J2
{C(I,J;K(J),K(J))-C(IJ;K(I),TT(J))}
+ J2 {C(i, I; *(>), TT( J)) - C(i, I; Tr(i), 7r(/))} + J^
{C(J,j;7r(/),7r(j))-C(J,j;7r(J),7r(j))}
+ J^ {C(»,J;ir(i),7r(J))-C(t,J;7r(i),7r(J))} + C(/, J; TT( J), TT(/)) - C(I, J; * (I), TT( J))
+ C(J,/;7r(/),7r(J)) - C(J,7;7r(J),7r(/)). The expectation conditional on 7r, using (3.2), now yields E{A - A' | TT} = 4n" 2 (n - 1)A + 2n" 2 A - 2n"2A*,
244
A. D. Barbour and Louis H. Y. Chen
where A* := a"1 ^
.C(i,j;ir(j),ir(i)). Hence, from (2.6), it follows that
E{(A' - A) 2 } = 2E{A(A - A')} = 4 n - 2 E { ( 2 n - l ) A 2 - A A * } < Sn^EA2, completing the proof.
•
The rest of the argument is exactly as for Theorem 2.4, with Lemma 2.2 and its proof remaining true in this context, with the new definitions of V, V and 82, with h(i, j,r, s) replaced by c*(i, r) — c*(i, s) + c*(j, s) — c*(j,r), and with 8 = 128<5i being a suitable choice for 8, now with the new definition 5i := n 4 cr~ 3 Ci3. This leads to the following theorem. Theorem 3.3: For X as defined in (3.1), pt as in (3.5) and a as in Lemma 3.1, we have sup \P[X - H<<JZ]-
$(z)| < (2 + C)6 + 1262 + (1 + V2)82,
Z
where 6 := 128Slt Sx := nV- 3 Ci 3 , l\ = EA2 < constant in Bolthausen's (1984) theorem.
2n
[^Z^22,
and C is the
As before, Si is the order of the error bound in Bolthausen's BerryEsseen bound for the normal approximation to the linear component V, and 82 represents the error incurred by neglecting A. In the bound given by Zhao et al. (1997), our 82 is replaced by a term of order much like
the difference lying only in the slight difference between their decomposition and ours. Now, by Holder's inequality, we have <J38s{(n)2}~2 > C22 and 81 > n" 1 / 2 . Hence <53 > {(n)2}2<7-3C232/2 x n'8l implying that £3 is of larger order than 82 if 82 3> n~1^2. Thus the bound given in Theorem 3.3 is always asymptotically of at least as small an order as that of Zhao et al. (1997), and is of strictly smaller order whenever 82 3> ^i-
Acknowledgement: ADB gratefully acknowledges support from the Institute for Mathematical Sciences of the National University of Singapore, while part of this work was completed.
Matrix correlation statistics
245
References 1. O. ABE (1969) A central limit theorem for the number of edges in the random intersection of two graphs. Ann. Math. Statist. 40, 144-151. 2. A. D. BARBOUR & G. K. EAGLESON (1986A) Random association of symmetric arrays. Stock. Analysis Applies 4, 239-281. 3. A. D. BARBOUR & G. K. EAGLESON (1986B) Tests for space time clustering. In: Stochastic spatial processes, Ed. P. Tautu, Lecture Notes in Mathematics 1212, 42-51: Springer, Berlin. 4. A. Ft. BLOEMENA (1964) Sampling from a graph. Mathematisch Centrum, Amsterdam. 5. E. BOLTHAUSEN (1984) An estimate of the remainder in a combinatorial central limit theorem. Z. Wahrsch. verw. Geb. 66, 379-386. 6. L. H. Y. CHEN k. Q.-M. SHAO (2004) Normal approximation for non-linear
statistics using a concentration inequality approach. Preprint. 7. A. D. CUFF & J. K. ORD (1973) Spatial autocorrelation. Pion, London. 8. H. E. DANIELS (1944) The relation between measures of correlation in the universe of sample permutations. Biometrika 33, 129-135. 9. R. V. ERICKSON (1974) L\ bounds for asymptotic normality of m-dependent sums using Stein's technique. Ann. Probab. 2, 522-529. 10. J. H. FRIEDMAN & L. C. RAFSKY (1983) Graph theoretic measures of multivariate association and prediction. Ann. Statist. 11, 377-391. 11. R. C. GEARY (1954) The contiguity ratio and statistical mapping. The Incorporated Statistician 5, 115-145. 12. W. HOEFFDING (1951) A combinatorial central limit theorem. Ann. Math. Statist. 22, 558-566. 13. L. J. HUBERT (1987) Assignment methods in combinatorial data analysis. Marcel Dekker, New York. 14. K. JOGDEO (1968) Asymptotic normality in nonparametric methods. Ann. Math. Statist. 39, 905-922. 15. G. KNOX (1964) Epidemiology of childhood leukaemia in Northumberland and Durham. Brit. J. Prev. Soc. Med. 18, 17-24. 16. N. MANTEL (1967) The detection of disease clustering and a generalized regression approach. Cancer Res. 27, 209-220. 17. P. A. P. MORAN (1948) The interpretation of statistical maps. J. Roy. Statist. Soc. B 10, 243-251. 18. C. P. SHAPIRO AND L. J. HUBERT (1979) Asymptotic normality of permutation statistics derived from weighted sums of bivariate functions. Ann. Statist. 7, 788-794. 19. C. STEIN (1986) Approximate computation of expectations. IMS Lecture Notes - Monograph Series, vol. 7, Hayward, CA. 20. A. WALD & J. WOLFOWITZ (1944) Statistical tests based on permutation of the observations. Ann. Math. Statist. 15, 358-372. 21. L. ZHAO, Z. BAI, C.-C. CHAO & W.-Q. LIANG (1997) Error bound in a cen-
tral limit theorem of double-indexed permutation statistics. Ann. Statist. 25, 2210-2227.
Applications of Stein's method in the analysis of random binary search trees Luc Devroye School of Computer Science, McGill University 3480 University Street, Montreal, Canada H3A 2K6. E-mail: [email protected] Under certain conditions, sums of functions of subtrees of a random binary search tree are asymptotically normal. We show how Stein's method can be applied to study these random trees, and survey methods for obtaining limit laws for such functions of subtrees.
1. Random binary search trees Binary search trees are almost as old as computer science: they are trees that we frequently use to store data in situations when dictionary operations such as INSERT, DELETE, SEARCH, and SORT are often needed. Knuth (1973) gives a detailed account of the research that has been done with regard to binary search trees. The purpose of this note is twofold: first, we give a survey, with proofs, of the main results for many random variables defined on random binary search trees. Based upon a simple representation given in (1.4), we can obtain limit laws for sums of functions of subtrees using Stein's theorem. The results given here improve on those found in Devroye (2002), and this was our second aim. There are sums of functions of subtree sizes that cannot be dealt with by Stein's method, most notably because the limit laws are not normal. A convenient way of treating those is by the fixed-point method, which we survey at the end of the paper. Formally, a binary search tree for a sequence of real numbers x\,..., xn is a Cartesian tree for the pairs (l,xi), (2,12), • • • > (n,xn). The Cartesian tree for pairs (ti,xi),...,(tn,xn) (Frangon, Viennot k Vuillemin (1978) and Vuillemin (1978)) is recursively defined by letting s = argmin{£j : 1 < i < n}, 247
248
Luc Devroye
making (ts,xs) the root (with key xs), attaching as a left subtree the Cartesian tree for (ti,Xi) : ti > ts,Xi < xs, and as a right subtree the Cartesian tree for (ti,Xi) : ti > ts,Xi > xs. We assume that all t^s and Xj's are distinct. Note that the ti's play the role of time stamps, and that the Cartesian tree is a heap with respect to the U's: along any path from the root down, the tj's increase. It is a search tree with respect to the keys, the x,'s. Observe also that the Cartesian tree is unique, and invariant to permutations of the pairs (ti,Xi).
Fig. 1.1. The binary search tree for the sequence 15, 5, 3, 25, 9, 6, 28, 16, 11, 20, 2, 1, 18, 23, 30, 26, 17, 19, 12, 10, 4, 13, 27, 8, 22, 14, 24, 7, 29, 21. The ^-coordinates are the time stamps (the y-axis is pointing down) and the x-coordinates are the keys.
The keys are fixed in a deterministic manner. When searching for an element, we travel from the root down until the element is located. The complexity is equal to the length of the path between the element and the root, also called the depth of a node or element. The height of a tree is
Random binary search trees
249
the maximal depth. Most papers on the subject deal with the choice of the ii's to make the height or average depth acceptably small. This can trivially be done in time O(nlogn) by sorting the Xi's, but one often wants to achieve this dynamically, by adding the z,'s, one at a time, insisting that the n-th addition can be performed in time O(logn). One popular way of constructing a binary search tree is through randomization: first randomly and uniformly permute the Xj's, using the permutation (
(1.1)
incrementally. This is the classical random binary search tree. This tree is identical to (1.2)
(t1,xi),(t2,x2),...,(tn,xn)
where (ti,t2, • • • ,tn)is the inverse random permutation, that is, ^ = j if and only if Gj = i. Since (ti,t2, • • •, tn) itself is a uniform random permutation, we obtain yet another identically distributed tree if we had considered {Uuxl),{U2,x2),...,{Un,xn)
(1.3)
where the C/j's are i.i.d. uniform [0,1] random variables. Proposal (1.3) avoids the (small) problem of having to generate a random permutation, provided that we can easily add (£/„, xn) to a Cartesian tree of size n — 1. This is the methodology adopted in the treaps (Aragon & Seidel, 1989, 1996), in which the C/,'s are permanently attached to the xz's. But since the Cartesian tree is invariant to permutations of the pairs, we can assume without loss of generality in all three representations above that xi
< x2
<
••• <
xn,
and even that Xi = i. We will see that one of the simplest representations (or embeddings) is the one that uses the model ([/ 1 ,l),(C/ 2 ,2),...,([/ n ,n).
(1.4)
Observe that neighboring pairs (Ui,i) and (Ui+i,i + 1) are necessarily related as ancestor and descendant, for if they are not, then there exists some node with key between i and i + 1 that is a common ancestor, but such a node does not exist. We write j ~ i if (Uj,j) is an ancestor of (Uj, i). Finally, note that in (1.1), only the relative ranks of the xai variables matter,
250
Luc Devroye
so that we also obtain a random binary search tree as the Cartesian tree for (l,C/ 1 ),(2,C/ 2 ),...,(n ! C/ n ).
(1.5)
2. The depths of the nodes Consider the depth A of the node of rank i (i.e., of (Ui,i)). We have
But for j < i, j ~ i if and only if Uj=min(Uj,Uj+1,...,Ui). This happens with probability l/(i - j + 1).
Fig. 2.1. We show the time stamps Uj, t/j+i,...,[/» with the axis pointing down. Note that in this set-up, the j-th node is an ancestor of the i-th node, as it has the smallest time stamp among Uj,Uj+i,... ,Ui.
251
Random binary search trees
Therefore,
i
1
n-i+1
..
i=2
J
=£- +v ± i=2
J
= Hi 4- Hn-i+i — 2 where /fn — 52^L1(l/i) is the n-th harmonic number. Thus, sup |E{A} - log(i(n - i + 1))| < 2. l
We expect the binary search tree to hang deeper in the middle. With a in (0,1), we have E{Dan} = 21ogn + log(a(l - a)) + 0(1). Let us dig a little deeper and derive the limit law for Di. It is convenient to work with the Cartesian tree in which Ui is replaced by 1. Then for j
Ui-i),
which has probability l/(i - j). Denoting the new depth by D[, we have obviously D[ > D^. More importantly, X^j
z?; = ^B(i/fc) + ;£ij'(i/fc) fe=i fc=i
where B(p) and B'(p) are Bernoulli (p) random variables, and all Bernoulli variables in the sum are independent. We therefore have, first using linearity of expectation, and then independence, E{DQ = H^
+ Hn-i,
= Hi-i + Hn_i - 6 where 0 < 6 < TT2/3. It is amusing to note that 0 < E{£>{ - Di} = 2 + H^
+ Hn-i -Hi-
Hn_l+1 < 2,
252
Luc Devroye
so the two random variables are not very different. In fact, by Markov's inequality, for t > 0,
m _ A > t } <Mi. V
b
The Lindeberg-Feller version of the central limit theorem (Feller, 1968) applies to sums of independent but not necessarily identically distributed random variables X^. A particular form of it, known as Lyapunov's central limit theorem, states that for bounded random variables (that is, such that supj \Xi\ < c < oo a.s.), provided that Y^i=i Var{X,} —> oo,
E^PQ-ETO)
normal(Q 1}
as n —> oo, where Zn —>c Z denotes convergence in distribution of the random variables Zn to the random variable Z, that is, for all points of continuity x of the distribution function F of Z, lim T{Zn <x} = F(x) = W{Z < x).
n—»oo
The normal distribution function is denoted by $. Applied to D\, we have to a bit careful, because the dependence of i on n needs to be specified. While one can easily treat specific cases of such dependence, it is perhaps more informative to treat all cases together! This can be done by making use of the Berry-Esseen inequality (Esseen, 1945, see also Feller, 1968). Lemma 2.1: (Berry-Esseen inequality). Let Xi,X2, • • • ,Xn be independent random variables with zero means and finite variances. Put
IfE{\Xi\3}
< oo, then
where C = 0.7655. The constant C given in Lemma 2.1 is due to Shiganov (1986), who improved on Van Beek's constant 0.7975 (1972). If we apply this Lemma to i— 1
n—i
D[ - E{D(} - J2(B(l/k) - (1/fc)) + ^(-B'(IA) - (1/*)), fc=l
k=l
Random binary search trees
253
then the the "B n " in Lemma 2.1 is nothing but yJVax{D[}, which we computed earlier: it is y/log(i(n — i + 1)) + 0(1) uniformly over all i. Its minimal value over all i is ^logn — 0(1). Finally, recalling that E{|B(p) - p| 3 } < E{(B(p) - p)2} - p(l - p) < p, we note that x,f
\ v7^^}
-
/
- iP (Var{Da)3/2 _ Cx2Hn
~ (^EgH-O(l)r = O(l/y/logn). The uniformity both with respect to x and i is quite striking.
Fig. 2.2. This figure illustrates the shape of a random binary search tree. The root is at the top, while the horizontal width represents the number of nodes at each level, also called the profile of a tree. Up to 0.3711... logn, all levels are full with high probability. Also, the height (maximal depth) is 4.31107.. .logn + O(loglogn) with high probability. The limit law mentioned above implies that nearly all nodes are in a strip of width O(y'logn) about 2 logn. Finally, the profile around the peak of the belly is gaussian.
254
Luc Devroye
Having obtained a limit law for A!> it should now be possible to to the same for A , which of course has dependent summands. We observe that for fixed t,
F { A - E{A} < x^VarfD^j < P iD'i - E{£>^} < xyJVariDfi +1 + 2 j + P {\D[ - E{£»a - (A - E{A})| > t + 2} < $(z) + 0(1) + P{A^ - A > t} <*(i)+o(l) + | , which is as close to <J?(:r) as desired by choice of t. A similar lower bound can be derived, and therefore,
lim sup P 1 Di~ ^{Dil
<x\-
$(z) = 0.
One can even obtain a rate by choice of t that is marginally worse than
O(l/VI5g^). The depth of the last node added (the one with the maximal time stamp) in the classical random binary search tree (1.1) is distributed as D'N, where N is a uniform random index from 1, 2,..., n, is also asymptotically normal:
Jfo^P JAV - E{AV) < z^VarlAVjj = *(*). This follows from the estimates above with a little extra work (see Devroye (1988) for an alternative proof). It is also known that E{AV} = 2(Hn-l)
and
Ya,r{D'N} = ~E{D'N} - O(l).
We note here that the exact distribution of D'N has been known for some time (Lynch, 1965; Knuth, 1973):
P{AV =fc}= ^ [ n ~ 1 ] 2 f c , l < f c < n , where [•] denotes the Stirling number of the first kind. In these references, we alsofindthe weak law of large numbers: D'N/(2 logn) —> 1 in probability. The same law of large numbers also holds for Av, the depth of a randomly selected node, and the limit law for D^ is like that for D'N (Louchard, 1987).
Random binary search trees
255
3. The number of leaves Some parameters of the random binary search tree are particularly easy to study. One of those is the number of leaves, Ln. Using embedding (1.4) to study this, we note that (Ui, i) is a leaf if and only if (£/»_i, i - 1) and (Ui+i,i + 1) are ancestors, that is, if and only if Ui = max(J/i_i, Ui, E/j+i). Thus, n-l
Ln =C l{£A>£/2} + 2 ^ l{tfi>max(t/i_i,J/i+l)} + l{t/ n >t/ n _i} i=2
= : 1{C/i>£/2} +L'n + 'i-{Un>Un-1}-
This is an n-term sum of random variables that are 2-dependent, where we say that a sequence X\, X2,..., Xn is fc-dependent if Xi,..., Xg is independent of Xt+k+i, • • •, Xn for all t. O-dependence corresponds to independence.
Fig. 3.1. For a tree with these time stamps (small on top), the leaves (marked) are those nodes whose stamps are larger than those of their neighbors.
Let Af{0,
256
Luc Devroye
Lemma 3.1: Let Z\,..., Zn,... be a stationary sequence of random variables (that is, for any t, the distribution of (Zi,..., Zi+e) does not depend upon i), and let it also be k-dependent with k held fixed. Then, if E|Zi| 3 < oo, the random variable vn where fe+i
a2 = V a r ^ } + 2 £(E{ZiZi} - E{Zi}E{Z<}). i=2
The standard central limit theorem for independent (or O-dependent) random variables is obtained as a special case. Subsequent generalizations of Lemma 2.1 were obtained by Brown (1971), Dvoretzky (1972), McLeish (1974), Ibragimov (1975), Chen (1978), Hall & Heyde (1980) and Bradley (1981), to name just a few. Clearly, _ . . r _. 2 n - 2 n + 1 E{L
">=2+ — =
—
a result first obtained by Mahmoud (1986). We study L'n as \Ln — L'n\ < 2. Its summands are identically distributed and of mean 1/3. Thus, £
"-^ 2 ) / 3 ^(oV) y/n - 2
where 4
a2 = Var{Z2} + 2 ^ ( E { Z 2 Z i } - E{Z 2 }E{Z j }). i=3
Here Zi = l{Ui} = max(C/j_i, Ut, Ui+i). As Zi is Bernoulli (1/3), we have Var{Zi} = (1/3)-(2/3) = 2/9. Furthermore, Z2Z3 = 0 as neighboring nodes cannot both be leaves. Thus, a2 - 2(TE{Z2Zi} - E{Z 2 }E{Z 4 }). We have E{Z 2 Z 4 } = ^ by a quick combinatorial argument. Thus, a2 = 2/45. It is then trivial to verify that \fn
\
45/
Random binary search trees
257
Let On denote the number of nodes with one child and let Tn be the number of nodes with two children. The quantities Ln, On and Tn are closely related, since Ln + On + Tn = n,Ln = Tn + 1. This implies that On = n +1 - 2 L n , Tn = Ln-1. Thus, EL n ~ n/3 implies the same thing for E 0 n and EX n . Furthermore, Var(On) ~ 4Var(Xn) ~ Var(Tn). Therefore, Tn follows the same limit laws as Ln, while for On we find that (0n — n/3)/y/n converges in distribution to a normal distribution with zero mean and variance 8/45. 4. Local counters Next we consider parameters that describe the number of nodes having a certain "local" property. The purpose of this section is to generalize the method of the previous section towards all local parameters of a tree, also called local counters. Besides Ln, we may consider Vjtn, the number of nodes with k proper descendants, or Lfcn: the number of nodes with k proper left descendants. Aldous (1990) gives a general methodology based upon urn models and branching processes for obtaining the first order behavior of the local quantities; his methods apply to a wide variety of trees; for the binary search tree, he has shown, among a number of other things, that Vkn/n —* 2/(fc + 2)(fc + 3) in probability as n —> oo. We will give a short proof of this, and obtain the limit law for Vfcn as well. We call a random variable Nn defined on a random binary search tree a local counter of order k if in embedding (1.4) it can be written in the form 71
wn = x;/([/i_fe,...,i/i,...)tfi+fc), i=l
where k is a fixed constant, £/, = 0 if i < 0 or i > n, and / is a {0,1}-valued function that is invariant under transformations of the E/j's that keep the relative order of the arguments intact. Clearly, Ln is a local counter of order one. Local counters have two key properties: A. The t-th and j-th terms in the definition of Nn are independent whenever \i — j \ > 2k. B. The distribution of the i-th term is the same for all choices of indices i € {k + I,... ,n — k}. Thus, we have the convenient representation Nn = An + Y17~k+i Z^ w h e r e ° < ^n < 2fc, and where the Z,'s are identically distributed and 2/c-dependent.
258
Luc Devroye
As a corollary of Lemma 3.1, we have the following limit law. Theorem 4.1: (Devroye, 1991). Let Nn be a local counter for a random binary search tree, with fixed parameter k. Define Zi = f(U{,..., Ui+2k), where U\, U2, . • • is a sequence ofi.i.d. uniform [0,1] random variables. Then ^-nE{Zl} y/n
where 2k+l
a2 = Var{Z!} + 2 £ (EfZ^} - E ^ E ^ } ) . i=2
If JE{Z\} j^ 0, then Nn/n —> E{Zi} in probability and in the mean as n —•> 0 0 .
Proof: The random variable Nn - 2k is distributed as Y^i=ik %%•> a n d satisfies the conditions of Lemma 3.1. Thus, Nn-2k-(n-2k)nZi}^ma2) Here we used the fact that the Zi's are 2fc-dependent. But Nn - nE{Zi}
Nn - 2k - (n - 2k)E,{Zx}
4fc
so that the first statement of Theorem 4.1 follows without work. The second • statement follows from the first one. Let k be fixed, independent of n. Simple considerations show that 14 n , the number of nodes with precisely k descendants, is indeed a local counter. Note that all the proper descendants of a node (Ui,i) are found by finding the largest 0 < j < i with Uj < Uit and the smallest £ greater than i and no (Ue-i,i-l), more than n such that Ue < J7j. All the nodes (Uj+i,j + l),..., (Ui,i) excluded, are proper descendants of (Ui,i). Thus, to decide whether (Ui,i) has exactly k descendants, it suffices to look at the nodes (f/i-fe-i, % - k - l ) , . . . , (Ui+k+i,i
+ k + 1),
so that Vfen is a local counter with parameter k + 1. Theorem 4.1 above implies the following (Devroye, 1991): Theorem 4.2: Let Vkn be the number of nodes with k proper descendants. Then 2 . , , .... Vkn
IT-*Pfc=:(fc
+
3)(fc + 2 )
>»*«*»*>**
Random binary search trees
259
Fig. 4.1. The subtree rooted at the i-th node consists of all the nodes in the shaded area. To the left and right, the subtree is "guarded" by nodes with smaller time stamps, Uj and Ut.
and —^—7=
* N(0,crl)
in distribution
as n —> oo, where o\ =: p fc (l - pk) + 2(fc + Ifpk
- 2(fe + 2)pt
and Ph
5fc + 8 ~' (fc + l)2(A; + 2)2(2fc + 5)(2fc + 3) '
Remark: The first part of this theorem is also implicit in Aldous (1990). Remark: When k = 0, we get p0 = 1/3, p0 = 2/15, cr$ = 2/45. For k = 1, we obtain P l = 1/6, p: = 13/1260 and o\ = 23/420. Proof: We have the representation n
1=1
260
Luc Devroye
where fc j=0
and Zi(j, £) is the indicator of the event that (X^), [/,) has j left descendants and £ right descendants. Assume throughout that 1
{[Ui+i+1
A simple argument shows that for i, j , £ as restricted above, 1
l U Jf
'
(j + £ + 3)!
(j+£ + 3)(j + e + 2)(j+£ + l) •
Thus, for 1 < i - k - 1, i + k + 1 < n, ^
" ~ j)} = (fc + 3)(fc + 2)(fc + l)
^
= :%
'
and
E{Z,} = g E W , * - i ) } = g * = (fc + 3)2(fc + 2) • It is clear that Vfcn is a local counter for a random binary search tree, so we may apply Theorem 4.1. To do so, we need to study E{ZjZ i+r }, where 1 < i - f c - l , i + r + k + l
{
0,
ilr
+ 2;
Pk ,
ifr = k-j+£
+ 2;
iir>k-j+£
+ 2,
E{Zi(j,
where
k\2
=
k - j)}M{Zi+r{£,
k-i)},
((2k + 4\
nf2k
+ 3\
5fc + 8 (fc + l)2(fc + 2)2(2fc + 5)(2fc + 3) '
n(2k
+ 2\\
261
Random binary search trees
The last expression is obtained by noting that of the (2k + 5)! possible p e r m u t a t i o n s of U i - j - i , . . . , U i + r + k - e + i , w i t h r = k - j + £ + 2, t h e r e are
only pk{2k + 5)! such that Zi{j, k - j)Zi+r(t, k - £) = 1. The three terms in the expression of pk are obtained by considering A. Ui+k-j+i is smaller than both #i_j-i and Ui+r+k-e+iB. Ui+k-j+i is smaller than one of t/j-j-i and Ui+r+k-e+iC. Ui+k-j+i is larger than both f/i-j-i and Ui+r+k-e+iIf r > 2fc + 2, then Zj and Zj + r are independent. Thus, we need only consider the case 1 < r < 2fc + 2. Let L, J be independent random variables uniformly distributed on {0,..., k}. k ) f k TE{ZiZi+r} = Et^Zi{j,kj) £ Zi+r(l, k-t)\
[j=o
k
J
e=o
k
k
k
= J2J2Pkl{[r = k - j + £ + 2}} +J2^l1{[r j=o €=o
j=o e=o
> k ~ j + £ + 2}}
= (k + l ) 2 p k P { r = k - J + L + 2 } + ( k + l)2qlTP{r
> k - J + L+ 2}.
Summing, this gives 2k+2
J2 (^{ZiZi+r} - E{Zi}E{Zi+r}) r=l
2fc+2
= (k + 1) V Yl F{ J - L = k + 2 - r} r=l 2 2
2fe+2
+{k + l) q k Y, V{J - L > k + 2 - r} - (2fc + 2)p2k r=l
= (fc + 1 ) V + P ! 5ZlP{^ - i > it + 2 - r} r=l 2fe+2
+pl Y^ ^i-7 -L> k + 2-r}-(2k r=k+2 fc+1
l)2pk+PlYV{J-L>r}
= {k +
r=l A;
+plYnJ-L>-r}~(2k r=0
+ 2)p2.
+ 2)p\
262
Luc Devroye
Rewriting this, so as to get some cancellation, we have 2k+2
]T (E{ZiZi+r} - E{Zi}E{Zi+P}) r=l
fc+2
l)2pk+p2kJ2nJ-L>r}
= (k +
r=2
k
+p2k £ ( 1 - P{ J - L > r}) - (2k + 2)p2k r=0 1 2
= (k + l) P fc -pl
£ > { • / ~L^
r
}+Pl(k
+ 1) - ( 2 * + 2)p2
r=0
= (fc + l) 2 p fe - p2 + Pfc(* + 1) - (2* + 2)p\ = {k + \fPk
-(k
+ 2)p2k .
By Lemma 2.1, Vfc n /n —> pjt in probability as n —> oo and —2——
vn
where
> A/"(0, Ufe)
in distribution,
erg = pfc(l - Pfe) + 2(* + l)2/9fe - 2{k + 2)p^ .
•
5. Urn models
The limit law for Ln can be obtained by several methods. For example, Poblete & Munro (1985) use the properties of Polya-Eggenberger urn models for the analysis of search trees. Bagchi & Pal (1985) developed a limit law for general urn models and applied it in the analysis of random 2-3 trees. In a binary search tree with n nodes, let Wn be the number of external nodes with another sibling external node, and let Bn count the remaining external nodes. Clearly, Wn + Bn = n + l, Wo = 0 and Bo = 1. When a random binary search tree is grown, each external node is picked independently with equal probability (see, e.g., Knuth & Schonhage, 1978). Thus, upon insertion of node n + l, we have: f (0,1) [ (2, -1)
with probability WW»B ; with probability w"^Bn •
Random binary search trees
263
This is known as a generalized Polya-Eggenberger urn model. The model is defined by the matrix
(::)-Gi) For general values of a, b, c, d, the asymptotic behavior of Wn is governed by the following Lemma (Bagchi & Pal, 1985) (for a special case, see e.g. Athreya & Ney, 1972): Lemma 5.1: Consider an urn model in which a + b = c + d =: s > 1, Wo + Bo > 1, 0 < Wo, 0 < Bo, a ^ c, b, c > 0, a - c < s/2, and, ifa<0, then a divides both c and Wo, and if d < 0, then d divides both b and BoThen W c " > T-— almost surely, Ur Wn+Bn b+c and .n d i s t r i b u t i o D ) Wn - E{W n } _^ ^ ( Q > ^ where _
bc
{s
_
b
_
c)2
2
(6 + c) 2fe + 2c - 5 ' 2
For L n , we have cr = 8/45. Since Ln = Wn/2, the variance of L n is one fourth that of W,,, so Theorem 4.1 follows immediately from Lemma 5.1 as well. Additionally, Lemma 5.1 implies that Wn/n —» 1/3 almost surely for our way of growing the tree. 6. Berry-Esseen laws for k-dependent sequences For local counters in which k grows with n, a normal limit law can be obtained by any of a number of generalizations of the Berry-Esseen inequality to Ar-dependent sequences. One of the most versatile ones, given forfc-dependentrandom fields, is due to Chen & Shao (2004). We say that a sequence of random variables Xi,..., Xn isfc-dependentif {Xi : i £ A} is independent of {Xi : i e B} whenever min{|i — j \ : i G A, j S B} > k. Lemma 6.1: If X\, X2, • • •, Xn is a k-dependent sequence with zero means and E{|Xj| 3 } < 00 for alii, then, setting n
Sn = J2Xi>B" = VV*r{Sn}, i=l
264
Luc Devroye
we have
sup p{£ <*)-*(x) <
^x+^nw)
x \Hn ) Bn Shergin (1979) had previously obtained this result with a non-explicit constant. The paper by Chen & Shao (2004) also offers a non-uniform bound in which the factor 60 is replaced by C/(l + |x|) 3 . Lemma 6.1 can be used to obtain limit laws for counters that have A; —> oo. In a further section, we give a related limit law, which, like Lemma 6.1, can be derived from Stein's method. We will describe its applications there.
7. Quicksort Quicksort is a fast and simple sorting method in which we take a random element from a set of n elements that need to be sorted, make it the pivot, and split the set of elements about this pivot (see figure below) using n — 1 comparisons.
Fig. 7.1. Basic quicksort step: the pivot about which a set of elements is split is the first element of the set.
It is well-known that the standard version of quicksort takes ~ 2nlogn comparisons on the average. A particular execution of quicksort can be pictured as a binary tree: the pivot determines the root, and we apply the data splits recursively as in the construction of a binary search tree from the same data. It is easy to see that the binary search tree that explains quicksort is in fact a random binary search tree. We can count the number
Random binary search trees
265
of comparisons by counting for each element how often it is involved in a comparison as a non-pivot element. For an element at depth Di in the binary search tree, this happens precisely Dt times. Thus, the total number of comparisons (Cn) is given by n
cn = Y/Di> where D\,...,
Dn is as for a random binary search tree. Therefore, E{CU = £ E { A } 4=1
n
= Y,{Hi +tfn-i+1- 2) »=1 n
= £2(^-1) i=l
= 2nHn + 2Hn - An ~ 2nlogn . A problem occurs when we try looking at Var{Cn} because the Z),'s are very dependent. Thus, the variance of the sum no longer is the sum of the variances. If it were, we would obtain that Var{Cn} ~ 2nlogn. But other studies (e.g., Sedgewick, 1983; Sedgewick and Flajolet, 1986) tell us that
Var{C n }~(j-^)n 2 . Furthermore, the asymptotic distribution of (Cn —Inlogn)/n is not normal as one would expect after seeing that Cn can be written as a simple sum. In other words, the study of Cn requires another approach. We will tackle it by the contraction method in a future section, and show that it is just beyond the area of application of Stein's method. It is noteworthy that n
Cn - Y,^ - !)' i=l
where Ni is the size of the subtree rooted at the i-th node in the random binary search tree. In other words, Cn is a special case of a toll function, a sum of functions of subtree sizes. These parameters will be studied in the next few sections.
266
Luc Devroye
8. Random binary search tree parameters Most shape-related quantities of the random binary search tree have been well-studied, including the expected depth and the exact distribution of the depth of the last inserted node (Knuth, 1973; Lynch, 1965), the limit theory for the depth (Mahmoud & Pittel, 1984, Devroye, 1988), the first two moments of the internal path length (Sedgewick, 1983), the limit theory for the height of the tree (Pittel, 1984; Devroye, 1986, 1987), and various connections with the theory of random permutations (Sedgewick, 1983) and the theory of records (Devroye, 1988). Surveys of known results can be found in Vitter & Flajolet (1990), Mahmoud (1992) and Gonnet (1984). Consider tree parameters for random binary search trees that can be written as u
where / be a mapping from the space of all permutations to the real line, S(u) is the random permutation associated with the subtree rooted at node u in the random binary search tree, which is considered in format (1.4), and the summation is with respect to all nodes u in the tree. The root of the binary search tree contains that pair (Ui, i) with smallest Ui value, the left subtree contains all pairs (Uj,j) with j < i, and the right subtree contains those pairs with j > i. Each node u can thus (recursively) be associated with a subset S(u) of (U\, 1),..., (Un, n), namely that subset that correponds to nodes in its subtree. With this embedding and representation, Xn is a sum over all nodes of a certain function of the permutation associated with each node. As each permutation uniquely determines subtree shape, a special case includes the functions of subtree shapes. Example 1: The toll functions. In the first class of applications, we let N(u) be the size of the subtree rooted at u, and set f(S(u)) = g(\S(u)\):
* n = £ 5 (JV(u)). u
Examples of such tree parameters abound: A. If g(n) = 1 for n > 0, then Xn = n. B. If g(n) = l{n=fc} for fixed k > 0, then Xn counts the number of subtrees of size k. C. If g(n) — l{ n = i}, then Xn counts the number of leaves.
Random binary search trees
267
D. If g(n) = n — 1 for n > 1, then Xn counts the number of comparisons in classical quicksort. E. If g{n) = log2 n for n > 0, then Xn is the logarithm base two of the product of all subtree sizes. F. If g(n) = l{ n = i} - l{n=2} for n > 0, then Xn counts the number of nodes in the tree that have two children, one of which is a leaf. Example 2: Tree patterns. Fix a tree T. We write S(u) « T if the subtree at u denned by the permutation S(u) is equal to T, where equality of trees refers to shape only, not node labeling. Note that at least one, and possibly many permutations with 15(^)1 = \T\, may give rise to T. If we set x
n = J2 1{S(.u)*T} u
then Xn counts the number of subtrees precisely equal to T. Note that these subtrees are necessarily disjoint. We are tempted to call them suffix tree patterns, as they hug the bottom of the binary search tree. Example 3: Prefix tree patterns. Fix a tree T. We write S(u) D T if the subtree at u denned by the permutation S{u) consists of T (rooted now at u) and possibly other nodes obtained by replacing all external nodes of T by new subtrees. Define X
n = /I 1{S(u)DT} u
For example, if T is a single node, then Xn counts the number of nodes, n. If T is a complete subtree of size 2k+1 - 1 and height k, then Xn counts the number of occurrences of this complete subtree pattern (as if we try and count by sliding the complete tree to all nodes in turn to find a match). Matching complete subtrees can possibly overlap. If T consists of a single node and a right child, then Xn counts the number of nodes in the tree with just one right child. Example 4: Imbalance parameters. If we set /(5(«)) equal to 1 if and only if the sizes of the left and right subtrees of u are equal, then Xn counts the number of nodes at which we achieve a complete balance. To study Xn, we first derive the mean and variance. This is followed by a weak law of large numbers for Xn/n. Several interesting examples illustrate this universal law. A general central limit theorem with normal limit is obtained for Xn using Stein's method. Several specific laws are obtained for particular choices of / . For example, for toll functions g as in
268
Luc Devroye
Example 1, with g(n) growing at a rate inferior to y/n, a universal central limit theorem is established in Theorem 11.1. 9. Representation (1.4) again We replace the sum over all nodes u in a random tree in the definition of Xn by a sum over a deterministic set of index pairs, thereby greatly facilitating systematic analysis. We write : = (i,Ui),...,(i
+
k-l,Ui+k-i),
so that |cr(i,fc)| = k. We define a*(i, k) = cr(i-l,fc + l), with the convention that (0, Uo) = (0,0) and (n + 1, Un+i) = (n + 1,0). Define the event Ai,k = [c(i, k) defines a subtree] . This event depends only on cr*(i, k), as Aitk happens if and only if among Ui-i,... ,Ui+k, Ui-i and Ui+k are the two smallest values. We will call these the "guards": they cut off and protect the subtree from the rest of the tree. We set Y^k — 1{A; k}- Note that if we include 0 and n + 1 in the summation, then 5Zi
Xn = ^f(S(u)) = J2Y, i=l
u
Yi,kfWi,k)) . fc=l
For example, in the example with toll function g, this yields 5 u
n n—i+1
*n = 5>(l ( )D = £ £ Yitkg{k). u
t=l
fe=l
The interest in this formula is that the Y^k are Bernoulli random variables with known mean:
{
1
if i = 1 andi +fc= n + l;
-j^i if i = 1 or i + k = n + 1 but not both; (fc+2)%+i) otherwise. There are about n2 of them, and they are dependent, but at least, all the randomness is now tightly controlled.
269
Random binary search trees
10. Mean and variance for toll functions Let
and Mk =
sup |/((T)| .
Note that |/xfc| < r^ < Mk- In the toll function example, we have Hk = g{k) and Tfc = Mfc = |fir(fc)|. We opt to develop the theory below in terms of these parameters. Lemma 10.1: Assume |/ifc| < oo for all k, ^ = o(k), and
k=\
Define fe=i(fc
+ 2)(fc + l) •
Then hm
n—*oo
Tl
= fi .
If also |/xfc| = O(Vk/log k), then E{Xn} - fin = 0(^72). Proof: We have n
E{Xn} = E n n-i
n-i+1
E 2
Ey
( a}Mfc ri-1
n
+
E E (jb+2)(fc+i)W + E ^ 1 ^ E ^—T2 M - i + l + ^ 1—Z K—1
K—1
i=l
270
Luc Devroye
It is trivial to conclude the first part of Lemma 10.1. For the last part, we have: \E{Xn -
M n}|
oo
n
2
°°
2
fc—1 fc—1
The expression for /j, is quite natural, as the probability of guards at positions i - 1 and i + k is TE{Yitk} = 2/((fc + 2)(fc + 1)). The following technical Lemma is due to Devroye (2002). Lemma 10.2: Assume that Mn < oo for all n and that / > 0. Assume that for some b > c > a > 0, we have /j,n = O(na), rn = O(nc), and Mn = O(nb). If a + b < 2, c < 1, then Var{* n } = o{n2). If a + b < 1, c < 1/2, then Var{Xn} = O(n). If f is a toll function and Mn - O(nb), then Var{Xn} = o(n2) ifb
(E{ZaZ0) - E{Za}E{Z0}) .
(a,/3)GE
We apply this fact, taking A to be the collection of all pairs (i, k), with 1 < i < n and l
271
Random binary search trees
one integer m. But to bound Var{Xn} from above, since / > 0, we have Var{Xn} < Y,
Var{yiifc/(<7(i,fc))}+
E{r ilfc /((7(t,fc))^/(ff(j ) 0)}
Y. ((i,fc),(j/))6E
(i,k)EA
= / + //.
By the independence of Yi
E{Yiikf(a(i,k))Yjfef(a(j,e))} =E{Yi,kYj,e}we
If none of the intervals contains 1 or n, then a brief argument shows that 4 E{y
n
,
n
£ ^ ( f c + ^ + 3)(fc + l)(^+l) ~
^ ^ ( f c + 3)(fc + l)(£ + l )
n
.
£ i £ j ( * + * + 2)(A:+l) +
+
^Z.
( f c +
2)(fc + l ) -
If ^ n = O(n a ) for a > 0, then it is easy to see that the three sums taken together are O(n2a). We next consider nested intervals. For properly nested intervals, with [i, i + k — 1] being the bigger one, we have
E{yiifc/(
272
Luc Devroye
Summed over all allowable pairs (i,k), (j, £) with the outer interval not covering 1 or n, and noting that in all cases considered, a < 1, this yields a quantity not exceeding n
, ,, ,
k
^(i
k=1
+ 2)(e + i)(k + 2)(k + i)
^
lf6 + a |O(nlogn)
^'
if6 + o = l.
The contribution of the border effect is of the same order. This is o(n2) if a + 6 < 2. It is O(n) if a + b < 1. Finally, we consider nested intervals with i = j and £ < k. Then E{y i ; f c /Ki,fc))r^/( C T (j,£))} < E { y a } M
f c
^ .
Summed over all appropriate (i,k,£) such that the outer interval does not cover 1 or n, we obtain a bound of
/C — 1 € — 1
The border cases do not alter this bound. Thus, the contribution to / / for • these nested intervals is o(n2) if a + b < 2 and is O(n) if a + b < 1. 11. A law of large numbers The estimates of the previous section permit us to obtain a law of large numbers for Xn, which was defined at the top of section 8. Theorem 11.1: Assume that Mn < oo for all n and that / > 0. Assume that for some b > c > a > 0, we have \in — O(na), rn = O(nc), and then Mn = O(nb). Ifa + b<2,c
*V n in probability. If f is a toll function and Mn = O(nb), then Xn/n probability when b < 1.
—> fi in
Proof: Note that a < 1. By Lemma 2.1, we have ~E{Xn}/n —> u. Choose e > 0. By Chebyshev's inequality and Lemma 2.1, F{|X n - E{X n }| > en} < ^
^
= o(l) .
Random binary search trees
Thus, Xn/n - E{Xn}/n
-» 0 in probability.
273
•
The result does not apply to the number of comparisons in quicksort as a = b = 1, a border case just outside the conditions. In fact, for quicksort, we have Xn/(2n\ogn) —> 1 in probability. For nearly all "smaller" toll functions and tree parameters, Theorem 11.1 is applicable. Three examples will illustrate this. Example 1: We let / be the indicator function of anything, and note that the law of large numbers holds. For example, let T be a possibly infinite collection of possible tree patterns, and let Xn count the number of subtrees in a random binary search tree that match a tree from T. Then, as shown below, the law of large numbers holds. There is no inherent limitation to T, which, in fact, might be the collection of all trees whose size is a perfect square and whose height is a prime number at the same time. Let Xn be the number of subtrees in a random binary search tree that match a given prefix tree pattern T, with \T\ = k fixed. Theorem 11.2: For any non-empty tree pattern collection T, we have
in probability, and E{Xn}/n
Xn n —> \x, where oo n=l(n
o
+ 2)(n + l)
and /j,n is the probability that a random binary search tree of size n matches an element ofT. Proof: Theorem 11.1 applies since / is an indicator function. Then, by Lemma 10.1, we obtain the limit /i for ~E{Xn}/n. • Note that Theorem 11.2 remains valid if we replace the phrase "matches an element of T" by the phrase "matches an element of T at its root", so that T is a collection of what we called earlier prefix tree patterns. Example 2: Perhaps more instructive is the example of the sumheight Sn, the sum of the heights of all subtrees in a random binary search tree on n nodes. Theorem 11.3: For a random binary search tree, the sumheight satisfies E{«Sn}
274
Luc Devroye
in probability. Here
where hk is the expected height of a random binary search tree on k nodes. Proof: The statement about the expected height follows from Lemma 10.1 without work. As the height of a subtree of size k is at most k — 1, we see that we may apply Theorem 11.1 with Mfc = A; — 1. By well-known results (Robson, 1977; Pittel, 1984; Devroye, 1986,1987), we haveE{if£} = 0(log2 n) where Hn is the height of a random binary search tree. Thus, we may formally take a and c arbitrarily small but positive, and 6 = 1 . • Example 3: Consider Xn = T,u(N(u))0> 0 < /? < 1. Recall that £ u N(u) is the number of comparisons in quicksort, plus n. Thus, Xn is a discounted parameter with respect to the number of quicksort comparisons. Clearly, Theorem 11.1 applies with a = b = c = j3, and thus, Xn/n —* fi in probability, and E{Xn}/n tends to the same constant fi. In a sense, this application is near the limit of the range for Theorem 11.1. For example, it is known that with Xn = ^2u(N(u))1+e, s > 0, there is no asymptotic concentration, and thus, Xn/g(n) does not converge to a constant for any choice of g(n). Also, for Xn = Y^,u N(u), the quicksort example, we have Xn/2nlogn —> 1 in probability (Sedgewick, 1983), so that once again Theorem 11.1 is not applicable.
12. Dependence graph We will require the notion of a dependency graph for a collection of random variables (Za)aev, where V is a set of vertices. Let the edge set E be such that for all disjoint subsets A and B of V that are not connected by an edge, (Za)a€A and (Za)aeB are mutually independent. Clearly, the complete graph is a dependency graph for any set of random variables, but this is useless. One usually takes the minimal graph (V, E) that has the above property, or one tries to keep \E\ as small as possible. Note that necessarily, Za and Zp are independent if (a,/?) ^ E, but to have a dependency graph requires much more than just checking pairwise independence. We call the neighborhood of N(a) of vertex a € V the collection of vertices j3 such that (a, /?) e E or a = /3. We define the neighborhood N(a\,... ,a r ) as Wj=1N(ctj).
Random binary search trees
275
Fig. 12.1. The dependence graph is such that any disjoint subsets that have no edges connecting them correspond to collections of random variables that are independent.
A. Consider now for V the pairs of indices (i,k) with 1 < i < n and 1 < k < n — i + 1. Let our collection of random variables be the permutations a(i,k), (i,k) € V. Let us connect (i,k) to (j,i) when i + k > j - 1 and j +1 > i. This means that the intervals (i,i + k — l) and (j, j + £ — 1) correspond to an edge in E if and only if they overlap or are disjoint and separated by exactly zero or one integer m. We claim that (V, E) is a dependency graph. Indeed, if we consider disjoint subsets A and B of vertices with no edges between them, then these vertices correspond to intervals that are pairwise separated by at least two integers, and thus, (cr(i, k))^tk)eA a n d {a(J>ty)(j,e)eB &re mutually independent. B. Consider next the collection of random variables Yitkg{k). For this collection, we can make a smaller dependency graph. Eliminate all edges from the graph of the previous paragraph if the intervals defined by the endpoints of the edges are properly nested. For example, if we have i < j < j + £—l
276
Luc Devroye
tribution, then given that Z\ and Zn are the two largest values, then Z2,..., Zn^\ are i.i.d. and uniform on [0, min(Zi, Zn)]; the permutation of Z2,..., Zn-\ is independent. Thus, for properly nested intervals as above, Yi^g(k) is independent of Yj^g{t).
13. Stein's method Stein's method (Stein, 1972) allows one to deduce a normal limit law for certain sums of random variables while only computing first and second order moments and verifying a certain dependence condition. Many variants have seen the light of day in recent years, and we will simply employ the following version derived in Janson, Luczak & Rucinski (2000, Theorem 6.21): Lemma 13.1: Suppose that (Sn)y° is a sequence of random variables such that Sn = S a e v Zna, where for each n, {Zna}a is a family of random variables with dependency graph (Vn,En). Let N(.) denote the neighborhood of a vertex or vertices. Suppose further that there exist numbers Mn and Qn such that
Y, E{\Zna\} < Mn,
aEVn
and for every n and r > 1, sup """
ar6V
Y, "^w(«lia,
«,)
1£{\Zn0\\Znai,Zna2,...,Znar}
for some number Br depending upon r only. Let a\ — Var{5n}. Then yVarlSn} if for some real s > 2,
lim^^=0.
Lemma 13.2: (Devroye, 2002). Define Xn = £ u s(|S(u)|) for a random binary search tree on n nodes. The following statements are equivalent: A. Var{Xn} = O(n), i.e., there are positive numbers a,N, such at Var{Xn} > an for alln>N.
277
Random binary search trees
B. Var{Xfc} > 0 for some k > 0. C. The function g is not constant on {1, 2 , . . . } . To study
Xn = 5>(|S(u)|) u
we define C(n) = maxi<j
E{Yi,k\9(k)\}<(v + o(l))n
by computations not unlike those for the mean done in Lemma 10.1, where ^ 2\g(n)\ ^ ( n + 2)(n + l) ' To apply Lemma 13.1, we note that we may thus take Mn = O{n). We also note that o\ — il{n), by Lemma 13.2, since g is not constant on {1,2,...}. Since we can choose s in Stein's theorem after having seen e, it suffices thus to show that Qn — O(n1^2~e) for some e > 0. Note that conditioning on Yitk9{k) is equivalent to conditioning on l^fe. For the bound on Qn, we need to consider any finite collection of random variables Yitk- We will give the bound in the case of just two such random variables, and leave it to the reader to verify the easy extension to any finite number, noting that the quantity Br in Lemma 13.1 can absorb the bigger proportional constant one would obtain then. We may bound Qn by Qn
sup
^
WO.U.<)ev»(p,r)eAr((i,fc),(i,<))
^{Yp,r\YxM,Yjtt\
.
278
Luc Devroye
We show that the sum above is uniformly bounded over all choices of (t ) *),tt,*)by0(log 2 n). Consider the set S = { 0 , 1 , . . . ,n,n + 1} and mark 0,n + l,i - 1, i + k,j — l,j + (. (where duplications may occur). The last four marked points are neighbors of the intervals represented by (i,k) and (j, f). Mark also all integers in S that are neighbors of these marked numbers. The total number of marked places does not exceed 3 x 4+2 x 2 = 16. The set S, when traversed from small to large, can be described by consecutive intervals of marked and unmarked integers. The number of unmarked integer intervals is at most five. We call these intervals Hi,..., H5, from left to right, with some of these possibly empty.
Fig. 13.1. The marking of cells is illustrated. The top two lines show the original intervals. In the first round, we mark (in black) the immediate neighbors of these intervals, as well as cells 0 and n + 1. In the second round, all neighbors (shown in gray) of already marked cells are also marked. The intervals of contiguous unmarked cells are called Hi. There are not more than five of these.
Set H = \JiHi. Define Hc = S - H. Consider Yp>r for r < n fixed. Let s=p + r-lbe the endpoint of the interval on which YPtr sits. Note that YP:r depends upon {Ui}p-i,r) $ N((i,k),(j,£)). B. If p,s G Hc, then we bound E{Yp>r\Yitk,Yj>t} by one.
279
Random binary search trees
C. If p or s is in Hc and the other endpoint is in Hi, then we bound as follows:
nrr,AY,*,Yu)
+ Win\p
t)|
because we can only be sure about the i.i.d. nature of the L/j's belonging to Hi n {p,..., s}, together with the two immediate neighbors of this set. D. If p € Hi, s € Hj, i < j , then we argue as in case C twice, and obtain the following bound: E{y p , r |y, f e ,y^} <
1 + |ftn{Pi...>a}l
*
1 + |2fjn{p>...t,}|
•
The above considerations permit us to obtain a bound for Qn by summing over all (p,r) € N((i,k), (j,t)). The sum for all cases (A) is zero. The sum for case (B) is at most 162 = 256. The sum over all (p,r) as in (C) is at most " 2 x 16 x 5 x ^
1
< 1601og(n + 1) .
r=l
Finally, the sum over all (p, r) described by (D) is at most
G)(f:^T)2^°iog*(»+1). The grand total is O(log2 n), as required. This argument can be repeated for conditioning on finite sets larger than two, as required by Stein's theorem. We leave that as an exercise (see also Devroye, 2002). •
14. Sums of indicator functions In this section, we take a simple example, in which u
where An is a non-empty collection of permutations of length A;, with k possibly depending upon n. We denote pn^ = l-A^J/fc!, the probability that a randomly picked permutation of length k is in the collection An. Particular examples include sets An that correspond to a particular tree pattern, in which case Xn counts the number of occurrences of a given tree pattern of size k (a "terminal pattern") in a random binary search tree. The interest
280
Luc Devroye
here is in the case of varying k. As we will see below, for a central limit law, k has to be severely restricted. Theorem 14.1: We have
E
<*»> = (*r|ffTT) +0 < 1 >
regardless of how k varies with n. If k = o(logn/loglogn), then it follows -> 1 in probability, and that E { X n } -» oo, Xn/E,{Xn} X-E{Xn} yVar{Xn}
K
'
]
Proof: Observe that n-fc+l Xn = 2_^ Yi,kZi • i=l
where Zi — l{a{itk)eAny,
Thus,
nxn} = g
2 (fe + 2) (fc + 1 )
^ }+ 2 x ^EIZ,}
_ 2(n - fc - l)pn,fc (Ar + 2)(jfc + l)
2pn,fc ib + 1 '
This proves the first part of the theorem. The computation of the variance is slightly more involved. However, it is simplified by considering the variance of n-fc
Yn = ]T YitkZi , and noting that \Xn —Yn\ < 2. This eliminates the border effect. We note t h a t YiykYjtk
+ k. T h u s ,
= Oifi<j
n-k
E{Fn2} = Y, ^{Yi,kZi} + 2 *=2
= JE{Yn} + 2
]T
EiYitZiYj+Zj}
2
J]
E{y i , fe Z i y i+t+ i, Jb Z j+fc+ i}
2
+2
J2
EiYiMEiYjtZj}
2
= (n-k-l)0
+ 2(n — 2fc — 2)a + (n - 2fc)2/32 + (lOifc + 6 - 5n){32
281
Random binary search trees
where a = ^{Y2,kZ2Y3+k,kZ3+k},
and /3 = E{Y 2 , fc Z 2 }. Also,
(E{y n }) 2 = ( ( n - f c - l ) / 3 ) 2 . Thus, Var{Yn} = 2(n - 2jfc - 2)a + (n - Jfe - l)/3 + ((n - 2k)2 - (n - k - I ) 2 + (lOfc + 6 - 5n)) /32 = n (2a + p - {2k + 3)/32) + 0{ka + kp + fc2/32) . We note that /? = 2p n ^/(fc+2)(fc+l). To compute a, let A, B, C be the minimal values among U1:..., Uk+i; 14+2, and I/fe+3,..., f/2)t+3, respectively. Clearly, a = p2ifcE{Y2,fcY3+fc,fc}. Considering all six permutations of A,B,C the latter expected value as 2
1
1
1
1
separately, one may compute
1
2
2k + 3 2k + 2k + l +2k + 3k + lk + l 5k + 3 ~ (2fc + 3)(2fc + l)(fc + l) 2 '
+
1
1
2k + 32k + 22k + l
Thus, (5fc + 3)p 2 fc (2k + 3)(2fc + l)(k + I ) 2 "
01
We have Var{Yn} = n ^ - , * ( 2 f c 2
+ 3)(2fc +1)(Jfe
+1)a
8fc + 12
2
\
" P - fe (fc + 2)2(fc + l) 2 + ^ f c ( f c + 2)(fc + l ) J Note that regardless of the value of pnfk, positive. Indeed, the coefficient is at least / \(k
2
+ 2) (k + l) (2k
\
+ 3)(2k + l)J
"
Thus, there exist universal constants c\, c2, c3 > 0 such that
Var{Fn} > cinplk/k2
^*/*> •
the coefficient of n is strictly
8fc4 + 18fc3 + 4fc2 -6k 2
+O
- c2pn,k/k ,
and Var{Yn} < c3npntk/k2
.
282
Luc Devroye
We have Yn/E,{Yn} -> 1 in probability if Var{y n } = o(E 2 {Y n }), that is, if k = o(^npntk)- Using pn,k > 1/fc!, we note that this condition holds if k = o(log n / log log n). Finally, we turn to the normal limit law and note that r
"" E ™^(o,i)
y/Yax{Yn} if fc = o(logn/loglogn). For the proof, we refer to Lemma 13.1 and its notation. Clearly, Mn = O(n). Furthermore, with r as in Lemma 13.1, we have Qn = O(rk). Finally, we recall the lower bound on the variance from above. Assume that npn,k/k —> oo. The condition of Lemma 13.1, applied with s = 3, holds if lim
nk2 , ,—jpr = 0 .
n—oo n6'zPn
A k/k
Since pn^ > 1/fc!, this in turn holds if A;5fc!3 lim — p - = 0.
n-»oo y ' n
Both conditions are satisfied if k = o(log n / log log n).
•
The limitation on k in Theorem 14.1 cannot be lifted without further conditions: indeed, if we consider as a tree pattern the tree that consists of a right branch of length k only, then E { X n } —> 0 iffc> (1+e) log n/log log n for any given fixed e > 0. As Xn is integer-valued, no meaningful limit laws can exist in such cases.
15. Refinements and bibliographic remarks Central limit theorems for slightly dependent random variables have been obtained by Brown (1971), Dvoretzky (1972), McLeish (1974), Ibragimov (1975), Chen (1978), Hall & Heyde (1980) and Bradley (1981), to name just a few. Stein's method (our Lemma 13.1, essentially) is one of the central limit theorems that is better equipped to deal with cases of considerable dependence. It is noteworthy that Berry-Esseen inequalities exist for various models of local dependence such as fc-dependence (Shergin (1979), Baldi & Rinott (1989), Rinott (1994), Dembo k Rinott (1996)). The paper of Chen & Shao (2004) covers all of these and provides Berry-Esseen inequalities in terms of a generalized notion of dependence graph.
Random binary search trees
283
Stein's method offers short and intuitive proofs, but other methods may offer attractive alternatives. Using the contraction method and the method of moments, Hwang & Neininger (2002) showed that the central limit result of Theorem 13.3 holds with G(n) = O{na), a < 1/2. Our result is weaker, but requires fewer analytic computations. The ranges of application of the results are not nested—in some situations, Stein's method is more useful, while in others the moment and contraction methods are preferable. The previous three section are based on Devroye (2002). Aldous (1991) showed that the number V&in of subtrees of size precisely equal to A; in a random binary search tree is in probability asymptotic to 2/(fc + 2)(fc + 1). The limit law for Vkn (Theorem 4.2) is due to Devroye (1991). The latter result follows also from the previous section if we take as toll function f(a) — l{|CT|=fc}. Recently, there has been some interest in the logarithmic toll function /() = log |er| (Grabner & Prodinger, 2002) and the harmonic toll function f(a) = Y^i 1/i (Panholzer & Prodinger, 2002). The authors in these papers are mainly concerned with precise first and second moment asymptotics. Fill (1996) obtained the central limit theorem for the case /(
16. The toll function ladder The various types of limit behavior can best be illustrated by considering the toll function g(n) = na. Define the random variable
Xn = £>(«)» u
where N(u) is the size of the subtree rooted at node u. For a = 1, we have Xn = n + Ln, as can easily be verified. We can use (yet another!) representation of a random binary search tree in terms of the sizes of left and right subtrees of the root, [nil], and [n(l — U)\, respectively, where U
284
Luc Devroye
is uniform [0,1]. A different independent uniform random variable is thus associated with each node. For a first intuition, we can drop the truncation function.
Fig. 16.1. The root splits a random binary search tree into two subsets, of sizes N and n-i-N respectively, where N is distributed as [nU\, and u is uniform [o,i].
Call the (random) binary search tree T. Augment the tree T by associating with each node the size of the subtree rooted at that node, and call the augmented tree T". The root of T" has value n. Since the rank of the root element of T is equally likely to be 1 , . . . , n, the number N of nodes in the left subtree of the root of T is uniformly distributed o n { 0 , l , . . . , n — 1}. A moment's thought shows that N =c [nU\ , where U is uniformly distributed on [0,1]. Also, the size of the right subtree of the root of T is n — 1 — N, which is distributed as \_n(l — U)\. All subsequent splits can be represented similarly by introducing independent uniform [0,1] random variables. Once again, we have used an embedding argument: we have identified a new fictitious collection of random variables Ui, U2, • •., and we can derive all the values of nodes in T' from it. This in turn determines T. However, the tree T obtained in this manner is only distributed as the T we are studying—it may be very different from the (random) instance of T presented to us. The rule is simply this: in an infinite binary tree, give the root the value n. Also, associate with each node an independent copy of U. If a node has value V, and its assigned copy of U is U' (say), then the value of the two children of the node are
285
Random binary search trees
[VU'\ and [V^l - U')\ respectively. Thus, the value of any node at distance k from the root of T" is distributed as [-••[\nU1\U2\---Uk\
,
where U\,..., Uk are i.i.d. uniform [0,1].
Fig. 16.2. The augmented tree T': the values associated with the nodes are the sizes of the subtrees of corresponding nodes in T. Only the values of the nodes two levels away from the root are shown (modulo some floor functions). The random splits in the tree are controlled by independent uniform [0,1] random variables (/1,(/2,...
For a > 1, the total contribution to Xn from the root level is na. From the next level, we collect na(Ua + (1 - U)a), from the next level na(Ua(U^ + {l-U1)a)+(l-U)a(U^ + (l-U2)a)), and so forth, where all the UiS are independent uniform [0,1] random variables. Taking expectations, it is easy to see that the expected contribution from the i-th level is about
\a+lj This decreases geometrically with i, the distance from the root. Therefore, the t o p few levels, which are all full, are of primary importance. It is rather easy to show t h a t
where Y may be described either as 1 + t / f + (1 -Ux)a
+ C/f(C/2Q + (1 - U2)a) + (1 - C/i) Q (t/ 3 a + (1 - U3)a) + • • • ,
286
Luc Devroye
or as the unique solution of the distributional identity
Y =c 1 + UaY + (1 - U)aY' where Y, Y', U are independent and Y =c Y'. Taking expectations in this recurrence, we see that E{y} = (a + l)/(a - 1), and that this tends to oo as a i 1. At a = 1, we have a transition to the next behavior, because Xn/n —> oo in probability. The contributions from the first few full levels is n, so that we expect Xn to be of the order of nlogn. In fact, as we already know, Xn/(n log n) —> 2 in probability. To discover a limit law, we have to subtract the mean, and, as we now know,
Xn-E{Xn} n
^
y
where Y has the quicksort limit distribution. See the next section for a description and derivation. For a < 1/2, we have seen (Theorem 13.3) that Xn-E{Xn}
^
2
for a certain a1 > 0 depending upon a. The main contributions to Xn all come from the bottom levels, and normality is obtained because they are slightly dependent. In the range 1/2 < a < 1, we have an intermediate behavior, with
xn-nxn} where Y satisfies the distributional identity given above. The details are given by Hwang & Neininger (2002).
17. The contraction method In some cases, a random variable Xn can be represented in a recursive manner as a function of similarly distributed random variables of smaller parameter. A limit law for Xn, properly normalized, can sometimes be obtained by the contraction method. This requires several steps. First, we need to settle on a metric on the space of distributions of interest. Then, we may attempt to normalize Xn in such a way that the limit distribution can be guessed as a stable point of a distributional identity (which one gets from the recursive description of Xn). Then one establishes that the recursion
Random binary search trees
287
yields a contraction for the given metric, that is, a fractional reduction in the distance between two distributions. This implies in many cases that the stable point is unique. Finally, the most work is usually related to the proof that Xn or its normalized version tends to that unique limit law in the given metric. We will illustrate this methodology on the number of comparisons in quicksort. For other examples, we refer to Rosier & Riischendorf (2001) or Neininger & Riischendorf (2004). Let Ln denote the internal path length of a random binary search tree Tn. We recall that this is equal to the number of comparisons needed to sort a random permutation of {1,2, ...,n} by QUICKSORT. We write =c to denote equality in distribution. The root recursion of Tn implies the following: Ln =c LZn-i + L'n_Zn + n - 1, n > 2, where Zn is uniformly distributed on {1,2,...,n}, L, =c L't, and Zn,Li,L\,\ < i < n, are independent. Furthermore, LQ = L\ = 0. We know that E{L n } ~2nlogn from earlier remarks, so some normalization is necessary to establish a limit law. We define _ L»-E{L»} ' n
—
•
n Regnier (1989) was the first one to prove by a martingale argument that Yn —>c Y, where Y is a non-degenerate non-normal random variable, and —>£ denotes convergence in distribution. Rosier (1991, 1992) characterized the law of Y by a fixed-point equation: Y is the unique solution of the distributional identity Y=cUY + (l-U)Y' + ip(U), where U is uniform on [0,1], Y =c Y', and U, Y, Y' are independent, and
288
Luc Devroye
Step 1: a space and a metric. Let T> denote the space of distribution functions with zero first moment and finite second moment. Then the Wasserstein metric d2 is defined by
Fig. 17.1. Illustration of optimal coupling for the Wasserstein metric.
If d2(F,G) = 0, then there exists a coupling such that X — Y with probability one, and thus, F = G. It is known that (T>, d2) is a complete metric space. Observe furthermore that a sequence Fn converges to F in T> if and only if Fn converges weakly to F and if the second moments of Fn converge to the second moments of F.
Random binary search trees
289
Step 2: prove that there is a unique fixed point. We begin by showing that the key distributional identity has a unique solution satisfying E{y} = 0. Lemma 17.1: Define a map M : T> —> V as follows: if Y and Y' are identically distributed with distribution function F of zero mean, U is uniform [0,1], and U, Y and Y' are independent, then M(F) is the distribution function of UY + {l-U)Y'
+ ip(U).
Then we have a contraction with respect to the Wasserstein metric: d2(M(F),M(G))<^d2(F,G). (In other words, M(-) is a contraction, and thus, there is a unique fixed point F eV with M(F) = F. Indeed, set Fo = F, Fn+1 = M(Fn), and note that d2(Fn+1),Fn) < ^/2/3d2(Fn,Fn-1) < (2/3)nd2(FuF0) -^Oasn^oo. But then Fn converges in (T>, d2) to some distribution function F. There is at least one solution. There can't be two, as d2(M(F), M(G)) = d2(F, G).) Proof: Let X, X' both have distribution function F, and let Y, Y' both have distribution function G, and assume that the means are zero. Furthermore, let U be as above, let (X,Y),(X',Y'),U be independent. Then, as M(F) is the distribution function of UX + (1 - U)X' +
m
290
Luc Devroye
is a contraction. Step 3: prove convergence to the fixed point. So, we are done if we can show that d,2(Fn,F) —* 0, where Fn is the distribution function of Yn and F is the distribution function of Y. This will be the first time we will need the actual form of ip. We note throughout that Y and Yn have zero mean and that both have finite variances. Prom the recursive distributional identity of Ln, we deduce these distributional identities: YQ = Y\ = 0, and furthermore, for n > 2, Ln--E{Ln} n LZn-i+L'n_Zn+n-\ E{£z n -i|Z n } + E { i ; _ z J Z n } + (n - 1) E{L Zn -i|Z n } -E{L Z n -i} +E{K_ z JZ n } - E { L ; _ Z J n ^ Zn - 1 , n — Zn =C Yzn-1 ^Yn-Zn \-Vn\Zn) |
where Mj)
=
=
nLj-i}
+ E{L;_3-} - ( E { L ; _ Z J + E { L Z n - 1 »
E{L,--x} + EiL;.,-} - QE{Ln ~{n- 1)}) n
Let us do a quick back-of-the-envelope calculation, replacing Zn by nil, n - Zn by n(l - [/) and E{Lj} by 2i logi If the Yn's indeed converge to Y in law, then we have that Y should be approximately distributed as UY + (l-U)Y' + ip{U). Theorem 17.2: lim d2(Fn,F) = 0 .
n—>oo
Proof: We let (Y',Y^) be an independent copy of (Y, Yn) for all n, and let these pairs be independent of U. It is critical to pick versions of (Yj, Y) and {Y-,Y') such that
E{(^- - Yf) = E{(*y - Y')2} = <%(FjtF).
291
Random binary search trees
Also, the sequence of Yj's is independent, and similarly for YL Since Zn is uniform on { 1 , . . . , n}, it is distributed as \nU], where U is uniform [0,1]. Thus,
Use one step in the distributional identity of Y to note the following:
n[/ 1 fi(P F-\
TTY UY ++rY' n_{nU-l " ~
^
-(l-U)Y' +
where we picked the same Y and Y' in both distributional identities. By the fact that Y, Y', Yn,Y^ have all zero mean, and the independence of (Y, Yn), (Y',Yn) and U, the right-hand-side is equal to
2vf(YlnU]JnU]n~l
-UYy\+-E{(
(17.1)
Here we used the symmetry in the first two terms. We treat each term in the upper bound separately, starting with the first one. Note that
n
n
so that, writing 8 for any quantity in absolute value less than one, the first
292
Luc Devroye
term is not more than
-i§»{(«-^^)"} ^X>(^)2E{«-,-^} <-11 (^r) 4iFs-uF) + i T E { |yi K ., - n> + 2=i£l <-supdi(F J -,.F) + -supd 2 (F J -,F) > /E{y5}+
l
2
J
(by the Cauchy-Schwarz inequality) (2
4J-E{Y2}\
^3
n
J j
2,
.
4V/E{^}
2E{Y 2 }
n
n2
The second term of (17.1) is bounded from above by the square of sup \
0
We recall that, if Hn denotes the n-th harmonic number, then we have Hn = logn + 7 4- (l/2n) + O(l/n 2 ), where 7 = 0.57721566490153286060... is the Euler-Mascheroni constant, so that E{L n } = 2nHn + 2Hn-4n = 2nlogn + n(2f-i)
+ 2logn + 2'y+l + O(l/n).
Thus, uniformly over 2 < j < n — 1, n - 1 + EfLj,!} + E{L n _j} - E{L n } _ n + 2j log(j - 1) + 2(n - j + 1) log(n - j) - 2(n + 1) logn + O(l) n
n
\ n J
n
\ n J
\nj
Denoting by 6 any number between 0 and 1, we then have uniformly over
293
Random binary search trees
2 < j < n - 1, and over (j — l)/n < u < j/n, \
= o( -) V \n/
J^R
n
\nj +^
n
log(tLL\ + >±zl±H log f!izi) \ nJ
log (Lil\
2
n
\ n J
_ ("-J + *) log("-* + *)
\ n) n \ n \ nJ n \ n J
n J
^ log f^i) - fcil iog (n-j + e) n \ n) n \ nJ
+;K^)RK^)RK^)| +
2(n - j) n
/ n - j \ + 61ogn \n — j + 9 J n
For j' = 1 and j = n, ywQ-)
= T t - 1 ^ E { L "n- l } - E { L " } = l + o (\r ^n> ) J ,
and for the same j , with u = (j — 9)/n,
sup |^(U) _ i| = o C^iny Combining this shows that sup 0
Mnu\)-ip{u)\=0(^\. n
\
J
We conclude that for some universal constants C, D,
4(Fn,F) <(l + -) supd^-.F) + - . It is left as an easy exercise to show that this implies that \im d22(Fn,F) = o(-).
m
294
Luc Devroye
Acknowledgements: Research of the author was sponsored by NSERC Grant A3456 and by FCAR Grant 90-ER-0291. The referee's comments are much appreciated. References 1. D. ALDOUS (1991) Asymptotic fringe distributions for general families of random trees. Ann. Appl. Probab. 1, 228-266. 2. C. R. ARAGON & R. G. SEIDEL (1996) Randomized search trees. Algorithmica, 16, 464-497. 3. C. R. ARAGON h R. G. SEIDEL (1989) Randomized search trees. In: Proc. 30th IEEE Symp. Foundations Comp. Sci., 540-545. 4. A. BAGCHI & A. K. PAL (1985) Asymptotic normality in the generalized Polya-Eggenberger unrn model, with an application to computer data structures. SIAM J. Algebraic Discr. Methods 6, 394-405. 5. P. BALDI & Y. RINOTT (1989) On normal approximation of distributions in terms of dependency graph. Ann. Probab. 17, 1646-1650. 6. R. C. BRADLEY (1981) Central limit theorems under weak dependence. J. Math. Anal. 11, 1-16. 7. B. M. BROWN (1971) Martingale central limit theorems. Ann. Math. Statist. 42, 59-66. 8. L. H. Y. CHEN (1978) Two central limit problems for dependent random variables. Z. Wahrsch. verw. Geb. 43, 223-243. 9. L. H. Y. CHEN & Q.-M. SHAO (2004) Normal approximation under local
dependence. Ann. Probab. 32, 1985-2028. 10. A. DEMBO & Y. RINOTT (1996) Some examples of normal approximations by Stein's method. In: Random Discrete Structures, eds. D. Aldous & R. Pemantle, pp. 25-44. Springer-Verlag, New York. 11. L. DEVROYE (1986) A note on the height of binary search trees. Journal of the A CM 33, 489-498. 12. L. DEVROYE (1987) Branching processes in the analysis of the heights of trees. Ada Informatica 24, 277-298. 13. L. DEVROYE (1988) Applications of the theory of records in the study of random trees. Ada Informatica 26, 123-130. 14. L. DEVROYE (1991) Limit laws for local counters in random binary search trees. Rand. Struct. Alg. 2, 303-316. 15. L. DEVROYE (1994) On random Cartesian trees. Rand. Struct. Alg. 5, 305327. 16. L. DEVROYE (2002) Limit laws for sums of functions of subtrees of random binary search trees. SIAM J. Computing 32, 152-171. 2002. 17. L. DEVROYE & R. NEININGER (2004) Distances and finger search in random binary search trees. SIAM J. Computing 33, 647-658. 18. M. DRMOTA (2004) Stochastic analysis of tree-like data structures. Proc. Roy. Soc: Math., Phys. Engin. Sci. 460, 271-307.
Random binary search trees
295
19. A. DVORETZKY (1972) Central limit theorems for dependent random variables. In: Proc. Sixth Berkeley Symp. Math. Statist. Prob. pp. 513-535. University of California Press. 20. C. G. ESSEEN (1945) Fourier analysis of distribution functions: a mathematical study of the Laplace-Gaussian law. Ada Mathematica 77, 1-125. 21. W. FELLER (1968) On the Berry-Esseen theorem. Z. Wahrsch. verw. Geb. 10, 261-268. 22. J. A. FILL (1996) On the distribution of binary search trees under the random permutation model. Rand. Struct. Alg. 8, 1-25. 23. J. A. FILL & S. JANSON (2000) Smoothness and decay properties of the limiting QUICKSORT density function. In: Mathematics and Computer Science: Algorithms, Trees, Combinatorics and Probabilities, eds. D. Gardy & A. Mokkadem, pp. 53-64. Birkhauser, Boston. 24. J. A. FILL & S. JANSON (2001) Approximating the limiting QUICKSORT
distribution. Rand. Struct. Alg. 19, 376-406. 25. J. A. FILL & S. JANSON (2002) Quicksort asymptotics. J. Algorithms 44, 4-28. 26. P. FLAJOLET & R. SEDGEWICK (1986) Digital search trees revisited. SIAM J. Computing 15, 748-767. 27. P. FLAJOLET, X. GOURDON, & C. MARTINEZ (1997) Patterns in random
binary search trees. Rand. Struct. Alg. 11, 223-244. 28. J. FRANgON, G. VIENNOT, & J. VUILLEMIN (1978) Description and analysis of an efficient priority queue representation. In: Proceedings of the 19th IEEE Symposium on the Foundations of Computer Science, pp. 1-7. 29. G. H. GONNET (1984) Handbook of Algorithms and Data Structures. Addison-Wesley, Reading, Mass. 30. P. GRABNER & H. PRODINGER (2002) Sorting algorithms for broadcast communications: mathematical analysis. Theor. Comp. Science 289, 51-67. 31. G. R. GRIMMETT & D. R. STIRZAKER (1992) Probability and Random Processes. Oxford University Press. 32. P. HALL & C. C. HEYDE (1980) Martingale Limit Theory and its Applications. Academic Press, New York. 33. W. HOEFFDING & H. ROBBINS (1949) The central limit theorem for dependent random variables. Ann. Math. Statist. 20, 773-780. 34. H.-K. HWANG k, R. NEININGER (2002) Phase change of limit laws in quicksort recurrence under varying toll functions. SIAM J. Computing 31, 16871722. 35. I. A. IBRAGIMOV (1975) A note on the central limit theorem for dependent random variables. Theor. Probab. Appl. 20, 135-141. 36. S. JANSON, T. LUCZAK, & A. RUCINSKI (2000) Random Graphs. John Wiley, New York. 37. D. E. KNUTH (1973) The Art of Computer Programming, Vol. 3 : Sorting and Searching. Addison-Wesley, Reading, Mass. 38. D. E. KNUTH & A. SCHONHAGE (1978) The expected linearity of a simple equivalence algorithm. Theor. Comp. Sci. 6, 281-315.
296
Luc Devroye
39. G. LOUCHARD (1987) Exact and asymptotic distributions in digital binary
search trees. RAIRO Theor. Informatics Applies 13, 253-276. 40. W. C. LYNCH (1965) More combinatorial problems on certain trees. Computer Journal, 7, 299-302. 41. H. MAHMOUD & B. PITTEL (1984) On the most probable shape of a search tree grown from . SIAM J. Algebraic Discr. Methods 5, 69-81. 42. H. M. MAHMOUD (1986) The expected distribution of degrees in random binary search trees. The Computer Journal 29, 36-37. 43. H. M. MAHMOUD (1992) Evolution of Random Search Trees. John Wiley, New York. 44. D. L. MCLEISH (1974) Dependent central limit theorems and invariance principles. Ann. Probab. 2, 620-628. 45. R. NEININGER & L. RUSCHENDORF (2004) A general contraction theorem and asymptotic normality in combinatorial structures. Adv. Appl. Prob. 14, 378-418. 46. A. PANHOLZER & H. PRODINGER (2002) Binary search tree recursions with harmonic toll functions. J. Comp. Appl. Math. 142, 211-225. 47. B. PITTEL (1984) On growing random binary trees. J. Math. Anal. Applies 103, 461-480. 48. P. V. POBLETE & J. I. MUNRO (1985) The analysis of a fringe heuristic for binary search trees. J. Algorithms 6, 336-350. 49. M. REGNIER (1989) A limiting distribution for quicksort. RAIRO Informatique Theor. Applic. 23, 335-343. 50. Y. RINOTT (1994) On normal approximation rates for certain sums of dependent random variables. J. Comp. Appl. Math 55, 135-143. 51. J. M. ROBSON (1979) The height of binary search trees. Austral. Comp. J. 11, 151-153. 52. U. ROSLER (1991) A limit theorem for quicksort. Theor. Informatics Applies 25, 85-100. 53. U. ROSLER (1992) A fixed point theorem for distributions. Stoch. Procs. Applies 42, 195-214. 54. U. ROSLER & L. RUSCHENDORF (2001) The contraction method for recursive algorithms. Algorithmica, 29, 3-33. 55. R. SEDGEWICK (1983) Mathematical analysis of combinatorial algorithms. In: Probability Theory and Computer Science, eds. G. Louchard and G. Latouche, pp. 123-205. Academic Press, London. 56. A. A. SHERGIN (1979) On the convergence rate in the central limit theorem for m-dependent random variables. Theor. Probab. Appl. 24, 782-796. 57. I. S. SHIGANOV (1986) Refinement of the upper bound of the constant in the central limit theorem. J. Soviet Math. 50, 2545-2550. 58. C. STEIN (1972) A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. Proc. Sixth Berkeley Symp. Math. Statist. Prob. 2, pp. 583-602. Univ. California Press. 59. K. H. TAN & P. HADJICOSTAS (1995) Some properties of a limiting distribution in quicksort. Statist. Probab. Lett. 25, 87-94.
Random binary search trees
297
60. P. VAN BEEK (1972) An application of Fourier methods to the problem of sharpening the Berry-Esseen inequality. Z. Wahrsch. verw. Geb. 23, 183196. 61. J. S. VITTER & P. FLAJOLET (1990) Average-case analysis of algorithms and data structures. In: Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity, ed. J. van Leeuwen, pp. 431-524. MIT Press, Amsterdam, 62. J. VUILLEMIN (1978) A data structure for manipulating priority queues. Commun. ACM 21, 309-314.